# 👉 First, you need some documents! 📑

This notebook takes documents _(from whichever sources you specify)_ and loads
them as LangChain `Document` objects, ready for processing in the next notebooks.

Options available for loading documents:
- **Local Files**: You can load documents from your local filesystem.
- **LangChain Retrivers**: There are various built-in retrivers that can load
  documents from the web, such as Wikipedia, etc.  
  _This notebook includes a few examples: Wikipedia, Song Lyrics (from AZLyrics), Movie Scripts Database (from IMSDb)_

> ℹ️ **Note**: This step is optional if you are exclusively using search indexes such as: Azure Cognitive Search, ElasticSearch, etc.  
> _These manage their own document storage_

> ℹ️ **Refer️ences**
> - https://python.langchain.com/docs/integrations/retrievers/


## 📂 Local Files

The process of loading local files is:
- 1. Specify path to folder with the desired documents
- 2. Load each document using a [LangChain Document Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/)

The simplest way to load a document us using the [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file).

This extracts the text from the desired document. By default, it extracts the entire text into a single string, but it can keep "elements" of the document together by using `mode="elements"`.

> _The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections, and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for partitioning a document._
> 
> ~ https://unstructured-io.github.io/unstructured/getting_started.html

If you want more fine-grained control over how documents are loaded, you browse and use specific [Document loaders](https://python.langchain.com/docs/integrations/document_loaders/), _e.g. you can find multiple ways of loading Word documents._

For CSV/HTML/JSON/Markdown/PDF, you can find detailed information [in this other section](https://python.langchain.com/docs/modules/data_connection/document_loaders/).

> ℹ️ **Refer️ences**
> - https://python.langchain.com/docs/modules/data_connection/document_loaders/
> - https://python.langchain.com/docs/integrations/document_loaders/
> - https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
> - https://pypi.org/project/unstructured/
> - https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory

In [1]:
from langchain.schema import Document

documents: list[Document] = []

In [None]:
path_to_folder_with_documents = "/path/to/folder/with/documents"

In [None]:
# Delete or comment out this cell to prevent these example science documents from being loaded
# Note: these documents are over 100MB and may take 10+ minutes to load
path_to_folder_with_documents = "../Resources/Example Documents/Science Translational Medicine"
glob_pattern = "*.pdf" # optional

In [10]:
# If you only want to load a single file (e.g. one of these large .pdf) then use this
# Note: this single document is 35MB and takes approx. 1 minute to load
path_to_single_file = "../Resources/Example Documents/Science Translational Medicine/2020_Neuroscience.pdf"

In [12]:
import os

if path_to_single_file and os.path.isfile(path_to_single_file):
    from langchain.document_loaders import UnstructuredFileLoader
    print(f"Loading single document: {path_to_single_file}")
    documents.extend(UnstructuredFileLoader(path_to_single_file).load())

elif path_to_folder_with_documents and os.path.isdir(path_to_folder_with_documents):
    from langchain.document_loaders import DirectoryLoader
    print(f"Loading all documents from folder: {path_to_folder_with_documents}")
    if glob_pattern:
        documents.extend(DirectoryLoader(path_to_folder_with_documents, glob="*.pdf", show_progress=True).load())
    else:
        documents.extend(DirectoryLoader(path_to_folder_with_documents, show_progress=True).load())


Loading single document: ../Resources/Example Documents/Science Translational Medicine/2020_Neuroscience.pdf


## 🌐 Files From The Interwebs

The following cells load LangChain `Document`s from various places on the internet using various [Document loaders](https://python.langchain.com/docs/integrations/document_loaders/).

_You can also make your own document loaders_.

Other cool ones to call out are:
- [Azure Blob Storage Container](https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_container) document loader
- [Azure Blob Storage File](https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_file) document loader

> ℹ️ **Refer️ences**
> - https://python.langchain.com/docs/integrations/document_loaders/
> - https://python.langchain.com/docs/integrations/document_loaders/azlyrics
> - https://python.langchain.com/docs/integrations/document_loaders/imsdb
> - https://python.langchain.com/docs/integrations/document_loaders/youtube_transcript
> - https://python.langchain.com/docs/integrations/document_loaders/wikipedia
> - https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_container
> - https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_file

In [2]:
# This cell loads a few lyrics documents from AZLyrics.com
from langchain.document_loaders import AZLyricsLoader

song_urls = [
    "https://www.azlyrics.com/lyrics/taylorswift/lovestory.html",
    "https://www.azlyrics.com/lyrics/alanismorissette/ironic.html",
    "https://www.azlyrics.com/lyrics/savestheday/thisisnotanexit.html",
    "https://www.azlyrics.com/lyrics/thrice/stareatthesun.html",
]

for song_url in song_urls:
    print(f"Loading {song_url}")
    documents.extend(AZLyricsLoader(song_url).load())

Loading https://www.azlyrics.com/lyrics/taylorswift/lovestory.html...
Loading https://www.azlyrics.com/lyrics/alanismorissette/ironic.html...
Loading https://www.azlyrics.com/lyrics/savestheday/thisisnotanexit.html...
Loading https://www.azlyrics.com/lyrics/thrice/stareatthesun.html...


In [3]:
# This cell loads the scripts from a couple of my favorite movies from IMSDb.com
from langchain.document_loaders import IMSDbLoader

movie_script_urls = [
    "https://imsdb.com/scripts/Hackers.html",
    "https://imsdb.com/scripts/Princess-Bride,-The.html"
]

for movie_script_url in movie_script_urls:
    print(f"Loading {movie_script_url}")
    documents.extend(IMSDbLoader(movie_script_url).load())


Loading https://imsdb.com/scripts/Hackers.html
Loading https://imsdb.com/scripts/Princess-Bride,-The.html


In [4]:
# This cell loads the trascript from a YouTube music video
from langchain.document_loaders import YoutubeLoader

youtube_video_urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
]

for youtube_video_url in youtube_video_urls:
    print(f"Loading {youtube_video_url}")
    documents.extend(YoutubeLoader.from_youtube_url(youtube_video_url).load())


Loading https://www.youtube.com/watch?v=dQw4w9WgXcQ


In [6]:
# This cell loads a few pages from Wikipedia
from langchain.document_loaders import WikipediaLoader

wikipedia_searches = [
    "Large language model",
    "Generative AI"
]

for search in wikipedia_searches:
    print(f"Searching for {search}")
    documents.extend(WikipediaLoader(search, load_max_docs=1).load())
    

Searching for Large language model
Searching for Generative AI


Loaded 10 documents
 - Source: https://www.azlyrics.com/lyrics/taylorswift/lovestory.html
 - Source: https://www.azlyrics.com/lyrics/alanismorissette/ironic.html
 - Source: https://www.azlyrics.com/lyrics/savestheday/thisisnotanexit.html
 - Source: https://www.azlyrics.com/lyrics/thrice/stareatthesun.html
 - Source: https://imsdb.com/scripts/Hackers.html
 - Source: https://imsdb.com/scripts/Princess-Bride,-The.html
 - Source: dQw4w9WgXcQ
 - Source: https://en.wikipedia.org/wiki/Large_language_model
 - Source: https://en.wikipedia.org/wiki/Generative_artificial_intelligence
 - Source: ../Resources/Example Documents/Science Translational Medicine/2020_Neuroscience.pdf


# Store Documents _(for use in other notebooks)_

Run the next cell to print out a summary of the documents you have stored.

It will also `%store` the documents _(in your IPython database)_ for use in the next notebooks.

In [None]:
print(f"Loaded {len(documents)} documents")
for document in documents:
    print(f" - Source: {document.metadata['source']}")

print(f"Storing {len(documents)} documents for the next notebooks...")
%store documents
print("Done!")

## One you are ready, head over to [Notebook #2](#) 👉