# 👉 First, you need some documents! 📑

This notebook takes documents _(from whichever sources you specify)_ and loads
them as LangChain `Document` objects, ready for processing in the next notebooks.

Options available for loading documents:
- **Local Files**: You can load documents from your local filesystem.
- **LangChain Retrivers**: There are various built-in retrivers that can load
  documents from the web, such as Wikipedia, etc.  
  _This notebook includes a few examples: Wikipedia, Song Lyrics (from AZLyrics), Movie Scripts Database (from IMSDb)_

> ℹ️ **Note**: This step is optional if you are exclusively using search indexes such as: Azure Cognitive Search, ElasticSearch, etc.  
> _These manage their own document storage_

> ℹ️ **Refer️ences**
> - https://python.langchain.com/docs/integrations/retrievers/


## 📂 Local Files

The process of loading local files is:
- 1. Specify path to folder with the desired documents
- 2. Load each document using a [LangChain Document Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/)

The simplest way to load a document us using the [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file).

This extracts the text from the desired document. By default, it extracts the entire text into a single string, but it can keep "elements" of the document together by using `mode="elements"`.

> _The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections, and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for partitioning a document._
> 
> ~ https://unstructured-io.github.io/unstructured/getting_started.html

If you want more fine-grained control over how documents are loaded, you browse and use specific [Document loaders](https://python.langchain.com/docs/integrations/document_loaders/), _e.g. you can find multiple ways of loading Word documents._

For CSV/HTML/JSON/Markdown/PDF, you can find detailed information [in this other section](https://python.langchain.com/docs/modules/data_connection/document_loaders/).

> ℹ️ **Refer️ences**
> - https://python.langchain.com/docs/modules/data_connection/document_loaders/
> - https://python.langchain.com/docs/integrations/document_loaders/
> - https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
> - https://pypi.org/project/unstructured/
> - https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory

In [2]:
from langchain.schema import Document

documents: list[Document] = []

In [None]:
path_to_folder_with_documents = "/path/to/folder/with/documents"

In [None]:
# Delete or comment out this cell to prevent these example science documents from being loaded
# Note: these documents are over 100MB and may take 10+ minutes to load
path_to_folder_with_documents = "../Resources/Example Documents/Science Translational Medicine"
glob_pattern = "*.pdf" # optional

In [None]:
# If you only want to load a single file (e.g. one of these large .pdf) then use this
# Note: this single document is 35MB and takes approx. 1 minute to load
path_to_single_file = "../Resources/Example Documents/Science Translational Medicine/2020_Neuroscience.pdf"

In [None]:
import os

if path_to_single_file and os.path.isfile(path_to_single_file):
    from langchain.document_loaders import UnstructuredFileLoader
    print(f"Loading single document: {path_to_single_file}")
    documents = UnstructuredFileLoader(path_to_single_file).load()

elif path_to_folder_with_documents and os.path.isdir(path_to_folder_with_documents):
    from langchain.document_loaders import DirectoryLoader
    print(f"Loading all documents from folder: {path_to_folder_with_documents}")
    if glob_pattern:
        documents.extend(DirectoryLoader(path_to_folder_with_documents, glob="*.pdf", show_progress=True).load())
    else:
        documents.extend(DirectoryLoader(path_to_folder_with_documents, show_progress=True).load())


## 🌐 Files From The Interwebs

In [None]:
# This cell loads a few lyrics documents from AZLyrics.com


# Save so we can use it in the next one, with link to next one...

In [None]:
print(f"Storing {len(documents)} documents for the next notebook!")
%store documents