# Data Loaders
* Load all kinds of data and then ask the LLM questions about it.
* Connect with data sources and load private documents.

## LangChain built-in data loaders.
* Labeled as "integrations".
* Most of them require to install the corresponding libraries.

## LangChain documentation on Document Loaders
* See the documentation page [here](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/).
* See the list of built-in document loaders [here](https://python.langchain.com/v0.1/docs/integrations/document_loaders/).
* See the latest documentation for document loaders [here](https://python.langchain.com/docs/integrations/document_loaders/#all-document-loaders)

In [None]:
#pip install python-dotenv

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [3]:
#!pip install langchain

## Connect with an LLM

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [4]:
#!pip install langchain-openai

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

## Simple data loading

#### Loading a .txt file

In [2]:
from langchain_openai import ChatOpenAI

chatModel = ChatOpenAI(model="gpt-3.5-turbo-0125")

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain-community

In [3]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/be-good.txt")

loaded_data = loader.load()

* If you uncomment and execute the next cell you will see the contents of the loaded document.

In [4]:
loaded_data

[Document(metadata={'source': './data/be-good.txt'}, page_content='Be good\n\nApril 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we started Y Combinator we came up with the\nphrase that became our motto: Make something people want.  We\'ve\nlearned a lot since then, but if I were choosing now that\'s still\nthe one I\'d pick.Another thing we tell founders is not to worry too much about the\nbusiness model, at least at first.  Not because making money is\nunimportant, but because it\'s so much easier than building something\ngreat.A couple weeks ago I realized that if you put those two ideas\ntogether, you get something surprising.  Make something people want.\nDon\'t worry too much about making money.  What you\'ve got is a\ndescription of a charity.When you get an unexpected result like this, it could either be a\nbug or a new discovery.  Either businesses aren\'t supposed to be\nlike charities, and we\'ve proven by reductio ad absurdum that o

#### Loading a CSV file

In [5]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader('./data/Street_Tree_List.csv')

loaded_data = loader.load()

In [7]:
# loaded_data

#### Loading an .html file

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install bs4

In [None]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader('./data/100-startups.html')

loaded_data = loader.load()

In [23]:
#loaded_data

#### Loading a .pdf file

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install pypdf

In [9]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./data/5pages.pdf')

loaded_data = loader.load_and_split()

In [10]:
loaded_data[0].page_content

'Page 1 of 4 PDF Files \nScan – Create – Reduce File Size  \n \n \nIt is recommended that you purchase an Adobe Acrobat product that \nallows you to read, create and manipulate PDF documents.  Go to http://www.adobe.com/products/acrobat/matrix.html\n to compare \nAdobe products and features –Adobe  Acrobat Standard is sufficient. \n \n \nScanning Documents \n \nYou should only have to scan docu ments that are not electronic, and \nwhen you are unable to create a PDF using PDFMaker or the Print \nCommand from the applicat ion you are using.   \n \nSignature Pages If you have a document such as a CV that requires a signature on a page only print the page that re quires the signature –printing the \nentire document and scanning it is not\n necessary or desired.  Once you \nsign and scan the signature page you can combine it with the original \ndocument using the Create PDF From Multiple Files feature. \n Scanner Settings Before scanning documents rememb er to make certain that the \nfollo

#### Loading a Wikipedia page and asking questions about it

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install wikipedia

In [None]:
from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader('query=name, load_max_docs=1')

loaded_data = loader.load()[0].page_content

In [18]:
loaded_data

'Tesla, Inc. ( TESS-lə or  TEZ-lə) is an American multinational automotive and clean energy company. Headquartered in Austin, Texas, it designs, manufactures and sells battery electric vehicles (BEVs), stationary battery energy storage devices from home to grid-scale, solar panels and solar shingles, and related products and services.\nTesla was founded in July 2003 by Martin Eberhard and Marc Tarpenning as Tesla Motors. Its name is a tribute to inventor and electrical engineer Nikola Tesla. In February 2004, Elon Musk joined as Tesla\'s largest shareholder; in 2008, he was named chief executive officer. In 2008, the company began production of its first car model, the Roadster sports car, followed by the Model S sedan in 2012, the Model X SUV in 2015, the Model 3 sedan in 2017, the Model Y crossover in 2020, the Tesla Semi truck in 2022 and the Cybertruck pickup truck in 2023. In June 2021 the Model 3 became the first electric car to sell 1 million units globally. In 2023, the Model Y

In [13]:
from langchain_core.prompts import ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    [
        ("human", "Answer this {question}, here is some extra {context}"),
    ]
)

messages = chat_template.format_messages(
    name="JFK",
    question="Where was JFK born?",
    context=loaded_data
)

In [14]:
response = chatModel.invoke(messages)

In [16]:
response.content

'y" before ultimately leaving the company. Musk became CEO and product architect of Tesla. The first Tesla Roadster was delivered to Musk in February 2008.\nProduction of the Tesla Roadster began in March 2008. It was the first highway-legal all-electric vehicle to use lithium-ion battery cells. In July 2009, Tesla unveiled its Model S sedan prototype. The Model S was released in June 2012.'

## cusomized data loaders

In [1]:
import glob
import os
from typing import List
from multiprocessing import Pool
from tqdm import tqdm

In [2]:

import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import (
    EverNoteLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredEmailLoader,
    UnstructuredEPubLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredODTLoader,
    UnstructuredPowerPointLoader,
    UnstructuredWordDocumentLoader,
)

In [4]:
LOADER_MAPPING = {
    ".csv": (CSVLoader, {}),
    ".doc": (UnstructuredWordDocumentLoader, {}),
    ".docx": (UnstructuredWordDocumentLoader, {}),
    ".enex": (EverNoteLoader, {}),
    ".epub": (UnstructuredEPubLoader, {}),
    ".html": (UnstructuredHTMLLoader, {}),
    ".md": (UnstructuredMarkdownLoader, {}),
    ".odt": (UnstructuredODTLoader, {}),
    ".pdf": (PyMuPDFLoader, {}),
    ".ppt": (UnstructuredPowerPointLoader, {}),
    ".pptx": (UnstructuredPowerPointLoader, {}),
    ".txt": (TextLoader, {"encoding": "utf8"}),
}


In [8]:
from langchain.docstore.document import Document
from typing import List
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain_text_splitters import RecursiveCharacterTextSplitter

In [6]:
embeddings_model_name = 'BAAI/bge-large-en'
chunk_size = 256
chunk_overlap = 32

In [9]:
def load_single_document(file_path: str) -> List[Document]:
    ext = "." + file_path.rsplit(".", 1)[-1]
    if ext in LOADER_MAPPING:
        loader_class, loader_args = LOADER_MAPPING[ext]
        loader = loader_class(file_path, **loader_args)
        return loader.load()

    raise ValueError(f"Unsupported file extension '{ext}'")

def load_documents(source_dir: str, ignored_files: List[str] = []) -> List[Document]:
    """
    Loads all documents from the source documents directory, ignoring specified files
    """
    all_files = []
    for ext in LOADER_MAPPING:
        all_files.extend(
            glob.glob(os.path.join(source_dir, f"**/*{ext}"), recursive=True)
        )
    filtered_files = [file_path for file_path in all_files if file_path not in ignored_files]

    with Pool(processes=os.cpu_count()) as pool:
        results = []
        with tqdm(total=len(filtered_files), desc='Loading new documents', ncols=80) as pbar:
            for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
                results.extend(docs)
                pbar.update()

    return results

def process_documents(source_directory,ignored_files: List[str] = []) -> List[Document]:
    logger.info(f"Loading documents from {source_directory}")
    documents = load_documents(source_directory, ignored_files)
    logger.info(f"Loaded {len(documents)} new documents from {source_directory}")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    texts = text_splitter.split_documents(documents)
    logger.info(f"Split into {len(texts)} chunks of text (max. {chunk_size} tokens each)")
    return texts




In [None]:
source_directory = './data1'
ignored_files = []
documents = process_documents(source_directory,ignored_files)

2025-03-21 11:02:54,897 - INFO - Loading documents from ./data1
Loading new documents:   0%|                              | 0/1 [00:00<?, ?it/s]

In [19]:
print(documents)

[]


## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 001-data-loaders.py
* In terminal, make sure you are in the directory of the file and run:
    * python 001-data-loaders.py