# Document loaders

Document Loaders are responsible for loading documents from a variety of sources.

- load PDF files
- load web pages
- load CSV data
- load data from a directory
- load HTML data
- load JSON data
- load Markdown data
- load Microsoft Office data
- write a custom document loader


# Load pdf

### Read pdf text

In [None]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../assets/2302.13971v1.pdf"

loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)
    
# print(f"{pages[0].metadata}\n")
print(pages[0].page_content)

LLaMA: Open and Efﬁcient Foundation Language Models
Hugo Touvron∗, Thibaut Lavril∗, Gautier Izacard ∗, Xavier Martinet
Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal
Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin
Edouard Grave∗, Guillaume Lample∗
Meta AI
Abstract
We introduce LLaMA, a collection of founda-
tion language models ranging from 7B to 65B
parameters. We train our models on trillions
of tokens, and show that it is possible to train
state-of-the-art models using publicly avail-
able datasets exclusively, without resorting
to proprietary and inaccessible datasets. In
particular, LLaMA-13B outperforms GPT-3
(175B) on most benchmarks, and LLaMA-
65B is competitive with the best models,
Chinchilla-70B and PaLM-540B. We release
all our models to the research community1.
1 Introduction
Large Languages Models (LLMs) trained on mas-
sive corpora of texts have shown their ability to per-
form new tasks from textual instructions or from a
few examples 

### Vector search over PDFs

In [23]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_ollama import OllamaEmbeddings

embedding=OllamaEmbeddings(base_url="http://localhost:11434", model="llama3.2:3b")

vector_store = InMemoryVectorStore.from_documents(documents=pages[0:2], embedding=embedding)

search_text = "LLaMA is a collection of foundation language models, released by Meta. Unlike GPT models, \
LLaMA models are open-source, i.e., model weights are released to the research community under a noncommercial \
license. Thus, the LLaMA family grows rapidly as these models are widely used by many research groups to develop \
better open-source LLMs to compete the closed-source ones or to develop task-specific LLMs for mission-critical applications."

docs = vector_store.similarity_search(search_text, k=2)

for doc in docs:
    print(f"Page {doc.metadata['page']}: {doc.page_content[:300]}\n")
    print()

Page 1: 2 Approach
Our training approach is similar to the methods
described in previous work (Brown et al., 2020;
Chowdhery et al., 2022), and is inspired by the
Chinchilla scaling laws (Hoffmann et al., 2022).
We train large transformers on a large quantity of
textual data using a standard optimizer.
2.1 


Page 0: LLaMA: Open and Efﬁcient Foundation Language Models
Hugo Touvron∗, Thibaut Lavril∗, Gautier Izacard ∗, Xavier Martinet
Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal
Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin
Edouard Grave∗, Guillaume Lample∗
Meta AI
Abstract
W




### Layout analysis and extraction of text from images

In [None]:
# Requires account and UNSTRUCTURED_API_KEY  from Unstructured.io
from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
    file_path=file_path,
    strategy="hi_res",
    partition_via_api=True,
    coordinates=True,
)
docs = []
for doc in loader.lazy_load():
    docs.append(doc)
    
print(len(docs))

first_page_docs = [doc for doc in docs if doc.metadata.get("page_number") == 1]

for doc in first_page_docs:
    print(doc.page_content)

### Extracting tables and other structures

In [None]:
import fitz
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from PIL import Image
from langchain_community.document_loaders import PyPDFLoader

file_path = "../assets/layout-parser-paper.pdf"

loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)
    

def plot_pdf_with_boxes(pdf_page, segments):
    pix = pdf_page.get_pixmap()
    pil_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

    fig, ax = plt.subplots(1, figsize=(10, 10))
    ax.imshow(pil_image)
    categories = set()
    category_to_color = {
        "Title": "orchid",
        "Image": "forestgreen",
        "Table": "tomato",
    }
    for segment in segments:
        points = segment["coordinates"]["points"]
        layout_width = segment["coordinates"]["layout_width"]
        layout_height = segment["coordinates"]["layout_height"]
        scaled_points = [
            (x * pix.width / layout_width, y * pix.height / layout_height)
            for x, y in points
        ]
        box_color = category_to_color.get(segment["category"], "deepskyblue")
        categories.add(segment["category"])
        rect = patches.Polygon(
            scaled_points, linewidth=1, edgecolor=box_color, facecolor="none"
        )
        ax.add_patch(rect)

    # Make legend
    legend_handles = [patches.Patch(color="deepskyblue", label="Text")]
    for category in ["Title", "Image", "Table"]:
        if category in categories:
            legend_handles.append(
                patches.Patch(color=category_to_color[category], label=category)
            )
    ax.axis("off")
    ax.legend(handles=legend_handles, loc="upper right")
    plt.tight_layout()
    plt.show()


def render_page(doc_list: list, page_number: int, print_text=True) -> None:
    pdf_page = fitz.open(file_path).load_page(page_number - 1)
    page_docs = [
        doc for doc in doc_list if doc.metadata.get("page_number") == page_number
    ]
    segments = [doc.metadata for doc in page_docs]
    plot_pdf_with_boxes(pdf_page, segments)
    if print_text:
        for doc in page_docs:
            print(f"{doc.page_content}\n")

In [None]:
render_page(docs, 5)

## Load webpages
