# Load and analyze PDF files

PDF files often hold crucial unstructured data unavailable from other sources. They can be quite lengthy, and unlike plain text files, cannot generally be fed directly into the prompt of a language model. In this tutorial, you'll walk through some workflows that allow you to use an LLM to glean information from your PDF files. More specifically:

- First, you'll learn how to use a [Document Loader](/docs/concepts/#document-loaders) to load text in a format usable by an LLM.
- Next, you'll use an LLM to extract specific information from a loaded page in the PDF.
- Finally, you'll build a retrieval-augmented generation (RAG) pipeline to answer more general natural-language questions about the PDF, including citations from the source material.

This tutorial will gloss over some concepts more deeply covered in our [RAG](/docs/how_to/tutorials/rag/) and [extraction](/docs/tutorials/extraction/) tutorials, so you may want to go through those first if you haven't already.

Let's dive in!

## Loading documents

First, you'll need to choose a PDF to load. We'll use a document from [Nike's annual public SEC report](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf). It's over 100 pages long, and contains some crucial data mixed with longer explanatory text. However, you can feel free to use a PDF of your choosing.

Once you've chosen your PDF, the next step is to load it into a format that an LLM can more easily handle, since LLMs generally require text inputs. LangChain has a few different [built-in document loaders](/docs/how_to/document_loader_pdf/) for this purpose which you can experiment with, but for now, let's try a standard one powered by `pypdf` that loads data from a filepath:

In [None]:
%pip install -qU pypdf langchain_community

In [4]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

107


In [8]:
print(docs[0].page_content[0:100])
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K

{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}


So what just happened?

- The loader reads the PDF at the specified path into memory.
- It then extracts text data using the `pypdf` package.
- Finally, it creates a LangChain [Document](/docs/concepts/#documents) for each page of the PDF with the page's content and some metadata about where in the document the text came from.

LangChain has [many other document loaders](/docs/integrations/document_loaders/) for other data sources, or you can create a [custom document loader](/docs/how_to/document_loader_custom/).

## Analyzing data

Now that you've loaded your PDF into an LLM-readable format with a document loader, the next step is to analyze the content using the LLM. You'll learn about two ways to do this next.

### Extracting specific fields

Extraction is useful if you know exactly what you want from a longer block of text. For example, if you knew the first extracted document contains an address somewhere in its content, and you wanted to save it to a database, you could create a [Pydantic](https://pydantic.dev/) schema like the following and bind it to [a chat model capable of reliably outputting structured data](/docs/integrations/chat/):

```{=mdx}
import ChatModelTabs from "@theme/ChatModelTabs";

<ChatModelTabs openaiParams={`model="gpt-4o"`} hideTogether={true} hideFireworks={true} />
```

In [None]:
# | output: false
# | echo: false

%pip install langchain_anthropic

import getpass
import os

from langchain_anthropic import ChatAnthropic

os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Anthropic API Key:")

llm = ChatAnthropic(model="claude-3-sonnet-20240229", temperature=0)

In [16]:
from typing import List, Optional

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field


class Address(BaseModel):
    """Fields related to a physical address."""

    line1: str = Field(description="The first line of the address")
    line2: Optional[str] = Field(
        default=None, description="The second line of the address, if present"
    )
    city: str = Field(description="The city the address is located within")
    state: str = Field(
        description="The state (or province) the address is located within"
    )
    postal_code: str = Field(description="The postal code of the address")


class Addresses(BaseModel):
    addresses: List[Address]


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an optional attribute you've been asked to extract, "
            "return null for the attribute's value.",
        ),
        ("human", "{text}"),
    ]
)

chain = prompt | llm.with_structured_output(Addresses)

chain.invoke({"text": docs[0].page_content})

Addresses(addresses=[Address(line1='One Bowerman Drive', line2=None, city='Beaverton', state='Oregon', postal_code='97005-6453')])

You could then repeat this process for each loaded page to obtain all mentioned addresses in the PDF.

:::info
For a deeper dive into extraction, see [this more focused tutorial](/docs/tutorials/extraction/) or [our how-to guides](/docs/how_to/#extraction).
:::

### Question answering with RAG

RAG is a useful technique if you want to be able to answer more general questions about the document. Use a [text splitter](/docs/concepts/#text-splitters) to split your loaded documents into smaller documents that can more easily fit into an LLM's context window, then load them into a [vector store](/docs/concepts/#vector-stores). You can then create a [retriever](/docs/concepts/#retrievers) from the vector store for use in our RAG chain:

In [None]:
%pip install langchain_chroma langchain_openai

In [20]:
# | output: false
# | echo: false

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [24]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

Finally, use some built-in helpers to construct our final `rag_chain`:

In [28]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What was Nike's revenue in 2023?"})

results

{'input': "What was Nike's revenue in 2023?",
 'context': [Document(page_content='Table of Contents\nFISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\nThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\nFISCAL 2023 COMPARED TO FISCAL 2022\n•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\nThe increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\nincrease was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, re

You can see that you get both a final answer in the `answer` key of the results dict, and the `context` the LLM used to generate an answer.

Examining the values under the `context` further, you can see that they are documents that each contain a chunk of the ingested page content. Usefully, these documents also preserve the original metadata from way back when you first loaded them. You can use this data to show which page in the PDF the answer came from, allowing users to check answers:

In [32]:
print(results["context"][0].page_content)

Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.


In [33]:
print(results["context"][0].metadata)

{'page': 35, 'source': '../example_data/nke-10k-2023.pdf'}


:::info
For a deeper dive into RAG, see [this more focused tutorial](/docs/tutorials/rag/) or [our how-to guides](/docs/how_to/#qa-with-rag).
:::

## Next steps

You've now learned how to load documents from a PDF file with a Document Loader and some techniques you can use to analyze those documents with an LLM.

For more on document loaders, you can check out:

- [The entry in the conceptual guide](/docs/concepts/#document-loaders)
- [Related how-to guides](/docs/how_to/#document-loaders)
- [Available integrations](/docs/integrations/document_loaders/)
- [How to create a custom document loader](/docs/how_to/document_loader_custom/)

For more on RAG, see:

- [Build a Retrieval Augmented Generation (RAG) App](/docs/tutorials/rag/)
- [Related how-to guides](/docs/how_to/#qa-with-rag)

For more on extraction, see:

- [Build an Extraction Chain](/docs/tutorials/extraction/)
- [How to return structured data from a model](/docs/how_to/structured_output/)
- [Related how-to guides](/docs/how_to/#extraction)