In [None]:
!pip install -qU pypdf langchain langchain-openai langchain-community

# Build a PDF ingestion and Question/Answering system

In this session, we will create a system that can answer questions about PDF files.

In [None]:
import os

langchain_api_key = 'your_langchain_api_key_here'  # Replace with your actual LangChain API key
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_API_KEY'] = langchain_api_key

openai_api_key = 'your_openai_api_key_here'  # Replace with your actual OpenAI API key
os.environ['OPENAI_API_KEY'] = openai_api_key

## Loading documents

In this session, we will use a PDF document from Nike's annual public SEC report. We need to load it into a format that an LLM can more easily handle, since LLMs generally require text inputs.

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./nike-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

len(docs)

106

In [7]:
print(docs[1].page_content[0:100])
print(docs[1].metadata)

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K 
(Mark One)
☑ ANNU
{'source': './nike-10k-2023.pdf', 'page': 1}


## Question answering with RAG

Next, we will prepare the loaded documents for later retrieval. We need to split the loaded documents into smaller documents that can more easily fit into an LLM's context window, then load them into a vector store. Then we can create a retriever from the vector store for use in our RAG chain

In [8]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo')

In [9]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = InMemoryVectorStore.from_documents(
    documents=splits, embedding=OpenAIEmbeddings(),
)

retriever = vectorstore.as_retriever()

Finally, we will use some built-in helpers to construct the final `rag_chain`:

In [10]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ('system', system_prompt),
        ('human', '{input}'),
    ]
)


question_answer_chain = create_stuff_documents_chain(
    llm=llm,
    prompt=prompt,
)

rag_chain = create_retrieval_chain(
    retriever,
    question_answer_chain,
)

results = rag_chain.invoke({'input': "What was Nike's revenue in 2023?"})

results

{'input': "What was Nike's revenue in 2023?",
 'context': [Document(id='2eebc3b6-7c37-4824-9355-1c4989d27bc1', metadata={'source': './nike-10k-2023.pdf', 'page': 36}, page_content='FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\nThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and \nmajor product line:\nFISCAL 2023 COMPARED TO FISCAL 2022\n• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported \nand currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & \nAfrica ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. \nRevenues, respectively. \n• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and \ncurrency-neutral basis, respectively. This increase was primarily due to higher rev

The final answer is in the `answer` key of the `results` dict, and the `context` the LLM used to generate an answer.

Under the `context`, we can see that they are documents that each contain a chunk of the ingested page content. These documents also preserve the original metadata from way back when we first load them:

In [11]:
results['context'][0].page_content

'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\nThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and \nmajor product line:\nFISCAL 2023 COMPARED TO FISCAL 2022\n• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported \nand currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & \nAfrica ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. \nRevenues, respectively. \n• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and \ncurrency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, \nWomen\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.'

In [12]:
results['context'][0].metadata

{'source': './nike-10k-2023.pdf', 'page': 36}

This particular chunk came from page 36 in the original PDF. We can use this to verify that answers are based on the source material.