# QA over PDF file

## Intro
* We will create Q&A app that can answer questions about PDF files.
* We will use Document Loader to load text in format usable by LLM, then build retrieval-augmented generation (RAG) pipeline to answer questions, including citations from source material.

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [2]:
MODEL_GPT = 'gpt-4o-mini'

## Connect with LLM

In [3]:
from langchain_openai import ChatOpenAI

# llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
llm = ChatOpenAI(model=MODEL_GPT)

## Load PDF file
* Loader reads PDF at the specified path into memory.
* It then extracts text data using pypdf package.
* Finally, it creates LangChain Document for each page of PDF with the page's content and some metadata about where in document the text came from.

In [4]:
#!pip install langchain-community

In [5]:
#!pip install pypdf

In [6]:
from langchain_community.document_loaders import PyPDFLoader

# file_path = "./data/Be_Good.pdf"
file_path = "../../data/Be_Good.pdf"

loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

11


In [7]:
print(docs[0].page_content[0:100])
print(docs[0].metadata)

Be Good - Essay by Paul Graham
Be Good
Be good
April 2008(This essay is derived from a talk at the 2
{'source': '../../data/Be_Good.pdf', 'page': 0}


## RAG
* We will use vector database (vector store) Chroma DB.
* Using text splitter, we will split loaded PDF into smaller documents that can more easily fit into LLM's context window, then load them into vector store.
* We can then create retriever from vector store for use in our RAG chain.

In [8]:
#!pip install langchain_chroma

In [9]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

* We will then use some built-in helpers to construct final rag_chain

In [10]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What is this article about?"})

results["answer"]

'The article "Be Good" by Paul Graham discusses the importance of creating products that people want and not overly focusing on business models in the early stages of startups. It emphasizes the value of user feedback and the "tamagotchi effect," where having users to care for can drive a startup\'s development and success. Additionally, it draws parallels between startups and charitable endeavors, highlighting the significance of good intentions and accountability.'

* If you print whole `results` you will see that you get both the final answer in answer key of results dict, and the context the LLM used to generate answer. See it below:

In [11]:
results

{'input': 'What is this article about?',
 'context': [Document(metadata={'page': 0, 'source': '../../data/Be_Good.pdf'}, page_content="Be Good - Essay by Paul Graham\nBe Good\nBe good\nApril 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we\nstarted Y Combinator we came up with the\nphrase that became our motto: Make something people want.  We've\nlearned a lot since then, but if I were choosing now that's still\nthe one I'd pick.Another thing we tell founders is not to worry too much about the\nbusiness model, at least at first.  Not because making money is\nunimportant, but because it's so much easier than building something\ngreat.A couple weeks ago I realized that if you put those two ideas\ntogether, you get something surprising.  Make something people want.\nDon't worry too much about making money.  What you've got is a\ndescription of a charity.When you get an unexpected result like this, it could either be a\nbug or a new discovery.  Eith

In [12]:
print(results["context"][0].metadata)

{'page': 0, 'source': '../../data/Be_Good.pdf'}
