# RAG by PDF Q&A

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/question_answering/quickstart.ipynb)

LangChain has a number of components designed to help build question-answering applications, and RAG applications more generally. This is a simple Q&A application over the [2022 Wells Fargo Annual Report](https://www08.wellsfargomedia.com/assets/pdf/about/investor-relations/annual-reports/2022-annual-report.pdf), using a PDF as the text data source.


This notebook uses:
1. Unstructured library to parse the report into raw text
2. RecursiveCharacterTextSplitter to split the text into smaller chunks
3. OpenAI transformer to embed text chunks into vectors
4. Storage retriever upon user input
5. LLM for answer generation that includes the question and retrieved data

## Setup

### Dependencies


```
!pip list

langchain            0.0.353
langchain-community  0.0.7
langchain-core       0.1.4
langsmith            0.0.76
```

*note: if you have pdfminer installed, do

```
!pip uninstall pdfminer
```
you cannot have both pdfminer and pdfminer.six

We'll install the packages for specific integrations separately:

## Setup

### Dependencies continued

We'll use the following packages:

In [None]:
!pip install -U langchain langchain-community langchainhub openai chromadb bs4
!pip install langchain_openai
!pip install pypdf
!pip install unstructured




We need to set environment variable `OPENAI_API_KEY`, which can be done directly or loaded from a `.env` file like so:

In [None]:

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

# import dotenv

# dotenv.load_dotenv()

··········


### LangSmith

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with [LangSmith](https://smith.langchain.com).

Note that LangSmith is not needed, but it is helpful. If you do want to use LangSmith, after you sign up at the link above, make sure to set your environment variables to start logging traces:

In [None]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

··········


## Indexing Pipeline
The indexing pipeline precedes retrieval and generation, and this usually happens offline. Because the source data is an public online pdf file, we use the OnlinePDFLoader from Langchain to load into a document format.

In [None]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import OnlinePDFLoader

In [None]:
!pip install pdf2image



In [None]:
!pip install pdfminer.six



In [None]:
!pip install pdfplumber
!pip install unstructured_inference
!pip install pikepdf



In [None]:
# Load, chunk and index the contents of the blog.
#loader = WebBaseLoader(
#    web_paths=("https://www08.wellsfargomedia.com/assets/pdf/about/investor-relations/annual-reports/2022-annual-report.pdf",),
#    bs_kwargs=dict(
#        parse_only=bs4.SoupStrainer(
#            class_=("post-content", "post-title", "post-header")
#        )
#    ),
#)
#docs = loader.load()

loader = OnlinePDFLoader("https://www08.wellsfargomedia.com/assets/pdf/about/investor-relations/annual-reports/2022-annual-report.pdf")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain.invoke("Summarize the article with precision")

'The article discusses the view that organizations have limits to their manageability based on size and suggests that simplification is the most effective solution. However, the speaker disagrees and believes that Wells Fargo is not too big or complex to manage, but rather its shortcomings are due to ineffective management and lack of prioritization. The speaker also mentions that while they are prepared for a potential economic downturn, they may miss some interim milestones in their transformation process.'

In [None]:
# cleanup
vectorstore.delete_collection()