# Question Answering over Documents with Langchain

In [1]:
from langchain.document_loaders import TextLoader, DirectoryLoader, PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI
import os

In [5]:
# ONLY USE IF KEY IS SAVED IN FILE

# Change this path to your key location
path_to_key = "../openai-key.txt"

with open(path_to_key) as fo:
    key = fo.readline()
    
os.environ["OPENAI_API_KEY"] = key.strip()

## Q/A with a single document

In [26]:
path_to_file = "./state_of_the_union.txt"
loader = TextLoader(path_to_file)

In [27]:
index = VectorstoreIndexCreator().from_loaders([loader])

Using embedded DuckDB without persistence: data will be transient


In [11]:
query = "What is the author's main thesis in this document? Support your summary with direct quotes from the document."
index.query(query)

' The author\'s main thesis in this document is that the government should take action to protect the rights of citizens and to provide economic relief. They should do this by providing a pathway to citizenship for immigrants, protecting access to health care, supporting veterans, and demanding more competition from corporations. Direct quotes from the document include: "Provide a pathway to citizenship for Dreamers, those on temporary status, farm workers, and essential workers" and "It’s time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children."'

In [12]:
index.query_with_sources(query)

{'question': "What is the author's main thesis in this document? Support your summary with direct quotes from the document.",
 'answer': ' The author\'s main thesis in this document is that the government should take action to protect the rights of citizens, including providing a pathway to citizenship for Dreamers, strengthening privacy protections, providing mental health services, supporting veterans, and increasing competition in the market. \n\n"Provide a pathway to citizenship for Dreamers, those on temporary status, farm workers, and essential workers...Let’s get it done once and for all...The constitutional right affirmed in Roe v. Wade—standing precedent for half a century—is under attack as never before...If we want to go forward—not backward—we must protect access to health care. Preserve a woman’s right to choose. And let’s continue to advance maternal health care in America...It’s time to strengthen privacy protections, ban targeted advertising to children, demand tech com

This seems to be working well, but we want to check that the model does not use any information outside the document.

In [14]:
# Test that it only queries the document
query = "According to the paper, what is a transformer model and what can they be used for? Use only the given document for your answers."
index.query(query)

' This document does not mention a transformer model.'

In [15]:
query = "What is a transformer model and what can they be used for?"
index.query(query)

" I don't know."

In [29]:
query = "What is linear regression?"
index.query(query)

" I don't know."

Great! the model does not attempt to guess or hallucinate an answer when it cannot find one in the document.

### Q/A over PDFs

Being able to read in PDFs would also be useful. LangChain accomplishes this with the help of PyPDF library.

In [16]:
path_to_pdf = "./research_papers/attention_is_all_you_need.pdf"
loader = PyPDFLoader(path_to_pdf)
index = VectorstoreIndexCreator().from_loaders([loader])


Using embedded DuckDB without persistence: data will be transient


In [17]:
query = "According to the paper, what is a transformer model and what can they be used for? Use only the given document for your answers."
index.query(query)

' A transformer model is a sequence transduction model based entirely on attention mechanisms, which can be used for machine translation tasks and English constituency parsing.'

This answer looks correct, but we need to check that it is searching only the document.

In [21]:
# Test that it only queries the document
query = "What is linear regression?"
index.query(query)

' Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.'

In [22]:
index.query_with_sources(query)

{'question': 'What is linear regression?',
 'answer': ' Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.\n',
 'sources': 'https://en.wikipedia.org/wiki/Linear_regression'}

When we ask the model for its source on the previous answer, it lists Wikipedia! This is not what we want. Let's ask it to only use the given document for its answer.

In [24]:
# Test that it only queries the document
query = "What is linear regression? Use only the given document to construct your answer."
index.query(query)

' Linear regression is not mentioned in the given document.'

In [25]:
index.query_with_sources(query)

{'question': 'What is linear regression? Use only the given document to construct your answer.',
 'answer': ' Linear regression is not mentioned in the given document.\n',
 'sources': './research_papers/attention_is_all_you_need.pdf'}

Now the model admits that the information is not given in the document.

## Q/A with multiple documents using VectorStore

We can also use LangChain to query over several documents in a directory. First, load the directory using the DirectoryLoader class (with help from the Unstructured library), then use ChromaDB to form document embeddings to search over.

In [3]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.indexes.vectorstore import VectorstoreIndexCreator
from langchain.chains import RetrievalQA

In [4]:
directory = "./research_papers"
loader = DirectoryLoader(directory, glob = "*.pdf", loader_cls=PyPDFLoader) # can add loader_cls=TextLoader to change loader type
documents = loader.load()
print(len(documents)) # this represents the number of pages of the PDF (not the number of PDF files)

27


In [5]:
text_splitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0)
texts = text_splitter.split_documents(documents)

In [6]:
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

Using embedded DuckDB without persistence: data will be transient


In [40]:
qa = RetrievalQA.from_chain_type(llm = OpenAI(), chain_type = "stuff", retriever = docsearch.as_retriever())

In [41]:
query = "According to the given documents, what is a GPT model? Use only the given documents to construct your answer."
qa.run(query)

' GPT (Generative Pre-Training) is a language model developed by OpenAI that uses a diverse corpus of unlabeled text to pre-train a language model, followed by discriminative fine-tuning on specific tasks.'

### Adding Sources

Next, we want to modify the chain so that the model lists its sources.

*Current problem*: this does not work, as it states that we exceed the max context length. Will need to try on shorter documents?

In [None]:
# Adding sources - this does not work as query is too long

from langchain.chains.qa_with_sources import load_qa_with_sources_chain
chain = load_qa_with_sources_chain(llm = OpenAI(), chain_type="stuff")
query = "What is a GPT model?"
docs = docsearch.similarity_search(query)
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

In [7]:
from langchain.chains import RetrievalQAWithSourcesChain
chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever())

query = "What is a GPT model?"
chain({"question": query}, return_only_outputs = True)

InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 5318 tokens (5062 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

## Additional Applications

In addition to asking questions over a dodcument, we can use LangChain to generate questions over a document. This use case would be ideal for generating homework or quizzes quickly. Below we use the Q/A chain to generate questions, but a future notebook will contain examples using the QAGenerationChain.

In [6]:
path_to_pdf = "./research_papers/attention_is_all_you_need.pdf"
loader = PyPDFLoader(path_to_pdf)
index = VectorstoreIndexCreator().from_loaders([loader])

Using embedded DuckDB without persistence: data will be transient


In [7]:
query = "Generate a multiple choice quiz from this document over the contents of the document. Each question should have four plausible answers, with only one answer being correct."
index.query(query)

'\nQ: What type of data was used to train the 4-layer transformer with dmodel = 1024?\nA. WMT 2014 English-German dataset\nB. Wall Street Journal (WSJ) portion of the Penn Treebank\nC. high-conﬁdence and BerkleyParser corpora\nD. Section 22 development set\n\nCorrect Answer: B. Wall Street Journal (WSJ) portion of the Penn Treebank'

Great! But we would like to see the justification for the correct answer. Let's engineer our prompt to reflect this.

In [8]:
query = "Generate a multiple choice quiz from this document over the contents of the document. Each question should have four plausible answers, with only one answer being correct. For each question, provide a justification as to why an answer is correct or incorrect"
index.query(query)

'\n\nQuestion: What type of attention function is used in this document?\nA. Additive Attention\nB. Dot-Product Attention\nC. Scaled Dot-Product Attention\nD. Multi-Head Attention\n\nCorrect Answer: C. Scaled Dot-Product Attention\nJustification: The document states that "We call our particular attention \'Scaled Dot-Product Attention\' (Figure 2)."'