## This is a tutorial on how to perform QA and query on your own pdf file (it can be extended to a collection of pdfs).

Acknowledgement: Sophia Yang's article on the functionality of LangChain => https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

In [1]:
import os
from google.colab import drive
import colab_env

Mounted at /content/gdrive


In [5]:
!pip install langchain openai chromadb tiktoken pypdf colab-env

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Before starting, use colab-env to store openai_api_key in vars.env file
see example: https://colab.research.google.com/github/apolitical/colab-env/blob/master/colab_env_testbed.ipynb#scrollTo=4LMqPJ9i5OZo

## Short little langchain tutorial on QA task for PDF files

### instantiate the LLM model


*   OpenAI defaults to `text-davinci-003`
*   ChatGPT defaults to `gpt-3.5-turbo`



In [2]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings import OpenAIEmbeddings

from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
os.chdir('/content/drive/My Drive/langchain')

In [5]:
llm = OpenAI(openai_api_key=os.environ["OPENAI_API_KEY"], temperature=1)


In [6]:
chatgpt = ChatOpenAI(model_name='gpt-3.5-turbo', openai_api_key=os.environ["OPENAI_API_KEY"], temperature=1)


In [9]:
# try a simple prompt
text = "does pineapple belong on a thin crust pizza?"

In [12]:
chatgpt([HumanMessage(content=text)])

AIMessage(content='As an AI language model, I do not have personal preferences. However, it is a matter of personal taste whether one prefers pineapple on a thin crust pizza. Some people enjoy the combination of sweet and savory flavors, while others do not like fruit on their pizza. Ultimately, it is up to individual taste buds to decide what toppings they prefer on their pizza.', additional_kwargs={}, example=False)

In [13]:
llm(text)

'\n\nNo, pineapple does not belong on a thin crust pizza. It is much more common to see pineapple on a thicker, deep dish crust pizza.'

## load the pdf file
### in loader.load(), it returns the pages in the pdf as list of document and feeds the whole document or set of documents to the LLM.

In [14]:
loader = PyPDFLoader("text_files/automated speech-based screening of depression.pdf")
documents = loader.load()

In [15]:
documents[8]

Document(page_content=' Karol Chlasta et al. / Procedia Com puter Science  00 (2019) 000 –000  9 \nTable 2. Summar y of other classification results for different CNN architectures . \nModel  Type of Input  Hyperparameters  (LR; EP) Accuracy  F1 Score  Precision  Recall  \nResNet 18  Image (224x224px)  0.01; 3 78% 0.0000  - 0.0000  \nResNet 34 \nResNet 50  \nResNet 50  Image (224x224px)  \nImage ( 224x224px)  \nImage (512x512px)  0.001 ; 3 \n0.01; 3 \n0.01; 3  81% \n67% \n78% 0.6154  \n0.3077  \n0.5714  0.5714  \n0.3333  \n0.5714  0.6667  \n0.2857  \n0.5714  \nResNet 101  \nResNet 101  \nResNet 152  \nResNet 152  Image (512x512px)  \nImage  (1024x1024 px) \nImage (512x512px)  \nImage (1024x1024px)  0.001 ; 4 \n0.0044; 4  \n0.0001 ; 3 \n0.003 ; 3 81% \n78% \n63% \n74% 0.2857  \n0.4000  \n0.3750  \n0.5333  0.2500  \n0.2500  \n0.3750  \n0.5000  0.3333  \n1.0000  \n0.3750  \n0.5714  \n \nThe ResNet -34 system using a smaller Set A and TTA classified 21 voice sam ples correctly, with only s

In [16]:
chain = load_qa_chain(llm=ChatOpenAI(model_name='gpt-3.5-turbo'), chain_type="map_reduce")
query = "what is the finding of this experiment?"
chain.run(input_documents=documents, question=query)

'The experiment proposed a method that uses deep convolutional neural networks for depression detection in speech. The proposed method produced a promising classification accuracy of around 70% for a ResNet-34 model, and 71% for a ResNet-50 model, both trained on spectrograms of 224x224 px. This result can be improved to 77% with Test Time Augmentation (TTA). The full summary of the results is presented in Table 2. Therefore, the finding of the experiment is that the proposed method achieved a promising accuracy of depression detection in speech using deep convolutional neural networks.'

## RetrievalQA
#### retrieve the most relevant chunck of text and feed those to the language model.

#### Note that `chain_type="stuff"` uses ALL of the text, and it can exceed the token limit and trigger API errors.

In [17]:
# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()
# create the vectorestore to use as the index
db = Chroma.from_documents(texts, embeddings)
# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})
# create a chain to answer questions
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model_name='gpt-3.5-turbo'), chain_type="stuff", retriever=retriever, return_source_documents=True)
query = "what is the finding of this experiment?"
result = qa({"query": query})


In [18]:
result

{'query': 'what is the finding of this experiment?',
 'result': 'The experiment presented in this paper proposes a novel method for automated speech-based screening of depression using deep convolutional neural networks. The proposed method produced a promising classification accuracy of around 70% for a ResNet-34 model, and 71% for a ResNet-50 model, both trained on spectrograms of 224x224 px. This result can be improved to 77% with Test Time Augmentation (TTA). The overall finding suggests that deep convolutional neural networks have potential for automated speech-based screening of depression.',
 'source_documents': [Document(page_content='Karol Chlasta et al. / Procedia Com puter Science  00 (2019) 000 –000  3 \n3.980, RMSE 4.653). Multimodal approaches based on different neural network  architectures  appear to be more \neffective than those based on a single modality, and they are an interesting a rea of furth er research.  \nAnother set of interesting results was presented durin

In [19]:
# try a different doc

In [20]:
loader_HOA_doc = PyPDFLoader("text_files/HOA_rules_exterior.pdf")
hoa_documents = loader_HOA_doc.load()

In [22]:
# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=0)
texts = text_splitter.split_documents(hoa_documents)
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()
# create the vectorestore to use as the index
db = Chroma.from_documents(texts, embeddings)
# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})
# create a chain to answer questions
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model_name='gpt-3.5-turbo'), chain_type="stuff", retriever=retriever, return_source_documents=True)
query_1 = "Can people have Wall Mounted Basketball Hoop on the porch or balcony?"
query_2 = "Can people have satellite dishes on the front porch"

result_1 = qa({"query": query_1})
result_2 = qa({"query": query_2})

In [23]:
result_1

{'query': 'Can people have Wall Mounted Basketball Hoop on the porch or balcony?',
 'result': "The Village West at Centennial's Rules and Regulations do not specifically mention whether or not wall-mounted basketball hoops are allowed on porches or balconies. However, it does state that all improvements made to a lot, including outdoor structures, even temporary, require design review and approval from the Design Review Committee. Therefore, if a resident wishes to install a wall-mounted basketball hoop on their porch or balcony, they will need to submit complete plans and specifications to the DRC for approval.",
 'source_documents': [Document(page_content='VILLAGE WEST at CENTENNIAL\nRULES and REGULATIONS\n11/15/2013 Page 4 of 6In order to ensure continued property value and community appeal, please consider the\nhigh standard of the Community when selecting décor for your Front Porches, and\nAttached Decks. For example, upholstered furniture or camping equipment is not\nappropriate 

In [24]:
result_2

{'query': 'Can people have satellite dishes on the front porch',
 'result': "The rules and regulations of Village West at Centennial state that satellite dishes must be installed in the preferred manner and a DRC review form showing the location must be on file for each satellite dish. While it does not specifically say whether or not satellite dishes are allowed on the front porch, it is recommended that owners consider the high standard of the community when selecting décor for their front porches and attached decks. Therefore, it is best to consult the guidelines for satellite installation and the DRC form available on the Association's website or from the property management company to find out more information about the preferred locations for satellite dishes.",
 'source_documents': [Document(page_content='VILLAGE WEST at CENTENNIAL\nRULES and REGULATIONS\n11/15/2013 Page 5 of 6SATELLITE DISHES\n\uf0b7\uf020To ensure all satellite equipment is installed in the preferred manner, a