## Experimenting with `LangChain` & `OpenAI` for Document QA

Leveraging the LangChain framework to build Document QA Tools that use ChatGPT to extract information and present it in humanly form

**Disclaimer**: all information extracted from the documents is made-up

In [19]:
import credentials
import time
import re
import os
os.environ["OPENAI_API_KEY"] = credentials.openai_api

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

### 1. QA (without `retriever` object)

With retriever object : https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html

https://python.langchain.com/en/latest/use_cases/question_answering.html
https://python.langchain.com/en/latest/modules/chains/index_examples/question_answering.html

In [2]:
loader = PyPDFLoader("../docs/contract.pdf")
docs = loader.load_and_split()

In [7]:
chain = load_qa_chain(ChatOpenAI(temperature=0.0), chain_type="stuff")

In [17]:
queries = ['Who are going to be working on the project?',
           'What is the hourly rate for the most senior colleagues?',
           'Overall what will be the weekly cost, if one week consists of 40 hours?',
           'Who are the agreeing parties?',
           'Where will the colleagues work?',
           'Which days can the project team take off?',
           'Is there anything that this contract forbids someone from doing?',
           'What are the most important aspects to working conditions?',
           'Which law governs the contract?',
           'When was Michael Jordan born?']

In [18]:
for query in queries:
    print('Question:', query)
    print('Answer:')
    print(chain.run(input_documents = docs, question = query))
    print('\n', '-------' * 10)
    time.sleep(20)

Question: Who are going to be working on the project?
Answer:
The following personnel from the CONTRACTOR will be working on the project:

- Márton Biró, Senior, Half-time, $43
- Kristóf Rábay, Senior, Half-time, $43
- Bence Molnár, Medior, Half-time, $25
- Áron Fellegi, Medior, Full-time, $25

 ----------------------------------------------------------------------
Question: What is the hourly rate for the most senior colleagues?
Answer:
The hourly rate for the most senior colleagues is $43.

 ----------------------------------------------------------------------
Question: Overall what will be the weekly cost, if one week consists of 40 hours?
Answer:
The weekly cost will depend on the number of hours worked by each team member. According to the provided information, the hourly rates for the team members are as follows:

- Márton Biró (Senior, Half-time): $43
- Kristóf Rábay (Senior, Half-time): $43
- Bence Molnár (Medior, Half-time): $25
- Áron Fellegi (Medior, Full-time): $25

Assumi

All answers are great, but Michael Jordan is not references in the contract, so this question didn't need to be answered. We can guide this with a custom prompt stating that if the information cannot be extracted from the context, don't answer the question

Create custom `PromptTemplate`

In [27]:
prompt_template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If the question cannot be answered based on the context, say you cannot determine the answer based on the given text.

{context}

Question: {question}
Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [28]:
chain_with_custom_prompt = load_qa_chain(ChatOpenAI(temperature=0.0), chain_type="stuff", prompt=PROMPT)

In [29]:
print('Question:', queries[-1])
print('Answer:')
print(chain_with_custom_prompt.run(input_documents = docs, question = queries[-1]))

Question: When was Michael Jordan born?
Answer:
Cannot determine the answer based on the given text.


Now it works perfectly, it only uses the context, not the information it was trained on

Before using the official summarization chain, let's see what QA can do when asked to summarize the text

In [43]:
q = 'Please summarize the entire contract as consicely as possible. Include the hourly rates.'
print('Question:', q)
print('Answer:')
a = chain.run(input_documents = docs, question = q)
print(re.sub("(.{128})", "\\1\n", a, 0, re.DOTALL))

Question: Please summarize the entire contract as consicely as possible. Include the hourly rates.
Answer:
The contract is between Software Guidance and Assistance, Inc. (SGA) and Csocsobajnok ZRT. The contractor will provide services 
to the client, Lexion Inc. The project is open-ended and the contractor personnel will continue to be the same. The hourly rates
 are $43 for senior personnel and $25 for medior personnel. The service hours may change upon request from the client. The tasks
 and deliverables will be defined by the client's CTO or co-founders. The contractor must inform SGA about any foreseeable chang
e in personnel availability. The contractor personnel will be subject to the client's security procedures while working on the p
roject. The contract may be terminated by either party with a notice period of 30 days. The contractor shall submit an invoice t
o SGA for services provided to the client. The contractor's fee is confidential and shall not be divulged to any other 

Pretty good! Now let's see with the official summarization chain

### 2. Summarization

https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html

In [46]:
chain = load_summarize_chain(ChatOpenAI(temperature=0.0), chain_type="stuff")
summary = chain.run(docs)

print(re.sub("(.{128})", "\\1\n", summary, 0, re.DOTALL))

Software Guidance & Assistance, Inc. and CSOCSOBAJNOK ZRT. have entered into a Contractor Agreement for the provision of service
s to Lexion Inc. The project is open-ended and will be performed by CONTRACTOR personnel, with rates adjusted for inflation annu
ally. The tasks and deliverables will be defined by the Client's CTO or co-founders. The CONTRACTOR personnel will be subject to
 CLIENT's security procedures while working on the project. Both parties may terminate the schedule with a notice period of 30 d
ays. The CONTRACTOR fee is confidential and shall not be divulged to any other party, including the Client. The CONTRACTOR may n
ot list the client or project name nor use the client logo without the express written permission of the Client.


Add custom `prompt`

In [47]:
prompt_template = """Write a concise summary of the following:

{text}

CONCISE SUMMARY:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])

In [49]:
chain = load_summarize_chain(ChatOpenAI(temperature=0.0), chain_type="stuff", prompt=PROMPT)
summary = chain.run(docs)

print(re.sub("(.{128})", "\\1\n", summary, 0, re.DOTALL))

The document outlines a project and fee schedule between Software Guidance & Assistance, Inc. (SGA) and CSOCSOBAJNOK ZRT. The pr
oject involves providing services to Lexion Inc. and may be extended beyond June 30, 2022. The personnel, rates, and tasks are d
efined, and the hours worked are subject to Hungarian workdays until 12 pm U.S. Eastern Time. The contractor must register their
 hours in SGA's electronic timesheet system and submit weekly timesheets. The parties may terminate the schedule with a 30-day n
otice period, and the contractor must submit invoices for payment within five days after the end of each monthly period. The con
tractor's fee is confidential, and they may not use the client's name or logo without permission.


In [50]:
prompt_template = """Write a concise summary of the following:

{text}

CONCISE SUMMARY IN HUNGARIAN:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(ChatOpenAI(temperature=0.0), chain_type="stuff", prompt=PROMPT)
summary = chain.run(docs)

print(re.sub("(.{128})", "\\1\n", summary, 0, re.DOTALL))

A Software Guidance & Assistance, Inc. és a CSOCSOBAJNOK ZRT. közötti szerződés értelmében a CONTRACTOR szolgáltatásokat nyújt a
 Lexion Inc. ügyfélnek. A projekt nyitott végű, és a személyzetet a CONTRACTOR biztosítja. Az óradíjakat évente egyszer az inflá
cióhoz igazítják, és a szolgáltatási órák változhatnak az ügyfél kérésére. A feladatokat és a teljesítendő feladatokat az ügyfél
 CTO-ja határozza meg. A CONTRACTOR személyzete a CLIENT biztonsági eljárásainak alá van vetve. Minden hónap végén a CONTRACTOR 
számlát küld az SGA-nak a szolgáltatásokért, és az SGA 15 munkanapon belül kifizeti a számlát. A CONTRACTOR díja bizalmas, és ne
m hozzáférhető harmadik fél számára, beleértve az ügyfelet is. A szerződést mindkét fél 30 napos értesítési idővel bármikor felm
ondhatja.


What are the intermediary steps during different chaining procedures?

In [56]:
chain = load_summarize_chain(ChatOpenAI(temperature=0.0), 
                             chain_type="map_reduce", 
                             return_intermediate_steps=True)

map_reduce_summary = chain({"input_documents": docs}, return_only_outputs=True)

In [58]:
map_reduce_summary.keys()

dict_keys(['intermediate_steps', 'output_text'])

In [60]:
print(re.sub("(.{128})", "\\1\n", map_reduce_summary['output_text'], 0, re.DOTALL))

Software Guidance & Assistance, Inc. and CSOCSOBAJNOK ZRT. have signed a Contractor Agreement to provide services to Lexion Inc.
 The proof-of-concept project was successful, and the Contractor has been offered an open-ended project starting from July 1, 20
22. The project tasks and deliverables will be defined by the client's CTO or co-founders, and the Contractor must inform SGA ab
out any changes in personnel availability. The Contractor personnel work half- or full-time on Hungarian workdays until 12 pm U.
S. Eastern Time, and overtime charges apply when the client makes a written demand for work outside normal working hours. Both p
arties can terminate the schedule with 30 days' notice, and the contractor cannot use the client's name or logo without permissi
on.


In [62]:
len(map_reduce_summary['intermediate_steps']) == len(docs)

True