## Experimenting with `LangChain` & `OpenAI` for Document QA within long documents

Leveraging the LangChain framework to build Document QA Tools that use ChatGPT to extract information and present it in humanly form

In [17]:
import credentials
import time
import re
import os
os.environ["OPENAI_API_KEY"] = credentials.openai_api

import tiktoken

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.vectorstores import FAISS

from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI

### 1. QA (without `retriever` object)

With retriever object : https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html

https://python.langchain.com/en/latest/use_cases/question_answering.html
https://python.langchain.com/en/latest/modules/chains/index_examples/question_answering.html

In [33]:
loader = PyPDFLoader("../docs/chatgpt_info_extraction.pdf")
docs = loader.load_and_split(text_splitter = TokenTextSplitter(encoding_name = 'cl100k_base'))

In [34]:
len(docs)

18

Calculate how many tokens will be just in the document itself (no prompt, no output)

In [36]:
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
whole_doc = ' '.join([i.page_content for i in docs])
tokens_in_doc = encoding.encode(whole_doc)

print(len(tokens_in_doc))

22284


Store document parts in vectorDB, so only relevant parts get passed to LLM

In [40]:
db = FAISS.from_documents(docs, OpenAIEmbeddings(model = 'gpt-3.5-turbo'))

Create chain but wrap in function that pre-selects relevant parts

In [58]:
chain = load_qa_chain(ChatOpenAI(temperature=0.0), chain_type="stuff")

def run_query(query, max_num_rel_docs = 2):
    relevant_docs = db.similarity_search(query, k = max_num_rel_docs)
    answer = chain.run(input_documents = relevant_docs, question = query)
    return answer

In [59]:
queries = ["Who are the authors?",
           "What was the main experiment?",
           "What were the main outcomes?",
           "Other than ChatGPT, which other language models did the researchers experiment with?",
           "What is the comparison between models in terms of performance?",
           "What are the most important takeaways?",
           "How many papers are referenced?",
           "What are some links or URLs present in the paper?"]

In [60]:
for query in queries:
    print('Question:', query)
    print('Answer:')
    print(run_query(query))
    print('\n', '-------' * 10)
    time.sleep(20)

Question: Who are the authors?
Answer:
There are multiple authors listed for different papers in this context. Please specify which paper you are referring to.

 ----------------------------------------------------------------------
Question: What was the main experiment?
Answer:
The main experiment was to evaluate the performance of ChatGPT, a language model, on a diverse range of Information Extraction (IE) tasks, including 7 fine-grained tasks spanning 14 datasets. The study collected 15 keys for each dataset from both ChatGPT and domain experts and compared ChatGPT's performance with several popular baselines. The aim was to analyze ChatGPT's abilities without any training and to understand its performance on different IE tasks. The study also compared ChatGPT's performance on both Standard-IE and OpenIE settings.

 ----------------------------------------------------------------------
Question: What were the main outcomes?
Answer:
The text provides several outcomes of the research

Most of these anwers are satisfactory, but due to using limited docs for input, ChatGPT does not see the whole context. With a paid plan there is no 3/min rate limitation, then I can pass the whole document into the prompt (with map reduce chaining)

### 2. Summarization

https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html

In [None]:
chain = load_summarize_chain(ChatOpenAI(temperature=0.0), chain_type="map_reduce")
summary = chain.run(docs)

In [67]:
print(re.sub("(.{128})", "\\1\n", summary, 0, re.DOTALL))

The articles and papers listed cover various topics related to natural language processing, including the evaluation of ChatGPT'
s performance in information extraction tasks, the use of language models for entity and relation extraction, and the potential 
impact of AI on online exam integrity. The studies explore the capabilities and limitations of language models, as well as their
 potential applications in fields such as medicine and translation. The articles also discuss techniques for improving the accur
acy and efficiency of natural language processing models.


In [None]:
prompt_template = """Write a concise summary of the following:

{text}

CONCISE SUMMARY IN HUNGARIAN:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(ChatOpenAI(temperature=0.0), chain_type="map_reduce", map_prompt=PROMPT, combine_prompt=PROMPT)
summary = chain.run(docs)

In [70]:
print(re.sub("(.{128})", "\\1\n", summary, 0, re.DOTALL))

A cikkgyűjtemény bemutatja a mesterséges intelligencia és a természetes nyelv feldolgozásának fejlesztésére és javítására összpo
ntosító kutatásokat. Az ACL 2020 és az ACL 2022 konferenciákon bemutatott kutatások között szerepelnek olyan témák, mint az absz
trakt összefoglalások hűséges és valósághűségi kérdései, a chatbotok döntéshozatalának magyarázata, a nyelvi homályosság elemzés
e, a nagy nyelvi modellek külső tudás és automatikus visszajelzés felhasználásával történő javítása, valamint a nyelvi modellek 
által tükrözött vélemények kérdése. Az A mellékletben bemutatják az egyes feladatokhoz használt adathalmazokat, valamint a legúj
abb módszereket is bemutatják minden adathalmazon, amelyek a legjobb eredményeket érték el az adott feladatokban.


Check API usage?

In [71]:
from openai.api_requestor import APIRequestor

In [84]:
r = APIRequestor();
resp = r.request("GET", '/usage?date=2023-04-25'); 
resp_object = resp[0] 

In [85]:
resp_object.data.keys()

dict_keys(['object', 'data', 'ft_data', 'dalle_api_data', 'whisper_api_data', 'current_usage_usd'])