
# **Quickstart Guide to Document QA with openai and langchain**
*Reading and analyzing journal paper *

In [None]:
! pip install langchain openai chromadb pypdf tiktoken

Collecting langchain
  Downloading langchain-0.0.267-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.6-py3-none-any.whl (405 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.5/405.5 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.15.1-py3-none-any.whl (271 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.0/271.0 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31

https://www.scientifica.uk.com/neurowire/gradhacks-a-guide-to-reading-research-papers

Mount the drive with the data

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


Authenticate the API Key

In [None]:
import openai

openai.api_key = your_key

Use ChatCompletion model from langchain

In [None]:
from langchain.chat_models import ChatOpenAI

llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613",openai_api_key = openai.api_key)

The PyPDFLoader will extract the text and metadata from each page of the PDF

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("journal paper.pdf")

<b>loader.load() method </b>
<li>Returns a list of Document objects</li><li>Each Document contains the page_content field which is the text content of the document</li>

We can loop through the list of Documents and print out the page_content to access the text from each document loaded by the loader.

In [None]:
pages = loader.load()

In [None]:
print(len(pages))
print(pages)
#print(pages[0].page_content)

12
[Document(page_content='   \n     \n   \n    \n   \n    \n   Int. J. Data Analysis Techniques and Strategies, Vol. 10, No. 4, 2018 369    \n \n   Copyright © 2018 Inderscience Enterprises Ltd. \n     \n   \n    \n   \n    \n       \n A lexicon-based term weighting scheme for emotion \nidentification of tweets \nS. Lovelyn Rose* and R. Venkatesan \nDepartment of CSE, \nPSG College of Technology, Coimbatore, India Email: lovelyndavid@gmail.com Email: ramanvenkatesan@yahoo.com \n*Corresponding author \nGirish Pasupathy \nDepartment of Windows Server and System Center (WSSC), \nMicrosoft India Development Center, India \nEmail: girishsub485@gmail.com \nP. Swaradh \nDepartment of CSE, \nKMCT College of Engineering, Calicut, India \nEmail: swaradh@gmail.com \nAbstract:  Detecting emotions in tweets is  a huge challenge due to its li mited \n140 characters and extensive use of twitter language with evolv ing terms and \nslangs. This paper uses variou s preprocessing techniques, forms  a  f

Initialize the OpenAIEmbeddings class<br>
The OpenAIEmbeddings class allows you to generate vector embeddings for text using OpenAI's embedding models.



In [None]:
from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(openai_api_key=openai.api_key)


Document QA<br>
<li>Split the documents
<li>Embed with OpenAIEmbeddings
<li>Store them in a Vectorstore index automatically.<br><br>

The key steps are:

<li>Load the PDF using PyPDFLoader into a list of Document objects

<li>Initialize the OpenAIEmbeddings model

<li>Pass the embeddings model to the VectorstoreIndexCreator

<li>Call .from_loaders() on the index creator, passing in the PyPDFLoader

A Vectorstore index is a search index that allows fast semantic search over text documents. It is created using the VectorstoreIndexCreator class in LangChain.

In [None]:
from openai.api_resources import embedding
from langchain.indexes import VectorstoreIndexCreator

index = VectorstoreIndexCreator(embedding=embeddings_model).from_loaders([loader])

Use the resulting index object to run semantic search queries over the PDF contents.

In [None]:
index.query(llm=llm,question="Who are the authors of the paper")

'The authors of the paper are S.L. Rose, R. Venkatesan, G. Pasupathy, and P. Swaradh.'

In [None]:
index.query(llm=llm,question="What are their affiliations")

'S. Lovelyn Rose is an Associate Professor at the Department of CSE in PSG College of Technology, Coimbatore. \nR. Venkatesan is currently a Professor and the Head of the Department of CSE at PSG College of Technology, Coimbatore. \nGirish Pasupathy is a Software Engineer at Microsoft India Development Center, Hyderabad. \nP. Swaradh is the Head of the Department of Computer Science and Engineering at KMCT College of Engineering, Calicut.'

In [None]:
index.query(llm=llm,question="Summarize the methodology employed")

'The methodology employed in this study involved the use of classification techniques to identify emotions in tweets. The classifiers used were naive Bayes, SVM, and random forests. The feature vector used in the classification was high-dimensional and sparse. The WEKA software, which consists of machine learning algorithms, was used for implementing these classifiers. The experimental setup involved using an Intel Xeon E3 processor with 16 GB RAM and the Windows 7 platform. The evaluation metrics used to assess the performance of the classifiers included precision score, recall score, and F1 score.'

In [None]:
index.query(llm=llm,question="List the refernce papers between the years 2012 to 2015")

'The reference papers between the years 2012 to 2015 are:\n\n1. Ide, N. and Suttles, J. (2013) ‘Distant supervision for emotion classification with discrete binary values’, Proc. 14th International Conference on Computational Linguistics and Intelligent Text Processing, pp.121–136.\n\n2. Jahnavi, Y. and Radhika, Y. (2015) ‘FPST: a new term weighting algorithm for long running and short-lived events’, International Journal of Data Analysis Techniques and Strategies, Vol. 7, No. 4, pp.366–383.\n\n3. Mohammad, S. (2012a) ‘#Emotional tweets’, Proc. 1st Joint Conference on Lexical and Computational Semantics, pp.246–255.\n\nPlease note that these are the only reference papers mentioned in the given context between the years 2012 to 2015.'

In [None]:
index.query(llm=llm,question="What is the sample used")

'The sample used in this study consists of 3,946 tweets in the training dataset and 1,965 tweets in the testing dataset. The tweets are categorized into different emotions such as Happy, Sad, Anger, Fear, Disgust, Surprise, and No_Emotion.'

In [None]:
index.query(llm=llm,question="What is the research investigating")

'The research is investigating the classification of emotions in tweets.'

In [None]:
index.query(llm=llm,question="What are the implications of the results? Give in 2 sentences")

'The results suggest that using n-gram models does not improve the accuracy of emotion identification and may not be suitable for new domains. Additionally, the inclusion of interjections, synonyms, negations, and punctuation marks can improve the accuracy of machine learning classifiers for emotion identification.'

In [None]:
index.query(llm=llm,question="What experiments could be carried out to answer any further questions?")

'Based on the given context, some possible experiments that could be carried out to answer further questions could include:\n\n1. Experiment to compare different classification techniques: Conduct a study to compare the performance of different classifiers (e.g., naive Bayes, SVM, random forests) on classifying emotions in textual content. This could involve using a dataset with labeled emotions and evaluating the accuracy, precision, recall, and F-measure of each classifier.\n\n2. Experiment to evaluate the impact of preprocessing: Investigate the effect of extensive preprocessing on the classification accuracy of emotions. This could involve comparing the performance of classifiers on datasets with and without preprocessing steps such as removing stopwords, stemming, or normalizing text.\n\n3. Experiment to assess the addition of a new emotion class: Explore the impact of adding a new emotion class (e.g., No_Emotion) on the classification accuracy. This could involve comparing the pe

In [None]:
index.query(llm=llm,question="What work have researchers previously done on this title. Format output  as bulleted list of points")

"- Researchers have used the affective nature of hashtags and emoticons in classification (Ide and Suttles, 2013; Mohammad, 2012a; Qadir and Riloff, 2014; Hoecke et al., 2014).\n- Supervised machine learning-based classification techniques such as naive Bayes, maximum entropy, SVM, k-NN, DNN, CNN, liblinear, multinomial naive Bayes, and logistic regression have been used (Cambria et al., 2013; Ide and Suttles, 2013; Cardie et al., 2005; Mohammad, 2012b; Qadir and Riloff, 2014; Cambria et al., 2013).\n- Mohammad (2012b) found that supervised methods perform better than unsupervised methods for emotion classification.\n- Previous work has used term weighting algorithms based on frequency, position, scattering, and topicality (Jahnavi and Radhika, 2015).\n- A novel term weighting scheme based on punctuations and negation has been proposed to improve the feature vector.\n- Studies on the effect of the n-gram model on emotion identification found that it does not improve classification accu