<a href="https://colab.research.google.com/github/nbeaudoin/LangChain-Experimentation-Zone/blob/main/PDF_Chat_with_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tutorial Source: https://www.youtube.com/watch?v=ZzgUqFtxgXI&list=PL8motc6AQftk1Bs42EW45kwYbyJ4jOdiZ&index=18

In [1]:
!pip -q install openai langchain
!pip install python-dotenv
!pip install tiktoken
!pip install faiss-gpu

# Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')

import os
from dotenv import load_dotenv

# Load variables from .env file
load_dotenv('/content/drive/MyDrive/Projects/keys/secrets.json')

# Use variables
openai_api = os.getenv('OPENAI_API_KEY')
huggingface_api = os.getenv('HUGGINGFACE_API_KEY')

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0
Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.5.1
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━

# Basic Chat PDF

In [2]:
!pip install PyPDF2

from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m174.1/232.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


# Reading in the PDF

In [3]:
# location of hte pdf file/files
doc_reader = PdfReader('/content/drive/MyDrive/Projects/docs/impromptu-rh.pdf')

In [4]:
doc_reader

<PyPDF2._reader.PdfReader at 0x7a5d7673bf10>

In [5]:
# read the data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(doc_reader.pages):
  text = page.extract_text()
  if text:
    raw_text += text

In [6]:
len(raw_text)

371090

In [7]:
raw_text[:100]

'Impromptu\nAmplifying Our Humanity \nThrough AI\nBy Reid Hoffman  \nwith GPT-4Impromptu: AmplIfyIng our '

# Text Splitter

This takes the text and splits it into chunks. The chunk size is characters not tokens

In [8]:
# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap = 200,  #striding over the text
    length_function = len
)

texts = text_splitter.split_text(raw_text)

In [9]:
len(texts)

466

In [10]:
texts[20]

'Because, really, an AI book? When things are moving so \nquickly? Even with a helpful AI on hand to speed the process, \nany such book would be obsolete before we started to write it—\nthat’s how fast the industry is moving.\nSo I hemmed and hawed for a bit. And then I thought of a frame \nthat pushed me into action.\nThis didn’t have to be a comprehensive “book” book so much as \na travelog, an informal exercise in exploration and discovery, \nme (with GPT-4) choosing one path among many. A snapshot \nmemorializing—in a subjective and decidedly not definitive \nway—the AI future we were about to experience.\nWhat would we see? What would impress us most? What would \nwe learn about ourselves in the process? Well aware of the brief \nhalf-life of this travelog’s relevance, I decided to press ahead.\nA month later, at the end of November 2022, OpenAI released \nChatGPT, a “conversational agent,” aka chatbot, a modified \nversion of GPT-3.5 that they had fine-tuned through a process'

# Making the embeddings

In [11]:
# Download the embeddings
embeddings = OpenAIEmbeddings()

In [12]:
docsearch = FAISS.from_texts(texts, embeddings)

In [13]:
docsearch.embedding_function

<bound method OpenAIEmbeddings.embed_query of OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base='', openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization='', allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False)>

In [14]:
query = "how does GPT-4 change social media?"
docs = docsearch.similarity_search(query)

In [15]:
len(docs)

4

In [16]:
docs[0]

Document(page_content='cian, GPT-4 and ChatGPT are not only able but also incredi-\nbly willing to focus on whatever you want to talk about.4 This \nsimple dynamic creates a highly personalized user experience. \nAs an exchange with GPT-4 progresses, you are continuously \nfine-tuning it to your specific preferences in that moment. \nWhile this high degree of personalization informs whatever \nyou’re using GPT-4 for, I believe it has special salience for the \nnews media industry.\nImagine a future where you go to a news website and use \nqueries like these to define your experience there:\n4  Provided it doesn’t violate the safety restrictions OpenAI has put on \nthem.93Journalism\n● Hey, Wall Street Journal, give me hundred-word summa-\nries of your three most-read tech stories today.\n● Hey, CNN, show me any climate change stories that hap-\npened today involving policy-making.\n● Hey, New York Times, can you create a counter-argument \nto today’s Paul Krugman op-ed, using only news

# Plain QA Chain

In [17]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [18]:
chain = load_qa_chain(OpenAI(),
                      chain_type="stuff")

In [20]:
# check the prompt
chain.llm_chain.prompt.template

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"

In [23]:
query = "who are the authors of the book?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Reid Hoffman and Sam Altman.'

In [26]:
query = "who are the authors of the book?"
query_02 = "has it rained this week?"
docs = docsearch.similarity_search(query_02)
chain.run(input_documents=docs, question=query)

' The authors of the book are Reid Abbasi and GPT-4.'

In [30]:
query = "who is the booked authored by?"
docs = docsearch.similarity_search(query, k=4)
chain.run(input_documents=docs, question=query)

' The book is authored by Reid Hoffman and GPT-4.'

In [31]:
# To get around context length, change the chain type
chain = load_qa_chain(OpenAI(),
                      chain_type="map_rerank",
                      return_intermediate_steps=True
                      )

query = "who are openai?"
docs = docsearch.similarity_search(query, k=4)
results = chain({"input_documents": docs, "question": query}, return_only_outputs=True)
results



{'intermediate_steps': [{'answer': ' OpenAI is an AI research and deployment company that was founded in 2015. It is known for its text-to-image generation tool DALL-E 2 and ChatGPT, as well as Midjourney and Stable Diffusion.',
   'score': '80'},
  {'answer': ' OpenAI is an organization founded in 2015 with the goal of developing technologies that put the power of AI directly into the hands of millions of people.',
   'score': '100'},
  {'answer': ' OpenAI is an AI research and development company based in San Francisco, California.',
   'score': '90'},
  {'answer': ' OpenAI is a research company focused on developing artificial intelligence (AI) technology.',
   'score': '100'}],
 'output_text': ' OpenAI is an organization founded in 2015 with the goal of developing technologies that put the power of AI directly into the hands of millions of people.'}

In [32]:
results['output_text']

' OpenAI is an organization founded in 2015 with the goal of developing technologies that put the power of AI directly into the hands of millions of people.'

In [35]:
results['intermediate_steps']

[{'answer': ' OpenAI is an AI research and deployment company that was founded in 2015. It is known for its text-to-image generation tool DALL-E 2 and ChatGPT, as well as Midjourney and Stable Diffusion.',
  'score': '80'},
 {'answer': ' OpenAI is an organization founded in 2015 with the goal of developing technologies that put the power of AI directly into the hands of millions of people.',
  'score': '100'},
 {'answer': ' OpenAI is an AI research and development company based in San Francisco, California.',
  'score': '90'},
 {'answer': ' OpenAI is a research company focused on developing artificial intelligence (AI) technology.',
  'score': '100'}]

In [36]:
# check the prompt
chain.llm_chain.prompt.template

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nIn addition to giving an answer, also return a score of how fully it answered the user's question. This should be in the following format:\n\nQuestion: [question here]\nHelpful Answer: [answer here]\nScore: [score between 0 and 100]\n\nHow to determine the score:\n- Higher is a better answer\n- Better responds fully to the asked question, with sufficient level of detail\n- If you do not know the answer based on the context, that should be a score of 0\n- Don't be overconfident!\n\nExample #1\n\nContext:\n---------\nApples are red\n---------\nQuestion: what color are apples?\nHelpful Answer: red\nScore: 100\n\nExample #2\n\nContext:\n---------\nit was night and the witness forgot his glasses. he was not sure if it was a sports car or an suv\n---------\nQuestion: what type was the car?\nHelpful Answer: a sports car or an su

# RetrievalQA

RetrievalQA chain uses load_qa_chain and combines it with the a retriever (in our case the FAISS index)

In [41]:
from langchain.chains import RetrievalQA

# set up FAISS as a generic retriever
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k": 4})

# create the chain to answer questions
rqa = RetrievalQA.from_chain_type(llm=OpenAI(),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)


In [40]:
rqa("What is OpenAI?")

{'query': 'What is OpenAI?',
 'result': ' OpenAI is a research organization that develops and shares artificial intelligence tools for the benefit of humanity. It does not claim any ownership or rights over the content that its tools produce or help produce.',
 'source_documents': [Document(page_content='ing to their own lives however they best saw fit.\nEspecially in the realm of work, I realized, AI deployed in this \nway could give individuals incredibly versatile new tools to \napply to their careers, professional development, and economic \nautonomy. So when I had a chance to become one of OpenAI’s \ninitial funders in 2015, I took it. The vision of AI that it was \nplanning to pursue felt like a natural extension of the goals that \nhad inspired me to co-found LinkedIn in 2002.\nWhen OpenAI released its text-to-image generation tool, \nDALL-E 2, in April 2022, and then followed up six months later \nwith ChatGPT, the organization’s mission to give millions of \nusers hands-on acc

In [45]:
query = "How does Gen AI impact journalism?"
rqa(query)['result']

' Gen AI can help journalists work more productively and effectively, as well as provide opportunities for novel ways of finding and telling stories. Additionally, it provides an opportunity to create a more visible culture of informational transparency and accountability.'

In [46]:
query = "How is GPT-4 different than GPT-3?"
rqa(query)['result']

' GPT-4 has been demonstrated to have greater proficiency than GPT-3, including being able to generate better lightbulb jokes, prose of all kinds, emails, poetry, essays, summaries of documents, translations of languages, and computer code. Additionally, GPT-4 is designed to focus on whatever the user wants to talk about, creating a personalized user experience.'

In [47]:
query = "Who is Nicholas Beaudoin?"
rqa(query)['result']

" I don't know."