# **Custom Knowledge ChatGPT with LangChain - Chat with PDFs**

**By Liam Ottley:**  [YouTube](https://youtube.com/@LiamOttley)





0.   Installs, Imports and API Keys
1.   Loading PDFs and chunking with LangChain
2.   Embedding text and storing embeddings
3.   Creating retrieval function
4.   Creating chatbot with chat memory (OPTIONAL) 








# 0. Installs, Imports and API Keys

In [16]:
# Import packages
import os
import re
import openai
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import textract
import tiktoken
from IPython.display import Markdown, display
import ipywidgets as widgets
from transformers import GPT2TokenizerFast
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import AzureChatOpenAI
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain_google_genai import GoogleGenerativeAI, GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
import plotly.graph_objects as go
from langchain.prompts import PromptTemplate

In [17]:
# Assign Azure OpenAI API Credentials
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_BASE"] = "https://ptsg5edhopenai01.openai.azure.com/"
os.environ["OPENAI_API_KEY"] = "d929ffb032eb4ef186804b69cbb95531"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"
os.environ["GOOGLE_API_KEY"] = "AIzaSyCO5SDX4JfMnXo2UZoydSQYEFspW0YJznQ"
# api_key = st.secrets["GOOGLE_API_KEY"]

# 1. Loading PDFs and chunking with LangChain

In [18]:
import os
files = filter(lambda f: f.lower().endswith(".pdf"), os.listdir(".\\Samples"))
file_list = list(files)
file_list

['Attention is All You Need.pdf']

In [19]:
os.listdir(".\\Samples")

['Attention is All You Need.pdf']

In [20]:
path = ".\\Samples"
for file_name in os.listdir(path):
    # construct full file path
    file = path + '\\' + file_name
    print(f'file: {file}')
    if os.path.isfile(file) and file.endswith('.pdf'):
        print('Deleting file:', file)
        os.remove(file)

file: .\Samples\Attention is All You Need.pdf
Deleting file: .\Samples\Attention is All You Need.pdf


In [21]:
# You MUST add your PDF to local files in this notebook (folder icon on left hand side of screen)

# Simple method - Split by pages 
loader = PyPDFLoader(f"{os.getcwd()}\\Attention is All You Need.pdf")
pages = loader.load_and_split(text_splitter=RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap = 20,
    length_function = len,
    separators= ["\n\n", "\n", ".", " "]
))
pages[1]

Document(page_content='convolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring signiﬁcantly', metadata={'source': 'c:\\Users\\ILLEGEAR\\OneDrive\\Desktop\\Personal Project\\RAG Projects\\PDF Chatbot\\Attention is All You Need.pdf', 'page': 0})

In [22]:
# Advanced method - Split by chunk

# Source: https://github.com/deanmalmgren/textract/issues/241

filename = 'Attention is All You Need'

# Step 1: Convert PDF to text
doc = textract.process(f"{os.getcwd()}\\{filename}.pdf")

# Step 2: Save to .txt and reopen for cleaning and save the cleaned data to .txt again (help prevent issues)
with open(f'{filename}.txt', 'w', encoding='utf-8') as f:
  f.write(doc.decode('utf-8'))

## Step 2.1: Read the text file
with open(f'{filename}.txt', 'r', encoding='utf-8') as f:
  text = f.read()

display(Markdown(f'**<u>Before Clean:</u>**\n'))
print(f'{text}\n\n')

## Step 2.2: Remove extra newlines in the text. E.g. Convert "\n\n" -> "\n"
clean_text = re.sub(r'\n\s*[^\w\s]*\s*\n\s*[^\w\s]*\s*\n+', '\n\n', text)

## Step 2.3: Save the cleaned text into .txt again
with open(f'{filename} Cleaned.txt', 'w', encoding='utf-8') as f:
    f.write(clean_text)

display(Markdown(f'**<u>After Clean:</u>**\n'))
print(clean_text)

**<u>Before Clean:</u>**


arXiv:1706.03762v4 [cs.CL] 30 Jun 2017



Attention Is All You Need



Ashish Vaswani∗

Google Brain

avaswani@google.com

Llion Jones∗

Google Research

llion@google.com



Noam Shazeer∗

Google Brain

noam@google.com



Niki Parmar∗

Google Research

nikip@google.com



Aidan N. Gomez∗ †

University of Toronto

aidan@cs.toronto.edu



Jakob Uszkoreit∗

Google Research

usz@google.com



Łukasz Kaiser∗

Google Brain

lukaszkaiser@google.com



Illia Polosukhin∗ ‡

illia.polosukhin@gmail.com



Abstract

The dominant sequence transduction models are based on complex recurrent or

convolutional neural networks that include an encoder and a decoder. The best

performing models also connect the encoder and decoder through an attention

mechanism. We propose a new simple network architecture, the Transformer,

based solely on attention mechanisms, dispensing with recurrence and convolutions

entirely. Experiments on two machine translation tasks show these models to

be superior in quality

**<u>After Clean:</u>**


arXiv:1706.03762v4 [cs.CL] 30 Jun 2017

Attention Is All You Need

Ashish Vaswani∗

Google Brain

avaswani@google.com

Llion Jones∗

Google Research

llion@google.com

Noam Shazeer∗

Google Brain

noam@google.com

Niki Parmar∗

Google Research

nikip@google.com

Aidan N. Gomez∗ †

University of Toronto

aidan@cs.toronto.edu

Jakob Uszkoreit∗

Google Research

usz@google.com

Łukasz Kaiser∗

Google Brain

lukaszkaiser@google.com

Illia Polosukhin∗ ‡

illia.polosukhin@gmail.com

Abstract

The dominant sequence transduction models are based on complex recurrent or

convolutional neural networks that include an encoder and a decoder. The best

performing models also connect the encoder and decoder through an attention

mechanism. We propose a new simple network architecture, the Transformer,

based solely on attention mechanisms, dispensing with recurrence and convolutions

entirely. Experiments on two machine translation tasks show these models to

be superior in quality while being more 

In [23]:
# # with open(f'{filename}.txt', 'r') as f:
# #   clean_text = f.read()

# # Step 3: Create function to count tokens
# encoder = tiktoken.encoding_for_model("gpt-4") # tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# def count_tokens(text: str) -> int:
#     return len(encoder.encode(clean_text))

#  Step 4: Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap = 20,
    length_function = len,
    separators= ["\n\n", "\n", ".", " "]
)
chunks = text_splitter.create_documents([clean_text]) # split_text(clean_text)

# Display first chunk
chunks[0]

Document(page_content='arXiv:1706.03762v4 [cs.CL] 30 Jun 2017\n\nAttention Is All You Need\n\nAshish Vaswani∗\n\nGoogle Brain\n\navaswani@google.com\n\nLlion Jones∗\n\nGoogle Research\n\nllion@google.com\n\nNoam Shazeer∗\n\nGoogle Brain\n\nnoam@google.com\n\nNiki Parmar∗\n\nGoogle Research\n\nnikip@google.com\n\nAidan N. Gomez∗ †\n\nUniversity of Toronto\n\naidan@cs.toronto.edu\n\nJakob Uszkoreit∗\n\nGoogle Research\n\nusz@google.com\n\nŁukasz Kaiser∗\n\nGoogle Brain\n\nlukaszkaiser@google.com\n\nIllia Polosukhin∗ ‡\n\nillia.polosukhin@gmail.com\n\nAbstract')

In [24]:
# Result is many LangChain 'Documents' around 500 tokens or less (Recursive splitter sometimes allows more tokens to retain context)
display(Markdown(f'**<u>Type of _`chunks`_ variable:</u>**\n'))
print(f'{type(chunks)}\n')

if hasattr(chunks, "__iter__"):

  display(Markdown(f'**<u>Length of _`chunks`_ variable:</u>**\n'))
  print(f'{len(chunks)}\n')

  display(Markdown(f'**<u>Type of a single _`chunk`_ variable:</u>**\n'))
  print(f'{type(chunks[0])}\n')

display(Markdown(f'**<u>Single _`chunks`_ variable output:</u>**\n'))
print(f'{chunks[6]}\n')

**<u>Type of _`chunks`_ variable:</u>**


<class 'list'>



**<u>Length of _`chunks`_ variable:</u>**


84



**<u>Type of a single _`chunk`_ variable:</u>**


<class 'langchain_core.documents.base.Document'>



**<u>Single _`chunks`_ variable output:</u>**


page_content='computation [31], while also improving model performance in case of the latter. The fundamental\n\nconstraint of sequential computation, however, remains.\n\nAttention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in\n\nthe input or output sequences [2, 18]. In all but a few cases [26], however, such attention mechanisms\n\nare used in conjunction with a recurrent network.'



In [25]:
# Visualise the data

import random
# Quick data visualization to ensure chunking was successful

def num_tokens_from_string(string: str, model: str) -> int:
    """Returns the number of tokens in a text string."""
    encoder = tiktoken.encoding_for_model(model)
    num_tokens = len(encoder.encode(string))
    return num_tokens

# Create a list of token counts
token_counts = [num_tokens_from_string(chunk.page_content, "gpt-3.5-turbo") for chunk in chunks]
token_counts = [str(x) for x in token_counts]
df = pd.DataFrame({'Token Count': token_counts})

# Count the occurrences of each unique value in 'Token Count'
value_counts = df['Token Count'].value_counts()

# Define the first 8 colors
color_palette = ['Blue', 'Red', 'Green', 'Yellow', 'Purple', 'Orange', 'LightBlue', 'LightGreen']

# Generate colors for the remaining values
remaining_colors = [random.choice(['rgb({},{},{})'.format(r, g, b) for r, g, b in np.random.randint(0, 256, size=(1, 3))]) for _ in range(len(value_counts) - len(color_palette))]

# Concatenate the color palette and remaining colors
colors = color_palette + remaining_colors

# Create the hover template
hover_template = '<b>Token Numbers:</b> %{x}<br><b>Frequency:</b> %{y}'

# Create the bar chart
bar_chart = go.Bar(
    x=value_counts.index,
    y=value_counts.values,
    marker=dict(color=colors),
    hovertemplate=hover_template
)

fig = go.Figure(bar_chart)

# Add a legend with color and label for each value
for value, color in zip(value_counts.index, colors):
    fig.add_trace(go.Bar(name=value, marker=dict(color=color)))

fig.update_layout(
    title={'text': 'Bar Chart: Quantity of each Token Number', 'x': 0.5, 'xanchor': 'center'},
    xaxis_title='Tokens Numbers',
    yaxis_title='Quantity of Chunks (Sentences)',
    # legend_title='Values'
)

fig.show()

# 2. Embed text and store embeddings

In [26]:
# Tutorial: https://clemenssiebler.com/posts/chatting-private-data-langchain-azure-openai-service/, https://youtu.be/kvdVduIJsc8

# # Initialize gpt-35-turbo
# llm = AzureChatOpenAI(deployment_name="Test", 
#                       openai_api_version = os.getenv("OPENAI_API_VERSION"), 
#                       openai_api_key = os.getenv("OPENAI_API_KEY"), 
#                       openai_api_base = os.getenv("OPENAI_API_BASE"),
#                       openai_api_type = os.getenv("OPENAI_API_TYPE")
#                       )

# Initialize Gemini Pro model
llm = ChatGoogleGenerativeAI(model="gemini-pro", google_api_key=os.getenv("GOOGLE_API_KEY"), temperature=0.5, convert_system_message_to_human=True)
llm

ChatGoogleGenerativeAI(model='gemini-pro', client= genai.GenerativeModel(
   model_name='models/gemini-pro',
   generation_config={}.
   safety_settings={}
), google_api_key=SecretStr('**********'), temperature=0.5, convert_system_message_to_human=True)

In [27]:
# Source: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#embeddings-models, https://github.com/hwchase17/langchain/issues/1560

# Get embedding model
# embeddings = OpenAIEmbeddings(deployment="Xpose_pdf", model="text-embedding-ada-002", chunk_size=1)
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", task_type="retrieval_query", google_api_key=os.getenv("GOOGLE_API_KEY"))

# Create vector database
db = FAISS.from_documents(documents=chunks, embedding=embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x2400fe0fd00>

# 3. Setup retrieval function

In [28]:
# Check similarity search is working
query = "Who created transformers?"
docs = db.similarity_search(query=query)
docs[0]

Document(page_content='Pdrop = 0.1.\n\n7\n\n\x0cTable 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the\n\nEnglish-to-German and English-to-French newstest2014 tests at a fraction of the training cost.\n\nBLEU\n\nModel\n\nByteNet [17]\n\nDeep-Att + PosUnk [37]\n\nGNMT + RL [36]\n\nConvS2S [9]\n\nMoE [31]\n\nDeep-Att + PosUnk Ensemble [37]\n\nGNMT + RL Ensemble [36]\n\nConvS2S Ensemble [9]\n\nTransformer (base model)\n\nTransformer (big)\n\nEN-DE\n\n23.75\n\n24.6\n\n25.16\n\n26.03\n\n26.30\n\n26.36\n\n27.3\n\n28.4\n\nEN-FR\n\n39.2\n\n39.92')

In [29]:
# Create QA chain to integrate similarity search with user queries (answer query from knowledge base)

chain = load_qa_chain(llm=llm, chain_type="stuff")

query = "Who created transformers?"
docs = db.similarity_search(query=query)

chain.run(input_documents=docs, question=query)

'The provided context does not contain any information about who created transformers, so I cannot answer this question from the provided context.'

# 5. Create chatbot with chat memory (OPTIONAL) 

In [31]:
# Adapt if needed
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(
    """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.
    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone question:""")

In [32]:
# Create conversation chain that uses our vectordb as retriver, this also allows for chat history management
qa = ConversationalRetrievalChain.from_llm(llm=llm, 
                                           retriever=db.as_retriever(), 
                                           condense_question_prompt=CONDENSE_QUESTION_PROMPT, 
                                           return_source_documents=True,
                                           chain_type="stuff"
                                           )

In [None]:
chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""
    
    if query.lower() == 'exit':
        print("Thank you for using the State of the Union chatbot!")
        return
    
    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))
    
    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> <font color="blue">{result["answer"]}</font>')) # #6495ED

print("Welcome to the Transformers chatbot! Type 'exit' to stop.")

input_box = widgets.Text(placeholder='Please enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Welcome to the Transformers chatbot! Type 'exit' to stop.



on_submit is deprecated. Instead, set the .continuous_update attribute to False and observe the value changing with: mywidget.observe(callback, 'value').



Text(value='', placeholder='Please enter your question:')

HTML(value='<b>User:</b> Hi, what is Transformers?')

HTML(value='<b><font color="blue">Chatbot:</font></b> <font color="blue">I cannot find the answer to your ques…

HTML(value='<b>User:</b> Who created Transformers?')

HTML(value='<b><font color="blue">Chatbot:</font></b> <font color="blue">The provided context does not mention…

Thank you for using the State of the Union chatbot!


In [36]:
# Display Chat History
chat_history

[('Hi, what is Transformers?',
  'I cannot find the answer to your question in the context you provided.'),
 ('Who created Transformers?',
  'The provided context does not mention the creator of Transformers, so I cannot answer this question from the provided context.')]