# **Custom Knowledge ChatGPT with LangChain - Chat with PDFs**

**By Liam Ottley:**  [YouTube](https://youtube.com/@LiamOttley)





0.   Installs, Imports and API Keys
1.   Loading PDFs and chunking with LangChain
2.   Embedding text and storing embeddings
3.   Creating retrieval function
4.   Creating chatbot with chat memory (OPTIONAL)








# 0. Installs, Imports and API Keys

In [3]:
#!pip install -r requirements.txt

Collecting textract
  Using cached textract-1.6.5-py3-none-any.whl (23 kB)
Collecting transformers
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting langchain
  Downloading langchain-0.0.304-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting ipywidgets
  Downloading ipywidgets-8.1.1-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting argcomplete~=1.10.0
  Using cached argcomplete-1.10.3-py2.py3-none-any.whl (36 kB)
Collecting beautifulsoup4~=4.8.0
  Using cached beautifulsoup4-4.8.2-py3-none-any.whl (106 kB)
Collecting chardet==3.*
  Using cached chardet-3.0.4-py2.py3-none-any.w

In [6]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from transformers import GPT2TokenizerFast
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
import textract


In [8]:
os.environ["OPENAI_API_KEY"] = "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"

# openai.api_key  = os.environ['OPENAI_API_KEY']

# 1. Loading PDFs and chunking with LangChain

In [11]:
# You MUST add your PDF to local files in this notebook (folder icon on left hand side of screen)
pdf_path = "./Dupre_economy_as_science.pdf"

# Simple method - Split by pages
loader = PyPDFLoader(pdf_path)
pages = loader.load_and_split()
print(pages[0])

# SKIP TO STEP 2 IF YOU'RE USING THIS METHOD
chunks = pages

page_content='MIDWEST STUDIES IN PHILOSOPHY, XVIII (1993) \nCould There Be a Science of Economics? \nJOHN DUPa \nuch scientific thinking and thinking about science involves the assump- M tion that there is a deep and pervasive order to the world that it is the \nbusiness of science to disclose. A paradigmatic statement of such a view can be \nfound in a widely discussed paper by a prominent economist, Milton Friedman \n(a paper which will be discussed in more detail shortly): \nA fundamental hypothesis of science is that appearances are deceptive and \nthat there is a way of looking at or interpreting or organizing the evidence \nthat will reveal superficially disconnected and diverse phenomena to be \nmanifestations of a more fundamental and relatively simple structure. \n(195311984. 231) \nOn the other hand, the person sometimes described as the father of modern \nscience, Francis Bacon, wrote: \nThe human understanding is of its own nature prone to suppose the \nexistence of more or

In [10]:
# Advanced method - Split by chunk

# Step 1: Convert PDF to text
doc = textract.process(pdf_path)

# Step 2: Save to .txt and reopen (helps prevent issues)
with open(pdf_path, 'w') as f:
    f.write(doc.decode('utf-8'))

with open(pdf_path, 'r') as f:
    text = f.read()

# Step 3: Create function to count tokens
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

# Step 4: Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 512,
    chunk_overlap  = 24,
    length_function = count_tokens,
)

chunks = text_splitter.create_documents([text])

KeyboardInterrupt: 

In [12]:
# Result is many LangChain 'Documents' around 500 tokens or less (Recursive splitter sometimes allows more tokens to retain context)
type(chunks[0])

langchain.schema.document.Document

In [13]:
# Quick data visualization to ensure chunking was successful

# Create a list of token counts
token_counts = [count_tokens(chunk.page_content) for chunk in chunks]

# Create a DataFrame from the token counts
df = pd.DataFrame({'Token Count': token_counts})

# Create a histogram of the token count distribution
df.hist(bins=40, )

# Show the plot
plt.show()

NameError: name 'count_tokens' is not defined

# 2. Embed text and store embeddings

In [25]:
# Get embedding model
embeddings = OpenAIEmbeddings()

# Create vector database
db = FAISS.from_documents(pages, embeddings)

# 3. Setup retrieval function

In [19]:
# Check similarity search is working
query = "does dupre agree with friedman?"
docs = db.similarity_search(query)
docs[0]

Document(page_content='366 JOHN DUPRE \nit is striking how little he says in this article to justify complacence about \nthe predictive powers of economics. Friedman compares, for example, the \nhypothesis that businessmen act as if they aimed to maximize profits with \nthe hypothesis that expert billiard players act as if they were able to make \nall the appropriate mathematical calculations of the trajectory of a billiard ball. \nThe value of this latter hypothesis is sufficiently demonstrated by the successful \nshots of the expert billiard player, and the psychology of neither the billiard- \nplayer nor the businessman is of any relevance to the evaluation of either \nhypothesis. But remarkably, rather than offer any empirical evidence that, just \nas expert billiard players generally make their shots, businessmen, whatever \nthey may be intending, do in fact maximize profits, Friedman offers nothing but \na broken-backed a priori argument for this conclusion. “[Ulnless the behavio

In [20]:
# Create QA chain to integrate similarity search with user queries (answer query from knowledge base)

chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")

query = "Who created transformers?"
docs = db.similarity_search(query)

chain.run(input_documents=docs, question=query)

" I don't know."

# 5. Create chatbot with chat memory (OPTIONAL)

In [21]:
from IPython.display import display
import ipywidgets as widgets

# Create conversation chain that uses our vectordb as retriver, this also allows for chat history management
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())

In [22]:
chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""

    if query.lower() == 'exit':
        print("Thank you for using the State of the Union chatbot!")
        return

    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))

    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))

print("Welcome to the Transformers chatbot! Type 'exit' to stop.")

input_box = widgets.Text(placeholder='Please enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Welcome to the Transformers chatbot! Type 'exit' to stop.


  input_box.on_submit(on_submit)


Text(value='', placeholder='Please enter your question:')

HTML(value='<b>User:</b> summarize the given text')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The text discusses the possibility of economics being a…

HTML(value='<b>User:</b> give me the main arguments for his claims')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The author argues that economics provides an example of…

HTML(value='<b>User:</b> what, according to the author, would be genuinely scientific?')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The author considers the identification of causal influ…

HTML(value='<b>User:</b> how would the author define science?')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The author\'s definition of science is that it involves…

In [24]:
import random
import gradio as gr

def random_response(message, history):
    return random.choice(["Yes", "No"])

demo = gr.ChatInterface(random_response)

demo.launch()


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


