# Conversations with LangChain Chatbot

> Ming Zhao
> 
> Dec, 2023

The goal of this project is to develop a chatbot capable of engaging in interactive conversations, remembering previous interactions, and generating responses based on the referenced documents.

### Contents

1. Document Loading    
2. Document Splitting      
3. Embedding   
4. Retrieval
5. Question Answering
6. Conversational Chat
7. Create A Chatbot 

In [1]:
# ! pip install langchain
# ! pip install pypdf
# ! pip install chromadb
# ! pip install -U langchain-openai
# ! pip install lark
# ! pip install "langchain[docarray]"

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import DocArrayInMemorySearch

## 1. Document Loading

In [4]:
# load documets
# loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
# pages = loader.load()

In [5]:
# load PDF
loaders = [
    PyPDFLoader("MachineLearning-Lecture01.pdf"),
    PyPDFLoader("MachineLearning-Lecture02.pdf"),
    PyPDFLoader("MachineLearning-Lecture03.pdf"),
    PyPDFLoader("MachineLearning-Lecture04.pdf")
]
pages = []
for loader in loaders:
    pages.extend(loader.load())

In [6]:
len(pages)

76

In [7]:
print(pages[0].page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [8]:
for page in pages[20:24]:
    print(page.metadata)

{'source': 'MachineLearning-Lecture01.pdf', 'page': 20}
{'source': 'MachineLearning-Lecture01.pdf', 'page': 21}
{'source': 'MachineLearning-Lecture02.pdf', 'page': 0}
{'source': 'MachineLearning-Lecture02.pdf', 'page': 1}


## 2. Document Splitting

In [9]:
# split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=150)

docs = text_splitter.split_documents(pages)

In [10]:
len(docs)

282

In [11]:
docs[281]

Document(page_content='[End of Audio] \nDuration: 76 minutes', metadata={'source': 'MachineLearning-Lecture04.pdf', 'page': 19})

## 3. Embedding

In [12]:
OPENAI_API_KEY = "OPENAI_API_KEY"

In [13]:
# define embedding
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [14]:
# store vectors
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embeddings)

In [15]:
print(vectordb._collection.count())

282


In [16]:
question1 = "is there an email i can ask for help"
ans1 = vectordb.similarity_search(question1, k=3) # len(ans)=3
print(ans1[0].page_content[:500])
print("-"*50)
print(ans1[1].page_content[:500])

newsgroup that's sort of a forum for people in  the class to get to  know each other and 
have whatever discussions you want to ha ve amongst yourselves. So the class newsgroup 
will not be monitored by the TAs and me. But this is a place for you to form study groups 
or find project partners or discuss homework problems and so on, and it's not monitored 
by the TAs and me. So feel free to ta lk trash about this class there.  
If you want to contact the teaching staff, pl ease use the email addr
--------------------------------------------------
sort of just fishing around for ideas or l ooking for ideas of projects to do, please, be 
strongly encouraged to come to  my office hours on Friday mornings, or go to any of the 
TA’s office hours to tell us a bout your project ideas, and we can help brainstorm with 
you.  
I also have a list of project ideas that I so rt of collected from my colleagues and from 
various senior PhD students working with me  or with other professors. And so if 

In [17]:
question2 = "what did they say about matlab?"
ans2 = vectordb.similarity_search(question2, k=3)
print(ans2[0].page_content[:500])
print("-"*50)
print(ans2[1].page_content[:500])

everything.  
So actually I, well, so yeah, just a side comment for those of you that haven't seen 
MATLAB before I guess, once a colleague of mine at a different university, not at 
Stanford, actually teaches another machine l earning course. He's taught it for many years. 
So one day, he was in his office, and an old student of his from, lik e, ten years ago came 
into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from
--------------------------------------------------
those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. An

In [18]:
question3 = "what did they say about regression in the third lecture?"
ans3 = vectordb.similarity_search(question3, k=5)
print(ans3[0].page_content[:500])
print("-"*50)
print(ans3[1].page_content[:500])

MachineLearning-Lecture03  
Instructor (Andrew Ng) :Okay. Good morning and welcome b ack to the third lecture of 
this class. So here’s what I want to do t oday, and some of the topics I do today may seem 
a little bit like I’m jumping, sort  of, from topic to topic, but here’s, sort of, the outline for 
today and the illogical flow of ideas. In the last lecture, we  talked about linear regression 
and today I want to talk about sort of an  adaptation of that called locally weighted 
regression.
--------------------------------------------------
Instructor (Andrew Ng) :All right, so who thought driving could be that dramatic, right? 
Switch back to the chalkboard, please. I s hould say, this work was done about 15 years 
ago and autonomous driving has come a long way. So many of you will have heard of the 
DARPA Grand Challenge, where one of my colleagues, Sebastian Thrun, the winning 
team's drive a car across a desert by itself.  
So Alvin was, I think, absolutely amazing wo rk for i

## 4. Retrieval

**Addressing Diversity: Maximum Marginal Relevance**

`Maximum Marginal Relevance` strives to achieve both *relevance* to the query and *diversity* among the results.

In [19]:
ans2_mmr = vectordb.max_marginal_relevance_search(question2, k=3)
print(ans2_mmr[0].page_content[:500])
print("-"*50)
print(ans2_mmr[1].page_content[:500])

everything.  
So actually I, well, so yeah, just a side comment for those of you that haven't seen 
MATLAB before I guess, once a colleague of mine at a different university, not at 
Stanford, actually teaches another machine l earning course. He's taught it for many years. 
So one day, he was in his office, and an old student of his from, lik e, ten years ago came 
into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from
--------------------------------------------------
those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. An

**Addressing Specificity: Working with Metadata**

`Metadata` provides context for each embedded chunk.

In [20]:
ans3_meta = vectordb.similarity_search(
    question3, k=3,
    filter={"source":"MachineLearning-Lecture03.pdf"})

for d in ans3_meta:
    print(d.metadata)

{'page': 0, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 13, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'MachineLearning-Lecture03.pdf'}


**Addressing Specificity: Working with Metadata Using Self-query Retriever**

`SelfQueryRetriever` uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [21]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of  \
                    `MachineLearning-Lecture01.pdf`,  \
                    `MachineLearning-Lecture02.pdf`,  \
                    `MachineLearning-Lecture03.pdf`,  \
                    `MachineLearning-Lecture04.pdf`", 
        type="string"),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer")]

document_content_description = "Lecture notes"

In [22]:
llm = OpenAI(model='gpt-3.5-turbo-instruct', 
             temperature=0,
             openai_api_key=OPENAI_API_KEY)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True)

In [23]:
ans3_self = retriever.get_relevant_documents(question3)

In [24]:
for d in ans3_self:
    print(d.metadata)

{'page': 2, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 1, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'MachineLearning-Lecture03.pdf'}


**Additional Tricks: Compression**

Information most relevant to a query may be buried in a document with a lot of irrelevant text.
Passing that full document through your application can lead to more expensive LLM calls and poorer responses.
Contextual compression is meant to fix this.

In [25]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever())

In [26]:
ans2_comp = compression_retriever.get_relevant_documents(question2)

for d in ans2_comp:
    print(d.metadata)

{'page': 8, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 9, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'MachineLearning-Lecture01.pdf'}


In [27]:
print(ans2_comp[0])
print("-"*50)
print(ans2_comp[1])
print("-"*50)
print(ans2_comp[2])
print("-"*50)
print(ans2_comp[3])

page_content='- "So actually I, well, so yeah, just a side comment for those of you that haven\'t seen MATLAB before I guess"\n- "So one day, he was in his office, and an old student of his from, lik e, ten years ago came into his office and he said, "Oh, professo r, professor, thank you so much for your machine learning class."\n- "I learned so much from it. There\'s this stuff that I learned in your class, and I now use every day. And it\'s help ed me make lots of money, and here\'s a picture of my big house."\n- "So my friend was very excited. He said, "W ow. That\'s great. I\'m glad to hear this machine learning stuff was actually useful. So what was it that you learned? Was it logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you learned that was so helpful?" And the student said, "Oh, it was the MATLAB."' metadata={'page': 8, 'source': 'MachineLearning-Lecture01.pdf'}
--------------------------------------------------
page_content='MATLAB, Octave, f

- Combining Various Techniques

In [28]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr"))

In [29]:
ans2_comb = compression_retriever.get_relevant_documents(question2)

for d in ans2_comb:
    print(d.metadata)

{'page': 8, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'MachineLearning-Lecture01.pdf'}
{'page': 7, 'source': 'MachineLearning-Lecture01.pdf'}


In [30]:
print(ans2_comp[0])
print("-"*50)
print(ans2_comp[1])
print("-"*50)
print(ans2_comp[2])

page_content='- "So actually I, well, so yeah, just a side comment for those of you that haven\'t seen MATLAB before I guess"\n- "So one day, he was in his office, and an old student of his from, lik e, ten years ago came into his office and he said, "Oh, professo r, professor, thank you so much for your machine learning class."\n- "I learned so much from it. There\'s this stuff that I learned in your class, and I now use every day. And it\'s help ed me make lots of money, and here\'s a picture of my big house."\n- "So my friend was very excited. He said, "W ow. That\'s great. I\'m glad to hear this machine learning stuff was actually useful. So what was it that you learned? Was it logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you learned that was so helpful?" And the student said, "Oh, it was the MATLAB."' metadata={'page': 8, 'source': 'MachineLearning-Lecture01.pdf'}
--------------------------------------------------
page_content='MATLAB, Octave, f

**Other Types of Retrieval**

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [31]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever

In [32]:
# load PDF
# loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
# pages = loader.load()
# all_page_text=[p.page_content for p in pages]
# joined_page_text=" ".join(all_page_text)

# split
# text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
# splits = text_splitter.split_text(joined_page_text)

# Retrieve
# svm_retriever = SVMRetriever.from_texts(splits,embeddings)
# tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [33]:
# question = "What are major topics for this class?"
# docs_svm=svm_retriever.get_relevant_documents(question)
# docs_svm[0]

In [34]:
# question = "What are major topics for this class?"
# docs_tfidf=tfidf_retriever.get_relevant_documents(question)
# docs_tfidf[0]

## 5. Question Answering

**RetrievalQA Chain**

In [35]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", 
                 temperature=0,
                 openai_api_key=OPENAI_API_KEY)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever())

In [58]:
question = "What are major topics for this class?"
result = qa_chain.invoke(question)
print(result["result"])

The major topics for this class are machine learning and its extensions. Thanks for asking!


**Prompt**

In [37]:
# build prompt
template = """Use the following pieces of context to answer the question at the end. 
              If you don't know the answer, just say that you don't know, don't try to make up an answer. 
              Use three sentences maximum. Keep the answer as concise as possible. 
              Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [38]:
# run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})

In [39]:
question = "Is probability a class topic?"
result = qa_chain.invoke(question)
print(result["result"])

Yes, probability is a topic in this class. Thanks for asking!


In [40]:
result["source_documents"][0]

Document(page_content="of this class will not be very program ming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  \nI also assume familiarity with basic proba bility and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna \nassume all of you know what ra ndom variables are, that all of you know what expectation \nis, what a variance or a random variable is. And in case of some of you, it's been a while \nsince you've seen some of this material. At some of the discussion sections, we'll actually \ngo over some of the prerequisites, sort of as  a refresher course under prerequisite class. \nI'll say a bit more about that later as well.  \nLastly, I also assume familiarity with basi c linear algebra. And again, most undergraduate \nlinear algebra courses are more than enough. So if you've taken courses like Math 51,", metadata={'page':

**RetrievalQA Chain Types**

- map-reduce
- refine
- map-rerank

In [41]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce")

In [42]:
result = qa_chain_mr.invoke(question)
print(result["result"])

Yes, probability is a class topic.


In [43]:
a_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine")

In [44]:
result = qa_chain_mr.invoke(question)
print(result["result"])

Yes, probability is a class topic.


### RetrievalQA Limitations

#### QA fails to preserve conversational history

In [45]:
question = "Is probability a class topic?"
result = qa_chain.invoke(question)
print(result["result"])

Yes, probability is a topic in this class. Thanks for asking!


In [46]:
question = "why are those prerequesites needed?"
result = qa_chain.invoke(question)
print(result["result"])

The prerequisites are needed because the class assumes that students have basic knowledge of computer science, computer skills, and programming skills. This knowledge is necessary to understand and apply the concepts taught in the class. Thanks for asking!


## 6. Chat

In [47]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", 
                 temperature=0,
                 openai_api_key=OPENAI_API_KEY)

In [48]:
llm.invoke("Hello world!")

AIMessage(content='Hello! How can I assist you today?')

### Memory

In [49]:
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    input_key='question', 
    output_key='answer')

### Conversational Retrieval Chain

In [50]:
retriever=vectordb.as_retriever()

qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory,
    return_source_documents=True,
    return_generated_question=True)

In [51]:
question = "Is probability a class topic?"
result = qa({"question": question})
print(result['answer'])

Yes, probability is a topic covered in the class. The instructor assumes familiarity with basic probability and statistics. Additionally, there is a discussion section dedicated to reviewing probability for those who need a refresher.


In [52]:
question = "why are those prerequesites needed?"
result = qa({"question": question})
print(result['answer'])

Familiarity with basic probability and statistics is assumed because these concepts are fundamental to understanding and applying machine learning algorithms. Probability and statistics provide the foundation for understanding uncertainty, making predictions, and evaluating the performance of models. 

The discussion section dedicated to reviewing probability is offered for those who may not have a strong background in probability and statistics or who may need a refresher. This is to ensure that all students have the necessary prerequisite knowledge to fully understand and engage with the material covered in the class. The review session aims to provide a solid foundation in probability concepts before diving into more advanced topics in machine learning.


## 7. Create A Chatbot

In [53]:
def load_db(files, chain_type="map_reduce", temperature=0, k=2):
    '''
    files: a list of file names or a file name
    chain_type: map_reduce, refine, map_rerank
    temperature: this parameter controls the creativity of the text generated by the OpenAI API. 
                 A higher temperature will produce more creative text, 
                 while a lower temperature will produce more predictable text.
    '''
    
    # load documents
    loaders = []
    for file in files:
        loaders.append(PyPDFLoader(file))
    pages = []
    for loader in loaders:
        pages.extend(loader.load())
        
    # split documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    docs = text_splitter.split_documents(pages)
    
    # define embedding
    OPENAI_API_KEY = "OPENAI_API_KEY"
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
    
    # store vector in memery
    db = Chroma.from_documents(documents=docs, embedding=embeddings)
    ###db = DocArrayInMemorySearch.from_documents(docs, embeddings)
    
    # define retriever
    retriever = ContextualCompressionRetriever(
            base_compressor=compressor,
            base_retriever=vectordb.as_retriever(search_type = "mmr"))
    ###retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2} #k: top k relevant documents which are retrieved)
    
    # define memory
    memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True, 
            input_key='question', 
            output_key='answer')
    
    # deine LLM
    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model_name="gpt-3.5-turbo", temperature=temperature)
    
    # create a chatbot chain
    qa = ConversationalRetrievalChain.from_llm(
            llm,  
            chain_type=chain_type, 
            retriever=retriever, 
            memory=memory,
            return_source_documents=True,
            return_generated_question=True)
        
    return qa 

In [54]:
qa = load_db(["MachineLearning-Lecture01.pdf", "MachineLearning-Lecture02.pdf",
              "MachineLearning-Lecture03.pdf", "MachineLearning-Lecture04.pdf"])

In [59]:
question = "Is this class useful?"
result = qa.invoke(question)

In [60]:
print(result['answer'])
# print(result['generated_question'])
# print(result['source_documents'])

Yes, based on the given information, the class is described as being useful for doing many things and the things learned in the class will be useful no matter what the student ends up doing later in life.
