Chatbot Memory: Retrieval Augmented Generation (RAG) Chain | LangChain | Python | Ask PDF Documents
https://www.youtube.com/watch?v=PtO44wwqi0M

In [71]:
!pip install pypdf



In [72]:
!pip install langchain_community



In [1]:
from langchain_community.document_loaders import PyPDFLoader

PDF Loader from the Folder

In [2]:
# loader = PyPDFLoader("documents/JTBD_Book.pdf")
# loader = PyPDFLoader("documents/Case_Study_FIH.pdf")
# loader = PyPDFLoader("documents/3_javabook.pdf")
loader = PyPDFLoader("documents/PHD_PROGRAMME.pdf")

Load the PDF using the pypdf into the "pages" variables
Each page is stored as a separate chunk. It also stores page numbers in metadata

In [3]:
pages = loader.load_and_split()
loader.load_and_split()
# print(pages)

[Document(metadata={'source': 'documents/PHD_PROGRAMME.pdf', 'page': 0}, page_content='1  \n \n \n \nRegulations  for \nDoctor of Philosophy   \n2024 \n \n \n(As per UGC  Regulations  2022 )'),
 Document(metadata={'source': 'documents/PHD_PROGRAMME.pdf', 'page': 1}, page_content='2 NATIONAL INSTITUTE OF  \nTECHNICAL TEACHERS TRAINING AND RESEARCH  \nCHENNAI  \nTable of Contents  \nDEFINITIONS  AND  NOMENCLATURE  3 \n1. GENERAL  ELIGIBILITY  1 \n2. EDUCATIONAL  QUALIFICATIONS  1 \n3. Ph.D. PROGRAMME  2 \n3.1 Full-time Ph.D. Programme  2 \n3.2 Part-Time Ph.D. Programme  3 \n3.3 Executive Ph.D. Programme  3 \n3.4 Change of Category  3 \n4. DURATION OF THE PROGRAMME  3 \n5. ADMISSION  4 \n6. SUPERVISOR RECOGNITION  5 \n7. CHANGE OF SUPERVISOR  6 \n8. NUMBER OF SCHOLARS  8 \n9. COURSE WORKS  8 \n10. RESEARCH ADVISORY COMMITTEE  10 \n11. EVALUATION AND ASSESSMENT METHODS  11 \n12. SUBMISSION OF SYNOPSIS  11 \n13. SUBMISSION OF THESIS  12 \n14. THESIS EVALUATION  13 \n15. ORAL EXAMINATION  15

In [76]:
#pages[8:30]

In [77]:
# for i in range (20):
#     print(pages[i].metadata)

Each page of the pdf is still quite long, we break the pages into smaller pieces.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size= 1000, chunk_overlap= 20)
documents = text_splitter.split_documents(pages)

In [5]:
print(f"{len(pages)} vs {len(documents)}")

22 vs 60


Load the Ollama Embeddings to convert each chunk of text to numeric vectors.

In [6]:
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model ="llama3")

Chroma Vector database - To store all the numeric vectors in a database. One local Database is Chroma

In [7]:
from langchain_community.vectorstores import Chroma
vector = Chroma.from_documents(documents, embeddings)

Use the LLM (Large Language Model) Ollama

In [8]:
from langchain_community.llms import Ollama
llm = Ollama(model="llama3")

Output Parser - Convert the output of the chatmodel into a pure text

In [9]:
from langchain_core.output_parsers import StrOutputParser
output_parser = StrOutputParser()

Retrievers - used to take question and compare it with all the numeric vectors in the Database and return most similar chunks of text

In [10]:
retriever = vector.as_retriever()

Adding Memory
Here we are performing the reformulation of the question.
user's followup question ==> LLM ==> Reformulated question (with history)

In [15]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

instruction_to_system ="""
Given a chat history and the latest user question
which might reference context in the chat history, formulate a standalone question
which can be understood without the chat history. Do NOT answer the question,
just reformulate it if needed and otherwise return it as is.
"""

question_maker_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", instruction_to_system),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human","{question}"),
    ]
)

question_chain = question_maker_prompt | llm | StrOutputParser()

Example for understanding the reformation of the question

In [16]:
from langchain_core.messages import AIMessage, HumanMessage
question_chain.invoke({"question":"Can you explain more",
        "chat_history": [HumanMessage(content="you explained that the moon is round")]})

"Here's the formulated standalone question:\n\nWhat are some additional details about the shape of the moon?"

Prompt
Building the prompt for the question and answer.
The prompt consist of :
    # a python-list of system instruction
    # a placeholder to take the chat history
    # user's question

In [17]:
qa_system_prompt = """

You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, provide a summary of the context. Do not generate your answer.\
{context}
"""
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human","{question}"),
    ]
)

Which question to pass to LLM?
We define a function that looks at the chat history,
    @if there is a history: it will pass the question chain (that reformulates user's question).
    @if chat history is empty: it will pass the user's question directly.

In [20]:
def contextualize_question(input:dict):
    if input.get("chat_history"):
        return question_chain
    else:
        return input["question"]

In [89]:
!pip install langchain_core



Retriever Chain
We need a chain to pass the following to the LLM:
    > context:use the vector retriever and get the most relevant chunks of the PDF
    > question: reformulated or the original question depending on the history
    > chat_history: python list of the chats

We use the following assign function which adds the context to whatever it gets as input and pass it to the next link of the chain.

In [23]:
from langchain_core.runnables import RunnablePassthrough
retriever_chain = RunnablePassthrough.assign(
    context = contextualize_question | retriever
)

In [24]:
retriever_chain.invoke({
    "chat_history":[HumanMessage(content="you explained that the moon is round")],
    "question": "can you explain more?"
})

{'chat_history': [HumanMessage(content='you explained that the moon is round')],
 'question': 'can you explain more?',
 'context': [Document(metadata={'page': 8, 'source': 'documents/PHD_PROGRAMME.pdf'}, page_content='and/or  \nii. NITTTR Chennai admit students through an Entrance Test conducted at the  \ninstitute . The Entrance Test syllabus shall consist of 50% of research \nmethodology, and 50% shall be subject - specific.  \niii. Students who have secured 50% marks in the entrance test are eligible to be \ncalled for the interview.  \niv. A relaxation of 5% marks will be allowed in the entrance examination for the \ncandidates belonging to SC/ST/OBC/Differently abled category, Economically \nweaker section (EWS) and other categories of candidates as per the decision of \nthe commission from time to time.'),
  Document(metadata={'page': 18, 'source': 'documents/PHD_PROGRAMME.pdf'}, page_content='recommendation for th e award or for rejection.  \n14.6 If one examiner recommends the 

Retrieval- Argumented Generation (RAG) Chain - The main Chain: This produces the final answer.

In [25]:
rag_chain = (
    retriever_chain | qa_prompt | llm #| output_parser
)

In [40]:
# question = "How would decision-making improve if everybody in your organization had knowledge of all your customers needs? "
#question="How a company should go about and define the core functional job?"
# question = "Explain in detail about the Innovation Process."
# question = "Explain about the TreeExpansionEvent class"
question = "Explain the Course works."

In [41]:
chat_history = []
ai_msg = rag_chain.invoke({"question":question, "chat_history":chat_history})
chat_history.extend([HumanMessage(content = question), ai_msg])
ai_msg

'Based on the provided documents, I can summarize the course work for the Ph.D. Programme:\n\nThe programme has two main categories of students:\n\n1. Regular Scholars:\n\t* Must give an undertaking to abide by the Ph.D. regulations.\n\t* Will work under the guidance of NITTTR Chennai faculty members.\n\t* Will have a research problem assigned to them.\n2. Executive Scholars (Working Professionals):\n\t* Sponsored by their employers and work on research problems related to their organizations.\n\t* Collaborate with NITTTR Chennai faculty members on funded projects.\n\nThe course work for both categories involves the following:\n\n1. Research methodology: 50% of the Entrance Test syllabus.\n2. Subject-specific knowledge: 50% of the Entrance Test syllabus.\n\nStudents who secure 50% marks in the entrance test are eligible to be called for an interview. A relaxation of 5% marks is allowed for SC/ST/OBC/Differently abled category, Economically Weaker section (EWS), and other categories as 

In [42]:
# print(ai_msg,'content')
print(ai_msg)

Based on the provided documents, I can summarize the course work for the Ph.D. Programme:

The programme has two main categories of students:

1. Regular Scholars:
	* Must give an undertaking to abide by the Ph.D. regulations.
	* Will work under the guidance of NITTTR Chennai faculty members.
	* Will have a research problem assigned to them.
2. Executive Scholars (Working Professionals):
	* Sponsored by their employers and work on research problems related to their organizations.
	* Collaborate with NITTTR Chennai faculty members on funded projects.

The course work for both categories involves the following:

1. Research methodology: 50% of the Entrance Test syllabus.
2. Subject-specific knowledge: 50% of the Entrance Test syllabus.

Students who secure 50% marks in the entrance test are eligible to be called for an interview. A relaxation of 5% marks is allowed for SC/ST/OBC/Differently abled category, Economically Weaker section (EWS), and other categories as per the decision of the

In [44]:
question = "can you explain more?"
ai_msg = rag_chain.invoke({"question": question, "chat_history": chat_history})
chat_history.extend([HumanMessage(content=question),ai_msg])
# print(ai_msg,"content")
# print(f"Title: {question.metadata['title']}, Source:{question.metadata['source']}")
print(ai_msg)

Based on the provided documents, here is a summary of the course works for the Ph.D. Programme:

For Regular Scholars:
* The total credit requirement is 24 credits.
* In case of candidates with B.E./B.Tech., a minimum course work for 24 credits shall be completed within two years from the date of admission.

For Executive Scholars (Working Professionals):
* The total credit requirement varies depending on their background and qualifications.


In [97]:
# for doc in ai_msg["context"]:
#     print(f"Title: {doc.metadata['title']}, Source:{doc.metadata['source']}")