# Notebook for building RAG pipeline using LLMs - Chat History, Citations

### This notebook covers:
    - RAG QA pipeline using LLMs
    - Dataset - latest articles on similar topic - Christopher nolan movies
    - Added citations to prevent hallucinations
    - Added Chat history 

### 0. Required Installations
#### pip install langchain
#### pip install 'langchain[llms]' (for mac)
#### pip install tiktoken  - openAi tokenizer
#### export OPENAI_API_KEY=""

## 0. Installations

### Import python packages

In [68]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from pathlib import Path
from langchain import PromptTemplate, OpenAI, LLMChain

## 1. DATA LOAD

### Load PDF data and Convert it using Pymupdf Library from Lang Chain

- PyMuPDFLoader RETURNS ONE DOCUMENT PER PAGE
- PyPDF- Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number.
- Refer : https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

### Function to read PDF input

In [69]:
def read_pdf_input(file_path):
    pdf_search = Path(file_path).glob("*.pdf")
    pdf_files  = [str(file.absolute()) for file in pdf_search]
    print('Total PDF files',len(pdf_files))
    pages = []
    for pdf in pdf_files:
        loader = PyPDFLoader(pdf)
        pages.extend(loader.load_and_split())
    return pages

In [70]:
from pathlib import Path
file_path='<data-path>'
pages = read_pdf_input(file_path)
print('Length of pages', len(pages))

Total PDF files 5
Length of pages 82


## 2. DATA STORE

###  Embed the documents

In [71]:

embeddings_model = OpenAIEmbeddings(openai_api_key="<open-ai-key>")

### Vector Store : FAISS 

In [72]:
db = FAISS.from_documents(pages, embeddings_model)

## 3. RETRIEVER

In [73]:
retriever = db.as_retriever()
#retriever = db.as_retriever(search_type="mmr")
#retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5})

- retriever = db.as_retriever(search_kwargs={"k": 1}) -> For retrieving top k=1 results
- retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5}) -> For returning greater than 0.5 similarity score




# 5. QA 

## 5.1 Adding Citations

### Prompt for adding citations

In [75]:
reader_template = """ You are QA assistant. You are given a question and a dictionary.
You need to generate the answer and cite the sentences used in generating the answer.
The key from dictionary is the citation and the value from the dictionary is the context to generate the answer for the question. 

- Try your best to list the citations
- If you do not find the answer, say politely that you don't know.
- Do not generate false information.
- Do not combine multiple sources, list them separately like [src_1][src_2]

Below is an example:
question : 'When did Roman Empire fall?'
dictionary: article4.pdf_Page2: The western empire suffered several Gothic invasions and, in AD 455, was sacked by Vandals. Rome continued to decline after that until AD 476 when the western Roman Empire came to an end.
Answer: 476 CE [article4.pdf_Page2]

Use the dictionary: {dictionary} for the Question: {question} to generate the answer.
Helpful Answer:
"""

In [76]:
from langchain import PromptTemplate, OpenAI, LLMChain
#reader_prompt = PromptTemplate(template=reader_template, input_variables=["context", "question"])

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0,openai_api_key="<open-ai-key>")



## Utility Functions

### Function to create key: value pair of source and context for citations

In [77]:
def data_preprocess(docs):
    result_dict={}
    for doc in docs:
        source_page = doc.metadata['source'].split("/")[-1] + '_Page' + str (doc.metadata['page']+1)
        result_dict[source_page]=doc.page_content
    return result_dict
        

### Function to generate answer and give citations 

In [78]:
def qa_reader(ques, llm, reader_template,retriever):
    
    #1. Generate the prompt using prompt template
    reader_prompt = PromptTemplate(template=reader_template, input_variables=["dictionary", "question"])
    
    #2. Use LLM chain to create llm instance with the llm model and the prompt
    llm_chain = LLMChain(prompt=reader_prompt, llm=llm)
    
    #3. Retrieve relevant documents from the retriever for the input query
    docs= retriever.get_relevant_documents(ques)
    
    #4. Data preprocess function to add source id for citations : return dict with source id as key and page content as value
    preprocess_docs = data_preprocess(docs)
    
    #5. Pass the retrieved documents as the dictionary and the input query to the LLM Chain created in step 2
    result = llm_chain.predict(dictionary=preprocess_docs, question=ques)
    return result
    

In [79]:
print('---------------- Answer from Reader ----------------')
ques = "Who is Christopher Nolan?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "Which is Christopher Nolan's latest movie?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "What actors did he work with?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "Can you provide list of more actors he has worked with?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "How did he begin his career?"
ans = qa_reader(ques, llm, reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "How is he as a person?"
ans = qa_reader(ques, llm, reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

---------------- Answer from Reader ----------------
Question:  Who is Christopher Nolan?
Answer:  Christopher Nolan is a British film director and writer known for his noirish visual aesthetic and unconventional narratives [article3.pdf_Page1]. He was raised by an American mother and a British father and spent time in both Chicago and London [article3.pdf_Page1]. Nolan became interested in moviemaking from a young age and would use his father's Super-8 camera to make shorts [article3.pdf_Page1]. His breakthrough film was "Memento" in 2000, which used a reverse-order story line to mirror the fractured mental state of its protagonist [article3.pdf_Page1]. Nolan gained further recognition with his film "Batman Begins" in 2005, which focused on the origins of the superhero and had a darker and more realistic tone compared to previous Batman films [article3.pdf_Page1].


Question:  Which is Christopher Nolan's latest movie?
Answer:  I'm sorry, but I don't have the information to answer you

## 5.2 Adding Conversational History

### Prompt for adding conversational history

In [80]:
conversational_template ="""As a Question answering assistant, generate a new question based on the asked question and the conversational history - History.
History will be passed as a list.
The string will be in format:
'Question: '+ asked question + " Answer: "+ answer
Your task is to fetch the text after keyword 'Question: ' and before keyword "Answer: " from History and this will be the asked question
by the user. You need to then generate a new question using the conversational history and below guidelines:
- Conversational history is ordered from least recent to most recent. Weight most recent history the most.
- If the asked question is not related to the conversational history, do not consider the history while answering and 
return the original question.
- Only generate a single query.
- If multiple options exist for the query, do not separate them by OR and do not ask the user to select. Just pick any single query.
- Do not ask clarifying questions, if you are not sure about the new query, just output the original question.

History : {history}
asked question : {ques}
New Question: 
"""

### utility Function for adding conversational history

In [81]:
def qa_conversationHistory(ques, llm, conversational_template,history):
    conversational_prompt = PromptTemplate(template=conversational_template, input_variables=["history","ques"])
    llm_chain = LLMChain(prompt=conversational_prompt, llm=llm)
    #docs= retriever.get_relevant_documents(ques)
    new_ques = llm_chain.predict(history=history,ques=ques)
    return new_ques

### Pipeline for adding:
 - Conversational History
 - Citations while generating answer

In [82]:
# Global variable to store history
history =[]
def pipeline(ques, llm, reader_template,conversational_template,retriever,history):
    
    # Generate new question based on previous history
    new_ques = qa_conversationHistory(ques, llm, conversational_template,history)
    
    # Generate answer for the new question and add citations
    ans = qa_reader(new_ques, llm, reader_template,retriever)
    
    # Add the new question and answer in the history 
    inp = 'Question: '+ new_ques + " Answer: "+ ans
    history.append(inp)
    return ques,new_ques,ans

In [85]:
history =[]
print('---------------- Answer with conversation history and citations ----------------')
print('Initial History ', history)

ques = "What is the movie Dunkirk about?"
ques,new_ques,ans = pipeline(ques, llm, reader_template,conversational_template,retriever,history)
print('Question: ',ques)
print('New Question: ',new_ques)
print('Answer: ',ans)

print('\n')

ques = "Who was the main actor in this movie?"
ques,new_ques,ans = pipeline(ques, llm, reader_template,conversational_template,retriever,history)
print('Question: ',ques)
print('New Question: ',new_ques)
print('Answer: ',ans)
print('\n')


ques = "Who is Christopher Nolan?"
ques,new_ques,ans = pipeline(ques, llm, reader_template,conversational_template,retriever,history)
print('Question: ',ques)
print('New Question: ',new_ques)
print('Answer: ',ans)
print('\n')



ques = "What is his latest movie?"
ques,new_ques,ans = pipeline(ques, llm, reader_template,conversational_template,retriever,history)
print('Question: ',ques)
print('New Question: ',new_ques)
print('Answer: ',ans)
print('\n')


ques = "How did he begin his career?"
ques,new_ques,ans = pipeline(ques, llm, reader_template,conversational_template,retriever,history)
print('Question: ',ques)
print('New Question: ',new_ques)
print('Answer: ',ans)
print('\n')

ques = "Has he won any awards?"
ans = qa_reader(ques, llm, reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "How is he as a person?"
ques,new_ques,ans = pipeline(ques, llm, reader_template,conversational_template,retriever,history)
print('Question: ',ques)
print('New Question: ',new_ques)
print('Answer: ',ans)
print('\n')

ques = "What was his first movie?"
ques,new_ques,ans = pipeline(ques, llm, reader_template,conversational_template,retriever,history)
print('Question: ',ques)
print('New Question: ',new_ques)
print('Answer: ',ans)
print('\n')



---------------- Answer with conversation history and citations ----------------
Initial History  []
Question:  What is the movie Dunkirk about?
New Question:  What is the movie Dunkirk about?
Answer:  The movie Dunkirk is about the British evacuation of France in 1940 during World War II [article1.pdf_Page8]. It depicts the efforts to rescue 338,000 Allied soldiers who were trapped on the beaches of Dunkirk [article2.pdf_Page7]. The film intercuts three narrative timelines of different lengths, resulting in surprising twists and turns in the story [article1.pdf_Page8]. It immerses audiences into the overpowering desperation to survive during war [article2.pdf_Page7]. The movie showcases the visual poetry of a pilot standing before the burning wreckage of his plane at twilight on a beach, symbolizing British resolve in the face of defeat [article2.pdf_Page7].


Question:  Who was the main actor in this movie?
New Question:  Who was the main actor in the movie Dunkirk?
Answer:  Harry St