# Notebook for building RAG pipeline using LLMs

### This notebook covers:
    - RAG QA pipeline using LLMs
    - Dataset - latest articles on similar topic - Christopher nolan movies
    

### 0. Required Installations
#### pip install langchain
#### pip install 'langchain[llms]' (for mac)
#### pip install tiktoken  - openAi tokenizer
#### export OPENAI_API_KEY=""

## 0. Installations

### Import python packages

In [28]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from pathlib import Path

## 1. DATA LOAD

### Load PDF data and Convert it using Pymupdf Library from Lang Chain

- PyMuPDFLoader RETURNS ONE DOCUMENT PER PAGE
- PyPDF- Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number.
- Refer : https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

### Function to read PDF input

In [29]:
def read_pdf_input(file_path):
    pdf_search = Path(file_path).glob("*.pdf")
    pdf_files  = [str(file.absolute()) for file in pdf_search]
    print('Total PDF files',len(pdf_files))
    pages = []
    for pdf in pdf_files:
        loader = PyPDFLoader(pdf)
        pages.extend(loader.load_and_split())
    return pages

In [42]:
file_path='<enter your file path>'
pages = read_pdf_input(file_path)
print('Length of pages', len(pages))

Total PDF files 5
Length of pages 82


## 2. DATA STORE

###  Embed the documents

In [32]:

embeddings_model = OpenAIEmbeddings(openai_api_key="<enter open ai key>")

### Vector Store : FAISS 

In [34]:
db = FAISS.from_documents(pages, embeddings_model)

## 3. RETRIEVER

In [35]:
retriever = db.as_retriever()
#retriever = db.as_retriever(search_type="mmr")
#retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5})

- retriever = db.as_retriever(search_kwargs={"k": 1}) -> For retrieving top k=1 results
- retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5}) -> For returning greater than 0.5 similarity score




## 4. QA

### Prompt tuning

In [36]:
reader_template = """As a Question answering assitant, generate an answer to the input question using the context provided.
Follow the below guidelines while answering the question.
- Use the context to answer the question. Do not answer out of the context available.
- Be concise and clear in your language.
- If you do not know the answer just say you - "Sorry, I do not know this!"
Use the context: {context} for the question: {question} to generate the answer.
Helpful Answer:"""

In [37]:
from langchain import PromptTemplate, OpenAI, LLMChain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0,openai_api_key="<enter open ai key>")


In [38]:
def qa_reader(ques, llm, reader_template,retriever):
    
    #1. Generate the prompt using prompt template
    reader_prompt = PromptTemplate(template=reader_template, input_variables=["context", "question"])
    
    #2. Use LLM chain to create llm instance with the llm model and the prompt
    llm_chain = LLMChain(prompt=reader_prompt, llm=llm)
    
    #3. Retrieve relevant documents from the retriever for the input query
    docs= retriever.get_relevant_documents(ques)
    
    #4. Pass the retrieved documents as the context and the input query to the LLM Chain created in step 2
    result = llm_chain.predict(context=docs, question=ques)
    
    #5. Return the output generated by the LLM
    return result
    

## Success Cases

In [39]:
print('---------------- Answer from Reader ----------------')
ques = "Who is Christopher Nolan?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "What are some of Nolan's best movies?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "What actors has he worked with?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "Explain the plot of movie Oppenheimer"
ans = qa_reader(ques, llm, reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')


---------------- Answer from Reader ----------------
Question:  Who is Christopher Nolan?
Answer:  Christopher Nolan is a British film director and writer known for his noirish visual aesthetic and unconventional narratives. He gained recognition with his film "Memento" in 2000 and achieved further success with "Batman Begins" in 2005. Nolan's works often focus on realistic and gritty portrayals of characters and settings.


Question:  What are some of Nolan's best movies?
Answer:  Some of Christopher Nolan's best movies include The Dark Knight, Memento, and Inception.


Question:  What actors has he worked with?
Answer:  Christopher Nolan has worked with actors such as Leonardo DiCaprio, Joseph Gordon-Levitt, Matthew McConaughey, Heath Ledger, and Cillian Murphy.


Question:  Explain the plot of movie Oppenheimer
Answer:  The plot of the movie Oppenheimer revolves around the creation of the atomic bomb during World War II. It specifically focuses on J. Robert Oppenheimer, the scientis

## Failure cases

In [41]:
print('---------------- Answer from Reader ----------------')
ques = "What is the movie Dunkirk about?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')


ques = "Who was the main actor in this movie?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "Who is Christopher Nolan?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "What is his latest movie?"
ans = qa_reader(ques, llm,reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "How is he as a person?"
ans = qa_reader(ques, llm, reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')

ques = "What was his first movie?"
ans = qa_reader(ques, llm, reader_template,retriever)
print('Question: ',ques)
print('Answer: ',ans)
print('\n')



---------------- Answer from Reader ----------------
Question:  What is the movie Dunkirk about?
Answer:  The movie Dunkirk is about the evacuation of Allied troops from the beaches of Dunkirk during World War II. Germany had advanced into France, trapping the troops, and the movie depicts their struggle to escape under air and ground cover. It is a fact-based story that showcases emotionally satisfying spectacle and a talented ensemble cast.


Question:  Who was the main actor in this movie?
Answer:  The main actor in the movie "Oppenheimer" is Cillian Murphy, who plays the titular character, J. Robert Oppenheimer.


Question:  Who is Christopher Nolan?
Answer:  Christopher Nolan is a British film director and writer known for his noirish visual aesthetic and unconventional narratives. He gained recognition with his film "Memento" in 2000 and achieved further success with "Batman Begins" in 2005. Nolan's works often focus on realistic and gritty portrayals of characters and settings.
