<a href="https://colab.research.google.com/github/lakhanrajpatlolla/aiml-learning/blob/master/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

### **Installing and importing packages**

In [None]:
!pip -q install openai
!pip -q install langchain
!pip -q install langchain-openai
!pip -q install pypdf
!pip -q install chromadb
!pip -q install tiktoken

In [None]:
import os
import openai
import numpy as np

#### **Authentication for OpenAI API**

In [None]:
f = open('/content/openapi_key.txt')
api_key = f.read()
os.environ['OPENAI_API_KEY'] = api_key
openai.api_key= os.getenv('OPENAI_API_KEY')

### **Loading the documents**

[PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

In [None]:
from langchain_community.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose
    PyPDFLoader("/content/pca_d1.pdf"),
    PyPDFLoader("/content/ens_d2.pdf"),
    PyPDFLoader("/content/ens_d2.pdf"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
print(docs[0].page_content)

### **Splitting of document**

[Recursively split by character](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter)

[Split by character](https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter)


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# Split
#from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

In [None]:
splits = text_splitter.split_documents(docs)
print(len(splits))
splits

### **Embeddings**

Let's take our splits and embed them.

In [None]:
from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [None]:
embedding

### **Understanding similarity search with a toy example**

In [None]:
sentence1 = "i like dogs"
sentence2 = "i like cats"
sentence3 = "the weather is ugly, too hot outside"

In [None]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [None]:
len(embedding1), len(embedding2), len(embedding3)

In [None]:
np.dot(embedding1, embedding2), np.dot(embedding1, embedding3),np.dot(embedding2, embedding3)

### **Vectorstores**

In [None]:
from langchain_community.vectorstores import Chroma # Light-weight and in memory

In [None]:
persist_directory = 'docs/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any

In [None]:
vectordb = Chroma.from_documents(
    documents=splits, # splits we created earlier
    embedding=embedding,
    persist_directory=persist_directory # save the directory
)

In [None]:
vectordb.persist() # Let's **save vectordb** so we can use it later!

In [None]:
print(vectordb._collection.count()) # same as number of splites

### **Similarity Search**

In [None]:
question = "how does pca reduce the dimension?"

In [None]:
docs = vectordb.similarity_search(question,k=3) # k --> No. of doc as return
print(len(docs))
print(docs[0].page_content)
print(docs[1].page_content)
print(docs[2].page_content)

### **Edge case where failure may happen**

1. Lack of Diversity : Semantic search fetches all similar documents, but does not enforce diversity.

    - Notice that we're getting duplicate chunks (because of the duplicate `ens_d2.pdf` in the index). `docs[0]` and `docs[1]` are indentical.

  **Addressing Diversity - MMR-Maximum Marginal Relevance**

2. Lack of spefificity:  The question may be from a particular doc but answer may contain information from other doc.

  **Addressing Specificity: Working with metadata - Manually**

  **Working with metadata using self-query retriever -Automatically**

  **Example 1. Addressing Diversity - MMR-Maximum Marginal Relevance**

In [None]:
question= 'how ensemble method works?'
docs = vectordb.similarity_search(question,k=2) # Without MMR

In [None]:
docs[0]

In [None]:
docs[1]

In [None]:
docs[2]

In [None]:
docs[3]

Addressing Diversity - MMR-Maximum Marginal Relevance

In [None]:
docs_with_mmr=vectordb.max_marginal_relevance_search(question, k=3, fetch_k=6) # With MMR

In [None]:
docs_with_mmr[0]

In [None]:
docs_with_mmr[1]

In [None]:
docs_with_mmr[2]

 **Example 2. Addressing Specificity: Working with metadata - Manually**

In [None]:
# Without metadata information
question = "what is the role of variance in pca?"
docs = vectordb.similarity_search(question,k=5)
for doc in docs:
    print(doc.metadata) # metadata contains information about from which doc the answer has been fetched

Notice above, the last information is from 'ens_d2' doc.

In [None]:
# With metadata information
question = "what is the role of variance in pca?"
docs = vectordb.similarity_search(
    question,
    k=5,
    filter={"source":'/content/pca_d1.pdf'} # manually passing metadata, using metadata filter.
)

for doc in docs:
    print(doc.metadata)

[**Addressing Specificity -Automatically: Working with metadata using self-query retriever**](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query)



### **Additional tricks: Compression**

Another approach for improving the quality of retrieved docs is compression. Information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

[Contextual compression](https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression) is meant to fix this.

### **Retrieval + Question Answering :  Connecting with LLMs**

In [None]:
llm_name = "gpt-3.5-turbo"
print(llm_name)

In [None]:
question = "What is principal component analysis?"
docs = vectordb.max_marginal_relevance_search(question, k=2, fetch_k=3)
len(docs)

In [None]:
docs[0]

In [None]:
docs[1]

In [None]:
#docs[2]

In [None]:
#docs[3]

####**[RetrievalQA chain](https://docs.smith.langchain.com/cookbook/hub-examples/retrieval-qa-chain)**

####**[Vector store-backed retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore)**

In [None]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)

In [None]:
from langchain.chains import RetrievalQA

In [None]:
question = "What is principal component analysis?"

qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever(), return_source_documents=True)

result = qa_chain.invoke({"query": question})

In [None]:
result["result"]

In [None]:
result["source_documents"]

###**Under the hood? --> Understanding RAG Prompt**

In [None]:
!pip install langchainhub

In [None]:
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")
prompt

Use three sentences maximum.Keep the answer as concise as possible.

In [None]:
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)

In [None]:
QA_CHAIN_PROMPT

In [None]:
# Run chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(search_type="mmr",search_kwargs={"k": 2, "fetch_k":6} ), # "k":2, "fetch_k":3
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                                       return_source_documents=True
                                       )

In [None]:
qa_chain

**Example 1**

In [None]:
question = "What is principal component analysis?"
result = qa_chain.invoke({"query": question})
result["source_documents"]

In [None]:
result["result"]

**Example 2**

In [None]:
question = "What does it say about variance in context of both PCA and Ensemble?"
result = qa_chain({"query": question})
result["source_documents"]

In [None]:
result["result"]

### **RetrievalQA chain types : [Map reduce, Refine, Map rerank(Legacy)](https://python.langchain.com/docs/modules/chains/)**

- Whatever techniques we havae used is stuff method (default - chain_type="stuff") and there is only one call to LLM

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(search_type="mmr",search_kwargs={"k": 4, "fetch_k":8}),
    chain_type="map_reduce"
)

In [None]:
question ="What principal component analysis?"
result = qa_chain_mr({"query": question})
result["result"]

### **Make it like Chatbot : Adding Memory**

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [None]:
# Run chain
from langchain.chains import ConversationalRetrievalChain
qa= ConversationalRetrievalChain.from_llm(llm,
                                       retriever=vectordb.as_retriever(search_type="mmr",search_kwargs={"k": 4, "fetch_k":8} ), # "k":2, "fetch_k":3
                                       memory=memory
                                       )

In [None]:
question = "tell me something about PCA"
result = qa.invoke({"question": question})

In [None]:
result['answer']

In [None]:
question = "please list point-wise,  how does pca works?"
result = qa({"question": question})

In [None]:
print(result['answer'])

In [None]:
question = "what do we get from covariance matrix for doing PCA?"
result = qa({"question": question})
print(result['answer'])

### **Download the vector DB**

In [None]:
# Zip the entire folder
!zip -r /content/docs.zip /content/docs

In [None]:
from google.colab import files
files.download("/content/docs.zip")

### **Upload the vector db from previous step and unzip**

In [None]:
!unzip /content/docs.zip  -d /