**What is RAG?**
Retrieval augmented generation is a type of information retrieval process. It modifies interactions with a large language model so that it responds to queries with reference to a specified set of documents, using it in preference to information drawn from its own vast, static training data.


[Google API key](https://aistudio.google.com/app/apikey)

[Document](https://constitutioncenter.org/media/files/constitution.pdf)

In [1]:
# installing necessary libraries
!pip install langchain_google_genai
!pip install langchain_community
!pip install langchain_text_splitters
!pip install pypdf
!pip install chromadb
!pip install -qU langchain_community pypdf

Collecting langchain_google_genai
  Downloading langchain_google_genai-2.0.9-py3-none-any.whl.metadata (3.6 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain_google_genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Downloading langchain_google_genai-2.0.9-py3-none-any.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Installing collected packages: filetype, langchain_google_genai
Successfully installed filetype-1.2.0 langchain_google_genai-2.0.9
Collecting langchain_community
  Downloading langchain_community-0.3.17-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.34 (from langchain_community)
  Downloading langchain_core-0.3.35-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.18 (from langchain_community)
  Downloading langchain-0.3.18-py3-none-any.whl.metadata (7.8 kB)
Coll

In [2]:
# load the pdf doument

from langchain_community.document_loaders import PyPDFLoader
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.vectorstores import chroma

In [3]:
path="constitution.pdf"
loader = PyPDFLoader(path)
data = loader.load()

In [4]:
data

[Document(metadata={'producer': 'Adobe PDF Library 23.1.125', 'creator': 'Acrobat PDFMaker 23 for Word', 'creationdate': '2023-04-10T12:53:44-04:00', 'company': '', 'created': 'D:20030612', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'source': 'constitution.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}, page_content='NATIONAL CONSTITUTION CENTER \n \n \n \n \n \n \n \n \n \n \n \nTHE \nCONSTITUTION  \nof the United States'),
 Document(metadata={'producer': 'Adobe PDF Library 23.1.125', 'creator': 'Acrobat PDFMaker 23 for Word', 'creationdate': '2023-04-10T12:53:44-04:00', 'company': '', 'created': 'D:20030612', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'source': 'constitution.pdf', 'total_pages': 19, 'page': 1, 'page_label': '2'}, page_content='C O N S T I T U T I O N O F T H E U N I T E D S T A 

In [5]:
len(data)

19

In [6]:
# split the document into chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 50)
docs = text_splitter.split_documents(data)

docs

[Document(metadata={'producer': 'Adobe PDF Library 23.1.125', 'creator': 'Acrobat PDFMaker 23 for Word', 'creationdate': '2023-04-10T12:53:44-04:00', 'company': '', 'created': 'D:20030612', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'source': 'constitution.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}, page_content='NATIONAL CONSTITUTION CENTER \n \n \n \n \n \n \n \n \n \n \n \nTHE \nCONSTITUTION  \nof the United States'),
 Document(metadata={'producer': 'Adobe PDF Library 23.1.125', 'creator': 'Acrobat PDFMaker 23 for Word', 'creationdate': '2023-04-10T12:53:44-04:00', 'company': '', 'created': 'D:20030612', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'source': 'constitution.pdf', 'total_pages': 19, 'page': 1, 'page_label': '2'}, page_content='C O N S T I T U T I O N O F T H E U N I T E D S T A 

In [7]:
len(docs)

64

In [None]:
# setting up the embeddings and creating a vector store with google GEMINI

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import Chroma

#Here you need to create an google api key for the model 
google_api_key = ""
embeddings = GoogleGenerativeAIEmbeddings(model = "models/embedding-001", google_api_key = google_api_key)

vectorstore = Chroma.from_documents(documents = docs, embedding = embeddings)



In [9]:
# retrieve information using LangChain and Gemini

retriever = vectorstore.as_retriever(search_type = "similarity", search_kwargs = {"k": 3})
retrieved_docs = retriever.invoke("tell me about the senate")

In [10]:
retrieved_docs

[Document(metadata={'company': '', 'created': 'D:20030612', 'creationdate': '2023-04-10T12:53:44-04:00', 'creator': 'Acrobat PDFMaker 23 for Word', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'page': 14, 'page_label': '15', 'producer': 'Adobe PDF Library 23.1.125', 'source': 'constitution.pdf', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'total_pages': 19}, page_content='the 17th Amendment.)  \nThe Senate of the United States shall be composed of two \nSenators from each State, elected by the people thereof, for \nsix years; and each Senator shall have one vote. The electors \nin each State shall have the qualifications requisite for elec- \ntors of the most numerous branch of the State legislatures. \nWhen vacancies happen in the representation of any State \nin the Senate, the executive authority of such State shall \nissue writs of election to fill such vacancies: Provided, That \nthe legislature of any State may empower the executive \

In [11]:
len(retrieved_docs)

3

In [12]:
print(retrieved_docs[2])

page_content='Plantations one, Connecticut five, New-York six, New 
Jersey four, Pennsylvania eight, Delaware one, Maryland 
six, Virginia ten, North Carolina five, South Carolina five, 
and Georgia three. 
When vacancies happen in the Representation from any 
State, the Executive Authority thereof shall issue Writs of 
Election to fill such Vacancies. 
The House of Representatives shall chuse their 
Speaker and other Officers; and shall have the sole 
Power of Impeachment. 
SECTION. 3 
The Senate of the United States shall be composed of two 
Senators from each State, [chosen by the Legislature there- 
of,]* for six Years; and each Senator shall have one Vote. 
Immediately after they shall be assembled in Consequence 
of the first Election, they shall be divided as equally as may 
be into three Classes. The Seats of the Senators of the first 
Class shall be vacated at the Expiration of the second Year, 
of the second Class at the Expiration of the fourth Year, and' metadata={'company'

In [13]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model = "gemini-1.5-pro", temperature = 0, max_tokens = 500, google_api_key = google_api_key)

In [14]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use only the following pieces of retrieved context to answer the question. "
    "If you don't know the answer based on the context, say 'The information is not available in the documents provided.' "
    "Do not use any outside knowledge or make assumptions. "
    "Use three sentences maximum and keep the answer concise."
    "\n\n"
    "{context}"
)

# Construct the prompt template with the system and human messages
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [15]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [16]:
response = rag_chain.invoke({"input":"tell me about the senate?"})

print(response["answer"])

The Senate is composed of two senators from each state, elected by the people for six-year terms. Each senator has one vote, and electors must meet the same requirements as electors for the most numerous branch of the state legislature.  When vacancies occur, the state's executive authority issues writs of election to fill them, but the state legislature may empower the executive to make temporary appointments.


In [17]:
response = rag_chain.invoke({"input":" what is the third amendment is talking about?"})

print(response)

{'input': ' what is the third amendment is talking about?', 'context': [Document(metadata={'company': '', 'created': 'D:20030612', 'creationdate': '2023-04-10T12:53:44-04:00', 'creator': 'Acrobat PDFMaker 23 for Word', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'page': 11, 'page_label': '12', 'producer': 'Adobe PDF Library 23.1.125', 'source': 'constitution.pdf', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'total_pages': 19}, page_content='shall not be infringed.  \nAmendment  III. \nNo Soldier shall, in time of peace be quartered in any \nhouse, without the consent of the Owner, nor in time of \nwar, but in a manner to be prescribed by law. \nAmendment  IV. \nThe right of the people to be secure in their persons, hous- \nes, papers, and effects, against unreasonable searches and \nseizures, shall not be violated, and no Warrants shall issue, \nbut upon probable cause, supported by Oath or affirma- \ntion, and particularly describing the 

In [18]:
response = rag_chain.invoke({"input":"what is the role of the president ?"})

print(response["answer"])

The president is Commander in Chief of the Army and Navy, and of the state militias when called into actual service of the United States.  They can grant reprieves and pardons for offenses against the United States, except in cases of impeachment.  Additionally, the president receives ambassadors and other public ministers and ensures laws are faithfully executed.


**LLM Evaluation:**
Evaluating Large Language Models (LLMs) is crucial to ensure their performance, reliability, and alignment with user expectations.
Variety of metrics are used to assess different aspects of their performance. Here are some of the most common evaluation metrics:
1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
2. BLEU (Bilingual Evaluation Understudy)
3. Perplexity
4. Human Evaluation etc

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
- **Definition:** ROUGE measures overlap between the generated text and reference text, focusing on recall (how much of the reference text is captured by the generated text).
**Variants:** ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence), and ROUGE-W (weighted longest common subsequence).
**Usage:** Commonly used in summarization tasks.



