# **RAG Implementation**

# Install Required Libraries

LangChain is a framework designed to build applications that integrate language models (like OpenAI's ChatGPT) with various external tools and data sources. It simplifies the development of intelligent systems by providing modular components for:

* [Prompt Engineering](https://python.langchain.com/docs/concepts/prompt_templates/): Easily create and manage prompts to guide the language model’s responses.

* Chains: Combine multiple steps (e.g., fetching data, processing it, and generating output) into workflows.

* [Tools](https://python.langchain.com/docs/concepts/tools/) & Agents: Extend model capabilities using tools (e.g., search engines, APIs) and empower dynamic decision-making with agents.

* Memory: Maintain state and context in conversations or across tasks.

* Data Augmentation: Integrate retrieval systems (like Pinecone) for Retrieval-Augmented Generation (RAG).

*Official Langchain Documentation: https://python.langchain.com/docs/introduction/*

In [None]:
pip install langchain



In [None]:
pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.13-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.13 (from langchain-community)
  Downloading langchain-0.3.13-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.27 (from langchain-community)
  Downloading langchain_core-0.3.28-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.23.2-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

Groq builds super-fast processors designed specifically for artificial intelligence (AI) and machine learning tasks. Their technology focuses on delivering high performance for AI applications, such as natural language processing, computer vision, and large-scale data analysis.

In simple terms, Groq's processors are like turbocharged engines for computers, helping them perform complex AI calculations faster and more efficiently than regular processors. This makes them great for tasks like training AI models, running real-time AI applications, or analyzing massive datasets quickly.

*Groq Weebsite: https://console.groq.com/login*

In [None]:
pip install -qU langchain-groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/109.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h


PyPDF is a Python library used for working with PDF files. It allows you to read, manipulate, and extract information from PDF documents. It's lightweight, easy to use, and supports a variety of PDF-related tasks.

In [None]:
%pip install -qU pypdf

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h

AstraDB is a cloud-based, fully-managed database-as-a-service powered by Apache Cassandra, known for its high scalability, reliability, and performance. It is designed for modern applications that require real-time data handling across distributed systems.

AstraDB Website: https://www.datastax.com/products/datastax-astra

In [None]:
pip install langchain_astradb

Collecting langchain_astradb
  Downloading langchain_astradb-0.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting astrapy<2.0.0,>=1.5.2 (from langchain_astradb)
  Downloading astrapy-1.5.2-py3-none-any.whl.metadata (19 kB)
Collecting deprecation<2.2.0,>=2.1.0 (from astrapy<2.0.0,>=1.5.2->langchain_astradb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting pymongo>=3 (from astrapy<2.0.0,>=1.5.2->langchain_astradb)
  Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting uuid6>=2024.1.12 (from astrapy<2.0.0,>=1.5.2->langchain_astradb)
  Downloading uuid6-2024.7.10-py3-none-any.whl.metadata (8.6 kB)
Collecting h2<5,>=3 (from httpx[http2]<1,>=0.25.2->astrapy<2.0.0,>=1.5.2->langchain_astradb)
  Downloading h2-4.1.0-py3-none-any.whl.metadata (3.6 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo>=3->astrapy<2.0.0,>=1.5.2->langchain_astradb)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.

Mistral AI focused on developing open-source large language models (LLMs) for AI applications. It aims to provide powerful, efficient, and transparent AI tools that can be freely used and adapted by developers and businesses.

Key Highlights:
* Open-Source Models: Mistral AI releases its models under permissive licenses, allowing for customization and integration without restrictions.
* High Performance: Their models are designed to offer competitive performance in natural language processing tasks while being efficient in resource usage.

Mistral AI is part of the growing movement to democratize AI by providing cutting-edge, open-source alternatives to proprietary models.

*Mistral AI Website*: https://mistral.ai/

In [None]:
pip install -qU langchain-mistralai

# Setup Environmental

In [None]:
from google.colab import userdata
import os
from langchain_groq import ChatGroq
from langchain_community.document_loaders import PyPDFLoader
from langchain_astradb import AstraDBVectorStore
from langchain_mistralai import MistralAIEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

**Generate API Keys**:

* Groq: https://console.groq.com/keys

* Astra DB: https://accounts.datastax.com/session-service/v1/login

* Mistral AI: https://console.mistral.ai/api-keys/

In [None]:
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = userdata.get('ASTRA_DB_TOKEN')
os.environ["ASTRA_DB_API_ENDPOINT"] = userdata.get('ASTRA_ENDPOINT')
os.environ["ASTRA_DB_KEYSPACE"] = "transcripts"
ASTRA_DB_APPLICATION_TOKEN = os.environ["ASTRA_DB_APPLICATION_TOKEN"]
ASTRA_DB_API_ENDPOINT = os.environ["ASTRA_DB_API_ENDPOINT"]
os.environ["MISTRAL_API_KEY"] = userdata.get('assign_key')

# RAG Code

**Dataset:**

Bollywood Movie Script: https://www.filmcompanion.in/companion-zone/download-the-script-of-3-idiots

In [None]:
file_path = ("/content/3_idiots_SCRIPT.pdf")

In [None]:
loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)

In [None]:
pages[8]

Document(metadata={'source': '/content/3_idiots_SCRIPT.pdf', 'page': 8}, page_content='INTRODUCTION\nINTRODUCTION\n03_Introduction.indd   903_Introduction.indd   9 5/31/2010   1:47:14 AM5/31/2010   1:47:14 AM')

In [None]:
llm = ChatGroq(model="llama3-8b-8192")
llm

ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x79d0b47d8670>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x79d0b47d9a20>, model_name='llama3-8b-8192', model_kwargs={}, groq_api_key=SecretStr('**********'))

In [None]:
embedding_model = MistralAIEmbeddings(model="mistral-embed")
embedding_model



MistralAIEmbeddings(client=<httpx.Client object at 0x79d0b47d87c0>, async_client=<httpx.AsyncClient object at 0x79d0b47daa70>, mistral_api_key=SecretStr('**********'), endpoint='https://api.mistral.ai/v1/', max_retries=5, timeout=120, wait_time=30, max_concurrent_requests=64, tokenizer=<langchain_mistralai.embeddings.DummyTokenizer object at 0x79d0a3ec87f0>, model='mistral-embed')

In [None]:
vstore = AstraDBVectorStore(
    collection_name="transcriptStorage",
    embedding=embedding_model,
    token=ASTRA_DB_APPLICATION_TOKEN,
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    namespace='transcripts',
)

In [None]:
len(pages)

426

In [None]:
inserted_ids = vstore.add_documents(pages[:30])
print(f"Inserted {len(inserted_ids)} documents.")

Inserted 30 documents.


In [None]:
inserted_ids = vstore.add_documents(pages[31:50])
print(f"Inserted {len(inserted_ids)} documents.")

Inserted 19 documents.


In [None]:
inserted_ids = vstore.add_documents(pages[51:70])
print(f"Inserted {len(inserted_ids)} documents.")

Inserted 19 documents.


In [None]:
results = vstore.similarity_search("How does the motor starts", k=3)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* 56
PROFESSOR POTDAR 
What’re you smiling for?
RANCHO
Sir, to study engineering was a childhood 
dream. I’m so happy to be here ﬁ nally.
PROFESSOR POTDAR
No need to be so happy. 
PROFESSOR POTDAR
Deﬁ ne a machine.
RANCHO 
Anything that simpliﬁ  es work, or saves 
time, is a machine. 
RANCHO 
It’s a warm day, press a button, get a blast 
of air. The fan ... a machine!
izksQslj iksíkj 
vki eqLdqjk D;ksaa jgs gSa\ ugha
jSUpks
,sDpoyh lj] cpiu ls pkgrk Fkk fd 
bUt+hfu;fjax dkWyst esa i<awA vkt ;gk¡ cSBk 
gw¡A cgqr et+k vk jgk gS ljA
izksQslj iksíkj
T;knk et+k ysus dh t:jr ughaa gSaA 
Taken aback, Rancho stops smiling.
izksQslj iksíkj
cksyks] e'khu dk M~;fQus'ku cksyksA
RANCHO 
A machine is anything that reduces human 
effort.
PROFESSOR POTDAR 
Will you please elaborate?
Rancho stands up and starts to explain.
jSUpks
lj] gj oks pht+ tks bUlku dk dke vklku 
djs ;k oD+r cpk;s, oks e'khu gS] ljA
Chatur frowns at Rancho in disdain.
jSUpks 
xjeh yx jgh gSa\ cVu nck;k] gok pkyw && 
QSu --- e'khu

In [None]:
retriever = vstore.as_retriever(search_kwargs={"k": 5})

In [None]:
retriever

VectorStoreRetriever(tags=['AstraDBVectorStore', 'MistralAIEmbeddings'], vectorstore=<langchain_astradb.vectorstores.AstraDBVectorStore object at 0x79d0b47d56f0>, search_kwargs={'k': 5})

In [None]:
template = """
You are an expert in answering questions based on the provided movie script {context}.
The script contains the following:
1. Character names and their dialogues.
2. Scene descriptions.
3. Hindi translations (which you must skip and focus only on the English transcript).
4. Songs (which you can ignore unless explicitly asked about them).

Your task is to:
- Provide concise and accurate answers to user questions based on the entire movie script.
- Skip irrelevant details like Hindi translations unless specified otherwise.

Only answer with respect to script. And do not mention other things like I think, The answer to this question is and etc.

Input:
- Question from the user: {input}

Output:
- Your response:
"""

prompt_for_movie = PromptTemplate(
    template=template,
    input_variables=["context", "input"]
)

In [None]:
rag_chain = (
    {"context": retriever, "input": RunnablePassthrough()}
    | prompt_for_movie
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain

{
  context: VectorStoreRetriever(tags=['AstraDBVectorStore', 'MistralAIEmbeddings'], vectorstore=<langchain_astradb.vectorstores.AstraDBVectorStore object at 0x79d0b47d56f0>, search_kwargs={'k': 5}),
  input: RunnablePassthrough()
}
| PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template='\nYou are an expert in answering questions based on the provided movie script {context}.\nThe script contains the following:\n1. Character names and their dialogues.\n2. Scene descriptions.\n3. Hindi translations (which you must skip and focus only on the English transcript).\n4. Songs (which you can ignore unless explicitly asked about them).\n\nYour task is to:\n- Provide concise and accurate answers to user questions based on the entire movie script.\n- Skip irrelevant details like Hindi translations unless specified otherwise.\n\nOnly answer with respect to script. And do not mention other things like I think, The answer to this question is and etc.\n

In [None]:
out=rag_chain.invoke("Who is rancho")

In [None]:
out

'Rancho is the main character in the movie.'

In [None]:
out=rag_chain.invoke("Can name any song in the movie")

In [None]:
out

'Give me some sunshine and Aal izz Well are the two songs mentioned in the script.'

In [None]:
out=rag_chain.invoke("what is the movie name")
out

'The movie name is 3 Idiots.'

In [None]:
out=rag_chain.invoke("who are these 3 idiots refering in the movie")
out

'The term "3 Idiots" refers to the three main characters of the movie, Farhan, Raju, and Rancho.'

In [None]:
out=rag_chain.invoke("explain the all is well song in short")
out

'The song "Aal Izz Well" is a mantra that Rancho teaches to his friends, Farhan and Raju. It\'s a way to trick their hearts into believing that everything is fine, even when it\'s not. Rancho explains that when our heart is scared, we need to con it into believing that everything is well, and this mantra helps to do just that.'