<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/langchain_masterclass/04_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG - Retrieval Augmented Generated

Retrieval-Augmented Generation (RAG) is an AI framework that enhances the accuracy and relevance of responses generated by large language models (LLMs). It combines two techniques:

1. Retrieval: The model first retrieves relevant information from external sources, such as databases, documents, or the web.
2. Generation: It then uses this retrieved information to inform and enhance the generation of responses.

This approach ensures that the responses are accurate, relevant, and contextually enriched by the most up-to-date and specific information available2. RAG is particularly useful for creating more reliable and effective AI systems across various applications

# Setup environment

In [150]:
!pip install -q langchain langchain_community langchain-openai chromadb randomname

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for randomname (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone


In [161]:
import os
import shutil
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from dotenv import load_dotenv

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
import randomname
from glob import glob

In [74]:
if os.path.exists(".env"):
    os.remove(".env")

from google.colab import files
uploaded = files.upload()
if uploaded:
    if load_dotenv(".env"):
        print("Uploaded and Loaded Sucessfully")

Saving .env to .env
Uploaded and Loaded Sucessfully


## 1. Load LLM Model

In [75]:
model = ChatOpenAI(model='gpt-3.5-turbo-0125')
model

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7b875a7bada0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7b875a7a83a0>, root_client=<openai.OpenAI object at 0x7b8727724c40>, root_async_client=<openai.AsyncOpenAI object at 0x7b875a7b8b50>, model_name='gpt-3.5-turbo-0125', model_kwargs={}, openai_api_key=SecretStr('**********'))

# High level RAG Pipeline using LangChain

## Load the source text

In [175]:
!wget -q https://raw.githubusercontent.com/kavyajeetbora/nlp_rag/refs/heads/master/data/taare_zameen_par.txt -O taare_zameen_par.txt
!wget -q https://raw.githubusercontent.com/kavyajeetbora/nlp_rag/refs/heads/master/data/swades.txt -O swades.txt
!wget -q https://raw.githubusercontent.com/kavyajeetbora/nlp_rag/refs/heads/master/data/munna_bhai.txt -O munna_bhai.txt
!wget -q https://raw.githubusercontent.com/kavyajeetbora/nlp_rag/refs/heads/master/data/lagaan.txt -O lagaan.txt

In [176]:
text_file_path = "taare_zameen_par.txt"

if os.path.exists(text_file_path):
    loader = TextLoader(text_file_path)
    documents = loader.load()
    print("Source Document Loaded Sucessfully")
else:
    print(f"No file called {text_file_path} found")

Source Document Loaded Sucessfully


In [177]:
documents[0].metadata

{'source': 'taare_zameen_par.txt'}

## Split the text into Chunks

Why chunking? as we know there is limited number of characters/tokens we can pass on to our LLM model for generation. So we need to break down our large chunk of text into smaller consumable pieces

In [178]:
text_splitter = CharacterTextSplitter(separator = " ", chunk_size=300, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
len(docs)

14

In [179]:
## Here is the first chunk
docs[0]

Document(metadata={'source': 'taare_zameen_par.txt'}, page_content="Ishaan is an 8-year-old boy living in Mumbai, who has trouble following school. He is assumed by all to simply hate learning and deemed a troublemaker, and is belittled for it. He has even repeated the 3rd standard due to his academic failures from the previous year. Ishaan's imagination,")

## Load Embedding Model

Embedding models are mostly encoder based model (for example BERT and RoBERTa architecture)

Here is a brief of encoder only models:

1. An encoder-only model is a type of machine learning model that focuses on understanding text, but doesn't generate new text:
2. What they do ?

    Encoder-only models are designed to analyze the meaning of words and sentences in a text, and produce task-specific outputs like labels or token predictions.
3. What they're good for?

    They're well-suited for tasks that require understanding text, like text classification, question answering, and sentiment analysis.
4. How they work
    
    Encoder-only models process input text in a bidirectional manner, considering the context of each word from both the left and right sides. This allows them to understand the full meaning of a text.
5. Examples
    
    BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa are examples of encoder-only models.

Encoder-only models are different from decoder-only models, which are used for other types of generative tasks like Q&A

In [180]:
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

## Create a Vector Database

We will store all the embeddings from the source in Vector Database.

We will use [`langchain_community.Chroma.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html#langchain_community.vectorstores.chroma.Chroma.from_documents)

In [181]:
os.makedirs("db", exist_ok=True)

random_suffix = randomname.get_name()

persistent_directory = f"db/chroma-({random_suffix})"

## If already there, delete and create a new one
for folder in glob("db/chroma*"):
    if os.path.exists(folder):
        shutil.rmtree(folder)

os.mkdir(persistent_directory)

vector_db = Chroma.from_documents(
    documents = docs,
    collection_name = "movie_embeddings_v1",
    embedding = embeddings,
    persist_directory = persistent_directory
)

## Retrieving Relevant Chunks based on a query

In [186]:
query = "Who was Ishaan ?"

retriever = vector_db.as_retriever(
    search_type = "similarity_score_threshold",
    search_kwargs = {"k": 2, "score_threshold": 0.5}
)

relevant_docs = retriever.invoke(query)

In [187]:
for doc in relevant_docs:
    print("Source:",doc.metadata['source'])
    print(doc.page_content)
    print("-"*30)

Source: taare_zameen_par.txt
Ishaan is an 8-year-old boy living in Mumbai, who has trouble following school. He is assumed by all to simply hate learning and deemed a troublemaker, and is belittled for it. He has even repeated the 3rd standard due to his academic failures from the previous year. Ishaan's imagination,
------------------------------


## Creating Metadata for documents

while embedding the documents, it is always a good idea to add metadata to each document like adding title, filename, filesize, pages, author etc etc

This is important when retrieving any document as we may want to know the reference when our LLM is generating any text for validation purpose

In [188]:
## creating a vector database from many documents
documents = []
for txt_file_path in glob("*.txt"):

    loader = TextLoader(txt_file_path)
    docs = loader.load()

    for doc in docs:
        doc.metadata = {"soure": txt_file_path}
        documents.append(doc)

In [194]:
## Split the documents into chunks
text_splitter = CharacterTextSplitter(separator=" ", chunk_size=500, chunk_overlap=0)
chunks = text_splitter.split_documents(documents)
len(chunks)

62

In [196]:
len(chunks[0].page_content)

494

In [197]:
os.makedirs("db", exist_ok=True)

random_suffix = randomname.get_name()

persistent_directory = f"db/chroma-({random_suffix})"

## If already there, delete and create a new one
for folder in glob("db/chroma*"):
    if os.path.exists(folder):
        shutil.rmtree(folder)

os.mkdir(persistent_directory)

vector_db = Chroma.from_documents(
    documents = chunks,
    collection_name = "movie_embeddings_v2",
    embedding = embeddings,
    persist_directory = persistent_directory
)

Retrive the text using a query:

In [216]:
retriever = vector_db.as_retriever(
    search_type = "similarity_score_threshold",
    search_kwargs = {"k": 3, "score_threshold": 0.1}
)

In [217]:
query = "Explain the character of Munna"
retriever.invoke(query)

[Document(metadata={'soure': 'munna_bhai.txt'}, page_content='and forgive him. Munna ends up marrying Suman after learning of her true identity as "Chinki", and together, they open a real hospital in Munna\'s family village. Circuit also gets married a year later and has a son nicknamed "Short Circuit". Asthana resigns as the dean and becomes the head doctor, employing Munna\'s methods, while Rustom succeeds him. As the film concludes, Anand, restored to normal mental health, narrates the story to a few children at the hospital as he is about to leave for'),
 Document(metadata={'soure': 'munna_bhai.txt'}, page_content="Munna Bhai film series, the film follows Munna Bhai, a don in the Mumbai underworld, trying to please his father by pretending to be a doctor, but when a doctor, Asthana (Irani), exposes his lies and tarnishes his father's honor, Munna enrolls in a medical college. Chaos ensues when Munna, upon finding that Asthana is the dean of the college, vows revenge, while also spa

In [218]:
query = "A.R Rahman composed the music of which movies ?"
retriever.invoke(query)

[Document(metadata={'soure': 'swades.txt'}, page_content='was composed by A. R. Rahman, with lyrics penned by Javed Akhtar.\n\nSwades was theatrically released on 17 December 2004, and it opened to rave reviews from critics, with praise for the performances of Khan, Joshi and Ballal, and the story, screenplay, and soundtrack. However, it emerged as a commercial failure at the box office.\n\nAt the 50th Filmfare Awards, Swades received 8 nominations, including Best Film, Best Director (Gowarikar) and Best Music Director (Rahman), and won Best Actor (Khan)'),
 Document(metadata={'soure': 'swades.txt'}, page_content="and Best Background Score (Rahman).\n\nIt was dubbed in Tamil as Desam and released on 26 January 2005, coinciding with Indian Republic Day. Despite its commercial failure, Swades is regarded ahead of its time and is now considered a cult classic of Hindi cinema and one of the best films in Shah Rukh Khan's filmography. [10][11] The film is owned by Red Chillies Entertainment

In [222]:
query = "In which movies the director was Ashutosh Gowariker ?"
retriever.invoke(query)

[Document(metadata={'soure': 'lagaan.txt'}, page_content='Lagaan: Once Upon a Time in India, or simply Lagaan, (transl.\u2009Land tax) is a 2001 Indian Hindi-language epic period musical[5] sports drama film written and directed by Ashutosh Gowariker. The film was produced by Aamir Khan, who stars alongside debutant Gracy Singh and British actors Rachel Shelley and Paul Blackthorne. Set in 1893, during the late Victorian period of British colonial rule in India, the film follows the inhabitants of a village in Central India, who, burdened by high taxes and'),
 Document(metadata={'soure': 'swades.txt'}, page_content="Swades: We, the People (transl.\u2009Homeland) is a 2004 Indian Hindi-language drama film co-written, directed and produced by Ashutosh Gowariker.[3] The film stars Shah Rukh Khan, Gayatri Joshi and Kishori Ballal while Daya Shankar Pandey, Rajesh Vivek, Lekh Tandon appear in supporting roles.\n\nThe plot was based on two episodes of the series Vaapsi on Zee TV's Yule Love 