
# Exercise Notebook: Implementing RAG (Retrieval-Augmented Generation)

In this exercise notebook, you will go through the steps required to implement Retrieval-Augmented Generation (RAG).
The notebook will guide you through each step, providing explanations and asking you to fill in the code.

Please fill in the code cells where prompted to complete the implementation.

**Let's get started!**



## Installing Required Libraries

Before starting, ensure you have all the necessary libraries installed.
Install the following libraries by running the appropriate command below.

- `langchain`
- `langchain_community`
- `unstructured`
- `sentence_transformers`
- `tiktoken`
- `chromadb`
- `langchain_chroma`
- `langchain_groq`

Fill in the installation command in the code cell below:


In [1]:
!pip install langchain
!pip install langchain_community
!pip install unstructured
!pip install sentence_transformers
!pip install tiktoken
!pip install chromadb
!pip install langchain_chroma
!pip install langchain_groq

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.1-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.121-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-ma

In [2]:
!kaggle datasets download -d khaledzsa/example

Dataset URL: https://www.kaggle.com/datasets/khaledzsa/example
License(s): unknown
Downloading example.zip to /content
 93% 10.0M/10.7M [00:01<00:00, 14.5MB/s]
100% 10.7M/10.7M [00:01<00:00, 8.71MB/s]


In [3]:
! unzip /content/example.zip

Archive:  /content/example.zip
  inflating: example.csv             



## Import Necessary Modules

Now, you need to import the necessary modules to build the RAG system.
Write the import statements for the libraries required in the following code cell.


In [8]:
import os
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
import markdown

import pandas as pd
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Data Pre-processing and Preparation

In this section, we will focus on preparing the dataset for retrieval-based models. The steps involve cleaning the text, tokenizing it, and vectorizing it for further use in our model. These steps are essential for efficient retrieval and generation.

In [13]:
df = pd.read_csv('/content/example.csv')
df.head()

Unnamed: 0,title,content
0,Geographic distribution and population,"According to the Indian census of 2001, there ..."
1,Language and literature,Malayalam is the language spoken by the Malaya...
2,Arrival of Cove Reber and Saosin EP (2004-2006),After the audition process and several guest v...
3,Formation and Translating the Name (2003-2004),"The original lineup for Saosin, consisting of ..."
4,Red Hot Organization and Tommy Boy Records dis...,"In 1996, Coolio appeared on the Red Hot Organi..."


In [14]:
#  cleaning the text, tokenizing it, and vectorizing it for further use in our model.
def clean_text(text):
    #
    text = text.replace('\n', ' ')

    text = ' '.join(text.split())
    # lower case
    text = text.lower()

    #remove links and punctioation

    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text

df['content'] = df['content'].apply(clean_text)

In [11]:
# # tokenize
# df['content'] = df['content'].apply(word_tokenize)

In [15]:
df.head()

Unnamed: 0,title,content
0,Geographic distribution and population,according to the indian census of 2001 there w...
1,Language and literature,malayalam is the language spoken by the malaya...
2,Arrival of Cove Reber and Saosin EP (2004-2006),after the audition process and several guest v...
3,Formation and Translating the Name (2003-2004),the original lineup for saosin consisting of b...
4,Red Hot Organization and Tommy Boy Records dis...,in 1996 coolio appeared on the red hot organiz...


# Read Files from the Directory

In this step, we will read all text-based files from a specified directory. The files could be in various formats such as Markdown (`.md`), plain text (`.txt`), or other similar formats. We will handle each file based on its extension and process it accordingly.

### Steps to Follow:

1. **Specify the directory**: Define the directory from which to load the files.
2. **Read files by extension**: Filter files based on their extensions (e.g., `.md`, `.txt`, etc.).
3. **Convert or process content**: For each file, load the content. For markdown files, we will convert them into HTML using the `markdown` module. For other text formats, we will simply read the content as plain text.
4. **Store the processed content**: The result of each file’s content will be stored in a list for further use.

In [16]:
directory = 'data/markdown_files'
os.makedirs(directory, exist_ok=True)

In [17]:
for i in range(0, 5_000):

    title = df['title'].iloc[i]
    content = df['content'].iloc[i]

    markdown_content = f"# {title}\n\n"
    markdown_content += f"{content}\n\n"

    with open(f'{directory}/{i}.md', 'w', encoding='utf-8') as file:
        file.write(markdown_content)

In [18]:
markdown_texts = []
for filename in os.listdir(directory):
  if filename.endswith(".md"):
    with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
      markdown_content = file.read()
      html_content = markdown.markdown(markdown_content)
      markdown_texts.append(html_content)

In [19]:
markdown_texts[1]

'<h1>Musical style and development</h1>\n<p>in icons of rock an encyclopedia of the legends who changed music forever scott schinder and andy schwartz describe the beatles musical evolution in their initial incarnation as cheerful wisecracking moptops the fab four revolutionised the sound style and attitude of popular music and opened rock and rolls doors to a tidal wave of british rock acts their initial impact would have been enough to establish the beatles as one of their eras most influential cultural forces but they didnt stop there although their initial style was a highly original irresistibly catchy synthesis of early american rock and roll and rb the beatles spent the rest of the 1960s expanding rocks stylistic frontiers consistently staking out new musical territory on each release the bands increasingly sophisticated experimentation encompassed a variety of genres including folkrock country psychedelia and baroque pop without sacrificing the effortless mass appeal of their e

## Split the Text into Chunks

In this step, we will split the text into manageable chunks. This is important for tasks such as document retrieval and text generation, where large bodies of text need to be broken down for efficient processing.

### Why Split Text into Chunks?

- **Memory Efficiency**: Working with smaller pieces of text is more memory efficient.
- **Improved Retrieval**: Splitting long documents into smaller sections can improve the relevance of retrieval tasks.
- **Better Generation**: For text generation, smaller chunks help models focus on a specific context.

### Steps to Follow:

1. **Specify the chunk size**: Define the maximum number of words or characters per chunk.
2. **Split the text**: Split each document or file content into chunks based on the defined size.
3. **Handle incomplete chunks**: If a document ends with a chunk that is smaller than the chunk size, include it as a valid chunk.
4. **Store the chunks**: Store all chunks in a list for further processing.

In [20]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = text_splitter.create_documents(markdown_texts)

## Initialize the Embedding Model & Create a Vector Store Using Chroma

In this step, we will initialize an embedding model to convert text chunks into numerical vectors. These embeddings will be used to measure the similarity between different chunks of text. After generating the embeddings, we will store them using Chroma, a vector store designed to efficiently manage and retrieve embeddings.

### Steps to Follow:

1. **Initialize the embedding model**: Choose an embedding model (e.g., Sentence Transformers or OpenAI embeddings) to convert text into vectors.
2. **Generate embeddings**: Convert each text chunk into its corresponding embedding.
3. **Create a vector store**: Use Chroma to store the embeddings and their associated metadata (e.g., the original text chunk).
4. **Verify the store**: Ensure that the embeddings are stored correctly and that you can retrieve them based on similarity.

In [21]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(documents, embedding_function, persist_directory="./chroma_db")

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Load the Persistent Directory for Chroma DB

In this step, we will focus on **loading** the persistent storage for Chroma DB. This allows us to access previously stored embeddings and metadata without recomputing them. By setting up persistent storage, we ensure that the vector database can be saved to disk and loaded again when needed.

### Steps to Follow:

1. **Specify the persistent directory**: Identify the directory where the Chroma DB is stored.
2. **Load the vector store**: Use Chroma to load the embeddings and metadata from this directory.
3. **Verify the loaded data**: Ensure that the embeddings and associated data have been correctly loaded and can be queried.

In [53]:
import os
import json
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

In [54]:
PRESIST_DIRECTORY = '/content/chroma_db'
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
persist_directory = "./chroma_db"
db = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)



In [55]:
def query_chroma_db(query, db, top_k=5):
    docs = db.similarity_search(query)
    results = [doc.page_content for doc in docs]
    return results

In [74]:
query_chroma_db("how is quayle?", db)

['he has also been on the board of directors of other companies including k2 sports amtran inc central newspapers inc btc inc and carvana co his son ben quayle was elected to the united states house of representatives in 2010 but failed to win reelection in 2012 cannotanswer</p>',
 '<p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999',
 'of new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus of the hudson institute and is president of quayle and associates he has also been a member of th

# Create & Test the Retrieval with a Sample Query

In this step, we will set up the retrieval process using the embeddings stored in Chroma DB. Retrieval is a key part of the Retrieval-Augmented Generation (RAG) pipeline, allowing us to find relevant documents or text chunks based on a query. After setting up the retrieval system, we will test it with a sample query to ensure that it returns the most relevant chunks.

### Steps to Follow:

1. **Set up the retrieval system**: Using the Chroma DB with the stored embeddings, create a retrieval function that can match a query to relevant text chunks.
2. **Prepare a sample query**: Define a query that you want to search for in the stored text chunks.
3. **Retrieve relevant chunks**: Use the query to search the vector store and retrieve the most similar chunks.
4. **Test the results**: Check that the returned chunks are relevant to the query and adjust the retrieval system if needed.

In [57]:
PROMPT_TEMPLATE="""
Answer the question based only on the following context:
Context: {context}
Question: {question}
Your answer:
"""

prompt_template = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)

In [58]:
groq_api_key = "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"
llm = ChatGroq(temperature=0, groq_api_key=groq_api_key, model_name="llama3-8b-8192")

In [59]:
MODEL = LLMChain(llm=llm,
                 prompt=prompt_template,
                 verbose=True)

In [60]:
def query_rag(query: str):
    similarity_search_results = db.similarity_search_with_score(query, k=4)
    context_text = "\n\n".join([doc.page_content for doc, _score in similarity_search_results])

    rag_response = MODEL.invoke({"context": context_text, "question": query})

    return rag_response

In [75]:
response = query_rag("how is quayle?")
response



Prompt after formatting:
[32;1m[1;3m
Answer the question based only on the following context:
Context: he has also been on the board of directors of other companies including k2 sports amtran inc central newspapers inc btc inc and carvana co his son ben quayle was elected to the united states house of representatives in 2010 but failed to win reelection in 2012 cannotanswer</p>

<p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999

of new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus 

{'context': 'he has also been on the board of directors of other companies including k2 sports amtran inc central newspapers inc btc inc and carvana co his son ben quayle was elected to the united states house of representatives in 2010 but failed to win reelection in 2012 cannotanswer</p>\n\n<p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999\n\nof new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus of the hudson institute and is president of quayle and associates he has also been a mem

In [76]:
print(f'Context:\n{response["context"]}\n\nQuestion:\n{response["question"]}\n\nText: \n{response["text"]}')

Context:
he has also been on the board of directors of other companies including k2 sports amtran inc central newspapers inc btc inc and carvana co his son ben quayle was elected to the united states house of representatives in 2010 but failed to win reelection in 2012 cannotanswer</p>

<p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999

of new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus of the hudson institute and is president of quayle and associates he has also been a member of t

In [77]:
query = "how is quayle?"
similarity_search_results = db.similarity_search_with_score(query, k=4)

In [78]:
print("First: ", similarity_search_results[0][0].page_content)
print("Second: ", similarity_search_results[1][0].page_content)
print("Third: ", similarity_search_results[2][0].page_content)
print("Fourth: ", similarity_search_results[3][0].page_content)

First:  he has also been on the board of directors of other companies including k2 sports amtran inc central newspapers inc btc inc and carvana co his son ben quayle was elected to the united states house of representatives in 2010 but failed to win reelection in 2012 cannotanswer</p>
Second:  <p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999
Third:  of new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus of the hudson institute and is president of quayle and associates he has also been

In [79]:
print(similarity_search_results[0][1])
print(similarity_search_results[1][1])
print(similarity_search_results[2][1])
print(similarity_search_results[3][1])

0.7989358901977539
1.030447006225586
1.052169919013977
1.0779919624328613


In [80]:
def query_rag_with_threshold(query: str, threshold: float):
    similarity_search_results = db.similarity_search_with_score(query, k=4)
    context_text = "\n\n".join([doc.page_content for doc, score in similarity_search_results if score > threshold])
    rag_response = MODEL.invoke({"context": context_text, "question": query})
    return rag_response

In [81]:
response = query_rag_with_threshold("how is quayle?", 0.80)
response



Prompt after formatting:
[32;1m[1;3m
Answer the question based only on the following context:
Context: <p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999

of new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus of the hudson institute and is president of quayle and associates he has also been a member of the board of directors of heckmann corporation a watersector company since the companys inception and serves as chairman of the companys compensation and nominating  governance committ

{'context': '<p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999\n\nof new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus of the hudson institute and is president of quayle and associates he has also been a member of the board of directors of heckmann corporation a watersector company since the companys inception and serves as chairman of the companys compensation and nominating  governance committees quayle is a director of aozora bank tokyo japan he has also been on the board of direc

In [82]:
print(f'Context:\n{response["context"]}\n\nQuestion:\n{response["question"]}\n\nText: \n{response["text"]}')

Context:
<p>quayle lives with his wife marilyn quayle in paradise valley arizona quayle authored a 1994 memoir standing firm which became a bestseller his second book the american family discovering the values that make us strong was published in 1996 and a third book worth fighting for in 1999 quayle writes a nationally syndicated newspaper column serves on a number of corporate boards chairs several business ventures and was chairman of campaign america a national political action committee in 1999

of new york as potentially a very serious misuse of the vice presidents office quayle is an honorary trustee emeritus of the hudson institute and is president of quayle and associates he has also been a member of the board of directors of heckmann corporation a watersector company since the companys inception and serves as chairman of the companys compensation and nominating  governance committees quayle is a director of aozora bank tokyo japan he has also been on the board of directors o