# **1️⃣ Why is Data Preparation Important for RAG?**
RAG models combine retrieval and generation, meaning poorly prepared data can lead to irrelevant retrievals and hallucinations.

## **📌 Good Data Preparation Ensures:**
✅ Efficient Retrieval → Finds the most relevant documents.

✅ Reduced Latency → Optimized indexing speeds up search.

✅ Better Generation Quality → Provides accurate and context-rich responses.

✅ Improved Scalability → Handles large-scale corpora effectively.

# **2️⃣ Key Steps in Data Preparation for RAG**
RAG applications need structured and indexed data for effective retrieval. The data pipeline consists of:

1️⃣ Data Collection & Preprocessing → Cleaning raw text, removing noise.

2️⃣ Text Chunking & Segmentation → Breaking documents into meaningful parts.

3️⃣ Metadata Enrichment → Adding tags for better retrieval.

4️⃣ Embedding & Indexing → Converting text into searchable vectors.

5️⃣ Storage & Retrieval Optimization → Efficiently managing vector databases.

# **3️⃣ Data Collection & Preprocessing**
Before using a retrieval system, we must clean and normalize data.

# **🔹 Steps in Preprocessing**
✅ Remove HTML, Special Characters, and Formatting Issues

✅ Tokenization & Lowercasing

✅ Stopword Removal (Optional)

✅ Lemmatization or Stemming

✅ Handling Missing Data & Duplicates

In [3]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample raw text
text = "<p>This is an <b>example</b> document for RAG applications.</p>"

# Step 1: Remove HTML tags
clean_text = re.sub(r'<.*?>', '', text)

# Step 2: Tokenization
tokens = word_tokenize(clean_text.lower())

# Step 3: Stopword Removal & Lemmatization
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

processed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalnum() and word not in stop_words]

print("Processed Tokens:", processed_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Processed Tokens: ['example', 'document', 'rag', 'application']


# **4️⃣ Text Chunking & Segmentation**
RAG models retrieve chunks of documents rather than entire texts. Proper chunking improves retrieval accuracy and reduces irrelevant context.

**🔹 Chunking Strategies**

✅ Fixed-Length Chunks (e.g., 512 tokens)

✅ Sentence-Based Splitting (Splitting on periods/full stops)

✅ Sliding Window Technique (Overlapping chunks for context retention)

✅ Semantic Chunking (Splitting at logical breaks using NLP)

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """Retrieval-Augmented Generation (RAG) models enhance text generation by retrieving external data.
They improve response accuracy and reduce hallucinations. RAG is widely used in AI applications."""

# Define chunk size & overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")


Chunk 1: Retrieval-Augmented Generation (RAG) models
Chunk 2: models enhance text generation by retrieving
Chunk 3: external data.
Chunk 4: They improve response accuracy and reduce
Chunk 5: reduce hallucinations. RAG is widely used in AI
Chunk 6: in AI applications.


# **5️⃣ Metadata Enrichment for Better Retrieval**
Adding metadata improves retrieval performance by enabling filtering and ranking of results.

**🔹 Common Metadata for RAG Applications**

📌 Document Titles → Identify key topics.

📌 Authors & Sources → Verify credibility.

📌 Timestamp → Ensure freshness of data.

📌 Categories & Tags → Improve filtering.

📌 Named Entities (NER) → Extract important names (e.g., "GPT-4").

In [5]:
from langchain.schema import Document

# Sample document with metadata
doc = Document(
    page_content="RAG models improve generative AI by retrieving relevant information.",
    metadata={"source": "AI Research", "category": "NLP", "date": "2025-01-30"}
)

print("Document Metadata:", doc.metadata)


Document Metadata: {'source': 'AI Research', 'category': 'NLP', 'date': '2025-01-30'}


# **6️⃣ Embedding & Indexing for Vector Search**
RAG models use embeddings to represent text as high-dimensional vectors.


**🔹 Steps in Embedding & Indexing**

1️⃣ Convert text chunks into vector embeddings.

2️⃣ Store embeddings in vector databases (FAISS, Pinecone, Weaviate).

3️⃣ Use similarity search to retrieve relevant chunks.

In [7]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.16-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.16 (from langchain-community)
  Downloading langchain-0.3.17-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.32 (from langchain-community)
  Downloading langchain_core-0.3.33-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.0-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

In [9]:
import openai
from google.colab import userdata
import os


openai_api= userdata.get("OPENAI_API_KEY")

In [11]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m39.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [13]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [14]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# Sample text chunks
chunks = ["RAG models retrieve data to improve AI.", "They enhance response accuracy and reduce hallucinations."]

# Convert to embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small",openai_api_key=openai_api)
vector_db = FAISS.from_texts(chunks, embeddings)

# Search for relevant content
query = "How do RAG models improve AI?"
retriever = vector_db.as_retriever()
results = retriever.get_relevant_documents(query)

print("Retrieved Documents:", [doc.page_content for doc in results])


  results = retriever.get_relevant_documents(query)


Retrieved Documents: ['RAG models retrieve data to improve AI.', 'They enhance response accuracy and reduce hallucinations.']


# **7️⃣Summary & Takeaways**
✅ Clean & preprocess data → Remove noise, normalize text.

✅ Use smart chunking → Ensures better retrieval without breaking context.

✅ Enrich metadata → Improves searchability and ranking.

✅ Embed & index efficiently → Vector search enables fast and relevant retrieval.