<a href="https://colab.research.google.com/github/nitishnarayanan002/RAG_on_Myntra_dataset/blob/main/RAG_Nitish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Creating an RAG for Mytnra Dataset that was taken from the Kaggle and how the bot will respond to the queries asked by the user**

# 1. Setup

## 1.1 Install Libraries

In [None]:
# Install required libraries
!pip install pandas openpyxl langchain pydantic chromadb sentence-transformers transformers accelerate rank_bm25
# Install a high-performance cross-encoder for re-ranking
!pip install --upgrade cohere




## 1.2 Mount Google Drive and API key

In [None]:
from google.colab import drive
from google.colab import userdata # <-- CORRECT IMPORT
import os
import json
import requests
from getpass import getpass

# 1. Mount Google Drive
drive.mount('/content/drive')

# 2. Get Perplexity API Key from Colab Secrets using `userdata`
# NOTE: The secret name is case-sensitive and must match what you set in Colab Secrets.
SECRET_NAME = 'PERPLEXITY_API_KEY'

try:
    # Use the correct method to retrieve the key
    PPLX_API_KEY = userdata.get(SECRET_NAME)

    if PPLX_API_KEY is None:
        print(f"Error: Secret '{SECRET_NAME}' not found. Please ensure it is saved correctly in Colab Secrets.")
        # Fallback for manual entry if the secret is not set
        PPLX_API_KEY = getpass(f"Enter your {SECRET_NAME} manually: ")
except Exception as e:
    print(f"An error occurred while loading the secret: {e}")
    PPLX_API_KEY = getpass(f"Enter your {SECRET_NAME} manually: ")

# Check if key is available
if PPLX_API_KEY:
    print(f"Perplexity API Key loaded successfully for RAG project.")
else:
    # This should stop execution if the key is mandatory for the project
    raise ValueError("FATAL ERROR: Perplexity API Key is required but could not be loaded or entered.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Perplexity API Key loaded successfully for RAG project.


# 2. Data Preparation and Document Creation

The goal is to convert the tabular data into text documents suitable for RAG.

In [None]:
import pandas as pd
import numpy as np # Import numpy for isna checks

# Define the path to your dataset
DATASET_PATH = "/content/drive/MyDrive/Fashion Dataset v2.xlsx"

try:
    df = pd.read_excel(DATASET_PATH)
except FileNotFoundError:
    print(f"Error: File not found at {DATASET_PATH}. Please check your path.")
    raise # Stop execution if the file isn't found

# --- COLUMN MAPPING AND CLEANING FIX ---
# 1. Clean the column names by stripping whitespace (already done, but keep for robustness)
df.columns = df.columns.str.strip()

# 2. Map the required RAG columns to the actual dataset columns
# We will use 'name', 'brand', and 'p_attributes' as the core descriptive fields.
CRITICAL_COLUMNS = ['name', 'brand', 'p_attributes']

# Check if all critical columns exist after cleaning
if not all(col in df.columns for col in CRITICAL_COLUMNS):
    # This should now never run if the column list above is correct
    missing_cols = [col for col in CRITICAL_COLUMNS if col not in df.columns]
    raise KeyError(f"FATAL: Missing required columns: {missing_cols}.")

# Drop rows with any missing values that are critical for search
df.dropna(subset=CRITICAL_COLUMNS, inplace=True)

# Create a 'document' column by concatenating relevant product attributes
def create_document_text(row):
    """Formats a single product entry into a comprehensive text document using actual column names."""

    # Use 'colour' for color and 'products' for category/type
    color = row['colour'] if 'colour' in row and pd.notna(row.get('colour')) else 'N/A'
    category = row['products'] if 'products' in row and pd.notna(row.get('products')) else 'General'
    product_id = row['p_id'] if 'p_id' in row and pd.notna(row.get('p_id')) else 'N/A'

    # Use 'description' if 'p_attributes' is missing, though we dropped NaN rows for 'p_attributes'
    attributes = row['p_attributes'] if 'p_attributes' in row and pd.notna(row.get('p_attributes')) else row.get('description', 'N/A')

    doc_text = (
        f"Product Name: {row['name']}\n"
        f"Brand: {row['brand']}\n"
        f"Category: {category}\n"
        f"Product ID: {product_id}\n"
        f"Color: {color}\n"
        f"Attributes/Features: {attributes}\n"
        f"Price: {row.get('price', 'N/A')}\n"
        f"Average Rating: {row.get('avg_rating', 'N/A')}"
    )
    return doc_text

df['document_content'] = df.apply(create_document_text, axis=1)

# The list of documents to be processed by the embedding model
documents = df['document_content'].tolist()

print(f"\nLoaded and processed {len(documents)} documents for RAG system using columns: {CRITICAL_COLUMNS}")


Loaded and processed 14214 documents for RAG system using columns: ['name', 'brand', 'p_attributes']


# 3. The Embedding Layer: Experimentation

This layer is about converting your text documents into numerical vectors.

## 3.1 Embeding the model


In [None]:
!pip install --upgrade sentence-transformers



In [None]:
!pip install "numpy<2"



In [None]:
import torch
from sentence_transformers import SentenceTransformer
import chromadb
from typing import List, Dict, Any

# --- CHOOSE YOUR MODEL FOR IMPLEMENTATION ---
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"

print(f"Loading Sentence Transformer: {EMBEDDING_MODEL_NAME}")

# Use GPU if available for faster embedding generation
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)

# --- 1. Generate Embeddings for All Documents ---
print(f"Generating embeddings for {len(documents)} chunks...")

# Efficient encoding with batching, GPU, and progress bar
embeddings = model.encode(
    documents,
    convert_to_tensor=True,
    device=device,
    show_progress_bar=True,
    batch_size=64  # Adjust if you face memory errors
).cpu().numpy()

print(f" Embeddings generated successfully: shape = {embeddings.shape}")

# --- 2. Prepare Metadata and IDs ---
# Ensure 'p_id' exists and has no NaNs
if 'p_id' not in df.columns:
    raise KeyError("FATAL: 'p_id' column missing from dataframe.")

df = df[df['p_id'].notna()].reset_index(drop=True)

ids = [str(pid) for pid in df['p_id']]
metadatas = [{"source": "Myntra Dataset", "p_id": str(pid)} for pid in df['p_id']]

# Sanity check: lengths must match
assert len(ids) == len(documents) == len(embeddings) == len(metadatas), \
    f"Length mismatch: ids={len(ids)}, docs={len(documents)}, emb={len(embeddings)}, meta={len(metadatas)}"

# --- 3. Initialize and Populate ChromaDB ---
PERSIST_DIRECTORY = "./chroma_db_direct"
COLLECTION_NAME = "myntra_products"

# Use in-memory client first for speed; switch to PersistentClient once stable
# chroma_client = chromadb.PersistentClient(path=PERSIST_DIRECTORY)
chroma_client = chromadb.Client()  # Faster for Colab testing

collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
)

# --- 4. Add Data to the Collection in Batches ---
batch_size = 200
print("Adding data to ChromaDB in batches...")

for i in range(0, len(ids), batch_size):
    print(f"→ Adding batch {i // batch_size + 1} / {len(ids) // batch_size + 1}")
    collection.add(
        embeddings=embeddings[i:i + batch_size],
        documents=documents[i:i + batch_size],
        metadatas=metadatas[i:i + batch_size],
        ids=ids[i:i + batch_size],
    )

print(f"ChromaDB collection '{COLLECTION_NAME}' created successfully!")

# --- 5. Helper Function for Direct Search ---
def direct_chroma_search(query: str, k: int = 5) -> List[Dict[str, Any]]:
    """Performs vector search directly against the ChromaDB collection."""
    # Embed the query using the same model
    query_embedding = model.encode([query], convert_to_tensor=False, show_progress_bar=False)

    # Perform vector similarity search
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k,
        include=['documents', 'distances', 'metadatas']
    )

    # Format output for readability
    formatted_results = []
    if results.get('documents') and results.get('distances'):
        for i in range(len(results['documents'][0])):
            formatted_results.append({
                "content": results['documents'][0][i],
                "score": results['distances'][0][i],
                "metadata": results['metadatas'][0][i],
            })

    return formatted_results

print(" direct_chroma_search() ready for use!")


Loading Sentence Transformer: all-MiniLM-L6-v2
Using device: cpu
Generating embeddings for 14214 chunks...


Batches:   0%|          | 0/223 [00:00<?, ?it/s]

 Embeddings generated successfully: shape = (14214, 384)
Adding data to ChromaDB in batches...
→ Adding batch 1 / 72
→ Adding batch 2 / 72
→ Adding batch 3 / 72
→ Adding batch 4 / 72
→ Adding batch 5 / 72
→ Adding batch 6 / 72
→ Adding batch 7 / 72
→ Adding batch 8 / 72
→ Adding batch 9 / 72
→ Adding batch 10 / 72
→ Adding batch 11 / 72
→ Adding batch 12 / 72
→ Adding batch 13 / 72
→ Adding batch 14 / 72
→ Adding batch 15 / 72
→ Adding batch 16 / 72
→ Adding batch 17 / 72
→ Adding batch 18 / 72
→ Adding batch 19 / 72
→ Adding batch 20 / 72
→ Adding batch 21 / 72
→ Adding batch 22 / 72
→ Adding batch 23 / 72
→ Adding batch 24 / 72
→ Adding batch 25 / 72
→ Adding batch 26 / 72
→ Adding batch 27 / 72
→ Adding batch 28 / 72
→ Adding batch 29 / 72
→ Adding batch 30 / 72
→ Adding batch 31 / 72
→ Adding batch 32 / 72
→ Adding batch 33 / 72
→ Adding batch 34 / 72
→ Adding batch 35 / 72
→ Adding batch 36 / 72
→ Adding batch 37 / 72
→ Adding batch 38 / 72
→ Adding batch 39 / 72
→ Adding batch 40

 # 4. The Search Layer: Confirmation & Query Refinement

This layer handles retrieving relevant chunks and refining the results.

## 4.1 Implementing Caching for Search

In [None]:
from typing import List, Dict, Any

# Simple in-memory cache
search_cache = {}

def get_cached_or_search(query: str, k: int = 5) -> List[Dict[str, Any]]:
    """
    Checks cache first; if not found, performs vector search on ChromaDB.
    Uses direct_chroma_search() from Step 4.
    """
    cache_key = (query.lower().strip(), k)

    if cache_key in search_cache:
        print(f" Cache Hit for query: '{query}'")
        return search_cache[cache_key]

    print(f" Cache Miss for query: '{query}'. Performing vector search...")
    results = direct_chroma_search(query, k=k)

    # Cache the results
    search_cache[cache_key] = results
    return results

# --- Example Queries for Testing ---
queries = {
    "Query 1" : "Do you have ethnic wear?",
    "Query 2": "I'm looking for Kurtas and Trousers",
   "Query 3": "Describe the fabric and features of W Women."
}

for qid, qtext in queries.items():
    print(f"\n {qid}: {qtext}")
    res = get_cached_or_search(qtext, k=5)
    for idx, r in enumerate(res, 1):
        print(f"\nResult {idx}:")
        print(f"→ Score: {r['score']:.4f}")
        print(f"→ Product ID: {r['metadata'].get('p_id', 'N/A')}")
        print(f"→ Content: {r['content'][:250]}...")



 Query 1: Do you have ethnic wear?
 Cache Miss for query: 'Do you have ethnic wear?'. Performing vector search...

Result 1:
→ Score: 0.9415
→ Product ID: 15723640
→ Content: Product Name: Ethnicity Gold-Toned Ethnic Embellished Regular Top
Brand: Ethnicity
Category: Top
Product ID: 15723640
Color: Gold
Attributes/Features: {'Body Shape ID': '443,333,424', 'Body or Garment Size': 'Garment Measurements in', 'Center Front O...

Result 2:
→ Score: 0.9517
→ Product ID: 17577456
→ Content: Product Name: Indo Era Women Classic Off-White Ethnic Motifs Nuovo Sleeves Top
Brand: Indo Era
Category: Top
Product ID: 17577456
Color: Off White
Attributes/Features: {'Body Shape ID': '443,333,424', 'Body or Garment Size': 'Garment Measurements in'...

Result 3:
→ Score: 0.9523
→ Product ID: 16588088
→ Content: Product Name: Ancestry Pink & Black Ethnic Motifs Print Bishop Sleeves Modal Regular Top
Brand: Ancestry
Category: Top
Product ID: 16588088
Color: Pink
Attributes/Features: {'Body Shape ID': '33

## 4.2 Implementing Re-ranking

We will use a high-performance cross-encoder from HuggingFace to re-rank the initial top 10 retrieved chunks.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from typing import List, Dict, Any
import numpy as np

# --- Reranker Model (Cross-Encoder) ---
RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"  # Lightweight and effective

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(RERANKER_MODEL_NAME)
reranker_model = AutoModelForSequenceClassification.from_pretrained(RERANKER_MODEL_NAME)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
reranker_model.to(device)

# --- Re-ranking Function ---
def rerank_documents(query: str, initial_results: List[Dict[str, Any]], top_n: int = 3) -> List[Dict[str, Any]]:
    """
    Re-ranks retrieved documents using a cross-encoder (semantic re-ranking).
    """
    if not initial_results:
        print("No documents to rerank.")
        return []

    # Form query-document pairs
    pairs = [[query, doc['content']] for doc in initial_results]

    # Tokenize and infer scores
    with torch.no_grad():
        inputs = tokenizer(
            pairs,
            padding=True,
            truncation=True,
            return_tensors='pt'
        ).to(device)
        scores = reranker_model(**inputs).logits.squeeze(-1).cpu().numpy()

    # Attach re-rank scores
    for i, doc in enumerate(initial_results):
        doc['rerank_score'] = float(scores[i])

    # Sort and return top_n
    reranked_results = sorted(initial_results, key=lambda x: x['rerank_score'], reverse=True)
    return reranked_results[:top_n]

# --- Updated Caching Layer (works with direct_chroma_search) ---
search_cache = {}

def get_cached_or_search(query: str, k: int = 5) -> List[Dict[str, Any]]:
    """Checks cache first, otherwise performs ChromaDB vector search."""
    cache_key = (query, k)

    if cache_key in search_cache:
        print(f"Cache Hit for query: '{query}'")
        return search_cache[cache_key]

    print(f"Cache Miss for query: '{query}'. Performing vector search...")
    results = direct_chroma_search(query, k=k)  # Uses your ChromaDB retrieval

    # Save to cache
    search_cache[cache_key] = results
    return results

# --- Run Test ---
TEST_QUERY = "What are some options of Libas?"
K_INITIAL_RETRIEVAL = 10  # Retrieve top 10 from Chroma
TOP_N_RERANKED = 3        # Rerank top 3

#  Initial retrieval
initial_results = get_cached_or_search(TEST_QUERY, k=K_INITIAL_RETRIEVAL)

# Reranking
final_context = rerank_documents(TEST_QUERY, initial_results, top_n=TOP_N_RERANKED)

print("\n--- RERANKED TOP 3 CONTEXT CHUNKS ---")
for i, chunk in enumerate(final_context):
    print(f"\nRANK {i+1} (Rerank Score: {chunk['rerank_score']:.4f}):")
    print(chunk['content'][:300], "...")


Cache Miss for query: 'What are some options of Libas?'. Performing vector search...

--- RERANKED TOP 3 CONTEXT CHUNKS ---

RANK 1 (Rerank Score: -1.1954):
Product Name: Libas Women Stylish Black Embellished Tiered Skirt
Brand: Libas
Category: Skirt
Product ID: 16872736
Color: Black
Attributes/Features: {'Add-Ons': 'NA', 'Body Shape ID': '443,324,333,424', 'Body or Garment Size': 'To-Fit Denotes Body Measurements in', 'Character': 'NA', 'Closure': 'Dra ...

RANK 2 (Rerank Score: -1.5515):
Product Name: Libas Women Blue Floral Printed Ethnic Palazzos
Brand: Libas
Category: Palazzos
Product ID: 17644444
Color: Blue
Attributes/Features: {'Body or Garment Size': 'To-Fit Denotes Body Measurements in', 'Care for me': 'NA', 'Closure': 'Slip-On', 'Fabric': 'Silk', 'Fabric 2': 'Blended', 'Fit ...

RANK 3 (Rerank Score: -1.6717):
Product Name: Libas Women Pink Floral Printed Ethnic Palazzos
Brand: Libas
Category: Palazzos
Product ID: 17644516
Color: Pink
Attributes/Features: {'Body or Garment S

# 5. Generation Layer

This layer uses the retrieved context to generate the final, coherent answer.

## 5.1  Perplexity API Function

In [None]:
def generate_answer_with_pplx(query: str, context: List[Dict[str, Any]]) -> str:
    """
    Calls the Perplexity API with the user query and retrieved context.
    """
    if not context:
        return "I could not find any relevant information in the Myntra dataset to answer your query."

    # 1. Structure the Context for the LLM
    context_text = "\n---\n".join([doc['content'] for doc in context])

    # 2. Design the Exhaustive Prompt Template
    SYSTEM_PROMPT = (
        "You are an expert Myntra fashion product knowledge assistant. "
        "Your task is to answer the user's question based ONLY on the provided context. "
        "If the context does not contain the answer, state clearly that the information is unavailable in the provided product catalog. "
        "Format the output clearly using a bulleted or numbered list of products and their key details (Name, Brand, ID, Color, and a summary of the Attributes). "
        "Ensure the final answer is coherent and directly addresses the user's request."
    )

    # 3. Create the Full Prompt (Including Few-shot Example - Optional but recommended)
    # Few-shot example structure:
    # Example Query, Example Context, Example Answer (not implemented here for brevity, but recommended for report)

    USER_MESSAGE = (
        f"Based on the Myntra Product Catalog data provided below, answer the following question:\n\n"
        f"--- USER QUERY ---\n{query}\n\n"
        f"--- RELEVANT PRODUCT CATALOG DATA ---\n{context_text}\n"
        f"--- END OF DATA ---"
    )

    # 4. API Call Configuration
    url = "https://api.perplexity.ai/chat/completions"

    # --- Experimentation Block: Change the model here ---
    # Experiment 1 (Online): "sonar-medium-online"
    # Experiment 2 (Chat): "sonar-small-chat"
    MODEL_NAME = "sonar"

    payload = {
        "model": MODEL_NAME,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_MESSAGE}
        ]
    }

    headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "authorization": f"Bearer {PPLX_API_KEY}"
    }

    try:
        response = requests.post(url, headers=headers, data=json.dumps(payload))
        response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)

        response_data = response.json()

        if 'choices' in response_data and len(response_data['choices']) > 0:
            return response_data['choices'][0]['message']['content']
        else:
            return "Error: Perplexity API did not return a valid response."

    except requests.exceptions.RequestException as e:
        return f"An error occurred during the API call: {e}"

## 5.2: Final Testing and Output

three queries to get the final generated answers.

In [None]:
# List of your 3 self-designed queries
all_test_queries = [
    "Do you have ethnic wear?",
    "I'm looking for Kurtas and Trousers",
    "Describe the fabric and features of W Women."
]

for i, query in enumerate(all_test_queries):
    print(f"\n\n=======================================================")
    print(f"TESTING QUERY {i+1}: {query}")
    print("=======================================================")

    # 1. Search Layer (Retrieval)
    initial_results = get_cached_or_search(query, k=10) # Using k=10 for initial search

    # 2. Search Layer (Re-ranking)
    # Get the final 3 best chunks to pass as context
    final_context = rerank_documents(query, initial_results, top_n=3)

    print("\n[CONTEXT]:")
    for j, chunk in enumerate(final_context):
        print(f"  Chunk {j+1} (Rerank Score: {chunk['rerank_score']:.4f}): {chunk['content'][:50]}...")

    # 3. Generation Layer (LLM Call)
    final_answer = generate_answer_with_pplx(query, final_context)

    print("\n[FINAL GENERATED ANSWER (Perplexity Sonar)]:")
    print(final_answer)



TESTING QUERY 1: Do you have ethnic wear?
Cache Hit for query: 'Do you have ethnic wear?'

[CONTEXT]:
  Chunk 1 (Rerank Score: -2.0691): Product Name: Ethnicity Gold-Toned Ethnic Embellis...
  Chunk 2 (Rerank Score: -2.1206): Product Name: Indo Era Women Classic Off-White Eth...
  Chunk 3 (Rerank Score: -2.1978): Product Name: Indo Era Women Bright Orange Ethnic ...

[FINAL GENERATED ANSWER (Perplexity Sonar)]:
Yes, the Myntra product catalog includes ethnic wear options. Here are some ethnic wear tops available:

1. **Ethnicity Gold-Toned Ethnic Embellished Regular Top**  
   - Brand: Ethnicity  
   - Product ID: 15723640  
   - Color: Gold  
   - Features: Sleeveless, round neck, embellished with ethnic motifs, cotton blend fabric, regular length, suitable for casual occasions.  
   - Price: ₹999  

2. **Indo Era Women Classic Off-White Ethnic Motifs Nuovo Sleeves Top**  
   - Brand: Indo Era  
   - Product ID: 17577456  
   - Color: Off White  
   - Features: Short puff sleeves, r