1) Load Chunks

In [1]:
import pandas as pd

chunks_df = pd.read_csv("../data/processed/qa_chunks.csv")
print(f"Loaded {len(chunks_df)} chunks")
chunks_df.head()

Loaded 1881 chunks


Unnamed: 0,chunk_id,text,question,answer,issue_area,issue_category,product_category,customer_sentiment,issue_complexity,source_row,chunk_type
0,qa_0_0,Question: How can I log in to my account to pu...,How can I log in to my account to purchase an ...,After confirming the customer's registered ema...,Login and Account,Mobile Number and Email Verification,Appliances,neutral,medium,0,qa_pair
1,qa_1_1,Question: Why am I being asked to ship back th...,Why am I being asked to ship back the computer...,The monitor has been recalled by the manufactu...,Cancellations and returns,Pickup and Shipping,Electronics,neutral,less,1,qa_pair
2,qa_1_2,Question: Can you guide me through the process...,Can you guide me through the process of return...,A prepaid shipping label will be sent to you v...,Cancellations and returns,Pickup and Shipping,Electronics,neutral,less,1,qa_pair
3,qa_2_3,Question: I am unable to click the 'Cancel' bu...,I am unable to click the 'Cancel' button for m...,The 'Cancel' button might not be working due t...,Cancellations and returns,Replacement and Return Process,Appliances,neutral,medium,2,qa_pair
4,qa_3_4,Question: What is the issue I am facing?\nAnsw...,What is the issue I am facing?,The agent understood that the customer was fac...,Login and Account,Login Issues and Error Messages,Appliances,neutral,less,3,qa_pair


2) Load Embedding Model

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
print("✅ Model loaded!")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 385.79it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


✅ Model loaded!
Embedding dimension: 384


3) Test Single Embedding

In [4]:
test_text = "How can I return my product?"
vector = model.encode(test_text)

print(f"Input text: {test_text}")
print(f"Vector shape: {vector.shape}")
print(f"First 10 values: {vector[:10]}")

Input text: How can I return my product?
Vector shape: (384,)
First 10 values: [-0.00238579 -0.00435041  0.02813725 -0.04858989  0.0357636   0.04144794
  0.01438444  0.0586939  -0.02398882 -0.07622898]


4) Embed All Chunks

In [3]:
texts = chunks_df['text'].tolist()

print(f"Embedding {len(texts)} chunks...")
embeddings = model.encode(texts, show_progress_bar=True)

print(f"✅ Done! Shape: {embeddings.shape}")

Embedding 1881 chunks...


Batches: 100%|██████████| 59/59 [00:03<00:00, 17.61it/s]

✅ Done! Shape: (1881, 384)





5) Store in ChromaDB

In [5]:
import chromadb

# Create ChromaDB client (saves to disk)
client = chromadb.PersistentClient(path="../data/vector_db")

# Delete collection if exists (for clean start)
try:
    client.delete_collection("qa_chunks")
except:
    pass

# Create collection
collection = client.create_collection(
    name="qa_chunks",
    metadata={"description": "Customer support QA pairs"}
)

print("✅ ChromaDB collection created!")

✅ ChromaDB collection created!


6) Add Chunks to ChromaDB

In [6]:
# Prepare data for ChromaDB
ids = chunks_df['chunk_id'].tolist()
documents = chunks_df['text'].tolist()
metadatas = []

for _, row in chunks_df.iterrows():
    metadatas.append({
        "issue_area": str(row['issue_area']),
        "issue_category": str(row['issue_category']),
        "product_category": str(row['product_category']),
        "customer_sentiment": str(row['customer_sentiment']),
        "issue_complexity": str(row['issue_complexity']),
        "chunk_type": str(row['chunk_type']),
        "source_row": int(row['source_row'])
    })

# Add to collection in batches (ChromaDB has a limit per batch)
batch_size = 500
for i in range(0, len(ids), batch_size):
    end = min(i + batch_size, len(ids))
    collection.add(
        ids=ids[i:end],
        documents=documents[i:end],
        embeddings=embeddings[i:end].tolist(),
        metadatas=metadatas[i:end]
    )
    print(f"Added batch {i//batch_size + 1}: {i} to {end}")

print(f"\n✅ Total chunks in ChromaDB: {collection.count()}")

Added batch 1: 0 to 500
Added batch 2: 500 to 1000
Added batch 3: 1000 to 1500
Added batch 4: 1500 to 1881

✅ Total chunks in ChromaDB: 1881


7) Test Search

In [7]:
# Ask a question
query = "I want to return a product"

# Search ChromaDB (it auto-embeds the query using the same model)
results = collection.query(
    query_texts=[query],
    n_results=3
)

print(f"Query: '{query}'\n")
print("Top 3 Results:")
for i in range(3):
    print(f"\n{'='*50}")
    print(f"Rank {i+1}")
    print(f"Chunk ID: {results['ids'][0][i]}")
    print(f"Distance: {results['distances'][0][i]:.4f}")
    print(f"Category: {results['metadatas'][0][i]['issue_area']}")
    print(f"Text: {results['documents'][0][i][:300]}...")

/home/ahmed/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [05:00<00:00, 277kiB/s]    


Query: 'I want to return a product'

Top 3 Results:

Rank 1
Chunk ID: qa_959_1836
Distance: 0.5879
Category: Cancellations and returns
Text: Question: How can I return a product I am not satisfied with and get a refund?
Answer: You can return the product and initiate a refund by following the instructions provided on our website. Once the return request is approved, ship the product back to us, and we will initiate the refund process. Th...

Rank 2
Chunk ID: qa_336_654
Distance: 0.6106
Category: Cancellations and returns
Text: Question: What do I need to return this wrong product and get a refund?
Answer: Initiate a return through the website, securely pack the product, ship it back, and wait for the refund once the product is inspected....

Rank 3
Chunk ID: qa_555_1096
Distance: 0.6558
Category: Cancellations and returns
Text: Question: Can you tell me how to proceed with the return process?
Answer: We will provide you with a prepaid shipping label that you can use to ship the item ba

8) Test More Queries

In [8]:
test_queries = [
    "My order hasn't arrived yet",
    "How do I change my password?",
    "I want a refund",
    "Product is not working properly",
    "How to cancel my subscription"
]

for query in test_queries:
    results = collection.query(query_texts=[query], n_results=1)
    print(f"\n❓ Query: {query}")
    print(f"✅ Best match: {results['documents'][0][0][:150]}...")
    print(f"   Category: {results['metadatas'][0][0]['issue_area']}")
    print(f"   Distance: {results['distances'][0][0]:.4f}")


❓ Query: My order hasn't arrived yet
✅ Best match: Question: What is the status of my order #987654321?
Answer: The order has been confirmed and shipped. It should be delivered within the next 2-3 busi...
   Category: Order
   Distance: 0.7437

❓ Query: How do I change my password?
✅ Best match: Question: Can you help me change the password for my account?
Answer: I have updated your password....
   Category: Login and Account
   Distance: 0.4919

❓ Query: I want a refund
✅ Best match: Question: Can you refund my order?
Answer: I have transferred your call to our senior team members, who will be able to assist you further with your r...
   Category: Shopping
   Distance: 0.7949

❓ Query: Product is not working properly
✅ Best match: Question: What is the issue with the product I received?
Answer: The product is defective and will be replaced....
   Category: Cancellations and returns
   Distance: 0.7621

❓ Query: How to cancel my subscription
✅ Best match: Question: What actions does 