# Amazon Product Semantic Search Pipeline

This notebook implements a semantic search pipeline using the Amazon Sales Dataset. The goal is to build a system that:
- **Constructs a combined text representation** for each product using product name, description, and customer review content.
- **Generates text embeddings** using a lightweight SentenceTransformer model (e.g., `all-MiniLM-L6-v2`).
- **Builds a FAISS index** for fast similarity search.
- **Performs query-based searches**, returning the most relevant products.
- **Applies filtering** (by price, category, and embedding distance) and **re-ranks results** using a re-ranking model (e.g., DeepSeek R1 Distill).

This pipeline demonstrates a real-world solution for e-commerce platforms such as Shopify, where high-quality product search is critical. In addition to manual inspection of top search results, we evaluate the system using quantitative metrics (like precision@K) and enhance the presentation of results through custom formatting and export options.

---

**Project Structure:**

- **Data Files:**  
  The raw dataset is stored in `E-commerce_Analysis/data/raw/amazon.csv`.

- **Python Modules:**  
  Reusable functions for vector search are located in `vectorshop/embedding/vector_search.py`.  
  Re-ranking functions using DeepSeek are in `vectorshop/data/language/utils/deepseek_rerank.py` (or in the `utils` folder).

- **Notebooks:**  
  This notebook (`04_amazon_dataset_vector_search.ipynb`) demonstrates the end-to-end pipeline from data loading to evaluation.

---

**Evaluation Strategy:**

1. **Manual Inspection:** We will review the top search results for various queries to assess their relevance.
2. **Quantitative Metrics:** We will compute metrics such as precision@K, if ground-truth labels or user feedback are available.
3. **Enhanced Presentation:** We will adjust display options (e.g., using Pandas HTML formatting) to improve the readability of our output for stakeholder review.

Let's begin by loading the data and constructing our combined text representation.


In [None]:
!pip install pandas numpy sentence-transformers faiss-cpu torch bitsandbytes transformers google-cloud-translate
!pip install requests beautifulsoup4 Pillow
!pip install tenacity
!pip install nltk
%pip install scikit-learn
!pip install accelerate
!pip install tqdm




In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import os
import pandas as pd
import sys
from pathlib import Path

# Add project root directory to sys.path
project_root = Path("/content/drive/My Drive/E-commerce_Analysis")
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print("Python path updated:")
print(sys.path[:5])

# Set display option for long text
pd.set_option('display.max_colwidth', None)

# Define the file path to your Amazon dataset with images
amazon_data_path = "/content/drive/My Drive/E-commerce_Analysis/data/raw/amazon_with_images.csv"

# Verify file existence
if os.path.exists(amazon_data_path):
    print("Amazon dataset with images found!")
else:
    print("Amazon dataset with images not found. Check the file path.")

# Load the dataset
amazon_df = pd.read_csv(amazon_data_path)
print("Amazon Sales Dataset Columns:")
print(amazon_df.columns.tolist())

print("\nSample data:")
print(amazon_df.head())

# Set device explicitly
device = "cpu"
print(f"Device set to: {device}")

Mounted at /content/drive
Python path updated:
['/content/drive/My Drive/E-commerce_Analysis', '/content', '/env/python', '/usr/lib/python311.zip', '/usr/lib/python3.11']
Amazon dataset with images found!
Amazon Sales Dataset Columns:
['product_id', 'product_name', 'category', 'discounted_price', 'actual_price', 'discount_percentage', 'rating', 'rating_count', 'about_product', 'user_id', 'user_name', 'review_id', 'review_title', 'review_content', 'product_link', 'image_url']

Sample data:
   product_id  \
0  B07JW9H4J1   
1  B098NS6PVG   
2  B096MSW6CT   
3  B08HDJ86NZ   
4  B08CF3B7N1   

                                                                                                                                                                                              product_name  \
0                                       Wayona Nylon Braided USB to Lightning Fast Charging and Data Sync Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini (3 FT Pack of 1, 

In [2]:
# Print project directory structure
from pathlib import Path

# Check if the project root exists
if not project_root.exists():
    print(f"Project root not found: {project_root}")
else:
    print(f"Directory structure of {project_root}:")

    # Function to print directory tree with indentation and depth limit
    def print_directory_tree(path, indent="", depth=0, max_depth=2):
        if depth > max_depth:
            return
        for item in path.iterdir():
            if item.is_dir():
                print(f"{indent}Directory: {item.name}")
                print_directory_tree(item, indent + "  ", depth + 1, max_depth)
            else:
                print(f"{indent}File: {item.name}")

    # Print the directory structure up to 2 levels deep
    print_directory_tree(project_root, max_depth=2)

Directory structure of /content/drive/My Drive/E-commerce_Analysis:
Directory: vectorshop
  Directory: embedding
    Directory: __pycache__
    File: vector_search.py
    File: deepseek_embeddings.py
    File: bm25_search.py
    File: embedding_tracker.py
    File: __init__.py
    File: hybrid_search.py
  Directory: search
    File: __init__.py
  File: __init__.py
  Directory: storage
    File: __init__.py
  Directory: __pycache__
    File: __init__.cpython-311.pyc
    File: config.cpython-311.pyc
  Directory: data
    Directory: __pycache__
    Directory: language
    File: sentiment.py
    File: rerank_utils.py
    File: extraction.py
    File: language_detection.py
    File: category_utils.py
    File: preprocessing.py
    File: load.py
    File: multimodal.py
    File: image_analyzer.py
    File: review_analyzer.py
    File: __init__.py
  File: config.py
Directory: notebooks
  File: 00_environment_test.ipynb
  File: 01_data_exploration.ipynb.ipynb
  File: 02_embeddings_test.ipynb
 

## Download Images
### Download images from the extracted URLs and save them locally.

In [None]:
import os
import requests

# Create directory for images
image_dir = "/content/drive/My Drive/E-commerce_Analysis/data/images"
os.makedirs(image_dir, exist_ok=True)

def download_image(url, product_id):
    """Download the image and save it with the product ID."""
    if pd.isna(url):
        return None
    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            image_path = os.path.join(image_dir, f"{product_id}.jpg")
            with open(image_path, 'wb') as f:
                f.write(response.content)
            return image_path
        else:
            print(f"Failed to download {url} (Status: {response.status_code})")
            return None
    except Exception as e:
        print(f"Error downloading {url}: {e}")
        return None

# Apply to the dataset
amazon_df['image_path'] = amazon_df.apply(lambda row: download_image(row['image_url'], row['product_id']), axis=1)

# Save the updated dataset
amazon_df.to_csv("/content/drive/My Drive/E-commerce_Analysis/data/processed/amazon_with_images.csv", index=False)
print("Images downloaded and paths saved to amazon_with_images.csv")

Images downloaded and paths saved to amazon_with_images.csv


In [None]:
import pandas as pd

def create_product_text(product_name, about_product, category, discounted_price, reviews=None):
    parts = [str(product_name), str(about_product)]
    # Split category terms
    category_terms = str(category).split('>')
    parts.extend([term.strip() for term in category_terms])
    # Add price (converted to USD for consistency)
    exchange_rate = 83  # INR to USD
    price_usd = float(str(discounted_price).replace('₹', '').replace(',', '')) / exchange_rate
    parts.append(f"Price: {price_usd:.2f} USD")
    if reviews and isinstance(reviews, list) and reviews[0]:
        parts.append(str(reviews[0]))
    return " ".join(parts)

# Create combined_text if not already present
if 'combined_text' not in amazon_df.columns:
    amazon_df['combined_text'] = amazon_df.apply(
        lambda row: create_product_text(
            row['product_name'],
            row['about_product'],
            row['category'],
            row['discounted_price'],
            [row['review_content']] if pd.notna(row['review_content']) else None
        ),
        axis=1
    )

# Save updated dataset
amazon_df.to_csv("/content/drive/My Drive/E-commerce_Analysis/data/processed/amazon_with_images.csv", index=False)
print("Updated combined_text with price and category terms.")
print(amazon_df.columns)

Updated combined_text with price and category terms.
Index(['product_id', 'product_name', 'category', 'discounted_price',
       'actual_price', 'discount_percentage', 'rating', 'rating_count',
       'about_product', 'user_id', 'user_name', 'review_id', 'review_title',
       'review_content', 'product_link', 'image_url', 'combined_text'],
      dtype='object')


## Generate Embeddings with CLIP
### Use the CLIP model to generate embeddings for both text (combined_text) and images (image_path).

In [None]:
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import numpy as np
import pandas as pd
import faiss

# Lazy loading for CLIP model and processor
_clip_model = None
_clip_processor = None

def get_clip_model(device="cpu"):
    global _clip_model, _clip_processor
    if _clip_model is None:
        _clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
        _clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    return _clip_model, _clip_processor

# Load CLIP model on CPU
model, processor = get_clip_model(device=device)

def generate_text_embedding(text):
    """Generate text embedding using CLIP."""
    inputs = processor(text=[text], return_tensors="pt", padding=True, truncation=True, max_length=77).to(device)
    with torch.no_grad():
        embedding = model.get_text_features(**inputs).cpu().numpy()[0]
    # Normalize the embedding
    embedding /= np.linalg.norm(embedding)
    return embedding

def generate_image_embedding(image_path):
    """Generate image embedding using CLIP."""
    if pd.isna(image_path):
        return np.zeros(512)
    try:
        image = Image.open(image_path).convert('RGB')
        inputs = processor(images=image, return_tensors="pt").to(device)
        with torch.no_grad():
            embedding = model.get_image_features(**inputs).cpu().numpy()[0]
        # Normalize the embedding
        embedding /= np.linalg.norm(embedding)
        return embedding
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return np.zeros(512)

# Generate embeddings in batches
batch_size = 100
for start in range(0, len(amazon_df), batch_size):
    end = min(start + batch_size, len(amazon_df))

    # Generate text embeddings
    text_embeddings_batch = amazon_df.loc[start:end-1, 'combined_text'].apply(generate_text_embedding)
    amazon_df.loc[start:end-1, 'text_embedding'] = text_embeddings_batch

    # Generate image embeddings
    image_embeddings_batch = amazon_df.loc[start:end-1, 'image_path'].apply(generate_image_embedding)
    amazon_df.loc[start:end-1, 'image_embedding'] = image_embeddings_batch

# After generating normalized embeddings
text_embeddings = np.vstack(amazon_df['text_embedding'].values)
image_embeddings = np.vstack(amazon_df['image_embedding'].values)

# Build indexes with inner product (cosine similarity)
text_index = faiss.IndexFlatIP(512)
text_index.add(text_embeddings)

image_index = faiss.IndexFlatIP(512)
image_index.add(image_embeddings)

# Check shapes before saving
print("Text embeddings shape before saving:", text_embeddings.shape)
print("Image embeddings shape before saving:", image_embeddings.shape)

# Save embeddings and indexes
np.save("/content/drive/My Drive/E-commerce_Analysis/data/processed/text_embeddings.npy", text_embeddings)
np.save("/content/drive/My Drive/E-commerce_Analysis/data/processed/image_embeddings.npy", image_embeddings)
faiss.write_index(text_index, "/content/drive/My Drive/E-commerce_Analysis/data/processed/text_index.faiss")
faiss.write_index(image_index, "/content/drive/My Drive/E-commerce_Analysis/data/processed/image_index.faiss")
print("Text and image embeddings generated and saved successfully")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

Text embeddings shape before saving: (1465, 512)
Image embeddings shape before saving: (1465, 512)
Text and image embeddings generated and saved successfully


In [None]:
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch
import pandas as pd

# Define device as CPU explicitly
device = torch.device("cpu")

# Load processor and model on CPU
processor_blip = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model_blip = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b").to(device)

def describe_image(image_path):
    if pd.isna(image_path):
        return ""
    try:
        image = Image.open(image_path).convert("RGB")
        inputs = processor_blip(image, return_tensors="pt").to(device)
        with torch.no_grad():
            generated_ids = model_blip.generate(**inputs, max_length=50)
        return processor_blip.batch_decode(generated_ids, skip_special_tokens=True)[0]
    except Exception as e:
        print(f"Error describing {image_path}: {e}")
        return ""

# Generate descriptions for all images (batch processing for efficiency)
batch_size = 50
for start in range(0, len(amazon_df), batch_size):
    end = min(start + batch_size, len(amazon_df))
    amazon_df.loc[start:end-1, 'image_desc'] = amazon_df.loc[start:end-1, 'image_path'].apply(describe_image)

# Append image descriptions to combined_text
amazon_df['combined_text'] = amazon_df.apply(
    lambda row: row['combined_text'] + " Image Description: " + row['image_desc'] if pd.notna(row['image_desc']) else row['combined_text'],
    axis=1
)

# Save updated dataset
amazon_df.to_csv("/content/drive/My Drive/E-commerce_Analysis/data/processed/amazon_with_images.csv", index=False)
print("Image descriptions generated and appended to combined_text.")

# Regenerate text embeddings with updated combined_text
for start in range(0, len(amazon_df), batch_size):
    end = min(start + batch_size, len(amazon_df))
    text_embeddings_batch = amazon_df.loc[start:end-1, 'combined_text'].apply(generate_text_embedding)
    amazon_df.loc[start:end-1, 'text_embedding'] = text_embeddings_batch

# Update text embeddings and index
text_embeddings = np.vstack(amazon_df['text_embedding'].values)
text_index = faiss.IndexFlatIP(512)
text_index.add(text_embeddings)
np.save("/content/drive/My Drive/E-commerce_Analysis/data/processed/text_embeddings.npy", text_embeddings)
faiss.write_index(text_index, "/content/drive/My Drive/E-commerce_Analysis/data/processed/text_index.faiss")
print("Text embeddings updated with image descriptions.")

preprocessor_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/882 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/548 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/122k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

Image descriptions generated and appended to combined_text.
Text embeddings updated with image descriptions.


In [None]:
print(amazon_df.head())
print(device)
print(amazon_df.info())

   product_id  \
0  B07JW9H4J1   
1  B098NS6PVG   
2  B096MSW6CT   
3  B08HDJ86NZ   
4  B08CF3B7N1   

                                                                                                                                                                                              product_name  \
0                                       Wayona Nylon Braided USB to Lightning Fast Charging and Data Sync Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini (3 FT Pack of 1, Grey)   
1        Ambrane Unbreakable 60W / 3A Fast Charging 1.5m Braided Type C Cable for Smartphones, Tablets, Laptops & other Type C devices, PD Technology, 480Mbps Data Sync, Quick Charge 3.0 (RCT15A, Black)   
2                                                                 Sounce Fast Phone Charging Cable & Data Sync USB Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini & iOS Devices   
3  boAt Deuce USB 300 2 in 1 Type-C & Micro USB Stress Resistant, Tangle-

## Store Embeddings in Faiss Indexes
### Create separate Faiss indexes for text and image embeddings.

In [None]:
import faiss
import numpy as np

# Load embeddings
text_embeddings = np.load("/content/drive/My Drive/E-commerce_Analysis/data/processed/text_embeddings.npy")
image_embeddings = np.load("/content/drive/My Drive/E-commerce_Analysis/data/processed/image_embeddings.npy")

# Print shapes to verify
print("Loaded text embeddings shape:", text_embeddings.shape)  # Should be (65, 512)
print("Loaded image embeddings shape:", image_embeddings.shape)  # Should be (65, 512)

# Build Faiss indexes
embedding_dim = 512  # CLIP embedding dimension
text_index = faiss.IndexFlatL2(embedding_dim)
text_index.add(text_embeddings)

image_index = faiss.IndexFlatL2(embedding_dim)
image_index.add(image_embeddings)

# Save indexes
faiss.write_index(text_index, "/content/drive/My Drive/E-commerce_Analysis/data/processed/text_index.faiss")
faiss.write_index(image_index, "/content/drive/My Drive/E-commerce_Analysis/data/processed/image_index.faiss")
print("Faiss indexes created and saved")

Loaded text embeddings shape: (1465, 512)
Loaded image embeddings shape: (1465, 512)
Faiss indexes created and saved


In [None]:
# Print shapes to verify
print("Loaded text embeddings shape:", text_embeddings.shape)  # Should be (65, 512)
print("Loaded image embeddings shape:", image_embeddings.shape)  # Should be (65, 512)

Loaded text embeddings shape: (1465, 512)
Loaded image embeddings shape: (1465, 512)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(amazon_df['combined_text'])

In [None]:
# Find a product with "earphones" in its name
earphone_idx = amazon_df[amazon_df['product_name'].str.contains("earphones", case=False)].index[0]
earphone_emb = text_embeddings[earphone_idx]  # Use the text_embeddings array instead of DataFrame

# Search the text index
_, indices = text_index.search(np.expand_dims(earphone_emb, 0), 1)
retrieved_product = amazon_df.iloc[indices[0][0]]

print(f"Known product: {amazon_df.iloc[earphone_idx]['product_name']}")
print(f"Retrieved product: {retrieved_product['product_name']}")

Known product: boAt Bassheads 100 in Ear Wired Earphones with Mic(Taffy Pink)
Retrieved product: boAt Bassheads 100 in Ear Wired Earphones with Mic(Taffy Pink)


## Integrate multi-modal search into Notebook
### Test the multi-modal search in your notebook.

In [None]:
"""
Test script for DeepSeek-enhanced vector search.
"""

import faiss
import numpy as np
import pandas as pd
import torch
from transformers import CLIPProcessor, CLIPModel
from sklearn.feature_extraction.text import TfidfVectorizer
import time

# Import both search functions for comparison
from vectorshop.embedding.vector_search import search_multi_modal, improved_search_multi_modal

# Path configuration
DATA_PATH = "/content/drive/My Drive/E-commerce_Analysis/data/processed/amazon_with_images.csv"
TEXT_INDEX_PATH = "/content/drive/My Drive/E-commerce_Analysis/data/processed/text_index.faiss"
IMAGE_INDEX_PATH = "/content/drive/My Drive/E-commerce_Analysis/data/processed/image_index.faiss"

def run_test():
    """Run test comparing original and improved search functions."""

    print("Loading dataset and indexes...")
    # Load dataset
    amazon_df = pd.read_csv(DATA_PATH)

    # Load FAISS indexes
    text_index = faiss.read_index(TEXT_INDEX_PATH)
    image_index = faiss.read_index(IMAGE_INDEX_PATH)

    # Set device
    device = "cpu"  # Use "cuda" if GPU is available

    # Load CLIP model and processor
    print("Loading CLIP model...")
    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    # Create TF-IDF vectorizer and matrix
    print("Computing TF-IDF matrix...")
    tfidf = TfidfVectorizer()
    tfidf_matrix = tfidf.fit_transform(amazon_df['combined_text'])

    # Test queries with expected product IDs
    test_queries = [
        "good quality of fast charging Cable for iPhone under 5 USD",
        "good quality headset with Noise Cancelling for computer and have warranty"
    ]
    expected_product_ids = ["B08CF3B7N1", "B009LJ2BXA"]

    # Run tests for each query
    for i, query in enumerate(test_queries):
        expected_id = expected_product_ids[i]
        print(f"\n{'-'*80}")
        print(f"Testing query: {query}")
        print(f"Expected product: {expected_id}")
        print(f"{'-'*80}\n")

        # Test original search function
        print("\n--- ORIGINAL SEARCH FUNCTION ---\n")
        start_time = time.time()
        original_results = search_multi_modal(
            query=query,
            text_index=text_index,
            image_index=image_index,
            df=amazon_df,
            model=clip_model,
            processor=clip_processor,
            tfidf=tfidf,
            tfidf_matrix=tfidf_matrix,
            device=device,
            top_k=5,
            exchange_rate=83
        )
        original_time = time.time() - start_time

        # Check if expected product is in top results
        if expected_id in original_results['product_id'].values:
            original_rank = original_results[original_results['product_id'] == expected_id].index[0] + 1
            print(f"✅ Original search found {expected_id} at rank {original_rank}")
        else:
            print(f"❌ Original search did not find {expected_id} in top results")

        print(f"Original search time: {original_time:.2f} seconds")

        # Test improved search function
        print("\n--- IMPROVED SEARCH FUNCTION (WITH DEEPSEEK) ---\n")
        start_time = time.time()
        improved_results = improved_search_multi_modal(
            query=query,
            text_index=text_index,
            image_index=image_index,
            df=amazon_df,
            model=clip_model,
            processor=clip_processor,
            tfidf=tfidf,
            tfidf_matrix=tfidf_matrix,
            device=device,
            top_k=5,
            exchange_rate=83,
            use_deepseek=True
        )
        improved_time = time.time() - start_time

        # Check if expected product is in top results
        if expected_id in improved_results['product_id'].values:
            improved_rank = improved_results[improved_results['product_id'] == expected_id].index[0] + 1
            print(f"✅ Improved search found {expected_id} at rank {improved_rank}")
        else:
            print(f"❌ Improved search did not find {expected_id} in top results")

        print(f"Improved search time: {improved_time:.2f} seconds")

        # Compare results
        print("\n--- COMPARISON ---\n")
        print("Original top 5 results:")
        print(original_results[['product_id', 'product_name', 'price_usd', 'score']])

        print("\nImproved top 5 results:")
        if 'final_score' in improved_results.columns:
            print(improved_results[['product_id', 'product_name', 'price_usd', 'final_score']])
        else:
            print(improved_results[['product_id', 'product_name', 'price_usd', 'score']])

if __name__ == "__main__":
    run_test()

Loading dataset and indexes...
Loading CLIP model...
Computing TF-IDF matrix...

--------------------------------------------------------------------------------
Testing query: good quality of fast charging Cable for iPhone under 5 USD
Expected product: B08CF3B7N1
--------------------------------------------------------------------------------


--- ORIGINAL SEARCH FUNCTION ---

Running original search_multi_modal function
Initial text indices: [ 118  117  133    2  379  623   97  486   58  925    7  422  668   62
  938  207  538   23  478  727  277  234   44  238  253  139  162   73
  564  983  205  172  196  195  333   81  316    4  393  632  300   83
  426  672  176  699  164  178  208   74  985  322   69  562  974  832
  235   15  464  185  547  204  219  602   59  928  181  140   29  503
  771   34  836    6  418  658  282   76  570  992   66  968   21  722
  156  229  201  137  407  644   10  428  673  240  223   45  857  331
  309  900  246  174   18  472  285  695  227  153  31

  combined_results.at[index, 'score'] += category_boost


DeepSeek model loaded successfully
Error initializing DeepSeek: The current model class (Qwen2Model) is not compatible with `.generate()`, as it doesn't have a language model head. Classes that support generation often end in one of these names: ['ForCausalLM', 'ForConditionalGeneration', 'ForSpeechSeq2Seq', 'ForVision2Seq'].
Initial rank of B08CF3B7N1: 38
B009LJ2BXA not in initial text results
Applying price filter: < 5.0 USD
Relevant categories: ['USBCables', 'Cable', 'Charger']
Before boosting:
   product_id    score
4  B08CF3B7N1  0.81735
After boosting:
   product_id     score
4  B08CF3B7N1  4.478692
Search results (without DeepSeek reranking):
     product_id  price_usd     score
151  B08QSDKFGQ   4.084337  5.064581
107  B0981XSZJ7   3.602410  4.911961
88   B0BMXMLSMM   2.397590  4.824315
199  B08XMG618K   2.710843  4.807991
75   B09CMP1SC8   2.397590  4.803131


  combined_results.at[index, 'score'] += category_boost


❌ Improved search did not find B08CF3B7N1 in top results
Improved search time: 7.60 seconds

--- COMPARISON ---

Original top 5 results:
     product_id  \
1    B098NS6PVG   
88   B0BMXMLSMM   
17   B082LSVT4B   
324  B0BQRJ3C47   
248  B09BW2GP18   

                                                                                                                                                                                                       product_name  \
1                 Ambrane Unbreakable 60W / 3A Fast Charging 1.5m Braided Type C Cable for Smartphones, Tablets, Laptops & other Type C devices, PD Technology, 480Mbps Data Sync, Quick Charge 3.0 (RCT15A, Black)   
88   Lapster 65W compatible for OnePlus Dash Warp Charge Cable , type c to c cable fast charging Data Sync Cable Compatible with One Plus 10R / 9RT/ 9 pro/ 9R/ 8T/ 9/ Nord & for All Type C Devices – Red, 1 Meter   
17                        Ambrane Unbreakable 60W / 3A Fast Charging 1.5m Braided Type C to Type C Cabl

  combined_results.at[index, 'score'] += category_boost


❌ Improved search did not find B009LJ2BXA in top results
Improved search time: 1.99 seconds

--- COMPARISON ---

Original top 5 results:
     product_id  \
906  B009LJ2BXA   
552  B08BCKN299   
800  B07JF9B592   
347  B01DEWVZ2C   
547  B0BMM7R92G   

                                                                                                                                                                product_name  \
906         Hp Wired On Ear Headphones With Mic With 3.5 Mm Drivers, In-Built Noise Cancelling, Foldable And Adjustable For Laptop/Pc/Office/Home/ 1 Year Warranty (B4B09Pa)   
552  Sounce Gold Plated 3.5 mm Headphone Splitter for Computer 2 Male to 1 Female 3.5mm Headphone Mic Audio Y Splitter Cable Smartphone Headset to PC Adapter – (Black,20cm)   
800                                                                                                       MAONO AU-400 Lavalier Auxiliary Omnidirectional Microphone (Black)   
347                                  JBL C10

In [None]:
# Import necessary module for TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Load Faiss indexes
text_index = faiss.read_index("/content/drive/My Drive/E-commerce_Analysis/data/processed/text_index.faiss")
image_index = faiss.read_index("/content/drive/My Drive/E-commerce_Analysis/data/processed/image_index.faiss")

# Load CLIP model and processor using lazy loading
model, processor = get_clip_model(device=device)

# Clear any residual GPU memory (just in case)
torch.cuda.empty_cache()

# Compute TF-IDF vectorizer and matrix
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(amazon_df['combined_text'])

# Test queries
queries = [
    "good quality of fast charging Cable for iPhone under 5 USD",
    "good quality headset with Noise Cancelling for computer and have warranty"
]

expected_product_ids = ["B08CF3B7N1", "B009LJ2BXA"]

for query, expected_id in zip(queries, expected_product_ids):
    print(f"\nTesting query: {query}")
    results = search_multi_modal(
        query=query,
        text_index=text_index,
        image_index=image_index,
        df=amazon_df,
        model=model,
        processor=processor,
        tfidf=tfidf,  # Pass the TF-IDF vectorizer
        tfidf_matrix=tfidf_matrix,  # Pass the precomputed TF-IDF matrix
        device=device,
        top_k=5,
        exchange_rate=83
    )
    print("Search results:")
    print(results[['product_id', 'product_name', 'price_usd', 'score']])
    # Check if expected product ID is in top results
    if expected_id in results['product_id'].values:
        rank = results[results['product_id'] == expected_id].index[0] + 1
        print(f"Expected product {expected_id} found at rank {rank}")
    else:
        print(f"Expected product {expected_id} not found in top results")


Testing query: good quality of fast charging Cable for iPhone under 5 USD
Running updated search_multi_modal - version check
Initial text indices: [ 118  117  133    2  379  623   97  486   58  925    7  422  668   62
  938  207  538   23  478  727  277  234   44  238  253  139  162   73
  564  983  205  172  196  195  333   81  316    4  393  632  300   83
  426  672  176  699  164  178  208   74  985  322   69  562  974  832
  235   15  464  185  547  204  219  602   59  928  181  140   29  503
  771   34  836    6  418  658  282   76  570  992   66  968   21  722
  156  229  201  137  407  644   10  428  673  240  223   45  857  331
  309  900  246  174   18  472  285  695  227  153  314  113  304  115
  111  256   14  456  692  505  109  173  328  519  252  582  755  483
  158  324   93  217  187  198  149  228  313   92  287  460  713  101
  136  248  120  131   33  833  107   27  768  374  780   28  504  784
  388  552  962  809  451   47   78 1000  975   71  957   75  569  990


  combined_results.at[index, 'score'] += category_boost
  combined_results.at[index, 'score'] += category_boost


In [None]:
print(amazon_df[amazon_df['product_id'] == 'B08CF3B7N1']['category'])
print(amazon_df[amazon_df['product_id'] == 'B009LJ2BXA']['category'])

4      Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables
393    Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables
632    Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables
Name: category, dtype: object
906    Computers&Accessories|Accessories&Peripherals|Audio&VideoAccessories|PCHeadsets
Name: category, dtype: object


In [None]:
# Assuming 'results' is the DataFrame returned by search_multi_modal with top_k=100

# Add a 'rank' column based on the order of results (1 = highest score)
results['rank'] = range(1, len(results) + 1)

# Filter for the products of interest
products_of_interest = ['B08CF3B7N1', 'B009LJ2BXA']
filtered_results = results[results['product_id'].isin(products_of_interest)]

# Select relevant columns: product_id, score, rank
relevant_info = filtered_results[['product_id', 'score', 'rank']]

# Display the results
print(relevant_info)

     product_id     score  rank
906  B009LJ2BXA  2.518048     1


In [None]:
import numpy as np

# Check a few text embeddings
for i in range(3):
    text_emb_norm = np.linalg.norm(amazon_df['text_embedding'].iloc[i])
    print(f"Text embedding {i} norm: {text_emb_norm:.4f}")

# Check a few image embeddings
for i in range(3):
    image_emb_norm = np.linalg.norm(amazon_df['image_embedding'].iloc[i])
    print(f"Image embedding {i} norm: {image_emb_norm:.4f}")

# Check query embedding normalization
query = "wireless earphones with excellent noise cancelling below 200USD"
inputs = processor(text=[query], return_tensors="pt").to(device)
query_embedding = model.get_text_features(**inputs).detach().cpu().numpy()[0]  # Detach the tensor
query_embedding /= np.linalg.norm(query_embedding)  # Normalize it
print(f"Query embedding norm: {np.linalg.norm(query_embedding):.4f}")

In [None]:
# Check similarity between two text embeddings
emb1 = amazon_df['text_embedding'].iloc[0]
emb2 = amazon_df['text_embedding'].iloc[1]
similarity = np.dot(emb1, emb2)
print(f"Cosine similarity between text embeddings 0 and 1: {similarity:.4f}")

# Check similarity between two image embeddings
img_emb1 = amazon_df['image_embedding'].iloc[0]
img_emb2 = amazon_df['image_embedding'].iloc[1]
img_similarity = np.dot(img_emb1, img_emb2)
print(f"Cosine similarity between image embeddings 0 and 1: {img_similarity:.4f}")

In [None]:
# Assuming query_embedding is defined (if not, regenerate it as in Step 1)
query = "wireless earphones with excellent noise cancelling below 200USD"
inputs = processor(text=[query], return_tensors="pt").to(device)
query_embedding = model.get_text_features(**inputs).detach().cpu().numpy()[0]  # Detach and convert
query_embedding /= np.linalg.norm(query_embedding)  # Normalize

# Text-only search
text_scores, text_indices = text_index.search(np.expand_dims(query_embedding, 0), 5)
text_results = amazon_df.iloc[text_indices[0]].copy()
text_results['score'] = text_scores[0]
print("Text-only search results:")
print(text_results[['product_id', 'product_name', 'score']])

# Image-only search
image_scores, image_indices = image_index.search(np.expand_dims(query_embedding, 0), 5)
image_results = amazon_df.iloc[image_indices[0]].copy()
image_results['score'] = image_scores[0]
print("Image-only search results:")
print(image_results[['product_id', 'product_name', 'score']])

# Test New Search System

# Import the New Search Components

In [None]:
# Import our new search components
from vectorshop.embedding.deepseek_embeddings import DeepSeekEmbeddings, create_product_text
from vectorshop.embedding.bm25_search import ProductBM25Search
from vectorshop.embedding.hybrid_search import HybridSearch

print("Successfully imported new search modules!")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


Successfully imported new search modules!


# Regenerate combined_text using the improved function

In [None]:
def create_robust_product_text(row):
    """
    Create a comprehensive text representation of a product with proper error handling.
    """
    parts = []

    # Add product name
    if 'product_name' in row and not pd.isna(row['product_name']):
        parts.append(f"Product: {row['product_name']}")

    # Add category with hierarchy
    if 'category' in row and not pd.isna(row['category']):
        category = str(row['category'])
        # Handle different category separators
        if '|' in category:
            category_parts = category.split('|')
        elif '>' in category:
            category_parts = category.split('>')
        else:
            category_parts = [category]

        # Add category information
        parts.append(f"Category: {' > '.join(category_parts)}")

        # Add primary category separately
        if len(category_parts) > 0:
            parts.append(f"Primary Category: {category_parts[0].strip()}")

    # Add product description
    if 'about_product' in row and not pd.isna(row['about_product']):
        parts.append(f"Description: {row['about_product']}")

    # Add rating information with careful error handling
    if 'rating' in row and not pd.isna(row['rating']):
        try:
            # Clean the rating string by keeping only digits and decimal point
            if isinstance(row['rating'], str):
                import re
                cleaned_rating = re.sub(r'[^\d.]', '', row['rating'])
                if cleaned_rating:
                    rating = float(cleaned_rating)
                else:
                    rating = None
            else:
                rating = float(row['rating'])

            # Add rating information if valid
            if rating is not None and rating > 0:
                if rating >= 4.0:
                    parts.append("Quality: High Rating")
                parts.append(f"Rating: {rating}")
        except:
            # Skip rating if conversion fails
            pass

    # Add price information
    if 'discounted_price' in row and not pd.isna(row['discounted_price']):
        try:
            price_str = str(row['discounted_price']).replace('₹', '').replace(',', '')
            price_inr = float(price_str)
            price_usd = price_inr / 83  # Convert to USD
            parts.append(f"Price: {price_usd:.2f} USD")
        except:
            # Skip price if conversion fails
            pass

    # Add review content if available
    if 'review_content' in row and not pd.isna(row['review_content']):
        parts.append(f"Reviews: {row['review_content']}")

    # Add image description if available
    if 'image_desc' in row and not pd.isna(row['image_desc']):
        parts.append(f"Image: {row['image_desc']}")

    # Join all parts with line breaks for better tokenization
    return "\n".join(parts)

# Use the robust function to regenerate combined_text
print("Regenerating combined_text with improved structure and error handling...")
amazon_df['combined_text_improved'] = amazon_df.apply(create_robust_product_text, axis=1)

# Keep the original combined_text (we'll need it for the original CLIP embeddings)
# and save the improved version to a new column

# Save the updated dataset
amazon_df.to_csv("/content/drive/My Drive/E-commerce_Analysis/data/processed/amazon_with_improved_text.csv", index=False)
print("Updated dataset saved with improved text structure")

# Print a sample of the improved text
print("\nSample of improved text representation:")
print(amazon_df['combined_text_improved'].iloc[0])

Regenerating combined_text with improved structure and error handling...
Updated dataset saved with improved text structure

Sample of improved text representation:
Product: Wayona Nylon Braided USB to Lightning Fast Charging and Data Sync Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini (3 FT Pack of 1, Grey)
Category: Computers&Accessories > Accessories&Peripherals > Cables&Accessories > Cables > USBCables
Primary Category: Computers&Accessories
Description: High Compatibility : Compatible With iPhone 12, 11, X/XsMax/Xr ,iPhone 8/8 Plus,iPhone 7/7 Plus,iPhone 6s/6s Plus,iPhone 6/6 Plus,iPhone 5/5s/5c/se,iPad Pro,iPad Air 1/2,iPad mini 1/2/3,iPod nano7,iPod touch and more apple devices.|Fast Charge&Data Sync : It can charge and sync simultaneously at a rapid speed, Compatible with any charging adaptor, multi-port charging station or power bank.|Durability : Durable nylon braided design with premium aluminum housing and toughened nylon fiber wound tightly aroun

# Initialize the Hybrid Search System

In [None]:
amazon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1465 entries, 0 to 1464
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   product_id              1465 non-null   object
 1   product_name            1465 non-null   object
 2   category                1465 non-null   object
 3   discounted_price        1465 non-null   object
 4   actual_price            1465 non-null   object
 5   discount_percentage     1465 non-null   object
 6   rating                  1465 non-null   object
 7   rating_count            1463 non-null   object
 8   about_product           1465 non-null   object
 9   user_id                 1465 non-null   object
 10  user_name               1465 non-null   object
 11  review_id               1465 non-null   object
 12  review_title            1465 non-null   object
 13  review_content          1465 non-null   object
 14  product_link            1465 non-null   object
 15  imag

In [None]:
!pip install --user -U nltk
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# Download with a specific download directory
nltk.download('punkt', download_dir='/root/nltk_data')
nltk.download('stopwords', download_dir='/root/nltk_data')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
import torch

# Now you can use torch.cuda.is_available()
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
# Import our new search components
from vectorshop.embedding.bm25_search import ProductBM25Search
from vectorshop.embedding.hybrid_search import HybridSearch

# Initialize the hybrid search system with the improved text column
print("Initializing hybrid search system...")
search_system = HybridSearch(
    df=amazon_df,
    device="cuda" if torch.cuda.is_available() else "cpu",  # Use GPU if available
    use_deepseek_reranking=True,
    exchange_rate=83
)

# We'll use the existing CLIP embeddings first to test the system
# But we'll tell the system to use our improved text for BM25 search
search_system.bm25_search = ProductBM25Search(amazon_df, text_column='combined_text_improved')

print("Search system initialized with improved text for BM25!")

Initializing hybrid search system...
Fitting BM25 to 1465 product descriptions...
BM25 fitted successfully
Fitting BM25 to 1465 product descriptions...
BM25 fitted successfully
Search system initialized with improved text for BM25!


In [None]:
import os
import torch
import numpy as np
import faiss
from tqdm.notebook import tqdm  # For progress bars

# Define paths
DATA_PATH = "/content/drive/My Drive/E-commerce_Analysis/data/processed/amazon_with_images.csv"
OUTPUT_DIR = "/content/drive/My Drive/E-commerce_Analysis/data/search_system"
EMBEDDINGS_DIR = os.path.join(OUTPUT_DIR, "embeddings_chunks")
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(EMBEDDINGS_DIR, exist_ok=True)

# Initialize the search system
print(f"Using {'GPU' if torch.cuda.is_available() else 'CPU'} for processing")
print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") if torch.cuda.is_available() else print("No GPU available")

# Process in chunks to enable restart capability
CHUNK_SIZE = 100  # Process 100 products at a time
total_products = len(amazon_df)
chunks = [(i, min(i+CHUNK_SIZE, total_products)) for i in range(0, total_products, CHUNK_SIZE)]

# Check for existing embeddings chunks
completed_chunks = []
for start, end in chunks:
    chunk_path = os.path.join(EMBEDDINGS_DIR, f"embeddings_{start}_{end}.npy")
    if os.path.exists(chunk_path):
        completed_chunks.append((start, end))

print(f"Found {len(completed_chunks)} completed chunks out of {len(chunks)} total")

# Only process remaining chunks
remaining_chunks = [chunk for chunk in chunks if chunk not in completed_chunks]

if not remaining_chunks:
    print("All chunks already processed!")
else:
    # Initialize search system for chunk processing
    search_system = HybridSearch(
        df=amazon_df,
        device="cuda" if torch.cuda.is_available() else "cpu",
        use_deepseek_reranking=False,  # Turn off during embedding generation to save memory
        exchange_rate=83
    )

    # Process each chunk
    for start, end in tqdm(remaining_chunks, desc="Processing chunks"):
        print(f"Processing products {start} to {end-1}")

        # Get subset of DataFrame
        chunk_df = amazon_df.iloc[start:end].copy()

        # Generate embeddings for this chunk
        embeddings = search_system.embeddings_generator.generate_product_embeddings(
            df=chunk_df,
            text_column='combined_text_improved',
            batch_size=16  # Adjust based on GPU memory
        )

        # Save this chunk's embeddings
        chunk_path = os.path.join(EMBEDDINGS_DIR, f"embeddings_{start}_{end}.npy")
        np.save(chunk_path, embeddings)
        print(f"Saved embeddings chunk to {chunk_path}")

        # Clear GPU memory after each chunk
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

# After all chunks are processed, combine them
if os.path.exists(os.path.join(OUTPUT_DIR, "combined_embeddings.npy")):
    print("Using existing combined embeddings")
    combined_embeddings = np.load(os.path.join(OUTPUT_DIR, "combined_embeddings.npy"))
else:
    # Combine all embedding chunks
    embedding_chunks = []
    for start, end in chunks:
        chunk_path = os.path.join(EMBEDDINGS_DIR, f"embeddings_{start}_{end}.npy")
        if os.path.exists(chunk_path):
            embedding_chunks.append(np.load(chunk_path))

    combined_embeddings = np.vstack(embedding_chunks)
    np.save(os.path.join(OUTPUT_DIR, "combined_embeddings.npy"), combined_embeddings)
    print(f"Combined embeddings saved with shape {combined_embeddings.shape}")

# Build the FAISS index from combined embeddings
vector_index_path = os.path.join(OUTPUT_DIR, "deepseek_vector_index.faiss")
if os.path.exists(vector_index_path):
    print(f"Using existing index from {vector_index_path}")
else:
    print("Building FAISS index from combined embeddings...")
    dimension = combined_embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)

    # Normalize embeddings for cosine similarity
    faiss.normalize_L2(combined_embeddings)

    # Add to index
    index.add(combined_embeddings)

    # Save index
    faiss.write_index(index, vector_index_path)
    print(f"Index saved to {vector_index_path}")

# Initialize the final search system with the built index
search_system = HybridSearch(
    df=amazon_df,
    vector_index_path=vector_index_path,
    device="cuda" if torch.cuda.is_available() else "cpu",
    use_deepseek_reranking=True,
    exchange_rate=83
)

print("Search system initialized successfully!")

Using GPU for processing
Available GPU memory: 15.83 GB
Found 15 completed chunks out of 15 total
All chunks already processed!
Using existing combined embeddings
Using existing index from /content/drive/My Drive/E-commerce_Analysis/data/search_system/deepseek_vector_index.faiss
Loading vector index from /content/drive/My Drive/E-commerce_Analysis/data/search_system/deepseek_vector_index.faiss
Fitting BM25 to 1465 product descriptions...
BM25 fitted successfully
Search system initialized successfully!


# Run Test Queries Including Target Products

In [None]:
# Test queries with the new hybrid search system
test_queries = [
    "good quality of fast charging Cable for iPhone under 5 USD",
    "good quality headset with Noise Cancelling for computer and have warranty",
    "bluetooth wireless earbuds with long battery life",
    "premium gaming mouse with RGB lighting"
]

target_product_ids = {
    "good quality of fast charging Cable for iPhone under 5 USD": "B08CF3B7N1",
    "good quality headset with Noise Cancelling for computer and have warranty": "B009LJ2BXA"
}

# Test each query
for query in test_queries:
    print(f"\n{'='*80}")
    print(f"Test Query: {query}")
    print(f"{'='*80}")

    # Check if this is a target query
    target_id = target_product_ids.get(query, None)
    if target_id:
        print(f"Target product ID: {target_id}")

    # Run the search
    results = search_system.search(query, top_k=5, debug=True)

    # Print results
    print("\nTop 5 Results with Details:")
    display_cols = ['product_id', 'product_name', 'category', 'price_usd']
    if 'hybrid_score' in results.columns:
        display_cols.append('hybrid_score')
    if 'bm25_score' in results.columns:
        display_cols.append('bm25_score')
    if 'vector_score' in results.columns:
        display_cols.append('vector_score')
    if 'semantic_score' in results.columns:
        display_cols.append('semantic_score')
    if 'final_score' in results.columns:
        display_cols.append('final_score')

    print(results[display_cols])

    # Check if target product is in results
    if target_id and target_id in results['product_id'].values:
        rank = results[results['product_id'] == target_id].index.tolist()[0] + 1
        print(f"\n✅ Target product {target_id} found at rank {rank}")
    elif target_id:
        print(f"\n❌ Target product {target_id} not found in top 5 results")


Test Query: good quality of fast charging Cable for iPhone under 5 USD
Target product ID: B08CF3B7N1
Searching for: good quality of fast charging Cable for iPhone under 5 USD
Loading deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


DeepSeek model loaded successfully
Query analysis: {'product_type': 'cable', 'key_features': ['warranty'], 'price_constraint': 'under 5 USD'}
Unable to parse price constraint: under 5 USD
Loading deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B for embeddings...


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Model loaded successfully on cuda
Found target product B08CF3B7N1 at index 632
Current score: 2.431146264076233


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for

DeepSeek reranking applied successfully
Search completed in 25.23 seconds

Top results:
     product_id  \
111  B0974G5Q2Y   
245  B08NCKT9FG   
258  B07CRL2GY6   
113  B09RX1FK54   
985  B09RWZRCP1   

                                                                                                                                                                                product_name  \
111                                                                                                                       boAt Laptop, Smartphone Type-c A400 Male Data Cable (Carbon Black)   
245                                                                                                                                                  Boat A 350 Type C Cable 1.5m(Jet Black)   
258                                                                                                                                     boAt Rugged V3 Braided Micro USB Cable (Pearl White)   
113       boAt Type C A750 St

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Query analysis: {'product_type': 'headset', 'key_features': ['good quality', 'headset', 'Noise Cancelling', 'computer', 'warranty'], 'price_constraint': None}
Category match for B009LJ2BXA: headset in ['Computers&Accessories', 'Accessories&Peripherals', 'Audio&VideoAccessories', 'PCHeadsets']
Found target product B009LJ2BXA at index 906
Current score: 2.7570901919201183


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for

B009LJ2BXA semantic score: 0.0
B009LJ2BXA final score: 2.909370380934455
B009LJ2BXA rank after reranking: 2
DeepSeek reranking applied successfully
Search completed in 15.17 seconds

Top results:
     product_id  \
969  B079S811J3   
906  B009LJ2BXA   
942  B07T9FV9YP   
932  B09MDCZJXS   
785  B098R25TGC   

                                                                                                                                                                                            product_name  \
969                                                            Redgear Cosmo 7,1 Usb Gaming Wired Over Ear Headphones With Mic With Virtual Surround Sound,50Mm Driver, Rgb Leds & Remote Control(Black)   
906                                     Hp Wired On Ear Headphones With Mic With 3.5 Mm Drivers, In-Built Noise Cancelling, Foldable And Adjustable For Laptop/Pc/Office/Home/ 1 Year Warranty (B4B09Pa)   
942                                                                          

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Query analysis: {'product_type': 'headphones', 'key_features': ['long battery life', 'wireless charging', 'protection'], 'price_constraint': 'no price constraints'}
Unable to parse price constraint: no price constraints
Found target product B08CF3B7N1 at index 632
Current score: 0.0


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for

DeepSeek reranking applied successfully
Search completed in 14.87 seconds

Top results:
     product_id  \
656  B07LG59NPV   
690  B09PL79D2X   
405  B08FN6WGDQ   
655  B0B1F6GQPS   
813  B09GFWJDY1   

                                                                                                                                                                                   product_name  \
656                                                     Boult Audio Probass Curve Bluetooth Wireless in Ear Earphones with Mic with Ipx5 Water Resistant, 12H Battery Life & Extra Bass (Black)   
690  boAt Airdopes 181 in-Ear True Wireless Earbuds with ENx  Tech, Beast  Mode(Low Latency Upto 60ms) for Gaming, with Mic, ASAP  Charge, 20H Playtime, Bluetooth v5.2, IPX4 & IWP (Cool Grey)   
405                                                                             Samsung Galaxy Buds Live Bluetooth Truly Wireless in Ear Earbuds with Mic, Upto 21 Hours Playtime, Mystic Black   
655            Bo

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Query analysis: {'product_type': '', 'key_features': [], 'price_constraint': None}


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for

DeepSeek reranking applied successfully
Search completed in 18.39 seconds

Top results:
     product_id  \
880  B07YWS9SP9   
775  B09ZHCJDP1   
940  B09GBBJV72   
258  B07CRL2GY6   
897  B08LT9BMPP   

                                                                                                                                                                   product_name  \
880                                                                       Zebronics, ZEB-NC3300 USB Powered Laptop Cooling Pad with Dual Fan, Dual USB Port and Blue LED Lights   
775          Amazon Basics Wireless Mouse | 2.4 GHz Connection, 1600 DPI | Type - C Adapter | Upto 12 Months of Battery Life | Ambidextrous Design | Suitable for PC/Mac/Laptop   
940                    HP 330 Wireless Black Keyboard and Mouse Set with Numeric Keypad, 2.4GHz Wireless Connection and 1600 DPI, USB Receiver, LED Indicators , Black(2V9E6AA)   
258                                                                              

# Compare Old vs. New Search Systems Side by Side

In [None]:
# Compare the old and new search systems side by side
from prettytable import PrettyTable
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import CLIPProcessor, CLIPModel
from vectorshop.embedding.vector_search import search_multi_modal

# Define CLIP model loading function
def get_clip_model(device="cpu"):
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    return model, processor

# Load existing indexes
text_index_clip = faiss.read_index("/content/drive/My Drive/E-commerce_Analysis/data/processed/text_index.faiss")
image_index_clip = faiss.read_index("/content/drive/My Drive/E-commerce_Analysis/data/processed/image_index.faiss")

# Initialize TF-IDF for the original search system
print("Initializing TF-IDF vectorizer...")
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(amazon_df['combined_text_improved'])
print("TF-IDF matrix created with shape:", tfidf_matrix.shape)

# Get CLIP model
print("Loading CLIP model...")
clip_model, clip_processor = get_clip_model(device=device)

# Create comparison table
comparison_table = PrettyTable()
comparison_table.field_names = ["Query", "Target Product ID", "Original System Rank", "Hybrid System Rank", "Speed Improvement"]

# Test queries
test_queries = [
    "good quality of fast charging Cable for iPhone under 5 USD",
    "good quality headset with Noise Cancelling for computer and have warranty"
]

for query in test_queries:
    print(f"\nComparing search systems for: {query}")
    target_id = target_product_ids.get(query)

    # Test original system
    start_time = time.time()
    original_results = search_multi_modal(
        query=query,
        text_index=text_index_clip,
        image_index=image_index_clip,
        df=amazon_df,
        model=clip_model,
        processor=clip_processor,
        tfidf=tfidf,
        tfidf_matrix=tfidf_matrix,
        device=device,
        top_k=10,
        exchange_rate=83
    )
    original_time = time.time() - start_time

    original_rank = "Not Found"
    if target_id in original_results['product_id'].values:
        original_rank = original_results[original_results['product_id'] == target_id].index[0] + 1

    # Test hybrid system
    start_time = time.time()
    hybrid_results = search_system.search(query, top_k=10, debug=False)
    hybrid_time = time.time() - start_time

    hybrid_rank = "Not Found"
    if target_id in hybrid_results['product_id'].values:
        hybrid_rank = hybrid_results[hybrid_results['product_id'] == target_id].index[0] + 1

    # Speed comparison
    speed_ratio = original_time / hybrid_time
    speed_improvement = f"{speed_ratio:.2f}x"

    # Add to table
    comparison_table.add_row([
        query[:30] + "...",
        target_id,
        original_rank,
        hybrid_rank,
        speed_improvement
    ])

print("\nSearch System Comparison:")
print(comparison_table)

Initializing TF-IDF vectorizer...
TF-IDF matrix created with shape: (1465, 19120)
Loading CLIP model...


  combined_results.at[index, 'score'] += category_boost
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



Comparing search systems for: good quality of fast charging Cable for iPhone under 5 USD
Running original search_multi_modal function
Initial text indices: [ 118  117  133    2  379  623   97  486   58  925    7  422  668   62
  938  207  538   23  478  727  277  234   44  238  253  139  162   73
  564  983  205  172  196  195  333   81  316    4  393  632  300   83
  426  672  176  699  164  178  208   74  985  322   69  562  974  832
  235   15  464  185  547  204  219  602   59  928  181  140   29  503
  771   34  836    6  418  658  282   76  570  992   66  968   21  722
  156  229  201  137  407  644   10  428  673  240  223   45  857  331
  309  900  246  174   18  472  285  695  227  153  314  113  304  115
  111  256   14  456  692  505  109  173  328  519  252  582  755  483
  158  324   93  217  187  198  149  228  313   92  287  460  713  101
  136  248  120  131   33  833  107   27  768  374  780   28  504  784
  388  552  962  809  451   47   78 1000  975   71  957   75  

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for


Comparing search systems for: good quality headset with Noise Cancelling for computer and have warranty
Running original search_multi_modal function
Initial text indices: [ 883  773  312  879  729  655  780  932  906  585  344  594  631  649
  172  512  706  809  466  969  716  647  680  920  415  867  139   62
  938  620  775  437  664  973  891  251  697  486  942  117  118  221
  133  609  137   74  985  167  656  204  874  523   21  722  708  227
  346  600  991  599  747    7  422  668  760  939  670  234   76  570
  992  636  457  435  772  832 1050  521   81 1157 1001  460  713  971
  349 1430  455  929  790  579  335  385  508  586 1002  824  202  821
  841  687  474  704  378  448  694  597  375  574  619 1448   83  662
  207  877  734  477  226  986  811  495  766   47  277  219   44  224
  113  445   18  472  827 1286  164  336  587  496  725  538  785  176
  240  999  426  672  196  755  142 1015   34  836  726  849  458  402
  859  791  516  229  627  584  552  962  926  

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for


Search System Comparison:
+-----------------------------------+-------------------+----------------------+--------------------+-------------------+
|               Query               | Target Product ID | Original System Rank | Hybrid System Rank | Speed Improvement |
+-----------------------------------+-------------------+----------------------+--------------------+-------------------+
| good quality of fast charging ... |     B08CF3B7N1    |      Not Found       |        633         |       0.01x       |
| good quality headset with Nois... |     B009LJ2BXA    |         907          |        907         |       0.00x       |
+-----------------------------------+-------------------+----------------------+--------------------+-------------------+


In [None]:
# Correct check
target_in_df = any(amazon_df['product_id'] == "B009LJ2BXA")
# Or better:
target_product = amazon_df[amazon_df['product_id'] == "B009LJ2BXA"]
if not target_product.empty:
    print("Target product found:", target_product['product_name'].values[0])
    print("Category:", target_product['category'].values[0])

Target product found: Hp Wired On Ear Headphones With Mic With 3.5 Mm Drivers, In-Built Noise Cancelling, Foldable And Adjustable For Laptop/Pc/Office/Home/ 1 Year Warranty (B4B09Pa)
Category: Computers&Accessories|Accessories&Peripherals|Audio&VideoAccessories|PCHeadsets


# Create Demo Function for Shopify Stakeholders

In [None]:
def demo_search_for_stakeholders(query, top_k=5):
    """
    Demonstration function that shows the power of the hybrid search system.
    This function is designed to showcase the system to Shopify stakeholders.

    Args:
        query: The search query from the user
        top_k: Number of results to return
    """
    print(f"\n{'='*80}")
    print(f"🔍 SEARCH QUERY: {query}")
    print(f"{'='*80}")

    # Start timing
    start_time = time.time()

    # Step 1: Query Analysis
    print("\n🧠 QUERY ANALYSIS:")
    query_analysis = search_system.reranker.analyze_query(query)

    # Display extracted information
    print(f"• Product Type: {query_analysis.get('product_type', 'General')}")
    print(f"• Key Features: {', '.join(query_analysis.get('key_features', ['None detected']))}")
    if query_analysis.get('price_constraint'):
        print(f"• Price Constraint: Under ${query_analysis.get('price_constraint')} USD")

    # Step 2: Run Search
    results = search_system.search(query, top_k=top_k, debug=False)

    # Calculate search time
    elapsed_time = time.time() - start_time

    # Step 3: Show Results and Explanations
    print(f"\n📊 TOP {top_k} RESULTS (found in {elapsed_time:.2f} seconds):")

    for i, (idx, row) in enumerate(results.iterrows()):
        print(f"\n{i+1}. {row['product_name']}")
        print(f"   Product ID: {row['product_id']}")
        print(f"   Category: {row['category']}")
        print(f"   Price: ${row['price_usd']:.2f} USD")

        # Show relevance explanation
        print("   Relevance Factors:")
        if 'bm25_score' in row and not pd.isna(row['bm25_score']):
            print(f"   • Keyword Match: {'High' if row['bm25_score'] > 5 else 'Medium' if row['bm25_score'] > 2 else 'Low'}")
        if 'vector_score' in row and not pd.isna(row['vector_score']):
            print(f"   • Semantic Similarity: {'High' if row['vector_score'] > 0.8 else 'Medium' if row['vector_score'] > 0.5 else 'Low'}")
        if 'semantic_score' in row and not pd.isna(row['semantic_score']) and row['semantic_score'] > 0:
            print(f"   • DeepSeek Rating: {row['semantic_score']:.1f}/10")

        # Show matching features if using DeepSeek reranker
        if query_analysis and 'key_features' in query_analysis and query_analysis['key_features']:
            matches = []
            product_text = str(row['combined_text_improved']).lower()
            for feature in query_analysis['key_features']:
                if feature.lower() in product_text:
                    matches.append(feature)
            if matches:
                print(f"   • Matching Features: {', '.join(matches)}")

    # Comparison with old system
    if query in target_product_ids:
        target_id = target_product_ids[query]
        if target_id in results['product_id'].values:
            target_rank = results[results['product_id'] == target_id].index.tolist()[0] + 1
            print(f"\n✅ IMPROVEMENT: Target product {target_id} found at rank {target_rank}")
            print(f"   (Previous system: rank {old_system_ranks.get(query, 'Not Found')})")
        else:
            print(f"\n❌ Target product {target_id} not found in top {top_k} results")
            print(f"   (Previous system: rank {old_system_ranks.get(query, 'Not Found')})")

    return results

# Define comparison data
target_product_ids = {
    "good quality of fast charging Cable for iPhone under 5 USD": "B08CF3B7N1",
    "good quality headset with Noise Cancelling for computer and have warranty": "B009LJ2BXA"
}

old_system_ranks = {
    "good quality of fast charging Cable for iPhone under 5 USD": 73,
    "good quality headset with Noise Cancelling for computer and have warranty": 907
}

# Sample usage with visually appealing output formatting
demo_search_for_stakeholders("wireless earbuds with long battery life and noise cancellation")