<pre>Please make sure your runtime has GPU,
else the code will take lot of time to run as deep learning models are used.</pre>

## Install dependencies

In [None]:
%%capture
!pip install faiss-gpu googletrans==4.0.0-rc1 fuzzywuzzy rank_bm25 sentence-transformers

In [None]:
import pandas as pd
import torch
import faiss
import numpy as np
from googletrans import Translator
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from rank_bm25 import BM25Okapi
import json
from sentence_transformers import SentenceTransformer, util

## Upload the `data.zip` file containing the product and info `csv` file

In [None]:
from google.colab import files
files.upload()

In [None]:
!unzip /content/data.zip

Archive:  /content/data.zip
  inflating: flipkart_com-ecommerce_sample.csv  


## Readying the dataset for retrieval

In [None]:

df = pd.read_csv('flipkart_com-ecommerce_sample.csv')


As we need a set of unique products for retrieval and recommendation, we will remove all other instances of that product.

In [None]:
unique_products_df = df.drop_duplicates(subset='product_name')

final_df = unique_products_df[['product_name', 'description', 'product_specifications', 'image','product_category_tree']]
final_df.columns = ['product_name', 'description', 'product_specification', 'image','product_category_tree']

final_df.to_csv('unique_products.csv', index=False)

final_df


Unnamed: 0,product_name,description,product_specification,image,product_category_tree
0,Alisha Solid Women's Cycling Shorts,Key Features of Alisha Solid Women's Cycling S...,"{""product_specification""=>[{""key""=>""Number of ...","[""http://img5a.flixcart.com/image/short/u/4/a/...","[""Clothing >> Women's Clothing >> Lingerie, Sl..."
1,FabHomeDecor Fabric Double Sofa Bed,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,"{""product_specification""=>[{""key""=>""Installati...","[""http://img6a.flixcart.com/image/sofa-bed/j/f...","[""Furniture >> Living Room Furniture >> Sofa B..."
2,AW Bellies,Key Features of AW Bellies Sandals Wedges Heel...,"{""product_specification""=>[{""key""=>""Ideal For""...","[""http://img5a.flixcart.com/image/shoe/7/z/z/r...","[""Footwear >> Women's Footwear >> Ballerinas >..."
4,Sicons All Purpose Arnica Dog Shampoo,Specifications of Sicons All Purpose Arnica Do...,"{""product_specification""=>[{""key""=>""Pet Type"",...","[""http://img5a.flixcart.com/image/pet-shampoo/...","[""Pet Supplies >> Grooming >> Skin & Coat Care..."
5,Eternal Gandhi Super Series Crystal Paper Weig...,Key Features of Eternal Gandhi Super Series Cr...,"{""product_specification""=>[{""key""=>""Model Name...","[""http://img5a.flixcart.com/image/paper-weight...","[""Eternal Gandhi Super Series Crystal Paper We..."
...,...,...,...,...,...
19948,Uberlyfe Large Vinyl Sticker,Buy Uberlyfe Large Vinyl Sticker for Rs.595 on...,"{""product_specification""=>[{""key""=>""Sales Pack...","[""http://img5a.flixcart.com/image/sticker/k/h/...","[""Baby Care >> Baby & Kids Gifts >> Stickers >..."
19958,We Witches Comfy Hues Women Wedges,Flipkart.com: Buy We Witches Comfy Hues Women ...,"{""product_specification""=>[{""key""=>""Occasion"",...","[""http://img5a.flixcart.com/image/sandal/m/y/z...","[""Footwear >> Women's Footwear >> Wedges""]"
19962,Stylistry Women Heels,Flipkart.com: Buy Stylistry Women Heels only f...,"{""product_specification""=>[{""key""=>""Occasion"",...","[""http://img6a.flixcart.com/image/sandal/z/s/2...","[""Footwear >> Women's Footwear >> Heels""]"
19976,Uberlyfe Extra Large Vinyl Sticker,Uberlyfe Extra Large Vinyl Sticker (Pack of 2)...,"{""product_specification""=>[{""key""=>""Number of ...","[""http://img6a.flixcart.com/image/sticker/f/r/...","[""Baby Care >> Baby & Kids Gifts >> Stickers >..."


Removing some null values.

In [None]:
final_df = final_df.dropna()
final_df

Unnamed: 0,product_name,description,product_specification,image,product_category_tree
0,Alisha Solid Women's Cycling Shorts,Key Features of Alisha Solid Women's Cycling S...,"{""product_specification""=>[{""key""=>""Number of ...","[""http://img5a.flixcart.com/image/short/u/4/a/...","[""Clothing >> Women's Clothing >> Lingerie, Sl..."
1,FabHomeDecor Fabric Double Sofa Bed,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,"{""product_specification""=>[{""key""=>""Installati...","[""http://img6a.flixcart.com/image/sofa-bed/j/f...","[""Furniture >> Living Room Furniture >> Sofa B..."
2,AW Bellies,Key Features of AW Bellies Sandals Wedges Heel...,"{""product_specification""=>[{""key""=>""Ideal For""...","[""http://img5a.flixcart.com/image/shoe/7/z/z/r...","[""Footwear >> Women's Footwear >> Ballerinas >..."
4,Sicons All Purpose Arnica Dog Shampoo,Specifications of Sicons All Purpose Arnica Do...,"{""product_specification""=>[{""key""=>""Pet Type"",...","[""http://img5a.flixcart.com/image/pet-shampoo/...","[""Pet Supplies >> Grooming >> Skin & Coat Care..."
5,Eternal Gandhi Super Series Crystal Paper Weig...,Key Features of Eternal Gandhi Super Series Cr...,"{""product_specification""=>[{""key""=>""Model Name...","[""http://img5a.flixcart.com/image/paper-weight...","[""Eternal Gandhi Super Series Crystal Paper We..."
...,...,...,...,...,...
19936,Purple Women Heels,Flipkart.com: Buy Purple Women Heels only for ...,"{""product_specification""=>[{""key""=>""Occasion"",...","[""http://img6a.flixcart.com/image/sandal/u/9/w...","[""Footwear >> Women's Footwear >> Heels""]"
19948,Uberlyfe Large Vinyl Sticker,Buy Uberlyfe Large Vinyl Sticker for Rs.595 on...,"{""product_specification""=>[{""key""=>""Sales Pack...","[""http://img5a.flixcart.com/image/sticker/k/h/...","[""Baby Care >> Baby & Kids Gifts >> Stickers >..."
19958,We Witches Comfy Hues Women Wedges,Flipkart.com: Buy We Witches Comfy Hues Women ...,"{""product_specification""=>[{""key""=>""Occasion"",...","[""http://img5a.flixcart.com/image/sandal/m/y/z...","[""Footwear >> Women's Footwear >> Wedges""]"
19962,Stylistry Women Heels,Flipkart.com: Buy Stylistry Women Heels only f...,"{""product_specification""=>[{""key""=>""Occasion"",...","[""http://img6a.flixcart.com/image/sandal/z/s/2...","[""Footwear >> Women's Footwear >> Heels""]"


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12664 entries, 0 to 19976
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   product_name           12664 non-null  object
 1   description            12664 non-null  object
 2   product_specification  12664 non-null  object
 3   image                  12664 non-null  object
 4   product_category_tree  12664 non-null  object
dtypes: object(5)
memory usage: 593.6+ KB


<pre>The main idea is to use all the metadata columns in string format,
so we try to preprocess them and bring them in proper format such that,
they can be passed on to some text encoder models.</pre>

In [None]:
def convert_category(category_string):
    return ','.join([x.strip() for x in category_string.split('>>')])

final_df['product_category_tree'] = final_df['product_category_tree'].apply(convert_category)
final_df['product_category_tree']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['product_category_tree'] = final_df['product_category_tree'].apply(convert_category)


Unnamed: 0,product_category_tree
0,"[""Clothing,Women's Clothing,Lingerie, Sleep & ..."
1,"[""Furniture,Living Room Furniture,Sofa Beds & ..."
2,"[""Footwear,Women's Footwear,Ballerinas,AW Bell..."
4,"[""Pet Supplies,Grooming,Skin & Coat Care,Shamp..."
5,"[""Eternal Gandhi Super Series Crystal Paper We..."
...,...
19936,"[""Footwear,Women's Footwear,Heels""]"
19948,"[""Baby Care,Baby & Kids Gifts,Stickers,Uberlyf..."
19958,"[""Footwear,Women's Footwear,Wedges""]"
19962,"[""Footwear,Women's Footwear,Heels""]"


In [None]:
import json

def parse_specifications(spec):
    try:
        spec_dict = json.loads(spec.replace("=>", ":"))
        specs = ", ".join([f"{item['key']}:{item['value']}" for item in spec_dict['product_specification'] if 'key' in item and 'value' in item])
        return specs.lower()
    except:
        return ""


final_df['product_specification'] = final_df['product_specification'].apply(parse_specifications)

final_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['product_specification'] = final_df['product_specification'].apply(parse_specifications)


Unnamed: 0,product_name,description,product_specification,image,product_category_tree
0,Alisha Solid Women's Cycling Shorts,Key Features of Alisha Solid Women's Cycling S...,"number of contents in sales package:pack of 3,...","[""http://img5a.flixcart.com/image/short/u/4/a/...","[""Clothing,Women's Clothing,Lingerie, Sleep & ..."
1,FabHomeDecor Fabric Double Sofa Bed,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,installation & demo details:installation and d...,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...","[""Furniture,Living Room Furniture,Sofa Beds & ..."
2,AW Bellies,Key Features of AW Bellies Sandals Wedges Heel...,"ideal for:women, occasion:casual, color:red, o...","[""http://img5a.flixcart.com/image/shoe/7/z/z/r...","[""Footwear,Women's Footwear,Ballerinas,AW Bell..."
4,Sicons All Purpose Arnica Dog Shampoo,Specifications of Sicons All Purpose Arnica Do...,"pet type:dog, brand:sicons, quantity:500 ml, m...","[""http://img5a.flixcart.com/image/pet-shampoo/...","[""Pet Supplies,Grooming,Skin & Coat Care,Shamp..."
5,Eternal Gandhi Super Series Crystal Paper Weig...,Key Features of Eternal Gandhi Super Series Cr...,"model name:gandhi paper weight mark v, weight:...","[""http://img5a.flixcart.com/image/paper-weight...","[""Eternal Gandhi Super Series Crystal Paper We..."
...,...,...,...,...,...
19936,Purple Women Heels,Flipkart.com: Buy Purple Women Heels only for ...,"occasion:casual, ideal for:women, tip shape:ro...","[""http://img6a.flixcart.com/image/sandal/u/9/w...","[""Footwear,Women's Footwear,Heels""]"
19948,Uberlyfe Large Vinyl Sticker,Buy Uberlyfe Large Vinyl Sticker for Rs.595 on...,"sales package:sticker, brand:uberlyfe, type:vi...","[""http://img5a.flixcart.com/image/sticker/k/h/...","[""Baby Care,Baby & Kids Gifts,Stickers,Uberlyf..."
19958,We Witches Comfy Hues Women Wedges,Flipkart.com: Buy We Witches Comfy Hues Women ...,"occasion:casual, ideal for:women, type:wedges,...","[""http://img5a.flixcart.com/image/sandal/m/y/z...","[""Footwear,Women's Footwear,Wedges""]"
19962,Stylistry Women Heels,Flipkart.com: Buy Stylistry Women Heels only f...,"occasion:party, ideal for:women, type:heels, h...","[""http://img6a.flixcart.com/image/sandal/z/s/2...","[""Footwear,Women's Footwear,Heels""]"


In [None]:
final_df['description'] = final_df['description'].astype(str)

## Retrieval Modelling Approach

Tokenize for BM25. For keyword search, we will split into single words.

In [None]:
tokenized_descriptions = [desc.split() for desc in final_df['product_name']]

In [None]:
bm25 = BM25Okapi(tokenized_descriptions)

Loading embedding model to get embeddings for our text query and metadata columns.

In [None]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2').to('cuda')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loading bi-encoder model specifically for comparing query and document (product_specifications)

In [None]:
bi_encoder_model_name = 'sentence-transformers/msmarco-distilbert-base-v4'
bi_encoder = SentenceTransformer(bi_encoder_model_name).to('cuda')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:

def get_bi_encoder_embeddings(texts, model):
    return model.encode(texts, convert_to_tensor=True)

def get_encoder_embeddings(texts, model):
    return model.encode(texts, convert_to_tensor=True)

We also try to build a multimodal retrieval system hence we need functions to load the images and generate their embeddings using a pre-trained ViT model which was fine-tuned on e-commerce dataset.

In [None]:
from PIL import Image
import requests
from io import BytesIO
import torch
import numpy as np

def load_image_from_url(image_url):
    try:
        response = requests.get(image_url)
        response.raise_for_status()  # Raise an HTTPError for bad responses

        image = Image.open(BytesIO(response.content))
        image = image.convert('RGB')  # Ensure the image is in RGB format
        image = image.resize((224, 224))  # Resize the image to 224x224

        # Convert the image to a tensor
        image_array = np.array(image)
        image_tensor = torch.tensor(image_array).permute(2, 0, 1).float().div(255)  # Normalizing to [0, 1]
        return image_tensor
    except Exception as e:
        print(f"Error loading image from {image_url}: {e}")
        # Create a zero tensor of the same shape as the expected image tensor
        zero_tensor = torch.zeros((3, 224, 224))
        return zero_tensor


In [None]:
img_url = 'http://img5a.flixcart.com/image/short/u/4/a/altht-3p-21-alisha-38-original-imaeh2d5vm5zbtgg.jpeg'
img = load_image_from_url(img_url)
img.shape

torch.Size([3, 224, 224])

In [None]:
from transformers import AutoImageProcessor, AutoModelForImageClassification

processor = AutoImageProcessor.from_pretrained("kraftman1/image_ecommerce_classifier_v3")
img_model = AutoModelForImageClassification.from_pretrained("kraftman1/image_ecommerce_classifier_v3").to('cuda')

In [None]:
img_model.classifier = torch.nn.Identity()

In [None]:
img_model

ViTForImageClassification(
  (vit): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTSdpaAttention(
            (attention): ViTSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_fe

The thing is we have multiple images for a product, but we need to index one image embedding corresponding to one product, hence we take average of all the images given for a product.

Note - The below cell takes a lot of time to run as there around 30K images, please feel free to skip the use of image embeddings,if you want please uncomment the code of image embeddings in the below cells and final retrival function

Please uncomment if image embeddings are used.

In [None]:
# def get_image_embeddings(image_urls, embedding_model):
#     embeddings = []
#     for image_url in image_urls:
#         image = load_image_from_url(image_url).to('cuda')
#         embedding = embedding_model(image.unsqueeze(0)).logits
#         embeddings.append(embedding)
#     return torch.stack(embeddings).mean(dim=0)


# final_df['image_embeddings'] = final_df['image'].apply(lambda x: get_image_embeddings(eval(x), img_model))

Convert our metadata to embeddings.

Please uncomment if image embeddings are used.

In [None]:
# image_embeddings = torch.stack(final_df['image_embeddings'].tolist()).cpu().numpy()

In [None]:
categories = final_df['product_category_tree'].tolist()
categories_embeddings = get_encoder_embeddings(categories, embedding_model)

In [None]:
descriptions = final_df['description'].tolist()
description_embeddings = get_encoder_embeddings(descriptions, embedding_model)

In [None]:
specifications = final_df['product_specification'].tolist()
specification_embeddings = get_bi_encoder_embeddings(specifications, bi_encoder)

Now we need a vector store to store these embeddings and then we can perform vector similarity search between query vector and embedding vectors we have stored, so we use FAISS library from facebook.

In [None]:
index_categories = faiss.IndexFlatL2(categories_embeddings.shape[1])
index_categories.add(categories_embeddings.cpu().numpy())

In [None]:
index_descriptions = faiss.IndexFlatL2(description_embeddings.shape[1])
index_descriptions.add(description_embeddings.cpu().numpy())

In [None]:
index_specifications = faiss.IndexFlatL2(specification_embeddings.shape[1])
index_specifications.add(specification_embeddings.cpu().numpy())

Please uncomment if image embeddings are used.

In [None]:
# index_images = faiss.IndexFlatL2(image_embeddings.shape[1])
# index_images.add(image_embeddings.cpu().numpy())

We use Translator API from google to handle multiligual queries.

In [None]:
translator = Translator()

In [None]:
%%capture
!pip install langdetect

In [None]:
from langdetect import detect

def translate_query(query):
    if detect(query) == 'en':
        return query
    else:
        translated = translator.translate(query, dest='en')
        return translated.text


We use fuzzy library to handle incomplete queries.

In [None]:
def fuzzy_match(query, choices, scorer=fuzz.token_sort_ratio, limit=5):
    results = process.extract(query, choices, scorer=scorer, limit=limit)
    return [result[0] for result in results]

## Main retrieval function

In [None]:
def retrieve_candidates(query, df, bm25, index_descriptions, index_specifications, embedding_model, bi_encoder, top_n=10):
    query = translate_query(query)

    # Perform fuzzy matching for incomplete terms
    descriptions = df['description'].tolist()
    fuzzy_matches = fuzzy_match(query, descriptions, limit=top_n)

    # Perform BM25 search on descriptions and product names
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_top_indices = bm25_scores.argsort()[-top_n:][::-1]
    bm25_candidates = df.iloc[bm25_top_indices]['product_name'].tolist()

    query_embedding = get_encoder_embeddings([query], embedding_model)
    bi_query_embedding = get_bi_encoder_embeddings([query], bi_encoder)

    # Perform semantic search on descriptions
    _, description_indices = index_descriptions.search(query_embedding.cpu().numpy(), top_n)
    semantic_candidates_descriptions = df.iloc[description_indices[0]]['description'].tolist()

    # Perform semantic search on product specifications
    distances, specification_indices = index_specifications.search(bi_query_embedding.cpu().numpy(), top_n)
    semantic_candidates_specifications = df.iloc[specification_indices[0]]['product_specification'].tolist()

    # Perform semantic search on category tree
    _, category_indices = index_categories.search(query_embedding.cpu().numpy(), top_n)
    semantic_candidates_categories = df.iloc[category_indices[0]]['product_category_tree'].tolist()

    #Please uncomment if image embeddings are used.
    # #Perform semantic search on images
    # _, image_indices = index_images.search(query_embedding.cpu().numpy(), top_n)
    # semantic_candidates_images = df.iloc[image_indices[0]]['image_embeddings'].tolist()

    # add semantic_candidates_images in the list below
    combined_candidates = list(set(fuzzy_matches+ bm25_candidates + semantic_candidates_descriptions + semantic_candidates_specifications + semantic_candidates_categories )) # +gpt2 candidates + semantic_candidates_images

    # Get embeddings for the combined candidates
    candidate_embeddings = get_encoder_embeddings(combined_candidates, embedding_model)
    query_embedding = get_encoder_embeddings([query], embedding_model)

    # Rank candidates based on cosine similarity
    scores = util.pytorch_cos_sim(query_embedding, candidate_embeddings).flatten()
    ranked_indices = scores.argsort(descending=True)
    ranked_candidates = [combined_candidates[i] for i in ranked_indices]

    # return
    # ranked_results = df[df['product_name'].isin(ranked_candidates)].head(top_n)
    ranked_results = df[df['description'].isin(ranked_candidates)].head(top_n)

    return ranked_results


In [None]:
# example
query = "shirts for women"
results = retrieve_candidates(query, final_df, bm25, index_descriptions, index_specifications, embedding_model, bi_encoder, top_n=3)


In [None]:
results

Unnamed: 0,product_name,description,product_specification,image,product_category_tree
1996,Ploomz Women's T-Shirt Bra,Ploomz Women's T-Shirt Bra\n ...,"brand color:light pink, color:pink, pattern:so...","[""http://img5a.flixcart.com/image/bra/s/h/v/md...","[""Clothing,Women's Clothing,Lingerie, Sleep & ..."
9743,Qbee Women's Solid Casual Shirt,Qbee Women's Solid Casual Shirt\n ...,"ideal for:women's, occasion:casual, pattern:so...","[""http://img5a.flixcart.com/image/shirt/f/w/q/...","[""Clothing,Women's Clothing,Western Wear,Shirt..."
10523,Yepme Graphic Print Women's V-neck Orange T-Shirt,Key Features of Yepme Graphic Print Women's V-...,"sleeve:short sleeve, number of contents in sal...","[""http://img6a.flixcart.com/image/t-shirt/b/t/...","[""Clothing,Women's Clothing,Western Wear,Shirt..."


## Visualizing results using Streamlit

In [None]:
%%capture
!pip install streamlit

In [None]:
%%writefile app.py
import streamlit as st
import pandas as pd
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AutoTokenizer, AutoModel
import faiss
import numpy as np
from googletrans import Translator
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from rank_bm25 import BM25Okapi
import json
from sentence_transformers import SentenceTransformer, util


df = pd.read_csv('flipkart_com-ecommerce_sample.csv')
unique_products_df = df.drop_duplicates(subset='product_name')

final_df = unique_products_df[['product_name', 'description', 'product_specifications', 'image','product_category_tree']]
final_df.columns = ['product_name', 'description', 'product_specification', 'image','product_category_tree']
final_df = final_df.dropna()
def convert_category(category_string):
    return ','.join([x.strip() for x in category_string.split('>>')])
# check this function
final_df['product_category_tree'] = final_df['product_category_tree'].apply(convert_category)
import json

def parse_specifications(spec):
    try:
        spec_dict = json.loads(spec.replace("=>", ":"))
        specs = ", ".join([f"{item['key']}:{item['value']}" for item in spec_dict['product_specification'] if 'key' in item and 'value' in item])
        return specs.lower()
    except:
        return ""


final_df['product_specification'] = final_df['product_specification'].apply(parse_specifications)
final_df['description'] = final_df['description'].astype(str)
tokenized_descriptions = [desc.split() for desc in final_df['product_name']]
bm25 = BM25Okapi(tokenized_descriptions)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2').to('cuda')
bi_encoder_model_name = 'sentence-transformers/msmarco-distilbert-base-v4'
bi_encoder = SentenceTransformer(bi_encoder_model_name).to('cuda')

def get_bi_encoder_embeddings(texts, model):
    return model.encode(texts, convert_to_tensor=True)

def get_encoder_embeddings(texts, model):
    return model.encode(texts, convert_to_tensor=True)

categories = final_df['product_category_tree'].tolist()
categories_embeddings = get_encoder_embeddings(categories, embedding_model)
descriptions = final_df['description'].tolist()
description_embeddings = get_encoder_embeddings(descriptions, embedding_model)

specifications = final_df['product_specification'].tolist()
specification_embeddings = get_bi_encoder_embeddings(specifications, bi_encoder)
index_categories = faiss.IndexFlatL2(categories_embeddings.shape[1])
index_categories.add(categories_embeddings.cpu().numpy())
index_descriptions = faiss.IndexFlatL2(description_embeddings.shape[1])
index_descriptions.add(description_embeddings.cpu().numpy())
index_specifications = faiss.IndexFlatL2(specification_embeddings.shape[1])
index_specifications.add(specification_embeddings.cpu().numpy())
translator = Translator()
from langdetect import detect

def translate_query(query):
    if detect(query) == 'en':
        return query
    else:
        translated = translator.translate(query, dest='en')
        return translated.text
def fuzzy_match(query, choices, scorer=fuzz.token_sort_ratio, limit=5):
    results = process.extract(query, choices, scorer=scorer, limit=limit)
    return [result[0] for result in results]
def retrieve_candidates(query, df, bm25, index_descriptions, index_specifications, embedding_model, bi_encoder, top_n=10):
    query = translate_query(query)

    # Perform fuzzy matching for incomplete terms
    descriptions = df['description'].tolist()
    fuzzy_matches = fuzzy_match(query, descriptions, limit=top_n)

    # Perform BM25 search
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_top_indices = bm25_scores.argsort()[-top_n:][::-1]
    bm25_candidates = df.iloc[bm25_top_indices]['product_name'].tolist()

    query_embedding = get_encoder_embeddings([query], embedding_model)
    bi_query_embedding = get_bi_encoder_embeddings([query], bi_encoder)

    # Perform semantic search on descriptions
    _, description_indices = index_descriptions.search(query_embedding.cpu().numpy(), top_n)
    semantic_candidates_descriptions = df.iloc[description_indices[0]]['description'].tolist()

    # Perform semantic search on product specifications
    distances, specification_indices = index_specifications.search(bi_query_embedding.cpu().numpy(), top_n)
    semantic_candidates_specifications = df.iloc[specification_indices[0]]['product_specification'].tolist()

    # Perform semantic search on category tree
    _, category_indices = index_categories.search(query_embedding.cpu().numpy(), top_n)
    semantic_candidates_categories = df.iloc[category_indices[0]]['product_category_tree'].tolist()

    #Please uncomment if image embeddings are used.
    # #Perform semantic search on images
    # _, image_indices = index_images.search(query_embedding.cpu().numpy(), top_n)
    # semantic_candidates_images = df.iloc[image_indices[0]]['image_embeddings'].tolist()

    # add semantic_candidates_images in the list below
    combined_candidates = list(set(fuzzy_matches+ bm25_candidates + semantic_candidates_descriptions + semantic_candidates_specifications + semantic_candidates_categories )) # +gpt2 candidates + semantic_candidates_images

    # Get embeddings for the combined candidates
    candidate_embeddings = get_encoder_embeddings(combined_candidates, embedding_model)
    query_embedding = get_encoder_embeddings([query], embedding_model)

    # Rank candidates based on cosine similarity
    scores = util.pytorch_cos_sim(query_embedding, candidate_embeddings).flatten()
    ranked_indices = scores.argsort(descending=True)
    ranked_candidates = [combined_candidates[i] for i in ranked_indices]

    # return the scores
    ranked_results = df[df['product_name'].isin(ranked_candidates)].head(top_n)
    # ranked_results = df[df['description'].isin(ranked_candidates)].head(top_n)

    return ranked_results


def display_products(results_df):
    for index, row in results_df.iterrows():
        st.subheader(row['product_name'])
        st.image(eval(row['image'])[0], caption=row['product_name'], use_column_width=True)  # Use the first image
        st.write(row['description'])

st.title('Product Search')

query = st.text_input("Enter your search query:")

results_placeholder = st.empty()

if query:

    result_df = retrieve_candidates(query, final_df, bm25, index_descriptions, index_specifications, embedding_model, bi_encoder, top_n=5)


    results_placeholder.dataframe(result_df)

    display_products(result_df)


Writing app.py


Please change `top_n` variable to get more number of recommendations.

In [None]:
! wget -q -O - ipv4.icanhazip.com

34.83.121.249


In [None]:
!streamlit run app.py & npx localtunnel --port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.83.121.249:8501[0m
[0m
[1G[0JNeed to install the following packages:
  localtunnel@2.0.2
Ok to proceed? (y) [20Gy
[K[?25hyour url is: https://tiny-heads-judge.loca.lt
2024-08-09 03:36:56.281380: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-09 03:36:56.333398: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-09 03:36:56.351347: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452

After running the above cell ,you will get a url ending with `.loca.lt`,go to that link, a prompt for password will open, paste the output of `! wget -q -O - ipv4.icanhazip.com` cell into the password space and the streamlit app will open.