# *🛍️ Fashion Sense AI*

## 📂 Listing Files in a Directory using `os` Module

This code snippet demonstrates how to use Python’s built-in `os` module to interact with the file system. Specifically, it lists the contents of a given directory in the Kaggle notebook environment.

In [1]:
# Importing the os module which provides functions to interact with the operating system
import os

# Using os.listdir to list all files and directories in the specified path
# This will print the contents of the "/kaggle/input/dataset-ecomerce/" directory
print(os.listdir("/kaggle/input/dataset-ecomerce/"))
print(os.listdir("/kaggle/input/hf-token"))

# I had taken 2 input source

['dresses_bd_processed_data.csv', 'jeans_bd_processed_data.csv', 'Images']
['hf_token.json']


## 📦 Installing Essential Libraries for ML, NLP, and App Development

In this step, we are preparing the Python environment by installing all the necessary libraries required for:

### 🔢 Data Processing & Analysis
- **`pandas`, `numpy`**: To handle structured data and perform numerical computations efficiently.

### 📊 Data Visualization
- **`matplotlib`, `seaborn`**: For creating insightful visualizations and exploratory data analysis.

### 🤖 Machine Learning & Preprocessing
- **`scikit-learn`**: For traditional ML models, feature scaling, model evaluation, and utilities.
- **`tqdm`, `requests`**: To enhance user experience with progress bars and handle API requests.

### 🧠 Transformers & Model Acceleration
- **`transformers`, `accelerate`**: From Hugging Face, to load and run large language models efficiently.
- **`bitsandbytes`, `flash-attn`**: To support low-bit quantization and fast attention for efficient model inference.

### ✍️ Text Processing & Embeddings
- **`ftfy`, `regex`**: To clean and normalize messy Unicode text.
- **`sentence-transformers`**: For generating sentence embeddings used in similarity search or semantic understanding.

### 🔍 Vector Search
- **`faiss-gpu-cu12`**: Facebook’s library for efficient similarity search over vector embeddings using GPU (CUDA 12).

### 🧠 LLM Integration & LangChain
- **`langchain`, `langchain-community`, `openai`**: To build applications that integrate with language models like GPT using LangChain.

### 🌐 App Interface
- **`streamlit`**: To create and run a web-based interactive application with Python scripts.

### 🔥 PyTorch Ecosystem
- **`torch`, `torchvision`, `torchaudio`**: For deep learning, and to work with image and audio data in PyTorch-based projects.

> ✅ These installations set up your environment for an end-to-end ML/NLP pipeline, including data handling, model deployment, LLM integration, and app development using Streamlit.


In [2]:
!pip install pandas numpy matplotlib seaborn scikit-learn requests tqdm transformers accelerate ftfy regex sentence-transformers faiss-gpu-cu12 langchain langchain-community streamlit # bitsandbytes # flash-attn
!pip install torch torchvision torchaudio

Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting faiss-gpu-cu12
  Downloading faiss_gpu_cu12-1.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.26-py3-none-any.whl.metadata (2.9 kB)
Collecting streamlit
  Downloading streamlit-1.46.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-core<1.0.0,>=0.3.49 (from langchain)
  Downloading langchain_core-0.3.66-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain
  Downloading langchain-0.3.26-py3-none-any.whl.metadata (7.8 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.0-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Downloading langchain_text_splitters

## 📄 Loading Fashion Product Data with pandas

In this step, we are importing structured product data related to dresses and jeans from CSV files. These datasets are stored in the Kaggle environment and will be used for downstream tasks like visual search, recommendation, or metadata-based filtering.

### 🔍 Breakdown:

- **`import pandas as pd`**: Imports the `pandas` library, which is essential for working with tabular data (e.g., CSV files).
- **`import numpy as np`**: Imports NumPy for numerical operations, often used later for feature engineering or matrix computations.

### 📥 Reading Product Datasets:

- **`dress = pd.read_csv(...)`**: Loads the **dresses** dataset into a DataFrame. This file likely contains features such as product IDs, categories, colors, brands, prices, and possibly image paths or descriptions.
- **`jeans = pd.read_csv(...)`**: Loads the **jeans** dataset in a similar fashion.

### 👁️ Displaying the Data:

- **`display(dress)` and `display(jeans)`**: Used to visualize the top rows of each dataset within a Jupyter or Kaggle notebook environment. This helps confirm that the data has been loaded correctly and gives a quick overview of available columns and values.

> ✅ These datasets form the foundation for performing visual similarity search, outfit recommendation, and trend analysis in the fashion domain.

In [3]:
# Importing the pandas library as 'pd' for data manipulation and analysis
import pandas as pd

# Importing the numpy library as 'np' for numerical operations
import numpy as np

# Reading the processed dress dataset CSV file into a pandas DataFrame named 'dress'
dress = pd.read_csv(r'/kaggle/input/dataset-ecomerce/dresses_bd_processed_data.csv')

# Reading the processed jeans dataset CSV file into a pandas DataFrame named 'jeans'
jeans = pd.read_csv(r'/kaggle/input/dataset-ecomerce/jeans_bd_processed_data.csv')

# Displaying the contents of the 'dress' DataFrame in a readable table format (Kaggle/Notebook compatible)
display(dress)

# Displaying the contents of the 'jeans' DataFrame similarly
display(jeans)

Unnamed: 0,selling_price,discount,category_id,meta_info,product_id,pdp_url,sku,brand,department_id,last_seen_date,launch_on,mrp,product_name,feature_image_s3,channel_id,feature_list,description,style_attributes,pdp_images_s3
0,{'INR': 474848.9539},0.0,30,Slim fit. Designed to hit at the ankle. UK siz...,b613d7b5dfe86f3e695d931d31fd729fdf44e181f14079...,https://www.ralphlauren.co.uk/en/kristian-silk...,479495,RALPH LAUREN,2,2025-05-01,2020-02-29,{'INR': 474848.9539},Kristian Silk Tuxedo Dress,https://gallery.stylumia.com/originals/2020/02...,14,"['Slim fit. Designed to hit at the ankle.', 'U...",The Kristian evening dress is informed by the ...,{},['https://gallery.stylumia.com/originals/2020/...
1,{'INR': 464648.6919},0.0,30,Slim fit. Designed to hit at the ankle. UK siz...,482b10a23f8d00cfc7c9bbeeac4e26d25dd303d8e62e97...,https://www.ralphlauren.co.uk/en/kristian-silk...,502670,RALPH LAUREN,2,2025-05-08,2020-02-29,{'INR': 464648.6919},Kristian Silk Tuxedo Dress,https://gallery.stylumia.com/originals/2020/02...,14,"['Slim fit. Designed to hit at the ankle.', 'U...",The Kristian evening dress is informed by the ...,{},['https://gallery.stylumia.com/originals/2020/...
2,{'INR': 29496.0812},0.0,30,Fit-and-flare silhouette. Intended to hit at t...,3508b052ef7a5eea820423b97713612bc92a3f2301a3d3...,https://www.ralphlauren.co.uk/en/fit-and-flare...,478766,RALPH LAUREN,2,2025-05-08,2020-02-29,{'INR': 29496.0812},Fit-and-Flare Shirtdress,https://gallery.stylumia.com/originals/2020/02...,14,['Fit-and-flare silhouette. Intended to hit at...,Airy georgette and a flattering fit-and-flare ...,{},['https://gallery.stylumia.com/originals/2020/...
3,{'INR': 17156.9392},0.0,30,"Fits true to size; take your normal size, Stra...",6360245240b68885bd4dbcef8d8856c0fb13f1314769f5...,https://www.anthropologie.com/shop/adena-crepe...,50297209_041,BHLDN,2,2025-01-31,2020-05-27,{'INR': 17156.9392},Adena Crepe Dress,https://gallery.stylumia.com/originals/2020/05...,48,"['Back zip', 'Polyester; polyester lining', 'P...","A sleek square neckline tops this stretchy, bo...","{'modelNotes': '', 'dimensions': 'Fits true to...",['https://gallery.stylumia.com/originals/2020/...
4,{'INR': 26079.5467},0.0,30,Fit-and-flare silhouette Designed to hit at th...,5d07037957e64d1e218499cb7d7a8e5e57aa59249bb806...,https://www.ralphlauren.co.uk/en/belted-cotton...,478750,RALPH LAUREN,2,2025-05-08,2021-02-12,{'INR': 26079.5467},Belted Cotton-Blend Shirtdress,https://gallery.stylumia.com/originals/2021/02...,14,['Fit-and-flare silhouette Designed to hit at ...,This iteration of Lauren's iconic shirtdress i...,{},['https://gallery.stylumia.com/originals/2021/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14604,{'INR': 34051.4605},0.0,30,"Main 74% TENCEL™ lyocell, 26% Nylon. Lining 10...",d71430db83e8a11f846346898ea7f8dd96417a8c0e329e...,https://www.next.co.uk/style/su724260/h28422#h...,H28422,All Saints,2,2025-05-08,2025-04-29,{'INR': 34051.4605},AllSaints Cream Sienna Dress,https://gallery.stylumia.com/originals/2025/04...,240,[],Slim fit. V-Neck. Shoulder straps. Maxi length.,"{'Composition': 'Main 74% TENCEL™ lyocell, 26%...",['https://gallery.stylumia.com/originals/2025/...
14605,{'INR': 20385.3225},0.0,30,100% Recycled polyester. Composition Machine w...,007f57fb3e8400242e620ce37becada006306e68df606e...,https://www.next.co.uk/style/su700558/h14377#h...,H14377,All Saints,2,2025-05-08,2025-05-01,{'INR': 20385.3225},AllSaints White Rosie Dress,https://gallery.stylumia.com/originals/2025/05...,240,"['Regular', 'Cowl neck', 'Short sleeve', 'Midi...",,"{'Composition': '100% Recycled polyester.', 'W...",['https://gallery.stylumia.com/originals/2025/...
14606,{'INR': 15829.9432},0.0,30,"73% LENZING™ ECOVERO™ Viscose, 27% Nylon. Comp...",70f3e724958742d76ca2dff07b563bdfc43eca78d88b5c...,https://www.next.co.uk/style/su747355/h45925#h...,H45925,All Saints,2,2025-05-08,2025-05-01,{'INR': 15829.9432},AllSaints Brown Arwen Dress,https://gallery.stylumia.com/originals/2025/05...,240,"['Round neck', 'Sleeveless', 'Maxi length']",,{'Composition': '73% LENZING™ ECOVERO™ Viscose...,['https://gallery.stylumia.com/originals/2025/...
14607,{'INR': 20141.6973},0.0,30,"63% Organic cotton, 37% TENCEL™ lyocell. Compo...",b5598a5048205634fc7465aae62dc24fefd9fc61ab1b1d...,https://www.next.co.uk/style/su747618/w52702#w...,W52702,All Saints,2,2025-05-01,2025-05-01,{'INR': 20141.6973},AllSaints Brown Faye Shacket Dress,https://gallery.stylumia.com/originals/2025/05...,240,[],Collar. Long sleeve. Short length.,"{'Composition': '63% Organic cotton, 37% TENCE...",['https://gallery.stylumia.com/originals/2025/...


Unnamed: 0,selling_price,discount,category_id,meta_info,product_id,pdp_url,sku,brand,department_id,last_seen_date,launch_on,mrp,product_name,feature_image_s3,channel_id,feature_list,description,style_attributes,pdp_images_s3
0,{'USD': 285.9978},0.000000,56,Skinny Fit: Mid-rise. Sits at the hip. Skinny ...,f4d992cf595405c44737ad1ff406360c9e7af5dc521020...,https://www.ralphlauren.co.uk/en/skinny-stretc...,398475,RALPH LAUREN,2,2025-05-15,2020-02-29,{'USD': 285.9978},Skinny Stretch Jeans,https://gallery.stylumia.com/originals/2020/02...,14,['Skinny Fit: Mid-rise. Sits at the hip. Skinn...,Skinny-fitting jeans made from 11.25 oz Japane...,{},['https://gallery.stylumia.com/originals/2020/...
1,{'USD': 285.9978},0.000000,56,Skinny Fit: mid-rise. Sits at the hip. Skinny ...,4c50e2a967da813d5b55452883e47ea7635e6f1972f1c9...,https://www.ralphlauren.co.uk/en/stretch-skinn...,398474,RALPH LAUREN,2,2025-05-15,2020-02-29,{'USD': 285.9978},Stretch Skinny Jeans,https://gallery.stylumia.com/originals/2020/02...,14,['Skinny Fit: mid-rise. Sits at the hip. Skinn...,Skinny-fitting jeans made from 11.25 oz Japane...,{},['https://gallery.stylumia.com/originals/2020/...
2,{'USD': 392.4156},0.000000,56,Boy Fit: mid-rise. Sits at the hip. Relaxed th...,2e37dd2d1b28c4df27175c39bb01f13320be5267cd7e24...,https://www.ralphlauren.co.uk/en/boy-fit-strai...,537869,RALPH LAUREN,2,2025-05-15,2020-08-13,{'USD': 392.4156},Boy Fit Straight Jeans,https://gallery.stylumia.com/originals/2020/08...,14,['Boy Fit: mid-rise. Sits at the hip. Relaxed ...,Straight-fitting jeans made from 320 g Japanes...,{},['https://gallery.stylumia.com/originals/2020/...
3,{'USD': 170.4379},40.174672,56,The Jenn Flare: Our flare sits at your true wa...,264387126c9841fe4bd9607926c3e6155bbefd53badbfc...,https://www.ralphlauren.co.uk/en/jenn-flare-je...,563799,RALPH LAUREN,2,2025-01-28,2021-01-17,{'USD': 284.8925},Jenn Flare Jean,https://gallery.stylumia.com/originals/2021/01...,14,['The Jenn Flare: Our flare sits at your true ...,"Cut for a flared, wide-leg silhouette, our Jen...",{},['https://gallery.stylumia.com/originals/2021/...
4,{'USD': 184.6675},0.000000,56,High-Rise Skinny Ankle: sits above the natural...,06ebbee1d18f36444c87fd746cd7ce5ff24b943a89f0f6...,https://www.ralphlauren.co.uk/en/high-rise-ski...,561065,RALPH LAUREN,2,2025-05-01,2021-02-16,{'USD': 184.6675},High-Rise Skinny Ankle Jean,https://gallery.stylumia.com/originals/2021/02...,14,['High-Rise Skinny Ankle: sits above the natur...,"Part of our superstretch collection, these hig...",{},['https://gallery.stylumia.com/originals/2021/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2869,{'USD': 420.0},0.000000,56,"Main material: 100% cotton, Secondary material...",f9fbee3a68a110f65c2dcaa01d00ed3fc787d3b367c519...,https://us.sandro-paris.com/en/p/wide-leg-stri...,d15edac77f1eadddd847303ce2ce10cf,Sandro Paris,2,2025-05-09,2025-04-30,{'USD': 420.0},Wide-leg striped jeans,https://gallery.stylumia.com/originals/2025/04...,383,[],SANDRO pays tribute to the artistic world of L...,"{'Composition': 'Main material: 100% cotton, S...",['https://gallery.stylumia.com/originals/2025/...
2870,{'USD': 92.8447},0.000000,56,Exterior: 100% cotton Composition MACHINE WASH...,a99c8e83ff86ce01228b54b0c8640d99adfca0a7755550...,https://www.massimodutti.com/gb/midrise-straig...,48640767/407,Massimo Dutti,2,2025-05-02,2025-04-30,{'USD': 92.8447},Mid-rise straight-leg jeans,https://gallery.stylumia.com/originals/2025/04...,7,['Straight-leg Jeans: straight silhouette from...,,"{'Composition': 'Exterior: 100% cotton', 'care...",['https://gallery.stylumia.com/originals/2025/...
2871,{'USD': 79.7574},0.000000,56,"Exterior: 50% cotton, Exterior: 50% lyocell Co...",fcf79b834d61cd5ce9430a807ef218a2b9a8b5a906f309...,https://www.massimodutti.com/gb/highwaist-wide...,49228723/800,Massimo Dutti,2,2025-05-09,2025-04-30,{'USD': 79.7574},High-waist wide-leg jeans with seam details,https://gallery.stylumia.com/originals/2025/04...,7,['Wide-leg trousers: this cut features a wide ...,,"{'Composition': 'Exterior: 50% cotton, Exterio...",['https://gallery.stylumia.com/originals/2025/...
2872,{'USD': 185.9961},0.000000,56,"Petite Fit 100% Cotton., Machine washable. Car...",05bb642a205ba1a24261cc662557cc9cdd1fc732d2664d...,https://www.reiss.com/style/su699374/h14421#h1...,H14421,Belinda,2,2025-05-01,2025-05-01,{'USD': 185.9961},Petite Straight-Leg Turn-Up Jeans in Mid Blue,https://gallery.stylumia.com/originals/2025/05...,299,[],The Belinda jeans offer a flattering straight-...,"{'Fit': 'Petite', 'Care & Fabric': '100% Cotto...",['https://gallery.stylumia.com/originals/2025/...


## ✅ Verifying Column Consistency Between Datasets

After loading the `dress` and `jeans` datasets, it's important to ensure they share the same schema — i.e., identical column names and order. This check is crucial when you plan to:

- Concatenate the datasets into a single product catalog
- Apply uniform processing or modeling pipelines
- Perform similarity search or clustering across categories

### 🔍 Code Purpose:
```python
dress.columns.to_list() == jeans.columns.to_list()

In [4]:
dress.columns.to_list() == jeans.columns.to_list()

True

## 🧩 Merging Datasets: Dresses and Jeans into One DataFrame

Once we've confirmed that both `dress` and `jeans` datasets share the same structure, we can safely combine them into a single unified DataFrame for further processing.

### 🔄 Code Breakdown:
```python
df = pd.concat([dress, jeans], ignore_index=True)

In [5]:
df = pd.concat([dress, jeans], ignore_index=True)
# df = df.iloc[0:10, :]

## 🧼 Cleaning Price Data and Extracting Relevant Fields

This step focuses on transforming and filtering the raw dataset to retain only the most essential information needed for modeling, recommendations, or visualization.

---

### 💰 Step 1: Extracting Prices from Stringified Dictionaries

Some price fields (`selling_price` and `mrp`) are stored as stringified dictionaries, e.g., `"{'INR': 1299}"`. The goal is to extract the price in INR (or fallback to USD if INR is unavailable).

```python
df["selling_price"] = df["selling_price"].apply(lambda x: eval(x).get("INR") or eval(x).get("USD"))
df["mrp"] = df["mrp"].apply(lambda x: eval(x).get("INR") or eval(x).get("USD"))

In [6]:
# Clean and extract required fields

# Extracting the "INR" value from the 'selling_price' dictionary; if not present, fallback to "USD"
# 'eval' is used to convert the string representation of a dictionary into an actual Python dictionary
df["selling_price"] = df["selling_price"].apply(lambda x: eval(x).get("INR") or eval(x).get("USD"))

# Same operation for 'mrp' column to get the actual numerical price in INR or fallback to USD
df["mrp"] = df["mrp"].apply(lambda x: eval(x).get("INR") or eval(x).get("USD"))

# Selecting and keeping only the relevant columns for further processing
# These include identifiers, image link, product info, pricing, category, and metadata
df = df[[
    "product_id", "feature_image_s3", "product_name", "brand",
    "description", "category_id", "style_attributes", "mrp",
    "selling_price", "meta_info"
]]

## 🖼️ Downloading Product Images from URLs

This optional block of code is used to download product images from their URLs and save them locally for tasks like:

- Running visual search or similarity matching
- Building a local dataset for training image-based models
- Creating an offline demo or image inspection interface

---

### 🔄 Workflow Overview:

```python
# import os, requests
# from PIL import Image
# from io import BytesIO

In [7]:
# # Image download
# import os, requests
# from PIL import Image
# from io import BytesIO

# os.makedirs("Images", exist_ok=True)
# for _, row in df.iterrows():
#     try:
#         response = requests.get(row["feature_image_s3"], timeout=10)
#         image = Image.open(BytesIO(response.content)).convert("RGB")
#         image.save(f"Images/{row['product_id']}.jpg")
#     except Exception as e:
#         print(f"Failed for {row['product_id']}: {e}")

## 🔐 Hugging Face Authentication using Kaggle Secrets

This block securely loads your Hugging Face token from a JSON file and logs into the Hugging Face Hub, enabling access to private or gated models and datasets.

---

### 🔍 Code Breakdown:

```python
import os
import json
from huggingface_hub import login

In [8]:
# Importing required libraries
import os  # Provides functions to interact with the operating system
import json  # Used to parse JSON files
from huggingface_hub import login  # Used to authenticate with Hugging Face Hub

# Opening and reading the Hugging Face token from a JSON file located in the input directory
with open("/kaggle/input/hf-token/hf_token.json") as f:
    secrets = json.load(f)  # Loading the JSON content into a Python dictionary named 'secrets'

# Setting the HF_TOKEN environment variable with the token from the secrets dictionary
os.environ["HF_TOKEN"] = secrets["HF_TOKEN"]

# Logging into Hugging Face Hub using the loaded token
login(os.environ["HF_TOKEN"])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## 🧠 CLIP-Based Image Embedding Pipeline for Fashion Products

This code cell performs the following steps to compute visual embeddings for each product image using OpenAI's CLIP model:

---

### Load CLIP Model and Processor

We load the `clip-vit-large-patch14` variant for high-quality image representations.

- The model is moved to the selected device and set to evaluation mode.
- The processor normalizes and resizes images as required by the model.

---

### Define `load_and_embed` Function

This function:

- Builds the image path from the `product_id`
- Loads and preprocesses the image
- Uses the CLIP model to extract a 1024-dimensional image feature vector
- Applies L2 normalization to enable cosine similarity comparisons
- Returns the embedding or logs an error if image loading fails

---

### Create List of Product IDs

We extract the list of `product_id`s from the `df` DataFrame for which embeddings need to be generated.

---

### Multithreaded Embedding Generation

- We use `ThreadPoolExecutor` to parallelize image loading and embedding (max 3 threads).
- For each valid image, we store its normalized embedding in the `image_embeddings` dictionary using `product_id` as the key.

---

### Final Status Output

After processing, we print how many product embeddings were successfully generated.

✅ This setup creates a dictionary of normalized image embeddings that can be used for visual similarity search, clustering, or hybrid recommendations.


In [10]:
# Importing required libraries
import torch  # PyTorch for tensor computations and model inference
from transformers import CLIPProcessor, CLIPModel  # Hugging Face Transformers for CLIP model and preprocessing
from PIL import Image  # Python Imaging Library to handle image loading
import numpy as np  # For numerical operations and array manipulation
import os  # OS operations for file checking
import concurrent.futures  # For multithreaded execution to speed up embedding
from tqdm.notebook import tqdm  # For progress bar in Jupyter/Kaggle notebooks

# Setup device
device = "cuda" if torch.cuda.is_available() else "cpu"  # Use GPU if available, else fallback to CPU

# # Load model and processor
# # Loading the pretrained CLIP model (ViT-Large Patch14 variant) and moving it to the selected device
# clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to(device).eval()

# # Initializing the corresponding processor for CLIP to handle image preprocessing
# clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14", use_fast=True)

from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32", use_fast=False)
clip_model = AutoModelForZeroShotImageClassification.from_pretrained("openai/clip-vit-base-patch32").to(device).eval()

# Function to load and embed a single image
def load_and_embed(product_id):
    # Construct image file path using product_id
    image_path = f"/kaggle/input/dataset-ecomerce/Images/Images/{product_id}.jpg"
    
    # If image file doesn't exist, return None
    if not os.path.exists(image_path):
        return None, None
    try:
        # Load and convert the image to RGB format
        image = Image.open(image_path).convert("RGB")
        
        # Preprocess the image using CLIP processor
        inputs = clip_processor(images=image, return_tensors="pt").to(device)
        
        # Disable gradient calculation and extract image features
        with torch.no_grad():
            features = clip_model.get_image_features(**inputs)
            
            # Normalize the embedding vector and convert to NumPy
            embedding = torch.nn.functional.normalize(features, p=2, dim=-1).cpu().numpy()[0]
        
        # Return product ID and its corresponding embedding
        return product_id, embedding
    except Exception as e:
        # Handle exceptions and print error message if any image fails to process
        print(f"❌ Error processing {product_id}: {e}")
        return None, None

# List of product IDs extracted from the DataFrame
product_ids = df["product_id"].tolist()

# Dictionary to store image embeddings for each product ID
image_embeddings = {}

# Multithreaded processing (loading + inference)
# Using ThreadPoolExecutor to parallelize the image loading and embedding process
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    results = executor.map(load_and_embed, product_ids)
    
    # Iterate through the results with a progress bar
    for pid, emb in tqdm(results, total=len(product_ids)):
        # Store the embedding if it's not None
        if pid and emb is not None:
            image_embeddings[pid] = emb

# Print completion status with number of successfully embedded products
print(f"✅ Embedding completed for {len(image_embeddings)} products.")

  0%|          | 0/17483 [00:00<?, ?it/s]

✅ Embedding completed for 17470 products.


In [11]:
len(list(image_embeddings.values())[0])

512

## 📝 Generating Text Embeddings with SentenceTransformer

This code generates text-based semantic embeddings for each fashion product using the `all-MiniLM-L6-v2` model from the SentenceTransformers library. These embeddings capture the textual meaning of product details like name, description, and attributes, and can later be combined with image embeddings for multimodal search or recommendations.

---

### 🔍 Step-by-Step Explanation:

1. **Load Pretrained Model**
   - We load the `all-MiniLM-L6-v2` model using `SentenceTransformer`, which is optimized for speed and semantic similarity tasks.

2. **Initialize Storage**
   - An empty dictionary `text_embeddings` is used to store product ID → text embedding mappings.

3. **Iterate Over DataFrame**
   - For each row in the DataFrame `df`, we:
     - Concatenate relevant text fields: `product_name`, `description`, `meta_info`, and `style_attributes` to form a descriptive input text.
     - Print the first concatenated text example (for sanity check).
     - Generate a 384-dimensional embedding using `text_model.encode(...)`.
     - Store the embedding using `product_id` as the key in `text_embeddings`.

4. **tqdm Progress Bar**
   - A progress bar is displayed using `tqdm` for real-time feedback on encoding progress.

---

> ✅ These semantic embeddings can be used for:
> - Text-based product search
> - Matching similar items using text descriptions
> - Fusing with image embeddings for hybrid retrieval systems

In [13]:
# Importing SentenceTransformer for generating text embeddings
from sentence_transformers import SentenceTransformer

# tqdm for displaying a progress bar in notebooks
from tqdm.notebook import tqdm

# Loading the 'all-MiniLM-L6-v2' model, a lightweight sentence embedding model from Sentence-Transformers
text_model = SentenceTransformer('all-MiniLM-L6-v2')

# Dictionary to store the resulting text embeddings for each product
text_embeddings = {}

# Counter to print the first example text for inspection
i = 0

# Iterating through each row in the DataFrame
for _, row in tqdm(df.iterrows()):
    # Concatenating product-related text fields to form a combined input sentence
    text = f"{row['product_name']} {row['description']} {row['meta_info']} {row['style_attributes']}"
    
    # Print the first text input to verify formatting
    if i == 0:
        print(text)
    
    i += 1  # Increment the counter

    # Generating text embedding using the sentence-transformer model
    embedding = text_model.encode(text, show_progress_bar=False)
    
    # Storing the embedding in the dictionary using product_id as the key
    text_embeddings[row["product_id"]] = embedding

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

0it [00:00, ?it/s]

Kristian Silk Tuxedo Dress The Kristian evening dress is informed by the tuxedo-inspired gown Mr Lauren custom-designed for Rosie Huntington-Whiteley for our 50th anniversary runway show in Central Park. Transforming a timeless pillar of menswear into the ultimate in feminine elegance, this American-made dress is realised in silk cady and distinguished by formal suiting details, such as silk-satin-covered peak lapels and buttons. Slim fit. Designed to hit at the ankle. UK size 12 has a 153 cm body length and an 84 cm sleeve length. Body length and sleeve length are taken from the centre back of the neck and change 0.5 cm between sizes. Peak lapels. Double-breasted silhouette. Silk-covered buttons. Long sleeves. Left chest welt pocket. Two front waist welt pockets. Full stretch silk lining. Shell: 100% silk. Lining: 91% silk, 9% elastane. Dry clean. Made in USA. Imported materials. Model is 1.78 m and wears a UK size 8. {}


In [14]:
print(list(text_embeddings.keys())[1:10])

['482b10a23f8d00cfc7c9bbeeac4e26d25dd303d8e62e97ba5ba74653f80ca72e', '3508b052ef7a5eea820423b97713612bc92a3f2301a3d342f44b5cec1fe013ef', '6360245240b68885bd4dbcef8d8856c0fb13f1314769f5273904a6eac26fb452', '5d07037957e64d1e218499cb7d7a8e5e57aa59249bb8065283612dbc23260621', '1ccbcbec76de6407bc85e62549e5272d9ecd9bb770ce3bf00b1863f07425b9ed', '5972c0b7f32ec835378a787150ee7d250ddc71cab21102d9e8cf2f40619a2cea', 'be8ce31a6c52deb536c50c20c2b3c623b80aef9ae6053fc0f73ed617aea49c7c', '9ecc179568163f14f1a834185be1c9b0c0400f06deb2c8e298684dd21a63a0d4', 'c0e3f743b45a208bd2c45f80874a41d3030c48e57712238510a5e3898f700202']


## 🔗 Combining Image and Text Embeddings into Unified Vectors

This step merges the previously generated visual and textual embeddings for each product into a single high-dimensional representation. This unified embedding captures both **visual appearance** and **semantic meaning**, enabling more accurate product similarity and recommendation.

---

### 🔍 Step-by-Step Explanation:

1. **Initialize `combined_embeddings` Dictionary**  
   A new dictionary to store the final multimodal embedding for each product, indexed by `product_id`.

2. **Iterate Over Product IDs**
   We loop through all `product_id`s in the DataFrame and:
   - Retrieve the corresponding **image embedding** from `image_embeddings`
   - Retrieve the corresponding **text embedding** from `text_embeddings`

3. **Concatenate Image and Text Embeddings**
   - If both embeddings are available for the product, we use `np.concatenate([...])` to combine them into a single vector.
   - The resulting vector will be of size `512 + 384 = 896` dimensions (CLIP large image + MiniLM text).

4. **Store in Dictionary**
   - Each combined embedding is stored in `combined_embeddings` with its `product_id` as the key.

---

> ✅ These fused embeddings provide a rich, multimodal representation that can power hybrid search engines, personalized recommendations, or clustering algorithms.


In [15]:
# Dictionary to store combined image and text embeddings for each product
combined_embeddings = {}

# Iterating through each product ID in the DataFrame with a progress bar
for pid in tqdm(df["product_id"]):
    # Retrieve image embedding for the current product ID
    img_emb = image_embeddings.get(pid)

    # Retrieve text embedding for the current product ID
    txt_emb = text_embeddings.get(pid)

    # If both embeddings are available, concatenate them into a single vector
    if img_emb is not None and txt_emb is not None:
        combined_embeddings[pid] = np.concatenate([img_emb, txt_emb])

  0%|          | 0/17483 [00:00<?, ?it/s]

## 🔍 Hybrid Visual + Textual Search with CLIP and SentenceTransformer

This function enables **multi-modal similarity search** by combining image and/or text queries. It computes a unified embedding for the query, searches against a pre-built FAISS index of product embeddings, and returns the IDs of the most similar products.

---

### 🧠 Function: `search_similar(query_image_path=None, query_text=None, top_k=5)`

This function supports the following search modes:
- **Image-only search** (query by example)
- **Text-only search** (semantic keyword search)
- **Hybrid search** (image + text fusion)

---

### 🧩 Steps Explained:

1. **Image Embedding (`img_vec`)**
   - If an image path is provided, the image is opened and passed through the CLIP processor and model.
   - The resulting feature vector is extracted and converted to a NumPy array.
   - If no image is provided, a zero vector of shape `(768,)` is used as a placeholder.

2. **Text Embedding (`txt_vec`)**
   - If a query text is provided (e.g., "shorts with side pockets"), it's encoded using the `SentenceTransformer` model.
   - If not, a zero vector of shape `(384,)` is used.

3. **Combine Embeddings**
   - Both vectors are concatenated to form a single `(1408,)` vector representing the multi-modal query.
   - The combined vector must match the dimensionality of the FAISS index (checked explicitly for safety).

4. **Similarity Search with FAISS**
   - The combined query embedding is passed to `faiss_index.search(...)`.
   - `top_k` nearest neighbors are returned based on vector similarity (typically cosine or L2).

5. **Return Results**
   - The function returns the list of product IDs corresponding to the most similar results from the index.

---

### 💬 Example Usage:
```python
search_similar(
    query_image_path='/kaggle/input/dataset-ecomerce/Images/Images/00029897a5....jpg',
    query_text='shorts',
    top_k=3
)
```

This would return the top 3 products visually and semantically similar to the given image and description.

> ✅ This hybrid search approach makes your system flexible and robust — users can search using text, images, or both.


In [28]:
# Importing PIL Image class with alias to avoid conflict with torchvision's Image
from PIL import Image as PILImage

# Function to search for visually and textually similar products using combined embeddings
def search_similar(query_image_path=None, query_text=None, top_k=5):
    # If an image path is provided, process the image
    if query_image_path:
        try:
            # Load and convert the image to RGB format
            image = PILImage.open(query_image_path).convert("RGB")
        except Exception as e:
            # Raise an error if the image could not be loaded
            raise ValueError(f"Could not open image: {query_image_path} — {e}")
        
        # Preprocess the image for CLIP model
        inputs = clip_processor(images=image, return_tensors="pt").to(device)
        with torch.no_grad():
            # Get image features using the CLIP model and convert to NumPy array
            img_vec = clip_model.get_image_features(**inputs).cpu().numpy()[0]
    else:
        # If no image is provided, initialize a zero vector of the same dimension
        img_vec = np.zeros(512, dtype=np.float32)
    
    # If a query text is provided, encode it using the text model
    if query_text:
        txt_vec = text_model.encode(query_text)
    else:
        # If no text is provided, use a zero vector for the text part
        txt_vec = np.zeros(384)

    # Concatenate image and text embeddings to create a single query vector
    combined = np.concatenate([img_vec, txt_vec]).astype("float32")

    # Validate that the query embedding dimension matches the FAISS index dimension
    if combined.shape[0] != faiss_index.d:
        raise ValueError(
            f"Embedding dimension mismatch. Combined shape: {combined.shape} vs FAISS index dim: {faiss_index.d}"
        )

    # Perform a similarity search on the FAISS index using the query embedding
    D, I = faiss_index.search(np.array([combined]), top_k)

    # Return the list of product IDs corresponding to the top-k nearest neighbors
    return [ids[i] for i in I[0]]

# Example usage (commented out): search for 3 items similar to both a given image and the keyword "shorts"
# search_similar('/kaggle/input/dataset-ecomerce/Images/Images/00029897a53a74bd8cce87c9c6711c83fecb010497ee30cfab271060ee93fcec.jpg',
#               'shorts',
#               3)

## 👗 Outfit Recommendation via KMeans Clustering on Text Embeddings

This section groups fashion products based on their **textual semantics** and uses these clusters to recommend similar items (e.g., to complete an outfit). It enables personalized or style-aware recommendations based on product descriptions and attributes.

---

### 🧠 Step-by-Step Breakdown:

1. **Prepare Text Embedding Matrix**
   ```python
   ids = list(text_embeddings.keys())
   vectors = [text_embeddings[pid] for pid in ids]
   text_matrix = np.array(vectors, dtype=np.float64)
   ```
   - Extracts all product IDs and their corresponding text embeddings.
   - Forms a NumPy matrix `text_matrix` for clustering.

2. **Fit KMeans Clustering**
   ```python
   kmeans = KMeans(n_clusters=15, random_state=42).fit(text_matrix)
   ```
   - Applies the **KMeans algorithm** to group similar product embeddings into `15` clusters.
   - Each cluster groups products with similar semantic descriptions (e.g., similar types, styles, or use cases).

3. **Assign Cluster Labels to Products**
   ```python
   cluster_labels = {
       pid: kmeans.predict([np.array(text_embeddings[pid], dtype=np.float64)])[0]
       for pid in ids
   }
   ```
   - Predicts the cluster label for each product and stores it in a dictionary `cluster_labels`.

4. **Define Outfit Recommendation Function**
   ```python
   def recommend_outfits(base_product_id, top_k=2):
   ```
   - Given a `base_product_id`, finds all other products in the **same semantic cluster**.
   - Returns the first `top_k` similar products (excluding the base item).
   - These are considered complementary or related fashion pieces (like pairing a top with a skirt or jeans with a shirt).

---

### 💡 Example Use Case:
```python
recommend_outfits("000a3f4dce44e1c1e67199826dfbc672f398067a62f03fc9a7a1e8fa4bde3aa4", top_k=2)
```
This would return 2 product IDs from the same cluster as the given product, ideal for use in outfit suggestion modules.

> ✅ This approach is unsupervised and scalable, making it ideal for cold-start recommendations or clustering large fashion inventories.
> 
> ⚠️ Note: You can tune `n_clusters` to better match the number of distinct style or category types in your catalog.


In [29]:
# Importing necessary libraries
import numpy as np  # For numerical operations
from sklearn.cluster import KMeans  # For clustering the text embeddings

# Convert embeddings to matrix
# Extract product IDs from text embeddings
ids = list(text_embeddings.keys())

# Create a list of embedding vectors corresponding to each product ID
vectors = [text_embeddings[pid] for pid in ids]

# Convert the list of vectors into a NumPy array (2D matrix) for clustering
text_matrix = np.array(vectors, dtype=np.float64)

# Fit KMeans
# Performing KMeans clustering on the text embedding matrix with 15 clusters
kmeans = KMeans(n_clusters=15, random_state=42).fit(text_matrix)

# Predict cluster labels
# Mapping each product ID to its predicted cluster label
cluster_labels = {
    pid: kmeans.predict([np.array(text_embeddings[pid], dtype=np.float64)])[0]
    for pid in ids
}

# Function to recommend outfits based on products from the same cluster
def recommend_outfits(base_product_id, top_k=2):
    # Get the cluster label for the base product
    base_cluster = cluster_labels[base_product_id]
    
    # Find other products in the same cluster, excluding the base product itself
    similar = [pid for pid in ids if cluster_labels[pid] == base_cluster and pid != base_product_id]
    
    # Return the top_k similar product IDs from the same cluster
    return similar[:top_k]



## 👗 Extracting Fashion Trend Keywords using Gemma-3B and Web Scraping

This code leverages a powerful LLM (`Gemma-3B`) to extract **top 50 trending fashion keywords** (e.g., clothing types, styles, fabrics, silhouettes) from both:
- Your **local product dataset (`df`)**
- An **online fashion catalog** (via web scraping)

The result is a rich, unified string of trend keywords that can be used for:
- Generating dynamic fashion recommendations
- Displaying current trends in a UI
- Filtering inventory or training models on style relevance

---

### 🔍 Step-by-Step Workflow

---

### 1. **Load Gemma-3B Model and Processor**
```python
model_id = "google/gemma-3-4b-it"
```
- Loads a 4B-parameter chat-tuned Gemma model.
- Uses `AutoProcessor` to prepare chat-style prompts for the model.
- Loads the model in `bfloat16` for memory efficiency, using `device_map="auto"` to utilize GPU.

---

### 2. **Function: `extract_trend_keywords_with_gemma(description_text)`**
- Crafts a structured system + user message prompt.
- Instructs the model to **only extract keywords from input**, avoiding hallucinated items.
- Asks for:
  - Types of clothing
  - Styles, fabrics, patterns, silhouettes
  - Descriptive adjectives (e.g., “off-shoulder sleeveless tops”)
  - Unique, non-repetitive English terms
- Uses `generate()` to decode up to 300 new tokens.
- Returns the result as a comma-separated keyword list.

---

### 3. **Function: `scrape_product_names(url)`**
- Scrapes up to 5000 product names from a given fashion website (e.g., FWRD).
- Uses `BeautifulSoup` to find product name divs and extract their clean text.
- Acts as a source of external web data to increase trend diversity.

---

### 4. **Function: `get_combined_trend_string(df, use_internet=True)`**
- Limits each product description in the dataset to 50 words.
- Extracts trends locally using `extract_trend_keywords_with_gemma(...)`.
- If `use_internet=True`, also:
  - Scrapes online catalog
  - Extracts web trends using the same Gemma model
- Combines local + web trends into a **de-duplicated, comma-separated list** of keywords.

---

### ✅ Final Output
```python
trend_string = get_combined_trend_string(df)
print("\n🔥 Combined Trending Styles:\n", trend_string)
```
Returns a rich string of current fashion trends by analyzing both internal product data and external online catalogs.

---

> 💡 This technique ensures **trend-awareness** in your fashion system without manual labeling or static rules.
> 
> ⚠️ Make sure to follow site scraping guidelines and respect scraping limits for public websites.


In [30]:
# Importing required libraries
import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML content
from transformers import AutoProcessor, Gemma3ForConditionalGeneration  # For loading Gemma model
import torch  # For tensor operations and model inference

# Load Gemma model and processor
# Model ID for the 4B instruction-tuned variant of Google's Gemma
model_id = "google/gemma-3-4b-it"

# Load processor to tokenize and format prompts for Gemma
gemma_processor = AutoProcessor.from_pretrained(model_id, use_fast=True)

# Load the quantized model and move to appropriate device with bfloat16 precision
gemma_model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
).eval()

# 🧠 Extract 50 fashion keywords using Gemma
def extract_trend_keywords_with_gemma(description_text):
    # Prompt construction using system and user roles for instruction following
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a fashion data analyst."}]
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text":
                    "From the text below, extract the top 50 trending fashion-related keywords, such as Types of clothing, clothing styles, colors, fabrics, patterns, silhouettes, or themes. "
                    "Avoid brand names. Return only the keywords as a comma-separated list.\n"
                    "Just give response from the given context only donot add things randomly from your side.\n"
                    "For Example if text is Silk Knit Tank Top, One shoulder top, sleevless Mini Dress, Off Shoulder Evening top, Mini tank top, Straped Off Shoulder Gown then trending cloth will be Sleevless Off shoulder tank tops. Like this we need to find trending clothes also add adjectives of clothes if present."
                    " And donot repeat the trend clothes give unique ones. And Give keywords in English only\n\n"
                    f"{description_text}"
                }
            ]
        }
    ]

    # Convert structured message into model-readable chat prompt
    prompt = gemma_processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

    # Tokenize the prompt
    tokenized = gemma_processor(text=prompt, return_tensors="pt").to(gemma_model.device)

    # Generate output using the model without tracking gradients
    with torch.no_grad():
        output = gemma_model.generate(
            input_ids=tokenized["input_ids"],
            attention_mask=tokenized["attention_mask"],
            max_new_tokens=300
        )

    # Decode the model's response, skipping special tokens
    response = gemma_processor.decode(output[0][tokenized["input_ids"].shape[-1]:], skip_special_tokens=True)
    return response.strip()

# Function to scrape product names from a fashion retail webpage
def scrape_product_names(url, max_items=5000):
    headers = {"User-Agent": "Mozilla/5.0"}  # Set user agent to mimic a browser
    
    try:
        # Fetch and parse the HTML content
        response = requests.get(url, headers=headers, timeout=30)
        soup = BeautifulSoup(response.content, "html.parser")

        # Extract product names from specific div classes
        name_divs = soup.find_all("div", class_="product-grids__copy-item js-plp-name")
        names = [div.get_text(strip=True) for div in name_divs]

        # Limit the number of items returned
        return names[:max_items]
    
    except Exception as e:
        # Catch and print any errors during scraping
        print(f"❌ Error: {e}")
        return []

# 🧠 Final function to get combined trends
def get_combined_trend_string(df, use_internet=True):
    # Inner function to limit number of words per description
    def limit_words(text, max_words=50):
        return " ".join(text.split()[:max_words])

    # Extract and clean description texts from DataFrame
    descriptions = df["description"].dropna().astype(str).tolist()
    limited_descriptions = [limit_words(desc) for desc in descriptions[:100]]  # Limit to first 100 entries
    local_text = "\n".join(limited_descriptions)

    # Generate fashion keywords using local data
    local_trends = extract_trend_keywords_with_gemma(local_text)
    print("🧵 Local Trends:", local_trends)

    # If enabled, scrape website and generate additional keywords
    if use_internet:
        url = "https://www.fwrd.com/fw/content/products/lazyLoadProductsForward?currentPlpUrl=https%3A%2F%2Fwww.fwrd.com%2Ffwpage%2Fcategory-clothing%2F3699fc%2F&currentPageSortBy=featuredF&useLargerImages=false&outfitViewSession=false&showBagSize=false&lookfwrd=false&backinstock=false&preorder=false&_=1749445996960"
        web_text = scrape_product_names(url)
        web_trends = extract_trend_keywords_with_gemma(web_text)
        print("🌐 Web Trends:", web_trends)
    else:
        web_trends = ""

    # Combine both local and web trend lists
    combined_keywords = set(local_trends.split(",") + web_trends.split(","))
    
    # Clean and sort the final list of unique keywords
    combined_clean = [kw.strip() for kw in combined_keywords if kw.strip()]
    return ", ".join(sorted(combined_clean))

# ✅ Usage
# Call function to extract combined trending fashion keywords from both local and web sources
trend_string = get_combined_trend_string(df)
print("\n🔥 Combined Trending Styles:\n", trend_string)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

🧵 Local Trends: dress, shirtdress, midi dress, mini dress, flow dress, sweater dress, knit dress, silk dress, satin dress, jersey dress, crepe dress, tweed dress, floral dress, printed dress, ruffled dress, pleated dress, flared dress, sheath dress, A-line dress, fit-and-flare dress, wrap dress, bodycon dress, shirtdress, dress, gown, maxi dress, mini dress, shirt dress, midi dress
🌐 Web Trends: Embroidery, Mini Dress, Maxi Dress, Tank Top, Dress, Skirt, Top, Pant, Gown, Sweater, Shirt, Vest, Jacket, Legging, Bodysuit, Bra, Shorts, Sleeve, Halter, Spaghetti Strap, Off Shoulder, Lace, Knit, Denim, Silk, Cotton, Linen, Jersey, Stripe, Floral, Animal Print, Ruched, Pleated, A-line, Bell-sleeve, Crewneck, Fitted, Maxi, Midi, Mini, Long Sleeve, Short Sleeve, Crop Top, Button-down, Wrap, Puffer, Trench Coat, Cardigan, Sweatshirt, Blouse

🔥 Combined Trending Styles:
 A-line, A-line dress, Animal Print, Bell-sleeve, Blouse, Bodysuit, Bra, Button-down, Cardigan, Cotton, Crewneck, Crop Top, Deni

## 👤 Summarizing User Fashion Preferences from Interaction History

This function analyzes a user’s interaction history and summarizes their preferences in terms of:
- **Top Brands**
- **Preferred Styles**
- **Common Product Descriptions**

It helps personalize recommendations, outfit suggestions, or display a style profile.

---

### 🧠 Function: `summarize_user_preferences(user_id, df, history_dict, top_k=5)`

---

### 🔍 Step-by-Step Explanation:

1. **Nested Function: `clean_style_attr(style_attr)`**
   - Handles inconsistent formats in the `style_attributes` field.
   - If it's a dictionary, flattens it into a readable string (`"fit: slim, sleeve: full"`).
   - If it's a string, strips whitespace.
   - Returns `"Unknown"` for unexpected types.

2. **Retrieve User History**
   ```python
   pids = history_dict.get(user_id, [])
   rows = df[df["product_id"].isin(pids)]
   ```
   - Looks up the product IDs the user interacted with from `history_dict`.
   - Filters the dataset to get all matching rows.

3. **Handle Empty History**
   - If no products are found for the user, return `"No Brands"`, `"No Styles"`, and `"No Description"`.

4. **Extract Top Brands**
   ```python
   brands = rows["brand"].dropna().astype(str).value_counts().index.tolist()[:top_k]
   ```
   - Counts the most frequent brands and returns the top `k`.

5. **Extract Top Style Attributes**
   ```python
   styles_cleaned = rows["style_attributes"].apply(clean_style_attr)
   styles = styles_cleaned.value_counts().index.tolist()[:top_k]
   ```
   - Applies the cleaner to each style attribute and selects the most common ones.

6. **Extract Descriptive Summary**
   ```python
   descriptions = rows["meta_info"].dropna().astype(str).tolist()
   merged_desc = " ".join(descriptions[:top_k * 2]) if descriptions else "No Description"
   ```
   - Concatenates up to `2 * top_k` product descriptions to build a rough summary of user interests.

7. **Return Summary**
   ```python
   return ", ".join(brands), ", ".join(styles), merged_desc
   ```
   - Returns the user's **preferred brands**, **style types**, and **sample descriptive text**.

---

### 📌 Usage Example:
```python
brands, styles, summary = summarize_user_preferences(user_id="user_42", df=df, history_dict=user_history)
print("👕 Brands:", brands)
print("🎨 Styles:", styles)
print("📝 Summary:", summary)
```

> ✅ This function enables **personalized recommendations** and profile generation for any fashion user by leveraging historical behavior.


In [31]:
# Function to summarize a user's fashion preferences based on their viewing or purchase history
def summarize_user_preferences(user_id, df, history_dict, top_k=5):
    
    # Helper function to clean and format the 'style_attributes' field
    def clean_style_attr(style_attr):
        if isinstance(style_attr, dict):
            # Convert dictionary to a comma-separated key-value string
            return ", ".join(f"{k}: {v}" for k, v in style_attr.items())
        elif isinstance(style_attr, str):
            # Strip whitespace if it's a plain string
            return style_attr.strip()
        else:
            # Handle missing or unknown format
            return "Unknown"

    # Get list of product IDs associated with the given user
    pids = history_dict.get(user_id, [])

    # Filter DataFrame to include only rows matching those product IDs
    rows = df[df["product_id"].isin(pids)]

    # If no matching rows found, return default placeholders
    if rows.empty:
        return "No Brands", "No Styles", "No Description"

    # Extract and clean top brands based on frequency
    brands = rows["brand"].dropna().astype(str).value_counts().index.tolist()[:top_k]

    # Clean style attributes and get most frequent ones
    styles_cleaned = rows["style_attributes"].apply(clean_style_attr)
    styles = styles_cleaned.value_counts().index.tolist()[:top_k]

    # Combine top meta_info descriptions into a single string
    descriptions = rows["meta_info"].dropna().astype(str).tolist()
    merged_desc = " ".join(descriptions[:top_k * 2]) if descriptions else "No Description"

    # Return summarized user preferences
    return ", ".join(brands), ", ".join(styles), merged_desc

In [32]:
df['meta_info'].isna().sum()

71

## ⚡ Building a FAISS Index for Fast Similarity Search

This section creates a **FAISS (Facebook AI Similarity Search)** index from the combined (image + text) embeddings of all products. The index enables **real-time nearest neighbor search** to support features like:

- Visual + text-based product retrieval
- Hybrid similarity recommendations
- Style or trend-aware fashion matching

---

### 🧠 Step-by-Step Explanation:

---

### Step 1: Prepare Embedding Matrix
```python
ids = list(combined_embeddings.keys())
vectors = np.stack([combined_embeddings[pid] for pid in ids]).astype("float32")
```
- Extracts all product IDs and their corresponding 1408-dimensional combined embeddings.
- Converts them into a NumPy array of shape `(N, 1408)` where `N` is the number of products.
- Casts to `float32`, the required format for FAISS.

---

### Step 2: Initialize and Build FAISS Index
```python
faiss_index = faiss.IndexFlatL2(vectors.shape[1])
faiss_index.add(vectors)
```
- **`IndexFlatL2`**: A fast, brute-force FAISS index using **L2 (Euclidean) distance**.
  - Ideal for small to medium-scale datasets where accuracy is more important than speed.
- Adds all vectors to the index for later querying.

---

### 📌 Notes:
- The dimension must match the combined embedding size: `1024 (image) + 384 (text) = 1408`.
- This index can now be used with:
  ```python
  faiss_index.search(query_vector, top_k)
  ```
  to retrieve the top `k` most similar products.

> ✅ FAISS enables scalable, low-latency vector search across thousands or millions of fashion items.
> 
> 🧠 You can switch to `IndexIVFFlat` or `IndexHNSWFlat` for approximate search in large-scale deployments.


In [33]:
# Importing necessary libraries
import faiss  # Facebook AI Similarity Search for fast nearest neighbor retrieval
import numpy as np  # For numerical operations

# Step 1: Create vectors
# Extract product IDs from the combined_embeddings dictionary
ids = list(combined_embeddings.keys())

# Stack all combined embeddings into a 2D NumPy array of shape (n_samples, 1408)
# Each row corresponds to a product's [image + text] embedding
vectors = np.stack([combined_embeddings[pid] for pid in ids]).astype("float32")

# Step 2: Create a new FAISS index (use correct dim = 1408)
# Create a flat L2 (Euclidean distance) index with dimensionality matching the embedding size
faiss_index = faiss.IndexFlatL2(vectors.shape[1])

# Add all product vectors to the FAISS index for similarity search
faiss_index.add(vectors)

In [34]:
len(list(combined_embeddings.values())[0])

896

In [35]:
512 + 384 # text_embeddings + image_embeddings

896

## *🧠 Personalized Outfit Suggestions using Gemma-3B and Multimodal Prompting*

This function utilizes the **Gemma-3B Instruction-Tuned** model from Google to generate **personalized and visually-aware outfit suggestions**. It combines the uploaded product image, product metadata, user preference history, and trend signals to simulate an expert fashion stylist's response.

---

### 🧩 Key Components:

1. **Model Initialization**
   - Loads `gemma-3-4b-it`, an instruction-tuned LLM capable of understanding structured prompts and multimodal inputs.
   - The processor formats image + text into a chat-compatible structure.
   - The model runs on GPU with `bfloat16` precision using `device_map="auto"`.

2. **Function: `generate_outfit_gemma(...)`**

#### 🔍 Inputs:
- `image_path`: Path to the product image
- `row`: A row from the product DataFrame containing all metadata
- `user_id`: The user for whom personalized suggestions are generated
- `number_of_suggestions`: Number of outfit pieces to be recommended

---

### 🧵 Prompt Construction

A structured multimodal prompt is created with the following components:

- **Image**: Shown to the model via PIL (converted to RGB)
- **Product Metadata**: Includes name, brand, style attributes, description, and price
- **User Preferences**: Top brands, style features, and liked descriptions from past activity
- **Trend Context**: A precomputed list of trending styles (e.g., "off-shoulder tops", "cropped denim")

The model is asked to:
- Suggest `n` matching outfit items
- Justify each recommendation briefly
- Format the response as a clean bullet list
- Avoid questions or irrelevant output

---

### ⚙️ Model Inference

- The `messages` list is passed through `apply_chat_template()` to produce the formatted input.
- Tokenized inputs are sent to the model for generation (`max_new_tokens=700`).
- The final output is decoded, trimmed, and returned as a user-ready stylist response.

---

### ✅ Output

Returns a stylized and structured list of outfit suggestions such as:
```
- Cropped white blazer – Adds structure to the soft silhouette while keeping the look modern.
- Gold layered necklaces – Accentuate the neckline and align with user’s preference for elegant accessories.
```

> 📌 This function is central to building an **Intelligent Styling Assistant**, combining **vision, text, personalization, and generative AI**.


In [36]:
# Import required libraries for model loading and image processing
from transformers import AutoProcessor, Gemma3ForConditionalGeneration  # Hugging Face tools for the Gemma model
from PIL import Image as PILImage  # PIL for image loading
import torch  # PyTorch for tensor manipulation and inference

# Define the model ID for the instruction-tuned version of Gemma
model_id = "google/gemma-3-4b-it"

# Load the processor for formatting inputs to the model
gemma_processor = AutoProcessor.from_pretrained(model_id, use_fast=True)

# Load the Gemma model using bfloat16 precision and automatic device placement
gemma_model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # use float16 if bfloat16 is unsupported
    device_map="auto"
).eval()

# Function to generate outfit suggestions based on a product image and user preferences
def generate_outfit_gemma(image_path, row, user_id="user123", number_of_suggestions=5):
    # Get summarized user preferences (brands, styles, and descriptions)
    user_brands, user_styles, user_description = summarize_user_preferences(user_id, df, user_history, top_k=3)

    # Refined Prompt
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a highly experienced fashion stylist and personal shopper."}]
        },
        {
            "role": "user",
            "content": [
                # Image input to condition the generation
                {"type": "image", "image": PILImage.open(image_path).convert("RGB")},

                # Textual prompt combining product info, user style preferences, and trending styles
                {"type": "text", "text":
                    "Using the image above and the following product and user profile information, "
                    f"please suggest {number_of_suggestions} specific and stylish outfit items that would complement this product perfectly.\n\n"

                    "🎯 **Product Details**:\n"
                    f"- **Name**: {row['product_name']}\n"
                    f"- **Brand**: {row['brand']}\n"
                    f"- **Style Attributes**: {row['style_attributes']}\n"
                    f"- **Description**: {row['description']}\n"
                    f"- **Price**: ₹{row['selling_price']}\n\n"

                    "🧍‍♀️ **User Style Preferences**:\n"
                    f"- **Favorite Brands**: {user_brands}\n"
                    f"- **Preferred Style Features**: {user_styles}\n"
                    f"- **Liked Descriptions**: {user_description}\n\n"

                    "🔥 **Trending Styles Right Now**:\n"
                    f"{trend_string}\n\n"

                    "💡 Provide specific clothing or accessory suggestions that:\n"
                    "- Match both the product’s style and user's preferences\n"
                    "- Reflect the current trends\n"
                    "- Include a short reason for each suggestion\n\n"
                    
                    "Format your answer as a bullet list with names and explanations and also just give response don't ask any further question."
                }
            ]
        }
    ]

    # Format input for Gemma
    inputs = gemma_processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to(gemma_model.device)  # Move input tensors to model device

    input_len = inputs["input_ids"].shape[-1]  # Capture input length to slice model output later

    # Generate model response without gradient tracking
    with torch.no_grad():
        outputs = gemma_model.generate(**inputs, max_new_tokens=700)
        response = gemma_processor.decode(outputs[0][input_len:], skip_special_tokens=True)

    return response  # Return the final generated outfit suggestions



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## *🛍️ Full Pipeline: Fashion Visual Search + Personal Styling Assistant*

This section brings together all core components of the fashion intelligence system and simulates a real-world user session. It performs **visual search, user preference modeling, outfit generation**, and **text-only inventory exploration** — all within a cohesive, interactive loop.

---

### 🧩 Step-by-Step Workflow:

---

### 1. **Seed User History from Images**
- A list of file IDs (`file_ids`) simulates browsing behavior.
- Each image is passed through `search_similar()` to retrieve the top visually similar product.
- These retrieved `product_id`s are added to `user_history["user123"]`.

---

### 2. **Summarize User Style Profile**
```python
summarize_user_preferences(user_id, df, user_history)
```
- Extracts user's favorite **brands**, **style attributes**, and **descriptions** based on their viewed history.
- Used to condition LLM-based outfit generation.

---

### 3. **Set Final Query Image & Text**
- A target image is selected to represent a final product of interest.
- A free-text query (e.g., `"just looking for mini shorts"`) mimics how users refine their intent.

---

### 4. **Visual + Text Similarity Search**
```python
search_similar(query_image_path=eval_image, query_text=text_query, top_k=10)
```
- Retrieves visually and semantically similar items from the FAISS index.
- Displays them with product name, brand, and price.

---

### 5. **Generate AI-Based Outfit Suggestions**
```python
generate_outfit_gemma(image_path, row, user_id, number_of_suggestions)
```
- Combines:  
  - The query product’s image  
  - Product metadata  
  - User preference profile  
  - Real-time fashion trend string  
- Uses Gemma-3B to generate 5 personalized outfit components with explanations.

---

### 6. **Text-Only Inventory Search**
```python
search_similar(query_text=text_query)
```
- Allows retrieval of products **without uploading an image**, ideal for mobile/web users or voice-based input.

---

### 7. **Cluster-Based Personalized Recommendations**
```python
recommend_outfits(base_product_id, top_k)
```
- Recommends stylistically similar items from the same KMeans cluster as the top match.
- Tailors suggestions based on unsupervised style grouping from your dataset.

---

### ✅ Output Summary:
- 🔍 Visually + semantically similar product list
- 🧠 AI-generated outfit items tailored to user style and current trends
- 🧾 Search results for user-entered fashion queries
- 📦 Personalized recommendations based on user cluster

> 💡 This complete pipeline showcases how to build a **real-time, intelligent fashion recommendation engine** using image + text + trends + user data — powered by multimodal retrieval and generative AI.


In [37]:
# Importing display tools and image handling libraries
from IPython.display import display, Markdown, Image as ColabImage  # For displaying images and markdown in notebooks
from PIL import Image as PILImage  # PIL for opening and manipulating images
import os  # OS operations (e.g., path handling)
import gc  # Python garbage collector for memory cleanup
import torch  # PyTorch for tensor operations

# Base directory of images in the dataset
base_dir = "/kaggle/input/dataset-ecomerce/Images/Images/"

# Filenames only
# A manually curated list of image filenames from the dataset, possibly used for demo, testing, or fixed visual samples
file_ids = [
    "001cc7734a6ded96796a018bbc477f2c02c591b349276d0cba02aa2ff5ac5643.jpg",
    "d2375eda5a1e7d3b70928f62f5d5544b4acbac3066b3ce81b69054f94a4b4699.jpg",
    "2d79fbf8a8d16367fb50a2d901ce156dcbf9d576cefd416a5704d9ce56c1a163.jpg", 
    "f51af2ef89785fc6bb9022e51638473b4c9d5502cdc1d646f6174ac2230a7dda.jpg",
    "7200c1cc52432f6162f49450d11b8aaa0fb20c7d4bdecaf78221724ef4e65842.jpg", # Pant
    "cf4929378f6df845c0586297e187ba83d0233132eb16ea5ef8e725a561dca350.jpg",
    "3912d625079ca4451b69f8bf05b49a366d18a5d19ccc66927ddfe8472b7e238c.jpg",
    "5bb24d735d0d883b74d54eca4259cda99d33b1b2276a8af368f8ed0b74434dbc.jpg",
    "41124e644edefe8f62183e4e272b0e1875b25f27a6921db1fd18f04227d28f33.jpg",
    "6ae3eab605f8fce59bac14c9683967e2a92b2810e0f5ad628770de1ed2c89335.jpg",
    "1f8858ffec5e9bb3080efd79a897f1c29b107331a727a025d9656bb159d1c210.jpg",
    "b9a31680489fd7d86c4e63db69f478d88ec29c94c7c14e1c6194bfb2f22dc451.jpg", 
    "a679c7224741a85a360c15433d3270b274ad2be6762964a3016de5e2cf977396.jpg",
    "2bfd2ecf2b3dcb0456f38e90d8eac7959a86d62bf391196052d716558ca3892a.jpg",
    "50b9e4c78cd22ccd2f51749f005933e225ec28745677c763bae9db6a1df0c699.jpg",
    "3b95bcf6ee388620dbefc4b157b3062b8817f5c58300d4dad5085f53788b07a4.jpg",
    "004eb122c5f855688d88bc67e1040b68649ebf006a524847381899af841106a2.jpg", # only dress
    "f47545349516bd64e6a235b7433618a7d56a89b9a314b8dd2b5559cb97ab5927.jpg", # only dress
    "a326c2ebae5f65dc52df9ef9cbe0174b7a6d0eebb6ef113031c6fa17d716e87e.jpg",
    "80e1cec6743c039088e55dc972614c004e084214878eb77263a8bf31d81abf7a.jpg",
    "d71430db83e8a11f846346898ea7f8dd96417a8c0e329ed9e1947f498105b427.jpg",
    "ae8e6b75dd7cdc91028c52a21da3a893addef0b6a26331b706be37edb6cffa7c.jpg",
    "6ad111071f0abd165ffb969e884143053af793a08b0ec5cfd7bbb336756076a9.jpg",
    "004e598c1e8b7f6960d98e18c72f672ad51213fba53daed642226a1a0beeb6ba.jpg",
    "1ea243e3fe3adb01f636837f4a0508c4e5fd0105ede18b1a6dbcc4833cd3754b.jpg",
    "e6dd290aa9ccf67afbcfafda532a780dd399cedd11b8609418512b55e57ffd8a.jpg",
    "48c364ae27afe5c90d18f034b54939b64b81e1edc2f9c39477cd39517d70de74.jpg",
    "fb269300b89a93ca40e4852b5538cb55c69b2d7210630c4534e80fc418875d66.jpg",
    "6f049551f8254f207ba91307cac602b4dd4f59bb9ea752531b782bfee143592b.jpg",
    "f98b809c263a7afcf9599af980dd001afa2d455993bdcea406655d88fb8c2ddb.jpg",
    "23ba3751946b5c1aeb2c26ecee35dbd110f77034912cceec37793b2582c2d249.jpg",
    "e32cdcaa1cc0bb2c45d7c87af5b4bfc9cbac9f3e2cfdab6d82d5068219dbc094.jpg",
    "b021027979aa958c026cbb6a7aafe3fe142d8c7e9146eba6effc092f155b76e8.jpg",
    "da95d235d355966b809f0d2a5565e2c44deecfd272908862196676e0da70cf37.jpg",
    "7c89760ba9ce4067d12e867ac43b53dbbc2930b5cbbb2e18705ba1b8f361de13.jpg",
    "7a762dedfa44e60e1827412d6ea1a583a0784ff8390d49261651f0bb7b5248b7.jpg",
    "e50759c449086b3b1b5ce74c2387af27719cee7f3c8f9c7898e864f80557154e.jpg",
    "76f7d48b979291b2bd59ad50b9b9f0d91aa7313c79b0664c1c3d8e9bff64d9ca.jpg",
    "1363a3681f397ec57fe076589a761a5527da8647551c06d3adb19875c62bccc4.jpg",
    "ca35331d6e4ab68e4ef28fca64f32b532dc29bc91b83d31257b8f5c1cc1f591a.jpg",
    "33eab84b80f35180bb7d28de928749093ffeaee000ecefb2a1aa9abe12b5d3d4.jpg",
    "106ea31260ca214d881d99b5add1cede7368c0dc3a14bae47ed7b200563539b2.jpg",
    "d19183e13cca52fdecb11c71b35689cb31ded1e10ec9a8813ea63ef4c66e04c1.jpg", # Pant
    "b03cdf3521c8aaadc9bbb220a27d637938b623e00ba6f7b228544c7cfaeb4503.jpg",
    "4065bdeaf7d6383b531c3bdb98e0fff8a5da18fe9b51c024a8ade189de9d5694.jpg",
    "70e40a189be184226083b19a9d7f78bd0ad04bb77dd2311ae410992ffb80c5f4.jpg",
    "1dd2334c0934cb3d3d77bfa89edf95744a726b7885ab4e1dc82a740f7252b434.jpg",
    "f11de4958958cd329871f79b24800cbaf9ca5e9ba7519d239c65cfff17c690f9.jpg",
    "7df2df33cbbe2465ab0d6bd9328edb35ddce05ad2116a3d96a5050f08ab47628.jpg",
    "59508ef2f9cc0a3e52071e12616a78e71c6e0f9786eec00435a0eff1f97cb3f0.jpg",
    "25d3c9f71e66be484e659a15c68ca6c54b1227fc038eda66a0aff2ba6e5dbd79.jpg",
    "d2b9ad7636a9861cd3987719c1b93f78d0aa295c799f71139decc9fc10930d6c.jpg",
    "532c1a5b7076eea7910de17c3360505dab1a54d762dae1e39f57897fa63a019e.jpg",
    "07211beaef746d5fb44bae6904afa700f290ff2c259124a49c2995f26a4a3642.jpg",
    "77a0a0be70b306aee1d77dfe8f45a421bd9fab024c057c3f4a761c8544aa5ed9.jpg",
]

# Initialize history
# Creating a dictionary to track product interaction history for a specific user
# This will be used to personalize recommendations later
user_history = {"user123": []}

# Step 1: Fill user history from seed images
# Iterating over the list of sample image filenames to simulate user browsing history
for file_id in file_ids:
    full_path = os.path.join(base_dir, file_id)  # Construct full image path
    print(f"\n📥 Browsing: {full_path}")
    try:
        # Search visually similar product using only image (text query is empty here)
        results = search_similar(full_path, query_text="", top_k=1)
        for pid in results:
            # If the product exists in the dataset, add it to the user's history
            if pid in df["product_id"].values:
                user_history["user123"].append(pid)
            else:
                # Skip unknown product IDs that may not exist in the DataFrame
                print(f"⚠️ Skipping unknown product_id: {pid}")
    except Exception as e:
        # Handle failures (e.g., unreadable image file or inference error)
        print(f"❌ Failed to process {file_id}: {e}")

# Step 2: Summarize user preferences
# Extract top 5 brands, styles, and description tokens from the user’s browsing history
brands, styles, desc = summarize_user_preferences("user123", df, user_history, top_k=5)
print("\n🧍‍♀️ User Preference Summary:")
print(f"Brands: {brands}")
print(f"Styles: {styles}")
print(f"Description Summary: {desc}")

# Step 3: Final evaluation with query image
# Multiple image paths provided for evaluation; only one is active at a time
# Final image to use for visual query + text query

# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/00086378cb48ebe15ed73d4ce4bc54cbe813cf13d2c8e1b31dcb9e2570aff771.jpg"
# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/dc97bca7bd1adfc74d1d0e77d4adfe96d064aa98b5505a2985fdacf2c6e9bec9.jpg" # pant
# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/5d01f17ecba0cd84b368f0776c43aee94b23dedce3d37de9244c512abf6f1814.jpg" 
# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/d2afb90fef7c4a9cbd95df9c9a139141b319dba0aa90c342aec592eca25aba40.jpg" 
# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/affad4a2a4fb56ca087361e3dc72fb17f6c579ded0751e6e9e45263524ae6596.jpg" 
# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/788e92783fb7fe6230bbf000c7f6f89321f2be4cad2eb5d8e3fa16ac203c5969.jpg" 
# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/ba805b1b37e2e3d2373017902660416c568ae8629a3aa5af88a0c2216d27a77b.jpg" 
eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/ddf5723e1553ce6c5840545e5876208e5013135fba4f52b87fc76933f768ed7a.jpg" 
# eval_image = "/kaggle/input/dataset-ecomerce/Images/Images/7594c9b2adcf5d0aad971af4d361b1cc77dcc5703506011d2ca96c4a0254116d.jpg" 
text_query = "just looking for mini shorts"
print(f"\n🎯 Final Query Image: {eval_image}")

# Step 4: Visually similar items
# Retrieve top-10 visually + textually similar products using FAISS search
similar_pids = search_similar(eval_image, query_text=text_query, top_k=10)
print("\n🔍 Step 1: Visually Similar Products")
for pid in similar_pids:
    row = df[df["product_id"] == pid].iloc[0]  # Get product info row
    image_url = row["feature_image_s3"]  # URL to display image from S3
    display(ColabImage(url=image_url))  # Show product image
    print(f"{row['product_name']}")
    print(f"Brand: {row['brand']} | Price: ₹{row['selling_price']}\n")

# Step 5: Outfit suggestion (LLaVA on top-1)
print("\n🧠 Step 2: Outfit Suggestions using Vision-Language Model")
top_pid = similar_pids[0]  # Take the top-1 most similar product
top_row = df[df["product_id"] == top_pid].iloc[0]  # Fetch corresponding row

try:
    # Generate personalized outfit suggestions using Gemma model
    outfit_suggestions = generate_outfit_gemma(eval_image, top_row, user_id="user123", number_of_suggestions=5)
except Exception as e:
    # Handle model generation error
    outfit_suggestions = f"❌ Error generating outfit suggestions: {e}"

# Display outfit recommendations in Markdown format
display(Markdown(f"**🧠 Suggested Items to Complete Your Outfit:**\n\n{outfit_suggestions}"))

# Step 6: Text-Only Search (e.g., user types what they want)
print("\n🧾 Step 4: Inventory Search Based on Text Query Only")

# Perform similarity search using only the text query, no image
text_only_results = search_similar(query_image_path=None, query_text=text_query, top_k=10)

# Display the search results from inventory
for pid in text_only_results:
    if pid in df["product_id"].values:
        row = df[df["product_id"] == pid].iloc[0]
        image_url = row["feature_image_s3"]
        display(ColabImage(url=image_url))
        print(f"{row['product_name']}")
        print(f"Brand: {row['brand']} | Price: ₹{row['selling_price']}\n")
    else:
        print(f"⚠️ Skipping unknown product_id: {pid}")

# Step 7: Cluster-based suggestions from user history
print("\n📦 Step 3: Personalized Recommendations Based on Your Style History")
# Recommend products from the same style cluster as the top match
history_recs = recommend_outfits(top_pid, top_k=10)

# Display the personalized cluster-based recommendations
for pid in history_recs:
    if pid in df["product_id"].values:
        row = df[df["product_id"] == pid].iloc[0]
        image_url = row["feature_image_s3"]
        display(ColabImage(url=image_url))
        print(f"{row['product_name']}")
        print(f"Brand: {row['brand']} | Price: ₹{row['selling_price']}\n")
    else:
        print(f"⚠️ Skipping unknown product_id: {pid}")


📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/001cc7734a6ded96796a018bbc477f2c02c591b349276d0cba02aa2ff5ac5643.jpg

📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/d2375eda5a1e7d3b70928f62f5d5544b4acbac3066b3ce81b69054f94a4b4699.jpg

📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/2d79fbf8a8d16367fb50a2d901ce156dcbf9d576cefd416a5704d9ce56c1a163.jpg

📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/f51af2ef89785fc6bb9022e51638473b4c9d5502cdc1d646f6174ac2230a7dda.jpg

📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/7200c1cc52432f6162f49450d11b8aaa0fb20c7d4bdecaf78221724ef4e65842.jpg

📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/cf4929378f6df845c0586297e187ba83d0233132eb16ea5ef8e725a561dca350.jpg

📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/3912d625079ca4451b69f8bf05b49a366d18a5d19ccc66927ddfe8472b7e238c.jpg

📥 Browsing: /kaggle/input/dataset-ecomerce/Images/Images/5bb24d735d0d883b74d54eca4259cda99d33b1b2276a8af368f8e

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


🔍 Step 1: Visually Similar Products


LSPACE Ringside Mini Dress
Brand: LSPACE | Price: ₹13041.975



LSPACE Kelsey Mini Dress
Brand: LSPACE | Price: ₹10233.3455



LSPACE Corsica Strapless Cutout Midi Dress
Brand: LSPACE | Price: ₹11915.9746



BHLDN Deco Square-Neck Mini Dress
Brand: BHLDN | Price: ₹25364.4333



BHLDN Sleeveless Lace Fit & Flare Mini Dress
Brand: BHLDN | Price: ₹22117.4205



AGOLDE Parker Denim Long Shorts
Brand: AGOLDE | Price: ₹148.0



LSPACE Calla Midi Dress
Brand: LSPACE | Price: ₹11915.9746



Lovella Linen Dress
Brand: Reformation | Price: ₹7895.9541



Sundays Tara Dress
Brand: Sundays | Price: ₹13663.32



Forever That Girl Airy Babydoll Mini Dress
Brand: Forever That Girl | Price: ₹12687.5125


🧠 Step 2: Outfit Suggestions using Vision-Language Model


**🧠 Suggested Items to Complete Your Outfit:**

Here are 5 outfit items that would complement the LSPACE Ringside Mini Dress, considering the user's preferences and current trends:

*   **Reformation Olivia Slip Dress (Color: Ivory)** -  This slip dress in a similar ivory hue would create a beautiful monochromatic look with the mini dress, aligning with Reformation’s brand aesthetic and the user's preference for a clean, minimalist style. The slip dress offers a luxurious feel, complementing the crepe fabric of the mini dress.

*   **By Anthropologie Bell Sleeve Cardigan (Color: Dusty Rose)** - A dusty rose bell sleeve cardigan adds a touch of romanticism and warmth. The bell sleeves fit within the user's favorite style features (like the halter neckline of the mini dress) and would create a sophisticated layered look. 

*   **PAIGE High-Waisted Skinny Jeans (Color: Light Wash Denim)** -  Pairing the mini dress with light wash, high-waisted skinny jeans is a current trend that offers a chic contrast. The denim complements the dress’s lightness and creates a versatile, effortless outfit perfect for casual outings.

*   **Sam Edelman Stevie Pointed-Toe Flats (Color: Blush)** - These flats are a sleek and comfortable option, aligning with the user’s fondness for polished footwear. Blush is a neutral that will work well with the ivory dress and add a feminine touch. 

*   **A Animal Print Scarf (Color: Leopard Print)** - Adding a delicate leopard print scarf tied loosely around the neck adds a playful pop of color and texture, embracing the current trend of animal prints while remaining within the user’s style preferences for sophisticated, elevated pieces.


🧾 Step 4: Inventory Search Based on Text Query Only


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Denim shorts
Brand: Sandro Paris | Price: ₹182.0



AGOLDE Parker Denim Long Shorts
Brand: AGOLDE | Price: ₹148.0



Petite Compact Stretch Waist Tab Detail Tailored Mini Dress
Brand: KarenMillen | Price: ₹8766.8194



Petite Compact Stretch Waist Tab Detail Tailored Mini Dress
Brand: KarenMillen | Price: ₹13367.2806



Petite Compact Stretch Waist Tab Detail Tailored Mini Dress
Brand: KarenMillen | Price: ₹8772.7626



PLEATED MINI TUNIC DRESS
Brand: COS | Price: ₹10519.7752



PLEATED MINI TUNIC DRESS
Brand: COS | Price: ₹10519.7752



long-sleeved pleated minidress
Brand: JIL SANDER | Price: ₹164700.613



Petite Soft Tailored Crepe Tab Detail Pleated Mini Dress
Brand: KarenMillen | Price: ₹6458.1044



Petite Compact Stretch One Shoulder Pleated Waist Mini Dress
Brand: KarenMillen | Price: ₹6342.3996


📦 Step 3: Personalized Recommendations Based on Your Style History


Sweaty Betty Explorer Midi Dress
Brand: Sweaty Betty | Price: ₹10766.8292



Peixoto Parker Tie Midi Dress
Brand: Peixoto | Price: ₹13587.131



Beyond Yoga Featherweight At The Ready Square-Neck Midi Dress
Brand: BEYOND YOGA | Price: ₹11867.241



PQ Victoria Maxi Dress
Brand: PQ Swim | Price: ₹13243.153



Sundays Shawn Mini Dress
Brand: sundays | Price: ₹12157.969



Hanky Panky Retro Chemise
Brand: hanky panky | Price: ₹7629.6528



Sundays Tara Dress
Brand: Sundays | Price: ₹13716.2298



Eberjey Mademoisella Chemise
Brand: Eberjey | Price: ₹10972.9838



Eberjey Summer Of Love Elba Cover-Up Dress
Brand: Eberjey | Price: ₹12469.0813



Peixoto Serena Dress
Brand: Peixoto | Price: ₹16973.8343



## 💾 Saving All Models Locally for Deployment or Reuse

To ensure fast reloading and offline compatibility, this step saves all key models and processors used in the system to the local working directory.

---

### 🔐 1. Save CLIP Model and Processor
```python
clip_model.save_pretrained("clip-vit-large-patch14")
clip_processor.save_pretrained("clip-vit-large-patch14")
```
- Saves the **CLIP visual encoder** (`clip-vit-large-patch14`) and its associated image processor to a folder named after the model.
- Enables reuse in future sessions or deployment to inference environments.

---

### ✍️ 2. Save SentenceTransformer Text Model
```python
text_model.save("all-MiniLM-L6-v2")
```
- Saves the **text embedding model** (`all-MiniLM-L6-v2`) used for semantic understanding of product descriptions, styles, and queries.

---

### 🧠 3. Save Gemma LLM and Processor
```python
gemma_model.save_pretrained("gemma-3-4b-it")
gemma_processor.save_pretrained("gemma-3-4b-it")
```
- Saves the **Gemma-3B instruction-tuned model** and its processor for generating personalized outfit recommendations.
- Useful for loading the same model in production or fine-tuning scenarios.

---

### ✅ Final Confirmation
```python
print("✅ All models saved locally in current working directory.")
```
- Confirms that all three core models have been persisted successfully.

> 📦 These saved models can now be uploaded to Hugging Face, integrated into APIs, or bundled in a Streamlit/HF Spaces app.


In [38]:
# # === 1. Save CLIP Model and Processor ===
# # Save the CLIP vision-language model to a local directory for reuse or deployment
# clip_model.save_pretrained("clip-vit-large-patch14")
# # Save the associated processor (handles image + text preprocessing)
# clip_processor.save_pretrained("clip-vit-large-patch14")

# # === 2. Save SentenceTransformer Model ===
# # Save the text embedding model (MiniLM) locally for future loading without internet
# text_model.save("all-MiniLM-L6-v2")

# # === 3. Save Gemma Model and Processor ===
# # Save the multimodal Gemma model to the local filesystem
# gemma_model.save_pretrained("gemma-3-4b-it")
# # Save the processor used to tokenize and format Gemma inputs
# gemma_processor.save_pretrained("gemma-3-4b-it")

# # Confirmation message
# print("✅ All models saved locally in current working directory.")

## 💾 Save & Load All Fashion Assistant Assets (Models, Embeddings, Indexes)

This section defines utility functions to **persist and reload** all key components of the fashion visual search and recommendation system, including:

- Image, text, and combined embeddings
- FAISS index for vector similarity search
- User history and trend strings
- Product metadata and cluster labels

This enables checkpointing, deployment, or sharing the pipeline across environments (local or cloud).

---

### 🧠 Function: `save_all_assets(...)`

Saves all the following components to a specified folder (`Assets/` by default):

#### ✅ Embeddings
- **`image_embeddings.pkl`**: Dict of image vectors by `product_id`
- **`text_embeddings.pkl`**: Dict of text vectors by `product_id`
- **`combined_vectors.npy`**: NumPy array of all combined vectors
- **`product_ids.pkl`**: List of product IDs used for indexing

#### 🔍 FAISS Index
- **`faiss_index.index`**: Saved similarity search index built from combined vectors

#### 📄 Metadata and User State
- **`product_metadata_df.pkl`**: Pandas DataFrame of the entire product catalog
- **`user_history.pkl`**: Dict of user → product_id history
- **`trend_string.pkl`**: Precomputed string of trending fashion keywords
- **`cluster_labels.pkl`**: Dict of product_id → KMeans cluster label

This ensures you can fully resume the session or deploy to another environment without recomputation.

---

### 🔁 Function: `load_all_assets(...)`

Reads all saved files and restores the following components into memory:

- Embedding dictionaries (`image_embeddings`, `text_embeddings`)
- Combined embedding matrix and product ID list
- FAISS index (`faiss_index`)
- Product metadata DataFrame
- User interaction history (`user_history`)
- Fashion trend string (`trend_string`)
- Cluster labels from KMeans

---

### 🧪 Final Check

After loading, we confirm the loaded asset integrity:
```python
print(f"Combined vector dimension: {combined_vectors.shape[1]}")
print(f"FAISS index dimension: {faiss_index.d}")
print(f"Trend string:\n{trend_string[:300]}...")
```
This validates that the saved data is correctly restored and compatible with the search pipeline.

---

> ✅ These functions are essential for deploying the system in production, caching training results, or packaging the app for platforms like Hugging Face Spaces or Streamlit.


In [39]:
# Import required libraries
import os, pickle, json  # For file I/O operations and serialization
import numpy as np  # For handling numerical arrays

# Function to save all necessary components to disk
def save_all_assets(
    image_embeddings,
    text_embeddings,
    combined_embeddings,
    faiss_index,
    df,
    user_history,
    trend_string,
    cluster_labels,
    save_dir="Assets"
):
    os.makedirs(save_dir, exist_ok=True)  # Create the save directory if it doesn't exist

    # Save image embeddings to a pickle file
    with open(os.path.join(save_dir, "image_embeddings.pkl"), "wb") as f:
        pickle.dump(image_embeddings, f)

    # Save text embeddings to a pickle file
    with open(os.path.join(save_dir, "text_embeddings.pkl"), "wb") as f:
        pickle.dump(text_embeddings, f)

    # Save the product IDs used in combined embeddings
    with open(os.path.join(save_dir, "product_ids.pkl"), "wb") as f:
        pickle.dump(list(combined_embeddings.keys()), f)

    # Save the actual combined embedding vectors as a .npy file
    np.save(os.path.join(save_dir, "combined_vectors.npy"), np.stack([combined_embeddings[pid] for pid in combined_embeddings]))

    # Save the FAISS index for fast similarity search
    faiss.write_index(faiss_index, os.path.join(save_dir, "faiss_index.index"))

    # Save the full product metadata DataFrame
    df.to_pickle(os.path.join(save_dir, "product_metadata_df.pkl"))

    # Save user history (dictionary) as pickle
    with open(os.path.join(save_dir, "user_history.pkl"), "wb") as f:
        pickle.dump(user_history, f)

    # Save the trend string (fashion keywords) as a pickle
    with open(os.path.join(save_dir, "trend_string.pkl"), "wb") as f:
        pickle.dump(trend_string, f)

    # Save the cluster labels (product_id to cluster mapping)
    with open(os.path.join(save_dir, "cluster_labels.pkl"), "wb") as f:
        pickle.dump(cluster_labels, f)

    print("✅ All embeddings, metadata, user history, and trend string saved.")

# Function to load all previously saved assets from disk
def load_all_assets(load_dir="Assets"):
    with open(os.path.join(load_dir, "image_embeddings.pkl"), "rb") as f:
        image_embeddings = pickle.load(f)

    with open(os.path.join(load_dir, "text_embeddings.pkl"), "rb") as f:
        text_embeddings = pickle.load(f)

    with open(os.path.join(load_dir, "product_ids.pkl"), "rb") as f:
        ids = pickle.load(f)

    # Load combined vectors and FAISS index
    combined_vectors = np.load(os.path.join(load_dir, "combined_vectors.npy"))
    faiss_index = faiss.read_index(os.path.join(load_dir, "faiss_index.index"))

    # Load the product metadata DataFrame
    df = pd.read_pickle(os.path.join(load_dir, "product_metadata_df.pkl"))

    with open(os.path.join(load_dir, "user_history.pkl"), "rb") as f:
        user_history = pickle.load(f)

    with open(os.path.join(load_dir, "trend_string.pkl"), "rb") as f:
        trend_string = pickle.load(f)

    with open(os.path.join(load_dir, "cluster_labels.pkl"), "rb") as f:
        cluster_labels = pickle.load(f)

    print(f"✅ Loaded assets from {load_dir}")
    return image_embeddings, text_embeddings, combined_vectors, ids, faiss_index, df, user_history, trend_string, cluster_labels

# Save all assets to disk
save_all_assets(image_embeddings, text_embeddings, combined_embeddings, faiss_index, df, user_history, trend_string, cluster_labels)

# Load all assets back from disk
image_embeddings, text_embeddings, combined_vectors, ids, faiss_index, df, user_history, trend_string, cluster_labels = load_all_assets()

# Display basic confirmation and validation info
print(f"Combined vector dimension: {combined_vectors.shape[1]}")
print(f"FAISS index dimension: {faiss_index.d}")
print(f"Trend string:\n{trend_string[:300]}...")

✅ All embeddings, metadata, user history, and trend string saved.
✅ Loaded assets from Assets
Combined vector dimension: 896
FAISS index dimension: 896
Trend string:
A-line, A-line dress, Animal Print, Bell-sleeve, Blouse, Bodysuit, Bra, Button-down, Cardigan, Cotton, Crewneck, Crop Top, Denim, Dress, Embroidery, Fitted, Floral, Gown, Halter, Jacket, Jersey, Knit, Lace, Legging, Linen, Long Sleeve, Maxi, Maxi Dress, Midi, Mini, Mini Dress, Off Shoulder, Pant, Pl...
