****Multimodal Embedding Generator****

This notebook implements a **multimodal embedding** pipeline that combines textual and visual information to create rich item representations. The goal is to generate 128-dimensional embeddings for 91,718 items by fusing:

* Text embeddings from item titles using Sentence-BERT (all-MiniLM-L6-v2)
* Image embeddings from product photos using CLIP (ResNet-50)
* Dimensionality reduction via PCA to compress from 1,408 to 128 dimensions

In [1]:
!pip uninstall -y protobuf
!pip install protobuf==3.20.3


Found existing installation: protobuf 6.33.0
Uninstalling protobuf-6.33.0:
  Successfully uninstalled protobuf-6.33.0
Collecting protobuf==3.20.3
  Downloading protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Downloading protobuf-3.20.3-py2.py3-none-any.whl (162 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: protobuf
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
opentelemetry-proto 1.37.0 requires protobuf<7.0,>=5.0, but you have protobuf 3.20.3 which is incompatible.
onnx 1.18.0 requires protobuf>=4.25.1, but you have protobuf 3.20.3 which is incompatible.
a2a-sdk 0.3.10 requires protobuf>=5.29.5, but you have protobuf 3

In [2]:
!pip install git+https://github.com/openai/CLIP.git


Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-2k7mclun
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-2k7mclun
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ftfy (from clip==1.0)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting 

In [3]:
# Imports
import os
import pandas as pd
import numpy as np
from PIL import Image
import torch
import clip  # <<--- IMPORTANT: make sure this is imported
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import warnings

warnings.filterwarnings("ignore")

2025-11-25 12:51:14.296486: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764075074.494035      47 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764075074.550344      47 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


**Load the Microlens dataset containing item metadata and features.**
*Note*: I have created a Kaggle dataset named **"microlens"** that contains all the required files for this notebook. Make sure to add it as a data source to your notebook before running.
Data Structure:

item_info.parquet (91,718 rows): Base item information with item_id, tags, and existing embeddings

item_feature.parquet (91,717 rows): Extended features including item_title and other attributes

item_images/: Folder containing product images named as image{item_id}.jpg

In [4]:
# ==========================
# Paths
# ==========================
item_info_path = "/kaggle/input/microlens/item_info.parquet"
item_feature_path = "/kaggle/input/microlens/item_feature.parquet"
image_folder = "/kaggle/input/microlens/item_images/item_images/"

output_path = "/kaggle/working/item_info_fused_multimodal.parquet"

# ==========================
# Load data
# ==========================
item_info = pd.read_parquet(item_info_path)
item_feature = pd.read_parquet(item_feature_path)

print(f"item_info shape: {item_info.shape}")
print(f"item_feature shape: {item_feature.shape}")

item_info shape: (91718, 3)
item_feature shape: (91717, 7)


In [5]:
# ==========================
# Load models
# ==========================
device = "cuda" if torch.cuda.is_available() else "cpu"

# Sentence-BERT for text
text_model = SentenceTransformer('all-MiniLM-L6-v2')
text_model.eval()

# CLIP-RN50 for images
clip_model, preprocess = clip.load("RN50", device=device)
clip_model.eval()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

100%|████████████████████████████████████████| 244M/244M [00:02<00:00, 111MiB/s]


CLIP(
  (visual): ModifiedResNet(
    (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu1): ReLU(inplace=True)
    (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu2): ReLU(inplace=True)
    (conv3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu3): ReLU(inplace=True)
    (avgpool): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (layer1): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace=True)
     

In [6]:
# Encode text
# ==========================
titles = item_feature['item_title'].tolist()
text_emb = text_model.encode(titles, batch_size=256, show_progress_bar=True)

# ==========================
# Encode images
# ==========================
def encode_image(img_path):
    """Returns CLIP embedding for a single image"""
    try:
        img = preprocess(Image.open(img_path).convert('RGB')).unsqueeze(0).to(device)
        with torch.no_grad():
            emb = clip_model.encode_image(img)
        return emb.cpu().numpy().flatten()
    except:
        # If image not found or corrupt, return zeros
        return np.zeros(clip_model.visual.output_dim)

image_embs = []
for item_id in item_feature['item_id']:
    img_file = os.path.join(image_folder, f"image{item_id}.jpg")  # adjust filename if needed
    image_embs.append(encode_image(img_file))

image_embs = np.array(image_embs)
print(f"Image embeddings shape: {image_embs.shape}")


Batches:   0%|          | 0/359 [00:00<?, ?it/s]

Image embeddings shape: (91717, 1024)


In [7]:
print(item_info.columns)


Index(['item_id', 'item_tags', 'item_emb_d128'], dtype='object')


In [8]:
# ==========================
# Combine text + image embeddings
# ==========================
# Make sure we use all rows from item_feature
text_emb = text_model.encode(item_feature['item_title'].tolist(), batch_size=256, show_progress_bar=True)

image_embs = []
for item_id in item_feature['item_id']:
    img_file = os.path.join(image_folder, f"image{item_id}.jpg")
    image_embs.append(encode_image(img_file))
image_embs = np.array(image_embs)

multimodal_emb = np.concatenate([text_emb, image_embs], axis=1)
print(f"Combined embeddings shape: {multimodal_emb.shape}")  # should match item_info rows minus padding

# ==========================
# Add padding row at top
# ==========================
padding_row = np.zeros(multimodal_emb.shape[1])  # zeros for first row
multimodal_emb_full = np.vstack([padding_row, multimodal_emb])
print(f"After padding, shape: {multimodal_emb_full.shape}")  # now matches item_info rows

# ==========================
# PCA to 128-d
# ==========================
pca = PCA(n_components=128)
multimodal_emb_128 = pca.fit_transform(multimodal_emb_full)
print(f"PCA-reduced embeddings shape: {multimodal_emb_128.shape}")

# ==========================
# Update item_info and save
# ==========================
item_info['item_emb_d128'] = list(multimodal_emb_128)
item_info.to_parquet(output_path, index=False)
print(f"Multimodal item_info saved to {output_path}")

Batches:   0%|          | 0/359 [00:00<?, ?it/s]

Combined embeddings shape: (91717, 1408)
After padding, shape: (91718, 1408)
PCA-reduced embeddings shape: (91718, 128)
Multimodal item_info saved to /kaggle/working/item_info_fused_multimodal.parquet


In [9]:

# Load the fused multimodal item_info
item_info_fused = pd.read_parquet("/kaggle/working/item_info_fused_multimodal.parquet")

# See the first few rows
print(item_info_fused.head())


   item_id        item_tags                                      item_emb_d128
0        0  [0, 0, 0, 0, 0]  [-0.0930395362406103, -0.023277998403371275, 0...
1        1  [0, 0, 0, 0, 1]  [-0.012815042788755983, -0.09739979563827816, ...
2        2  [0, 0, 2, 3, 4]  [-0.04312385186445468, 0.007002069472291313, -...
3        3  [0, 0, 5, 6, 7]  [0.11733471775338727, -0.07496686939378722, -0...
4        4  [0, 0, 0, 8, 9]  [0.029414119909678172, -0.028581324888330612, ...


In [10]:
# Check the shape
print("Shape:", item_info_fused.shape)

Shape: (91718, 3)


In [11]:
# Inspect the embedding column (first row only, as it's large)
print("First embedding vector:", item_info_fused['item_emb_d128'].iloc[0])

First embedding vector: [-9.30395362e-02 -2.32779984e-02  2.04750604e-01 -1.36218245e-01
  4.18094425e-02 -5.29150630e-02  4.71030199e-02  4.25416613e-02
 -4.98953277e-02 -5.23277428e-02  2.04498765e-02 -1.21913106e-02
  1.11224409e-02  2.01866605e-02  5.27886647e-03  1.03064325e-02
 -4.86732067e-02  7.22326592e-02  1.53338971e-02 -4.80879004e-02
 -4.73850613e-02 -1.37578837e-02  1.77517524e-02  2.07183565e-02
  7.84526965e-03 -7.24601752e-02  3.07343365e-04  5.18291762e-02
  8.13975265e-03 -2.34206490e-03 -4.06716391e-02  1.89334248e-02
  2.35710335e-02  2.93670449e-03  1.17712800e-02 -3.46356395e-02
 -4.27335892e-03  2.18394862e-02  1.59552658e-02  2.42888382e-02
  6.87711665e-03 -7.01798813e-03  4.11477817e-03 -1.97578338e-02
 -1.13116516e-02 -4.18479367e-02 -1.24200766e-02  5.30512977e-02
 -1.86224218e-02  6.18654530e-03  2.16350757e-02  1.46057335e-02
 -1.53896270e-03 -9.80813580e-03 -2.42925596e-02  9.57079043e-03
  1.41696380e-02  2.36932926e-02  1.97235621e-02  3.95591688e-03
 