## 07: Text Vectorization
Convert the product title (e.g., "Logitech C920x HD Pro Webcam") into a list of 384 numbers (a vector) that captures its meaning.

This is what allows our "Content Tower" to understand that "Webcam" and "Camera" are similar, even if they have different item IDs.

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np
import torch
import warnings
import pandas as pd

In [3]:
# Suppress warnings
warnings.filterwarnings("ignore")

# --- 1. Load the Checkpoint Data ---
print("Loading checkpoint data...")
train_full = pd.read_parquet('../data/train_categorical.parquet')
val_full = pd.read_parquet('../data/val_categorical.parquet')
test_full = pd.read_parquet('../data/test_categorical.parquet')
print("Data loaded successfully!")


Loading checkpoint data...
Data loaded successfully!


reduce the dataset size. proving the concept and building the pipeline is more important than the volume of data. Dropping from 2 Million rows to 100,000 rows.

Run vectorization on 100K rows dataset

In [4]:
# --- 2. THE FIX: Create a "Mini" Dataset (5% of data) ---
print("\nCreating a MINI dataset for rapid prototyping...")
# We take 5% of the data (approx 80k train, 10k val, 10k test)
# random_state=42 ensures we get the same random rows every time
train_df = train_full.sample(frac=0.05, random_state=42)
val_df = val_full.sample(frac=0.05, random_state=42)
test_df = test_full.sample(frac=0.05, random_state=42)

print(f"New Training Size: {len(train_df)} rows (was {len(train_full)})")
print(f"New Validation Size: {len(val_df)} rows")
print(f"New Test Size: {len(test_df)} rows")


Creating a MINI dataset for rapid prototyping...
New Training Size: 80000 rows (was 1600000)
New Validation Size: 10000 rows
New Test Size: 10000 rows


In [5]:
# --- 3. Setup Model ---
print("\nLoading Model...")
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

# --- 4. Vectorize ---
def encode_titles_column(df, column_name='title'):
    print(f"Encoding {len(df)} titles...")
    titles_list = df[column_name].tolist()
    vectors = model.encode(titles_list, show_progress_bar=True, batch_size=64)
    return list(vectors)

print("\n--- Starting Vectorization on MINI dataset ---")

print("Processing Train Set...")
train_df['title_vector'] = encode_titles_column(train_df)

print("Processing Val Set...")
val_df['title_vector'] = encode_titles_column(val_df)

print("Processing Test Set...")
test_df['title_vector'] = encode_titles_column(test_df)


Loading Model...
Using device: cpu

--- Starting Vectorization on MINI dataset ---
Processing Train Set...
Encoding 80000 titles...


Batches:   0%|          | 0/1250 [00:00<?, ?it/s]

Processing Val Set...
Encoding 10000 titles...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

Processing Test Set...
Encoding 10000 titles...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

In [7]:
# --- 5. Save the FINAL Mini Files ---
print("\nSaving MINI files...")
# We save these as 'mini' so we know they are the small version
train_df.to_parquet('../data/100k/train_final_mini.parquet', index=False)
val_df.to_parquet('../data/100k/val_final_mini.parquet', index=False)
test_df.to_parquet('../data/100k/test_final_mini.parquet', index=False)


Saving MINI files...


In [10]:
train_df['title_vector'].head()

541200    [-0.055413358, -0.030980006, 0.054996975, 0.00...
750       [0.021162007, 0.03409558, -0.048250135, -0.000...
766711    [0.026660616, 0.035521336, -0.058954343, -0.00...
285055    [-0.08367935, 0.09948159, -0.027692752, -0.016...
705995    [-0.010657759, 0.0041562063, 0.040858876, -0.0...
Name: title_vector, dtype: object