# Data Preparation for Semantic ID Generation

```
Data Preparation for Semantic ID Generation
1. Configuration Setup
Set category (e.g., Video_Games)
Define sequence length parameters (MIN_SEQUENCE_LENGTH=3, MAX_SEQUENCE_LENGTH=100)
2. Download and Load Data
Download 3 data files from Amazon 2023 dataset:
Item metadata (meta_{CATEGORY}.jsonl.gz)
Review data (review_{CATEGORY}.jsonl.gz)
Sequences data ({CATEGORY}.train.csv.gz)
Unzip and save to local data directory
3. Prepare Item Metadata
Load item metadata from JSONL
Process list columns into text fields:
description_text (join description list)
features_text (join features list)
categories_text (join categories with " > ")
Filter items requiring both title (>20 chars) and description (>100 chars)
Fill null values for all metadata fields
Create item_context - a concatenated string combining: Product title, Description, Features, Category, Store, Average rating, Rating count, Price
4. Load User Sequences
Load CSV with user interaction sequences (user_id, parent_asin, rating, timestamp, history)
Deduplicate by user_id - keep row with longest history per user
Create sequence column by appending target item (parent_asin) to history
5. Filter Items Without Metadata
Remove items from sequences that don't exist in valid items set
Filter out sequences shorter than MIN_SEQUENCE_LENGTH
6. Truncate Long Sequences
Truncate sequences longer than MAX_SEQUENCE_LENGTH (keep last N items)
Calculate sequence length statistics
7. Save Processed Data
Save filtered sequences to {CATEGORY}_sequences.parquet (user_id, sequence, sequence_length)
Save valid item metadata to {CATEGORY}_items.parquet (with item_context for semantic embedding)
```

In [1]:
# CATEGORY = "Baby_Products"
CATEGORY = "Video_Games"

# Define sequence lengths
MIN_SEQUENCE_LENGTH = 3
MAX_SEQUENCE_LENGTH = 100  # Adjust as needed

In [2]:
import sys
from pathlib import Path

NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR.parent
sys.path.append(str(PROJECT_ROOT))

# Data directory
DATA_DIR = Path(PROJECT_ROOT, "data")
DATA_DIR.mkdir(exist_ok=True)

In [3]:
import gzip
import shutil
import urllib.request

import polars as pl

from src.logger import setup_logger

logger = setup_logger("dataprep")

In [4]:
# URLs for the data files
ITEMS_URL = f"https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/meta_categories/meta_{CATEGORY}.jsonl.gz"  # fmt: off
REVIEWS_URL = f"https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/{CATEGORY}.jsonl.gz"  # fmt: off
SEQUENCES_URL = f"https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/benchmark/5core/timestamp_w_his/{CATEGORY}.train.csv.gz"  # fmt: off

## Download and Load Data

First, we download the gzipped JSONL files from the Amazon dataset and unzip them to the data directory. The data is then loaded using Polars' `read_ndjson` function which can handle newline-delimited JSON files.

In [5]:
# Download and unzip the data files
def download_and_unzip(url, output_path):
    """Download a gzipped file and unzip it."""
    # Download the gzipped file
    gz_path = output_path.with_suffix(".jsonl.gz")

    logger.info(f"Downloading {url}...")
    urllib.request.urlretrieve(url, gz_path)
    logger.info(f"Downloaded to {gz_path}")

    # Unzip the file
    logger.info(f"Unzipping {gz_path}...")
    with gzip.open(gz_path, "rb") as f_in:
        with open(output_path, "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)
    logger.info(f"Unzipped to {output_path}")

    # Remove the gzipped file
    gz_path.unlink()
    logger.info(f"Removed {gz_path}\n")


# Download item metadata
item_output_path = DATA_DIR / f"meta_{CATEGORY}.jsonl"
if not item_output_path.exists():
    download_and_unzip(ITEMS_URL, item_output_path)
else:
    logger.info(f"Item data already exists at {item_output_path}")

# Download review data
review_output_path = DATA_DIR / f"review_{CATEGORY}.jsonl"
if not review_output_path.exists():
    download_and_unzip(REVIEWS_URL, review_output_path)
else:
    logger.info(f"Review data already exists at {review_output_path}")

# Download sequences data
sequences_output_path = DATA_DIR / f"{CATEGORY}.train.csv.gz"
if not sequences_output_path.exists():
    download_and_unzip(SEQUENCES_URL, sequences_output_path)
else:
    logger.info(f"Sequences data already exists at {sequences_output_path}")

18:08:21 - Item data already exists at /Users/eugeneyan/projects/semantic-id/data/meta_Video_Games.jsonl


18:08:21 - Review data already exists at /Users/eugeneyan/projects/semantic-id/data/review_Video_Games.jsonl


18:08:21 - Sequences data already exists at /Users/eugeneyan/projects/semantic-id/data/Video_Games.train.csv.gz


## Prepare item metadata

In [5]:
# Load item metadata and filter for items with both title and description
item_df = pl.read_ndjson(DATA_DIR / f"meta_{CATEGORY}.jsonl", ignore_errors=True)
logger.info(f"Total items in metadata: {len(item_df):,}")

# Check what columns are available
logger.info(f"Item metadata columns: {item_df.columns}")
logger.info(f"Total items in metadata: {len(item_df):,}")

item_df = item_df.with_columns(
    pl.col("description").list.join(" ").fill_null("").alias("description_text"),
    pl.col("features").list.join(" ").fill_null("").alias("features_text"),
    pl.col("categories").list.join(" > ").fill_null("").alias("categories_text"),
)

16:59:43 - Total items in metadata: 137,269
16:59:43 - Item metadata columns: ['main_category', 'title', 'average_rating', 'rating_number', 'features', 'description', 'price', 'images', 'videos', 'store', 'categories', 'details', 'parent_asin', 'bought_together']
16:59:43 - Total items in metadata: 137,269


In [None]:
# downsample to 1000 items !!!!
item_df = item_df.sample(1000)

In [9]:
item_df.shape

(1000, 17)

In [10]:
item_df.head(1).select("features").item()

"""Charge and Play: This joycon c…"
"""Easy Sliding in and out: Easy …"
"""Fast and Safe Charging: This J…"
"""LED Charging Indicator: When t…"
"""What You Get: 1 x Joycon grip,…"
"""Compatible devices: Game Conso…"


In [11]:
item_df.head(1).select("description_text").item()

'Features:?V shaped Handle Design?The handle combines the left and right Joycons into a full size controller, allows you to control at will while playing the game. The v shaped handle design is ergonomic, and more comfortable to hold.?Lightweight and Easy to Carry?This Joy Con charging grip is small and light, and easy to carry without gravity. With a card slot for easy installation of the wrist strap, more robust and stable. It is resistant to fall and wear, and comfortable to use.?Type C Interface?Using type C interface, it can be used with the charging cable of Switch. No need to prepare extra charging cables.?Multiple Colors to Choose from ?You can choose gradient colors, such as red blue, blue green and pink blue, or pure white. No matter what style your joycon is, you can find a color that matches it.'

In [12]:
item_df.head(1).select("features_text").item()

'Charge and Play: This joycon charging grip is designed for all joycons. It charges the joycon through the Type-C charging cable, so you can keep playing while charging Easy Sliding in and out: Easy sliding in ensures the safety of each Joy-Con. It can be installed without removing the protective cover of Joycon. The original size chute will not scratch the Joycon Fast and Safe Charging: This Joycon charger handle has a built-in smart security chip, which will automatically power off after fully charged in about 30 seconds, ensuring that your Joycon controller will not be overcharged LED Charging Indicator: When the LED charging indicator shows red, it means the Joycon is charging, and the indicator shows green when fully charged What You Get: 1 x Joycon grip, 2 x thumb grips Compatible devices: Game Consoles'

In [13]:
item_df.head(1).select("categories_text").item()

'Video Games > Nintendo Switch > Accessories'

In [14]:
# Filter items that have both title and description (non-null and non-empty)
# Note: description is a list of strings, so we check if the list has elements
original_num_items = len(item_df)
logger.info(f"Initial number of items in metadata: {original_num_items:,}")

item_df = item_df.filter(
    (pl.col("title").is_not_null())
    & (pl.col("title").str.len_chars() > 20)
    & (pl.col("description_text").is_not_null())
    & (pl.col("description_text").str.len_chars() > 100)
)

# Create set of valid item IDs
valid_items = set(item_df["parent_asin"].to_list())
logger.info(f"Items with valid metadata (title + description): {len(valid_items):,}")
logger.info(
    f"Items without valid metadata: {original_num_items - len(valid_items):,} ({(original_num_items - len(valid_items)) / original_num_items * 100:.1f}%)"
)

17:00:35 - Initial number of items in metadata: 1,000
17:00:35 - Items with valid metadata (title + description): 465
17:00:35 - Items without valid metadata: 535 (53.5%)


In [15]:
pl.Config.set_fmt_str_lengths(500)
item_df.select("description_text").head(10)

description_text
str
"""Features:?V shaped Handle Design?The handle combines the left and right Joycons into a full size controller, allows you to control at will while playing the game. The v shaped handle design is ergonomic, and more comfortable to hold.?Lightweight and Easy to Carry?This Joy Con charging grip is small and light, and easy to carry without gravity. With a card slot for easy installation of the wrist strap, more robust and stable. It is resistant to fall and wear, and comfortable to use.?Type C Interf…"
"""This Metal Hard Cover Aluminum is made by hard solid plastic while looks like aluminum metal to prevent your NDSi XL/LL console from damages in daily use. It Provides the maximum protection available while allowing easy access to all buttons and ports."""
"""Never miss a chance to dance! 40 new must-dance songs are coming to Just Dance 2021. (Nintendo Switch, Xbox One, PS4, Google Stadia, Xbox Series X, PS5) Whether they are chart-topping hits, great classics for families, viral internet phenomena or emerging artists you need to know: there is enough music for everyone in Just Dance 2021! Immerse yourself in a multitude of creative universes, including some created with innovative and original production techniques. Fancy a quick session? Use the qu…"
"""Product description Professor Theo has been kidnapped by the imperial forces of the evil empire. Now his robotic personal assistant, Marina Liteyears, must save the day! Marina will get some help from the troops who have remained loyal to King Aster. With their help, she'll grab, shake and throw her way to the Professor! Besides rescuing the Professor, Marina will also join in the effort to rebel against that nasty, horrible and evil empire. So, Grab a controller and help our Super Heroine shake…"
"""Product Features: Compatible: Xbox OneWorldwide voltage: 100-240v AC cable: 3.93 ft / 1.2mDC cable: 3.28 ft / 1.0mLess Noise: built-in cooling fan. Less noise, More quite and longer life. Product Specifications: Input Voltage: AC 100-240V/1.78A, 47-63Hz Output Voltage: DC 12V, 10.83A/220W; 5Vsb/1A Product Dimension(L*W*H): 170*75*50(mm) Package include: 1x Power Adapter 1x Power Cable (US Version)"""
"""A cursed gem that has left a trail of smoldering bodies in its wake is threatening the residents of the House of 1000 Doors, prompting the head of the mystical dwelling to summon Kate Reed to its aid. The House needs you. Will you answer its call?PC Minimum System Requirements: PC Recommended System Requirements: Processor: 1.5 GHz Processor: 1.5 GHz RAM: 512 MB RAM: 512 MB Hard Disk: 790 MB Hard Disk: 790 MB Video Card: 256 MB 3D video card Video Card: 256 MB 3D video card Suppo…"
"""Next Level Racing Challenger Simulator Cockpit (NLR-S016)Bring the race track to your home with the Next Level Racing Challenger Simulator Cockpit. The Challenger cockpit is the perfect solution to mount your steering wheel, pedals and even shifter in an authentic racing position to give you a true race car driving experience from the comfort of your home. The minimalist design of the Challenger cockpit provides you with a rigid and realistic racing experience without a huge footprint. The steer…"
"""Parrot Pals is the newest title in the Discovery Kids games line-up. Choose your favorite breed, give it a name, and start to take care of it. You can train your parrot and teach it new tricks. You can even speak to your parrot and it will repeat your words back to you."""
"""THEATRHYTHM FINAL FANTASY CURTAIN CALL expands on the original in every way with new gameplay modes, over 200 songs, and 60 playable characters. Featuring music that spans the full breadth of the FINAL FANTASY franchise, players tap along to the beats and harmonies as the adorable characters battle and quest through their worlds. The music of over 20 titles is brought together in one package, fusing together the moving scores, cinematic visuals, and role-playing elements the series is known for.…"
"""HyperX Cloud MIX Rose Gold is a wired Gaming Headset that pumps out rich hi-res audio at frequencies from 10Hz to 40kHz. Switch to Bluetooth mode for a lightweight, wireless headset for when you're on the go. HyperX custom-designed 40mm Dual chamber drivers separate the bass from the mids and highs for more distinction and clarity by reducing distortion. Cloud Mix is compatible with PC and console platforms with 3. 5mm Ports and Bluetooth Ready media devices."""


In [16]:
pl.Config.set_fmt_str_lengths(100)
item_df = item_df.select(
    "parent_asin",
    "title",
    "description_text",
    "features_text",
    "main_category",
    "categories_text",
    "store",
    "average_rating",
    "rating_number",
    "price",
)

item_df.head()

parent_asin,title,description_text,features_text,main_category,categories_text,store,average_rating,rating_number,price
str,str,str,str,str,str,str,f64,i64,f64
"""B097C888TS""","""ThreeShip Kawvisy Joycon Charging Grip Compatible with Nintendo Switch and Switch OLED, Joycon Charg…","""Features:?V shaped Handle Design?The handle combines the left and right Joycons into a full size con…","""Charge and Play: This joycon charging grip is designed for all joycons. It charges the joycon throug…","""Computers""","""Video Games > Nintendo Switch > Accessories""","""ThreeShip""",4.5,195,
"""B008BM78CC""","""Metal Hard Cover Aluminum for Nintendo Ndsi Dsi Xl LL Purple""","""This Metal Hard Cover Aluminum is made by hard solid plastic while looks like aluminum metal to prev…","""Classy and attractive Red Nintendo Ndsill Case protects your device against dirt, dust and bumps It …","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo DS > Accessories > Cases & Storage""","""uTrusted""",5.0,1,
"""B08GZY3VTZ""","""Just Dance 2021 (PS4)""","""Never miss a chance to dance! 40 new must-dance songs are coming to Just Dance 2021. (Nintendo Switc…","""Never miss a chance to dance 40 new must-dance songs are coming to Just Dance 2021 Exercising has ne…","""Video Games""","""Video Games > PlayStation 4 > Games""","""Ubisoft""",4.4,1329,19.99
"""B00002STF6""","""Mischief Makers - Nintendo 64""","""Product description Professor Theo has been kidnapped by the imperial forces of the evil empire. Now…","""Puzzle/action starring a robotic cleaning maid named Marina on a mission to rescue her kidnapped cre…","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo 64 > Games""","""Nintendo""",4.0,76,68.26
"""B0894MXDDS""","""Xbox One Power Supply Brick, Peoture Xbox AC Adapter Replacement Charger Power Cord Cable for Micros…","""Product Features: Compatible: Xbox OneWorldwide voltage: 100-240v AC cable: 3.93 ft / 1.2mDC cable: …","""[Durable &Less Noise] Great improvements have been made on the cooling fan and the inner structure o…","""Home Audio & Theater""","""Video Games > Xbox One > Accessories > Cables & Adapters > Adapters""","""PEOTURE""",3.5,134,


In [17]:
item_df = item_df.with_columns(
    [
        pl.col("title").fill_null(""),
        pl.col("description_text").fill_null(""),
        pl.col("features_text").fill_null(""),
        pl.col("main_category").fill_null(""),
        pl.col("categories_text").fill_null(""),
        pl.col("store").fill_null(""),
        pl.col("average_rating").fill_null(""),
        pl.col("rating_number").fill_null(0),
        pl.col("price").fill_null(""),
    ]
)

item_df.head()

parent_asin,title,description_text,features_text,main_category,categories_text,store,average_rating,rating_number,price
str,str,str,str,str,str,str,str,i64,str
"""B097C888TS""","""ThreeShip Kawvisy Joycon Charging Grip Compatible with Nintendo Switch and Switch OLED, Joycon Charg…","""Features:?V shaped Handle Design?The handle combines the left and right Joycons into a full size con…","""Charge and Play: This joycon charging grip is designed for all joycons. It charges the joycon throug…","""Computers""","""Video Games > Nintendo Switch > Accessories""","""ThreeShip""","""4.5""",195,""""""
"""B008BM78CC""","""Metal Hard Cover Aluminum for Nintendo Ndsi Dsi Xl LL Purple""","""This Metal Hard Cover Aluminum is made by hard solid plastic while looks like aluminum metal to prev…","""Classy and attractive Red Nintendo Ndsill Case protects your device against dirt, dust and bumps It …","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo DS > Accessories > Cases & Storage""","""uTrusted""","""5.0""",1,""""""
"""B08GZY3VTZ""","""Just Dance 2021 (PS4)""","""Never miss a chance to dance! 40 new must-dance songs are coming to Just Dance 2021. (Nintendo Switc…","""Never miss a chance to dance 40 new must-dance songs are coming to Just Dance 2021 Exercising has ne…","""Video Games""","""Video Games > PlayStation 4 > Games""","""Ubisoft""","""4.4""",1329,"""19.99"""
"""B00002STF6""","""Mischief Makers - Nintendo 64""","""Product description Professor Theo has been kidnapped by the imperial forces of the evil empire. Now…","""Puzzle/action starring a robotic cleaning maid named Marina on a mission to rescue her kidnapped cre…","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo 64 > Games""","""Nintendo""","""4.0""",76,"""68.26"""
"""B0894MXDDS""","""Xbox One Power Supply Brick, Peoture Xbox AC Adapter Replacement Charger Power Cord Cable for Micros…","""Product Features: Compatible: Xbox OneWorldwide voltage: 100-240v AC cable: 3.93 ft / 1.2mDC cable: …","""[Durable &Less Noise] Great improvements have been made on the cooling fan and the inner structure o…","""Home Audio & Theater""","""Video Games > Xbox One > Accessories > Cables & Adapters > Adapters""","""PEOTURE""","""3.5""",134,""""""


In [18]:
item_df = item_df.with_columns(
    pl.concat_str(
        [
            pl.lit("Product: "),
            pl.col("title"),
            pl.lit("\n\nDescription: "),
            pl.col("description_text"),
            pl.lit("\n\nFeatures: "),
            pl.col("features_text"),
            pl.lit("\n\nCategory: "),
            pl.col("main_category"),
            pl.lit(", Category tree: "),
            pl.col("categories_text"),
            pl.lit("\n\nStore: "),
            pl.col("store"),
            pl.lit("\n\nAverage rating: "),
            pl.col("average_rating"),
            pl.lit(", Rating count: "),
            pl.col("rating_number"),
            pl.lit("\n\nPrice: "),
            pl.col("price"),
        ]
    ).alias("item_context")
)

item_df.head()

parent_asin,title,description_text,features_text,main_category,categories_text,store,average_rating,rating_number,price,item_context
str,str,str,str,str,str,str,str,i64,str,str
"""B097C888TS""","""ThreeShip Kawvisy Joycon Charging Grip Compatible with Nintendo Switch and Switch OLED, Joycon Charg…","""Features:?V shaped Handle Design?The handle combines the left and right Joycons into a full size con…","""Charge and Play: This joycon charging grip is designed for all joycons. It charges the joycon throug…","""Computers""","""Video Games > Nintendo Switch > Accessories""","""ThreeShip""","""4.5""",195,"""""","""Product: ThreeShip Kawvisy Joycon Charging Grip Compatible with Nintendo Switch and Switch OLED, Joy…"
"""B008BM78CC""","""Metal Hard Cover Aluminum for Nintendo Ndsi Dsi Xl LL Purple""","""This Metal Hard Cover Aluminum is made by hard solid plastic while looks like aluminum metal to prev…","""Classy and attractive Red Nintendo Ndsill Case protects your device against dirt, dust and bumps It …","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo DS > Accessories > Cases & Storage""","""uTrusted""","""5.0""",1,"""""","""Product: Metal Hard Cover Aluminum for Nintendo Ndsi Dsi Xl LL Purple Description: This Metal Hard …"
"""B08GZY3VTZ""","""Just Dance 2021 (PS4)""","""Never miss a chance to dance! 40 new must-dance songs are coming to Just Dance 2021. (Nintendo Switc…","""Never miss a chance to dance 40 new must-dance songs are coming to Just Dance 2021 Exercising has ne…","""Video Games""","""Video Games > PlayStation 4 > Games""","""Ubisoft""","""4.4""",1329,"""19.99""","""Product: Just Dance 2021 (PS4) Description: Never miss a chance to dance! 40 new must-dance songs a…"
"""B00002STF6""","""Mischief Makers - Nintendo 64""","""Product description Professor Theo has been kidnapped by the imperial forces of the evil empire. Now…","""Puzzle/action starring a robotic cleaning maid named Marina on a mission to rescue her kidnapped cre…","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo 64 > Games""","""Nintendo""","""4.0""",76,"""68.26""","""Product: Mischief Makers - Nintendo 64 Description: Product description Professor Theo has been kid…"
"""B0894MXDDS""","""Xbox One Power Supply Brick, Peoture Xbox AC Adapter Replacement Charger Power Cord Cable for Micros…","""Product Features: Compatible: Xbox OneWorldwide voltage: 100-240v AC cable: 3.93 ft / 1.2mDC cable: …","""[Durable &Less Noise] Great improvements have been made on the cooling fan and the inner structure o…","""Home Audio & Theater""","""Video Games > Xbox One > Accessories > Cables & Adapters > Adapters""","""PEOTURE""","""3.5""",134,"""""","""Product: Xbox One Power Supply Brick, Peoture Xbox AC Adapter Replacement Charger Power Cord Cable f…"


In [19]:
logger.info(item_df.slice(0, 1).select("item_context").item())

17:00:37 - Product: ThreeShip Kawvisy Joycon Charging Grip Compatible with Nintendo Switch and Switch OLED, Joycon Charger for Switch, Switch Joycon Grip with 2 Thumb Grip Caps(Blue and Green)

Description: Features:?V shaped Handle Design?The handle combines the left and right Joycons into a full size controller, allows you to control at will while playing the game. The v shaped handle design is ergonomic, and more comfortable to hold.?Lightweight and Easy to Carry?This Joy Con charging grip is small and light, and easy to carry without gravity. With a card slot for easy installation of the wrist strap, more robust and stable. It is resistant to fall and wear, and comfortable to use.?Type C Interface?Using type C interface, it can be used with the charging cable of Switch. No need to prepare extra charging cables.?Multiple Colors to Choose from ?You can choose gradient colors, such as red blue, blue green and pink blue, or pure white. No matter what style your joycon is, you can find 

## Load sequences

In [20]:
# Load the gzipped CSV file
df = pl.read_csv(f"../data/{CATEGORY}.train.csv.gz")

# Display basic information about the dataset
logger.info(f"Dataset shape: {df.shape}")
logger.info(f"Columns: {df.columns}")

df.head()

17:00:38 - Dataset shape: (736827, 5)
17:00:38 - Columns: ['user_id', 'parent_asin', 'rating', 'timestamp', 'history']


user_id,parent_asin,rating,timestamp,history
str,str,f64,i64,str
"""AEVPPTMG43C6GWSR7I2UGRQN7WFQ""","""B08R5B7YS4""",1.0,1611459666223,
"""AEVPPTMG43C6GWSR7I2UGRQN7WFQ""","""B0863MT183""",4.0,1613701986538,"""B08R5B7YS4"""
"""AEVPPTMG43C6GWSR7I2UGRQN7WFQ""","""B08P8P7686""",5.0,1613702112995,"""B08R5B7YS4 B0863MT183"""
"""AEVPPTMG43C6GWSR7I2UGRQN7WFQ""","""B0B7LV3DN2""",4.0,1617641445475,"""B08R5B7YS4 B0863MT183 B08P8P7686"""
"""AEVPPTMG43C6GWSR7I2UGRQN7WFQ""","""B09WMQ6DXG""",5.0,1620231368468,"""B08R5B7YS4 B0863MT183 B08P8P7686 B0B7LV3DN2"""


In [21]:
# Deduplicate by user_id, keeping the row with the longest history
# First, calculate the length of each history
df = df.with_columns(
    pl.when(pl.col("history").is_null())
    .then(0)
    .otherwise(pl.col("history").str.count_matches(r"\S+"))
    .alias("history_length")
)

logger.info(f"Original dataset num rows: {df.shape[0]:,}")

# Sort by user_id and history_length (descending), then keep first row per user
df = df.sort(["user_id", "history_length"], descending=[False, True]).group_by("user_id").first().drop("history_length")

logger.info(f"Deduplicated dataset num rows: {df.shape[0]:,}")
logger.info(f"Number of unique users: {df.n_unique('user_id'):,}")

17:00:38 - Original dataset num rows: 736,827


17:00:39 - Deduplicated dataset num rows: 91,562
17:00:39 - Number of unique users: 91,562


In [22]:
df

user_id,parent_asin,rating,timestamp,history
str,str,f64,i64,str
"""AE222HFZDH6BPTYFOUWGGU63YSIQ""","""B0BW17W9GM""",5.0,1593366227132,"""B082R1RGZF B07SNN8GV5 B01GY35T4S B07QX99XJJ"""
"""AE2252DKW4XJIZP5QPFMQVJBVRTA""","""B07JH3LSHN""",5.0,1562210954523,"""B0050SX4CI B002ORTCAQ B0090ECASW B004OYV7ZU B00F27JGVA B01N6N3J8D B01KV3BB0S B0C5K2TWD8"""
"""AE225O22SA7DLBOGOEIFL7FT5VYQ""","""B0053BCML6""",4.0,1370812046000,"""B00005YTYC B00029QOQS B0006B7DXA B001LETH2Q B0009XEC02 B000NNDN1M B00136MBHA B007VTVRFA"""
"""AE227CCN4C37WTOB3J2TZPOKLEQQ""","""B001QCWRWK""",1.0,1442558286000,"""B0049U4DXM B002V8KA72 B0000NSZMM B000KQLDP0 B0032C9V7G B001EYUPHO B000WCCURW B00LZVNWIA B00AYZMZ9K B…"
"""AE22BPPZGGRTSYOHK2J3LCG5HGAQ""","""B0053BCML6""",5.0,1418058672000,"""B00KVP3OY8 B07K3KHFSY B00KVP76G0 B00KVOVBGM"""
…,…,…,…,…
"""AHZZXIAECG2VICHCSJEF5DRYKKSQ""","""B0C1W4WV2B""",4.0,1609599510847,"""B004DGIYEG B01D8H09TS B0B7BSBCSC B09GYN7T42"""
"""AHZZXP52C2AFASKIZR44MMSPNNNA""","""B0BN942894""",5.0,1547328847950,"""B000S6AG9G B006PP3YMU B0129D5C8K B00897Z27C B000N5Z2L4 B00VULDPCI B073J375DQ"""
"""AHZZY32XFW5RNYNV3KCKCPQHODSA""","""B07F33X1M7""",5.0,1540016552944,"""B01IC2A28C B00YQ74LVM B00L59D9HG B07DPK5NPD B0748N6796"""
"""AHZZYR74YSTBTJMWGZFMU5B6HWZA""","""B0748N6796""",5.0,1565464967770,"""B00E369SDM B003O6G114 B0072HYRNK B002I0K6X6 B0050SX9I2 B00N4OBA8K B0BN2FNKLM B01D63UU52 B01H2YOA3O B…"


In [23]:
# Create sequences column by appending parent_asin to history as a list
df = df.with_columns(pl.col("history").str.split(" ").list.concat([pl.col("parent_asin")]).alias("sequence"))

df.head()

user_id,parent_asin,rating,timestamp,history,sequence
str,str,f64,i64,str,list[str]
"""AE222HFZDH6BPTYFOUWGGU63YSIQ""","""B0BW17W9GM""",5.0,1593366227132,"""B082R1RGZF B07SNN8GV5 B01GY35T4S B07QX99XJJ""","[""B082R1RGZF"", ""B07SNN8GV5"", … ""B0BW17W9GM""]"
"""AE2252DKW4XJIZP5QPFMQVJBVRTA""","""B07JH3LSHN""",5.0,1562210954523,"""B0050SX4CI B002ORTCAQ B0090ECASW B004OYV7ZU B00F27JGVA B01N6N3J8D B01KV3BB0S B0C5K2TWD8""","[""B0050SX4CI"", ""B002ORTCAQ"", … ""B07JH3LSHN""]"
"""AE225O22SA7DLBOGOEIFL7FT5VYQ""","""B0053BCML6""",4.0,1370812046000,"""B00005YTYC B00029QOQS B0006B7DXA B001LETH2Q B0009XEC02 B000NNDN1M B00136MBHA B007VTVRFA""","[""B00005YTYC"", ""B00029QOQS"", … ""B0053BCML6""]"
"""AE227CCN4C37WTOB3J2TZPOKLEQQ""","""B001QCWRWK""",1.0,1442558286000,"""B0049U4DXM B002V8KA72 B0000NSZMM B000KQLDP0 B0032C9V7G B001EYUPHO B000WCCURW B00LZVNWIA B00AYZMZ9K B…","[""B0049U4DXM"", ""B002V8KA72"", … ""B001QCWRWK""]"
"""AE22BPPZGGRTSYOHK2J3LCG5HGAQ""","""B0053BCML6""",5.0,1418058672000,"""B00KVP3OY8 B07K3KHFSY B00KVP76G0 B00KVOVBGM""","[""B00KVP3OY8"", ""B07K3KHFSY"", … ""B0053BCML6""]"


## Filter items without metadata

We need to filter out items from the history that don't have valid metadata (both title and description). This ensures we only work with items that have sufficient information for generating semantic representations.

In [24]:
# Function to filter sequence items based on valid metadata
def filter_sequence_items(sequence_list, valid_items_set):
    if sequence_list is None:
        return None

    # Filter the list to keep only items with valid metadata
    filtered_items = [item for item in sequence_list if item in valid_items_set]

    return filtered_items if filtered_items else None


# Filter sequences where metadata is missing
df = df.with_columns(
    pl.col("sequence")
    .map_elements(lambda x: filter_sequence_items(x, valid_items), return_dtype=pl.List(pl.String))
    .alias("sequence")
)

# Filter out rows where:
# 1. The target item (parent_asin) doesn't have valid metadata
# 2. The filtered sequences is empty or null
rows_before_filtering = df.shape[0]
logger.info(f"Rows before filtering: {rows_before_filtering:,}")

df = df.filter((pl.col("sequence").is_not_null()) & (pl.col("sequence").list.len() >= MIN_SEQUENCE_LENGTH))

# Log statistics
logger.info(f"Rows after filtering: {df.shape[0]:,}")
logger.info(
    f"Rows removed: {rows_before_filtering - df.shape[0]:,} ({(rows_before_filtering - df.shape[0]) / rows_before_filtering * 100:.1f}%)"
)

17:00:41 - Rows before filtering: 91,562
17:00:41 - Rows after filtering: 12
17:00:41 - Rows removed: 91,550 (100.0%)


## Truncate long sequences

For users with sequences longer than a maximum length, we truncate to keep only the last n items to maintain consistent sequence lengths and focus on recent interactions.

In [25]:
# Calculate sequence lengths before truncation
df = df.with_columns(pl.col("sequence").list.len().alias("sequence_length_before"))

# Using Polars expressions for efficient truncation - take last N items
df = df.with_columns(pl.col("sequence").list.tail(MAX_SEQUENCE_LENGTH).alias("sequence"))

# Update sequence length for truncated sequences
df = df.with_columns(pl.col("sequence").list.len().alias("sequence_length"))

# Calculate truncation statistics
sequences_truncated = (df["sequence_length_before"] > MAX_SEQUENCE_LENGTH).sum()
pct_truncated = sequences_truncated / len(df) * 100

logger.info(f"Sequences truncated: {sequences_truncated:,} ({pct_truncated:.1f}%)")

# Replace the sequences column with the truncated version
df = df.drop(["sequence_length_before"])

logger.info(
    f"Sequence lengths - Min: {df['sequence_length'].min()}, Max: {df['sequence_length'].max()}, Mean: {df['sequence_length'].mean():.1f}, Median: {df['sequence_length'].median()}"
)

17:00:42 - Sequences truncated: 0 (0.0%)
17:00:42 - Sequence lengths - Min: 3, Max: 5, Mean: 3.6, Median: 3.0


In [26]:
df.group_by("sequence_length").len().with_columns((pl.col("len") / pl.sum("len")).alias("probability")).sort(
    "sequence_length"
).with_columns(pl.col("probability").cum_sum().alias("cumulative_probability")).head(10)

sequence_length,len,probability,cumulative_probability
u32,u32,f64,f64
3,8,0.666667,0.666667
4,1,0.083333,0.75
5,3,0.25,1.0


In [27]:
df = df.select(["user_id", "sequence", "sequence_length"])
df.head()

user_id,sequence,sequence_length
str,list[str],u32
"""AEEATBN5EQ44LM65IAU6S4GSBWHA""","[""B001EYUQAU"", ""B003GDJ07W"", ""B001E8WQGI""]",3
"""AENQ4UD7LRPE5DW5BFOTE2UCJRWA""","[""B00006GSNZ"", ""B0006VGY26"", … ""B001EYUQAU""]",5
"""AEWLQYBQDYWWUWK6UHHTNWO5AHYA""","[""B00008RUYZ"", ""B000GIXIPK"", ""B001EHD9A6""]",3
"""AFL2R6GJV4EBL4MEYY5QNV4RLWGQ""","[""B00006GSNZ"", ""B000GIXIPK"", ""B001G6062O""]",3
"""AFU5GOJMXLDLFVJYBNFFPZYYSLOA""","[""B004SL3LLW"", ""B002SUTB28"", ""B003LPTAL6""]",3


## Save the processed data

Now we'll save the filtered data for use in subsequent steps of the semantic ID generation pipeline.

In [28]:
df.head()

user_id,sequence,sequence_length
str,list[str],u32
"""AEEATBN5EQ44LM65IAU6S4GSBWHA""","[""B001EYUQAU"", ""B003GDJ07W"", ""B001E8WQGI""]",3
"""AENQ4UD7LRPE5DW5BFOTE2UCJRWA""","[""B00006GSNZ"", ""B0006VGY26"", … ""B001EYUQAU""]",5
"""AEWLQYBQDYWWUWK6UHHTNWO5AHYA""","[""B00008RUYZ"", ""B000GIXIPK"", ""B001EHD9A6""]",3
"""AFL2R6GJV4EBL4MEYY5QNV4RLWGQ""","[""B00006GSNZ"", ""B000GIXIPK"", ""B001G6062O""]",3
"""AFU5GOJMXLDLFVJYBNFFPZYYSLOA""","[""B004SL3LLW"", ""B002SUTB28"", ""B003LPTAL6""]",3


In [29]:
item_df.head()

parent_asin,title,description_text,features_text,main_category,categories_text,store,average_rating,rating_number,price,item_context
str,str,str,str,str,str,str,str,i64,str,str
"""B097C888TS""","""ThreeShip Kawvisy Joycon Charging Grip Compatible with Nintendo Switch and Switch OLED, Joycon Charg…","""Features:?V shaped Handle Design?The handle combines the left and right Joycons into a full size con…","""Charge and Play: This joycon charging grip is designed for all joycons. It charges the joycon throug…","""Computers""","""Video Games > Nintendo Switch > Accessories""","""ThreeShip""","""4.5""",195,"""""","""Product: ThreeShip Kawvisy Joycon Charging Grip Compatible with Nintendo Switch and Switch OLED, Joy…"
"""B008BM78CC""","""Metal Hard Cover Aluminum for Nintendo Ndsi Dsi Xl LL Purple""","""This Metal Hard Cover Aluminum is made by hard solid plastic while looks like aluminum metal to prev…","""Classy and attractive Red Nintendo Ndsill Case protects your device against dirt, dust and bumps It …","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo DS > Accessories > Cases & Storage""","""uTrusted""","""5.0""",1,"""""","""Product: Metal Hard Cover Aluminum for Nintendo Ndsi Dsi Xl LL Purple Description: This Metal Hard …"
"""B08GZY3VTZ""","""Just Dance 2021 (PS4)""","""Never miss a chance to dance! 40 new must-dance songs are coming to Just Dance 2021. (Nintendo Switc…","""Never miss a chance to dance 40 new must-dance songs are coming to Just Dance 2021 Exercising has ne…","""Video Games""","""Video Games > PlayStation 4 > Games""","""Ubisoft""","""4.4""",1329,"""19.99""","""Product: Just Dance 2021 (PS4) Description: Never miss a chance to dance! 40 new must-dance songs a…"
"""B00002STF6""","""Mischief Makers - Nintendo 64""","""Product description Professor Theo has been kidnapped by the imperial forces of the evil empire. Now…","""Puzzle/action starring a robotic cleaning maid named Marina on a mission to rescue her kidnapped cre…","""Video Games""","""Video Games > Legacy Systems > Nintendo Systems > Nintendo 64 > Games""","""Nintendo""","""4.0""",76,"""68.26""","""Product: Mischief Makers - Nintendo 64 Description: Product description Professor Theo has been kid…"
"""B0894MXDDS""","""Xbox One Power Supply Brick, Peoture Xbox AC Adapter Replacement Charger Power Cord Cable for Micros…","""Product Features: Compatible: Xbox OneWorldwide voltage: 100-240v AC cable: 3.93 ft / 1.2mDC cable: …","""[Durable &Less Noise] Great improvements have been made on the cooling fan and the inner structure o…","""Home Audio & Theater""","""Video Games > Xbox One > Accessories > Cables & Adapters > Adapters""","""PEOTURE""","""3.5""",134,"""""","""Product: Xbox One Power Supply Brick, Peoture Xbox AC Adapter Replacement Charger Power Cord Cable f…"


In [30]:
# Save the filtered sequences with full history
output_path = DATA_DIR / "output" / f"{CATEGORY}_sequences.parquet"
df.write_parquet(output_path)
logger.info(f"Saved filtered sequences to: {output_path} (rows = {df.shape[0]:,})")

# Save the valid items metadata
metadata_output_path = DATA_DIR / "output" / f"{CATEGORY}_items.parquet"
item_df.write_parquet(metadata_output_path)
logger.info(f"Saved valid item metadata to: {metadata_output_path} (rows = {len(item_df):,})")

17:00:53 - Saved filtered sequences to: /Users/amirmasoud/personal/v_tests/semantic-ids-llm/data/output/Video_Games_sequences.parquet (rows = 12)
17:00:53 - Saved valid item metadata to: /Users/amirmasoud/personal/v_tests/semantic-ids-llm/data/output/Video_Games_items.parquet (rows = 465)


In [31]:
df.shape

(12, 3)

In [32]:
item_df.shape

(465, 11)