# AmazonReviews2023 ‚Äî Handmade_Products (Filtered) ‚Äî UniSRec + BLAIR

This notebook implements a BLAIR-based recommendation pipeline on the Handmade_Products domain of AmazonReviews2023, targeting a sequential / item recommendation task.
The dataset has been pre-filtered externally to this domain and is provided as local CSV files.

After cloning the repository, the workflow is as follows:

Upload the filtered review and metadata CSV files.

Create dataset/process_local_csv.py to convert the local CSV data into RecBole-compatible format.

Run training and evaluation with RecBole, saving models and metrics.

Export all relevant artefacts (processed data, logs, results) for reporting and submission.

## 0. Setup
Environment setup and dependency installation.

In [1]:
!pip install datasets==2.19.1 pyarrow_hotfix
!pip install recbole transformers sentencepiece tqdm

Collecting datasets==2.19.1
  Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow_hotfix
  Downloading pyarrow_hotfix-0.7-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2024.3.1,>=2023.1.0 (from fsspec[http]<=2024.3.1,>=2023.1.0->datasets==2.19.1)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m542.0/542.0 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow_hotfix-0.7-py3-none-any.whl (7.9 kB)
Downloading fsspec-2024.3.1-py3-none-any.whl (171 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m172.0/172.0 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow_hotfix, fsspec, datasets
  Attemp

## 1. Imports
Set up imports, reproducibility seed (if needed), and basic helpers for file checking and result export.

In [2]:
import numpy as np
import pyarrow as pa
import pandas as pd
import datasets
import torch
import recbole

print("numpy:", np.__version__)
print("pyarrow:", pa.__version__)
print("pandas:", pd.__version__)
print("datasets:", datasets.__version__)
print("torch:", torch.__version__)
print("recbole:", recbole.__version__)

import os, json, time
from datetime import datetime
import pandas as pd

def now_id():
    return datetime.now().strftime('%Y%m%d_%H%M%S')

RUN_ID = now_id()
EXPORT_DIR = f"exports/{RUN_ID}"
os.makedirs(EXPORT_DIR, exist_ok=True)
print('RUN_ID =', RUN_ID)
print('EXPORT_DIR =', EXPORT_DIR)


numpy: 2.0.2
pyarrow: 18.1.0
pandas: 2.2.2
datasets: 2.19.1
torch: 2.9.0+cu126
recbole: 1.2.0
RUN_ID = 20251228_063300
EXPORT_DIR = exports/20251228_063300


## 2. Project repo layout
Clone the AmazonReviews2023 repository and navigate to the seq_rec_results directory for subsequent preprocessing and training steps.

In [3]:
%cd /content
!rm -rf AmazonReviews2023

!git clone https://github.com/hyp1231/AmazonReviews2023.git
%cd /content/AmazonReviews2023/seq_rec_results


/content
Cloning into 'AmazonReviews2023'...
remote: Enumerating objects: 127, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 127 (delta 25), reused 20 (delta 20), pack-reused 87 (from 1)[K
Receiving objects: 100% (127/127), 344.61 KiB | 15.66 MiB/s, done.
Resolving deltas: 100% (47/47), done.
/content/AmazonReviews2023/seq_rec_results


<div style="border-left: 6px solid #d93025; background: #fdecea; padding: 12px 14px; border-radius: 10px;">
  <b>üö® IMPORTANT</b>
  <ul style="margin: 8px 0 0 18px;">
    <li><b>Upload</b> the filtered review (/content/AmazonReviews2023/seq_rec_results/local_data/review_filtered.csv) and metadata CSV (/content/AmazonReviews2023/seq_rec_results/local_data/metadata.csv) files.</li>
    <li><b>Create</b> <code>dataset/process_local_csv.py</code> to convert the local CSV data into <b>RecBole-compatible format</b>.</li>
  </ul>
</div>


## 3. Train‚ÄìValidation‚ÄìTest Split
As the data is already filtered before being loaded into the notebook, we split it into training, validation, and test sets following an 8:1:1 ratio.


###Data sanity check

In [5]:
from pathlib import Path
import os
import pandas as pd

# You are currently in: /content/AmazonReviews2023/seq_rec_results
BASE = Path(".").resolve()

REVIEW_CSV = str(BASE / "local_data" / "review_filtered.csv")
META_CSV   = str(BASE / "local_data" / "metadata.csv")

assert os.path.exists(REVIEW_CSV), f"Missing: {REVIEW_CSV}"
assert os.path.exists(META_CSV),   f"Missing: {META_CSV}"

df_r = pd.read_csv(REVIEW_CSV)
df_m = pd.read_csv(META_CSV)

display(df_r.head())
display(df_m.head())

print("reviews shape :", df_r.shape)
print("meta shape    :", df_m.shape)

# best-effort columns
uid_col = "user_id" if "user_id" in df_r.columns else None
item_col = "parent_asin" if "parent_asin" in df_r.columns else ("asin" if "asin" in df_r.columns else None)
rating_col = "rating" if "rating" in df_r.columns else None

if uid_col and item_col:
    print("n_users:", df_r[uid_col].nunique())
    print("n_items:", df_r[item_col].nunique())
    # interactions per user/item
    u_cnt = df_r.groupby(uid_col)[item_col].size()
    i_cnt = df_r.groupby(item_col)[uid_col].size()
    print("user interactions (min/median/max):", int(u_cnt.min()), float(u_cnt.median()), int(u_cnt.max()))
    print("item interactions (min/median/max):", int(i_cnt.min()), float(i_cnt.median()), int(i_cnt.max()))
else:
    print("‚ö†Ô∏è Could not infer user/item columns automatically. Please check df_r.columns:", list(df_r.columns))

if rating_col:
    display(df_r[rating_col].value_counts().sort_index())


Unnamed: 0,user_id,asin,parent_asin,rating,title,text,helpful_vote,verified_purchase,timestamp
0,AFZUK3MTBIBEDQOPAK3OATUOUKLA,B07PWBRXJG,B07PWBRXJG,5.0,fair trade necklace,"This is a very lovely, fair trade necklace mad...",1,True,2020-12-05 19:21:29
1,AFZUK3MTBIBEDQOPAK3OATUOUKLA,B0855GMD9K,B0855GMD9K,5.0,Set of 2 Fair Trade bracelets,I purchased about 5 different color combos of ...,7,True,2020-12-04 19:15:16
2,AFZUK3MTBIBEDQOPAK3OATUOUKLA,B07HN8PP1V,B07HN8PP1V,5.0,Green Fair Trade African bracelet,I purchased about 5 different color combos of ...,2,True,2020-12-04 19:14:01
3,AFZUK3MTBIBEDQOPAK3OATUOUKLA,B07PVC79VH,B07PVC79VH,5.0,Recycled fair trade African bracelet,"Very lovely, VERY easy to take on and off beca...",1,True,2020-12-04 19:11:46
4,AFZUK3MTBIBEDQOPAK3OATUOUKLA,B07T947ZG1,B07T947ZG1,5.0,African fair trade bracelets,I purchased about 5 different color combos of ...,2,True,2020-12-04 19:09:30


Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together
0,Handmade,Daisy Keychain Wristlet Gray Fabric Key fob La...,4.5,12,"['High Quality Fabrics', 'Antique Brass Metall...",['This charming Daisy Fabric Keychain wristlet...,,[{'thumb': 'https://m.media-amazon.com/images/...,[],Generic,"['Handmade Products', 'Clothing, Shoes & Acces...",{'Package Dimensions': '8 x 4 x 0.85 inches; 0...,B07NTK7T5P,
1,Handmade,Anemone Jewelry Beauteous November Birthstone ...,4.1,10,"['Stunning gemstone and detailed design', 'Ban...",['Anemone brings this November birthstone ring...,69.0,[{'thumb': 'https://m.media-amazon.com/images/...,[],Anemone Jewelry,"['Handmade Products', 'Jewelry', 'Rings', 'Sta...","{'Department': 'womens', 'Date First Available...",B0751M85FV,
2,Handmade,Silver Triangle Earrings with Chevron Pattern,5.0,1,[],['These large silver triangles are stamped wit...,,[{'thumb': 'https://m.media-amazon.com/images/...,[],Zo√´ Noelle Designs,"['Handmade Products', 'Jewelry', 'Earrings', '...","{'Department': 'Women', 'Date First Available'...",B01HYNE114,
3,Handmade,"Las Vegas Subway Sign Print - Caesar's, Freemo...",1.0,1,[],"[""Subway Signs make fabulous wall art. A bit o...",19.95,[{'thumb': 'https://m.media-amazon.com/images/...,[],Generic,"['Handmade Products', 'Home & Kitchen', 'Artwo...","{'Date First Available': 'June 14, 2018'}",B07TKZF3Z1,
4,Handmade,Round Cut Cubic Zirconia Stud Earrings Fashion...,4.2,2,['‚≠ê„ÄêSize„ÄëThe remarkable cubic zirconia studs h...,[],14.99,[{'thumb': 'https://m.media-amazon.com/images/...,[],VDKIDKT,"['Handmade Products', 'Jewelry', 'Earrings', '...",{'Package Dimensions': '2.36 x 2.05 x 1.65 inc...,B0BKBJT5MM,


reviews shape : (32510, 9)
meta shape    : (164817, 14)
n_users: 11180
n_items: 15810
user interactions (min/median/max): 1 3.0 30
item interactions (min/median/max): 1 1.0 60


Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
1.0,1110
2.0,662
3.0,1224
4.0,2350
5.0,27164


###Train-validation-test split

In [10]:
import os
import pandas as pd

# ====== EDIT THESE ======
INPUT_CSV = "/content/AmazonReviews2023/seq_rec_results/local_data/review_filtered.csv"   # <-- s·ª≠a ƒë∆∞·ªùng d·∫´n
OUT_DIR   = "dataset/processed/Handmade_Products_local/_raw_time_split_8_1_1"  # <-- s·ª≠a n·∫øu mu·ªën
# ========================

os.makedirs(OUT_DIR, exist_ok=True)

df = pd.read_csv(INPUT_CSV)

# ch·ªâ s·∫Øp x·∫øp theo th·ªùi gian (ƒë·ªÉ ƒë√∫ng temporal split)
df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce")
df = df.dropna(subset=["timestamp"])
df = df.sort_values("timestamp").reset_index(drop=True)

n = len(df)
n_train = int(n * 0.8)
n_valid = int(n * 0.1)

train_df = df.iloc[:n_train].copy()
valid_df = df.iloc[n_train:n_train + n_valid].copy()
test_df  = df.iloc[n_train + n_valid:].copy()

train_path = os.path.join(OUT_DIR, "train.csv")
valid_path = os.path.join(OUT_DIR, "valid.csv")
test_path  = os.path.join(OUT_DIR, "test.csv")

train_df.to_csv(train_path, index=False)
valid_df.to_csv(valid_path, index=False)
test_df.to_csv(test_path, index=False)

print("[DONE] temporal split 8/1/1")
print("rows:", n, "| train:", len(train_df), "| valid:", len(valid_df), "| test:", len(test_df))
print("saved:", train_path, valid_path, test_path, sep="\n- ")


[DONE] temporal split 8/1/1
rows: 32510 | train: 26008 | valid: 3251 | test: 3251
saved:
- dataset/processed/Handmade_Products_local/_raw_time_split_8_1_1/train.csv
- dataset/processed/Handmade_Products_local/_raw_time_split_8_1_1/valid.csv
- dataset/processed/Handmade_Products_local/_raw_time_split_8_1_1/test.csv


## 4. Build RecBole dataset
Write the script dataset/process_local_csv.py (keeping its content identical to dataset/process_amazon_2023.py), then run it to generate:

*.train/valid/test.inter

*.item and *.data_maps

*.feature (PLM embeddings)

In [11]:
import os, re, html, json
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModel

# ================== CONFIG ==================
DOMAIN = "Handmade_Products_local"

SPLIT_DIR = "/content/AmazonReviews2023/seq_rec_results/dataset/processed/Handmade_Products_local/_raw_time_split_8_1_1"
TRAIN_CSV = os.path.join(SPLIT_DIR, "train.csv")
VALID_CSV = os.path.join(SPLIT_DIR, "valid.csv")
TEST_CSV  = os.path.join(SPLIT_DIR, "test.csv")

META_CSV = "/content/AmazonReviews2023/seq_rec_results/local_data/metadata.csv"

OUT_DIR = "/content/AmazonReviews2023/seq_rec_results/dataset/processed/Handmade_Products_local"
MAX_HIS_LEN = 50

PLM = "hyp1231/blair-roberta-base"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 16
# ===========================================

os.makedirs(OUT_DIR, exist_ok=True)

# --------- helpers (gi·ªëng tinh th·∫ßn All_Beauty) ---------
def list_to_str(x):
    if isinstance(x, list):
        return list_to_str(", ".join(x))
    return x

def clean_text(raw_text):
    text = list_to_str(raw_text)
    if pd.isna(text):
        return ""
    text = html.unescape(str(text)).strip()
    text = re.sub(r"<[^>]+>", "", text)
    text = re.sub(r"[\n\t]", " ", text)
    text = re.sub(r" +", " ", text)
    text = re.sub(r"[^\x00-\x7F]", " ", text)
    return text

def feature_process(feature):
    # h·ª£p v·ªõi metadata local: float/list/str/NaN
    if pd.isna(feature):
        return ""
    if isinstance(feature, float):
        return f"{feature}. "
    if isinstance(feature, list) and len(feature) > 0:
        s = ""
        for v in feature:
            s += clean_text(v) + ", "
        return s[:-2] + ". "
    return clean_text(feature) + ". "

def build_item2meta(meta_df):
    features_needed = ["title", "features", "categories", "description"]
    item2meta = {}
    for _, row in meta_df.iterrows():
        pid = row["parent_asin"]
        meta_text = ""
        for f in features_needed:
            meta_text += feature_process(row.get(f, ""))
        item2meta[pid] = meta_text
    return item2meta

def truncate_history_str(hist_str, max_his_len):
    parts = str(hist_str).split(" ")
    parts = [p for p in parts if p]
    return " ".join(parts[-max_his_len:])

def filter_df_by_metadata(df, item2meta):
    # drop target kh√¥ng c√≥ metadata
    df = df[df["parent_asin"].isin(item2meta)].copy()

    # l·ªçc history item kh√¥ng c√≥ metadata
    def _filt_hist(h):
        items = str(h).split(" ")
        items = [it for it in items if it in item2meta]
        return " ".join(items)

    df["history"] = df["history"].apply(_filt_hist)
    df = df[df["history"].str.len() > 0].copy()
    return df

def remap_id(datasets):
    user2id = {"[PAD]": 0}
    id2user = ["[PAD]"]
    item2id = {"[PAD]": 0}
    id2item = ["[PAD]"]

    for split in ["train", "valid", "test"]:
        df = datasets[split]
        for u, tgt, hist in zip(df["user_id"], df["parent_asin"], df["history"]):
            if u not in user2id:
                user2id[u] = len(id2user)
                id2user.append(u)
            if tgt not in item2id:
                item2id[tgt] = len(id2item)
                id2item.append(tgt)
            for it in str(hist).split(" "):
                if it and it not in item2id:
                    item2id[it] = len(id2item)
                    id2item.append(it)

    return {"user2id": user2id, "id2user": id2user, "item2id": item2id, "id2item": id2item}

def build_samples_from_interactions(df_inter):
    # df_inter: user_id, parent_asin, timestamp
    df_inter = df_inter.copy()
    df_inter["timestamp"] = pd.to_datetime(df_inter["timestamp"], errors="coerce")
    df_inter = df_inter.dropna(subset=["timestamp"])
    df_inter = df_inter.sort_values(["user_id", "timestamp"])

    rows = []
    for u, g in df_inter.groupby("user_id"):
        seq = g["parent_asin"].tolist()
        if len(seq) < 2:
            continue
        for i in range(1, len(seq)):
            hist = " ".join(seq[:i])
            rows.append((u, hist, seq[i]))
    return pd.DataFrame(rows, columns=["user_id", "history", "parent_asin"])

# ================== LOAD META ==================
meta_df = pd.read_csv(META_CSV)
if "parent_asin" not in meta_df.columns:
    raise ValueError("metadata.csv thi·∫øu c·ªôt parent_asin")

item2meta = build_item2meta(meta_df)

# ================== LOAD SPLITS (interaction raw) ==================
train_i = pd.read_csv(TRAIN_CSV)
valid_i = pd.read_csv(VALID_CSV)
test_i  = pd.read_csv(TEST_CSV)

need_cols = {"user_id", "parent_asin", "timestamp"}
for name, df in [("train", train_i), ("valid", valid_i), ("test", test_i)]:
    miss = need_cols - set(df.columns)
    if miss:
        raise ValueError(f"{name}.csv thi·∫øu c·ªôt {miss}. Ph·∫£i l√† interaction raw.")

# ================== BUILD samples per split ==================
datasets = {
    "train": build_samples_from_interactions(train_i),
    "valid": build_samples_from_interactions(valid_i),
    "test":  build_samples_from_interactions(test_i),
}

# ================== FILTER + TRUNCATE ==================
for split in ["train", "valid", "test"]:
    df = datasets[split]
    df = filter_df_by_metadata(df, item2meta)
    df["history"] = df["history"].apply(lambda h: truncate_history_str(h, MAX_HIS_LEN))
    df = df[df["history"].str.len() > 0].copy()
    datasets[split] = df.reset_index(drop=True)

# ================== WRITE .inter ==================
for split in ["train", "valid", "test"]:
    out_path = os.path.join(OUT_DIR, f"{DOMAIN}.{split}.inter")
    with open(out_path, "w", encoding="utf-8") as f:
        f.write("user_id:token\titem_id_list:token_seq\titem_id:token\n")
        for u, hist, tgt in zip(datasets[split]["user_id"], datasets[split]["history"], datasets[split]["parent_asin"]):
            f.write(f"{u}\t{hist}\t{tgt}\n")

# ================== DATA MAPS + id2meta ==================
data_maps = remap_id(datasets)

id2meta = {0: "[PAD]"}
for asin, meta_text in item2meta.items():
    if asin not in data_maps["item2id"]:
        continue
    item_id = data_maps["item2id"][asin]
    id2meta[item_id] = meta_text
data_maps["id2meta"] = id2meta

with open(os.path.join(OUT_DIR, f"{DOMAIN}.data_maps"), "w", encoding="utf-8") as f:
    json.dump(data_maps, f)

# ================== GENERATE ITEM FEATURES (.feature) ==================
device = torch.device(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(PLM)
model = AutoModel.from_pretrained(PLM).to(device)
model.eval()

# 1-base: sorted_text[0] -> item_id=1
sorted_text = [data_maps["id2meta"].get(i, "") for i in range(1, len(data_maps["item2id"]))]

all_embeddings = []
for i in tqdm(range(0, len(sorted_text), BATCH_SIZE)):
    batch = sorted_text[i:i + BATCH_SIZE]
    inputs = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    all_embeddings.append(outputs.last_hidden_state[:, 0, :].cpu().numpy())

all_embeddings = np.concatenate(all_embeddings, axis=0).astype(np.float32)
feature_path = os.path.join(OUT_DIR, f"{DOMAIN}.{PLM.split('/')[-1]}.feature")
all_embeddings.tofile(feature_path)

# ================== STATS ==================
print("DONE FULL PREPROCESS + FEATURE")
print("OUT:", OUT_DIR)
print("train samples:", len(datasets["train"]))
print("valid samples:", len(datasets["valid"]))
print("test  samples:", len(datasets["test"]))
print("#Users:", len(data_maps["user2id"]) - 1)
print("#Items:", len(data_maps["item2id"]) - 1)
print("feature:", feature_path)
print("avg meta char len:", float(np.mean([len(x) for x in sorted_text])) if len(sorted_text) else 0.0)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 900/900 [07:01<00:00,  2.14it/s]


DONE FULL PREPROCESS + FEATURE
OUT: /content/AmazonReviews2023/seq_rec_results/dataset/processed/Handmade_Products_local
train samples: 16278
valid samples: 1607
test  samples: 1749
#Users: 9398
#Items: 14388
feature: /content/AmazonReviews2023/seq_rec_results/dataset/processed/Handmade_Products_local/Handmade_Products_local.blair-roberta-base.feature
avg meta char len: 1105.6536697247707


## 5. Training / Evaluation (UniSRec)
**Notebook uses a patched run.py with:**

NumPy 2.x compatibility fixes

safe torch.load handling

automatic PLM embedding attachment (*.feature) during dataset loading

**After execution, it produces:**

training/evaluation logs

resolved config snapshots

model checkpoints, all saved under saved/.

In [31]:
%%writefile run_with_user_pred.py
# ==== NumPy 2.0 compatibility patch for older RecBole ====
import numpy as np

# scalar-type aliases removed/changed in NumPy 2.0
if not hasattr(np, "float_"):
    np.float_ = np.float64
if not hasattr(np, "complex_"):
    np.complex_ = np.complex128
if not hasattr(np, "unicode_"):
    # NumPy 2.0: use str_ instead
    np.unicode_ = np.str_
if not hasattr(np, "bool_"):
    np.bool_ = np.bool8
if not hasattr(np, "int_"):
    np.int_ = np.int64

# deprecated short names that older libs assign/use
if not hasattr(np, "float"):
    np.float = float
if not hasattr(np, "complex"):
    np.complex = complex
if not hasattr(np, "unicode"):
    np.unicode = str
if not hasattr(np, "bool"):
    np.bool = bool
if not hasattr(np, "int"):
    np.int = int
# =========================================================

import os
import argparse
from logging import getLogger
from collections import defaultdict
import math

import torch
import pandas as pd

from recbole.config import Config
from recbole.data import data_preparation
from recbole.utils import init_seed, init_logger, set_color, get_trainer

from utils import get_model, create_dataset

# ==== PyTorch 2.6 compatibility patch for RecBole checkpoint loading ====
import inspect

_torch_load_orig = torch.load
def _torch_load_compat(*args, **kwargs):
    # If torch.load supports weights_only (PyTorch >=2.6), force it to False unless user specified
    if "weights_only" in inspect.signature(_torch_load_orig).parameters and "weights_only" not in kwargs:
        kwargs["weights_only"] = False
    return _torch_load_orig(*args, **kwargs)

torch.load = _torch_load_compat
# =========================================================================


def _ndcg_at_k(rec_list, gt_set, k: int) -> float:
    k = min(k, len(rec_list))
    if k <= 0:
        return 0.0
    if not gt_set:
        return 0.0

    dcg = 0.0
    for i in range(k):
        if rec_list[i] in gt_set:
            dcg += 1.0 / math.log2(i + 2)

    ideal_hits = min(len(gt_set), k)
    idcg = 0.0
    for i in range(ideal_hits):
        idcg += 1.0 / math.log2(i + 2)

    return float(dcg / idcg) if idcg > 0 else 0.0


def _id2token_safe(dataset, field, ids):
    """
    Always return:
    - if ids is list/tuple/ndarray -> list[str]
    - if ids is scalar -> str
    """
    import numpy as np

    def _to_list_str(x):
        # x can be list/tuple/np.ndarray/torch tensor
        if isinstance(x, np.ndarray):
            x = x.tolist()
        if isinstance(x, torch.Tensor):
            x = x.detach().cpu().tolist()
        # now x should be list/tuple
        return [str(t) for t in list(x)]

    # list-like ids -> list[str]
    if isinstance(ids, (list, tuple, np.ndarray, torch.Tensor)):
        tokens = dataset.id2token(field, list(ids) if not isinstance(ids, torch.Tensor) else ids.detach().cpu().tolist())
        return _to_list_str(tokens)

    # scalar -> str
    tokens = dataset.id2token(field, [ids])
    if isinstance(tokens, np.ndarray):
        tokens = tokens.tolist()
    return str(tokens[0])


def save_user_evaluation_details(trainer, test_data, config: Config, out_csv_path: str, ks=(10, 50)):
    """
    Export ƒë√∫ng format nh∆∞ user_evaluation_details.csv:
    userID, recommended_items, groundtruth_items, recall@10, precision@10, ndcg@10, recall@50, precision@50, ndcg@50
    """
    logger = getLogger()
    device = config["device"]
    model = trainer.model.to(device)
    model.eval()

    dataset = test_data.dataset
    uid_field = dataset.uid_field
    iid_field = dataset.iid_field

    # field l·ªãch s·ª≠ ƒë·ªÉ mask items ƒë√£ t∆∞∆°ng t√°c (n·∫øu c√≥)
    item_list_field = config["ITEM_LIST_FIELD"] if "ITEM_LIST_FIELD" in config else "item_id_list"

    max_k = max(ks)

    # per-user
    user2_recs = {}                  # uid(int) -> list[item_id(int)] top-max_k (l·∫•y 1 l·∫ßn)
    user2_gt_items = defaultdict(list)  # uid(int) -> list[item_id(int)] (gom l·∫°i)

    with torch.no_grad():
        for batch in test_data:
            # RecBole Sequential dataloader th∆∞·ªùng batch l√† tuple: (interaction, ..., positive_u, positive_i)
            interaction = batch[0].to(device)
            positive_u = batch[2].to(device)
            positive_i = batch[3].to(device)

            scores = model.full_sort_predict(interaction)

            # mask PAD item id=0
            if scores.dim() == 2 and scores.size(1) > 0:
                scores[:, 0] = -1e12

            # mask history items n·∫øu c√≥
            if item_list_field in interaction:
                hist = interaction[item_list_field].long()  # [B, L]
                B, L = hist.shape
                flat_hist = hist.view(-1)
                flat_row = torch.arange(B, device=device).unsqueeze(1).expand(B, L).reshape(-1)
                mask_ok = flat_hist > 0
                scores[flat_row[mask_ok], flat_hist[mask_ok]] = -1e12

            topk_items = torch.topk(scores, k=max_k, dim=1).indices  # [B, max_k]

            u_list = positive_u.detach().cpu().tolist()
            gt_list = positive_i.detach().cpu().tolist()
            recs_list = topk_items.detach().cpu().tolist()

            for u_id, gt_id, rec_ids in zip(u_list, gt_list, recs_list):
                u_id = int(u_id)
                gt_id = int(gt_id)
                if u_id == 0:
                    continue

                # store recs once per user (first time seen)
                if u_id not in user2_recs:
                    user2_recs[u_id] = rec_ids

                user2_gt_items[u_id].append(gt_id)

    # Build rows
    rows = []
    for u_id, rec_ids in user2_recs.items():
        gt_ids = user2_gt_items.get(u_id, [])

        # unique GT gi·ªØ th·ª© t·ª±
        seen = set()
        gt_unique = []
        for x in gt_ids:
            if x not in seen:
                seen.add(x)
                gt_unique.append(x)

        user_token = _id2token_safe(dataset, uid_field, u_id)
        rec_tokens = _id2token_safe(dataset, iid_field, rec_ids)
        gt_tokens = _id2token_safe(dataset, iid_field, gt_unique)
        if not isinstance(rec_tokens, (list, tuple)):
            rec_tokens = [str(rec_tokens)]
        else:
            rec_tokens = [str(x) for x in rec_tokens]

        if not isinstance(gt_tokens, (list, tuple)):
            gt_tokens = [str(gt_tokens)]
        else:
            gt_tokens = [str(x) for x in gt_tokens]

        # normalize list type
        if not isinstance(rec_tokens, (list, tuple)):
            rec_tokens = [rec_tokens]
        if not isinstance(gt_tokens, (list, tuple)):
            gt_tokens = [gt_tokens]

        gt_set = set(gt_tokens)

        row = {
            "userID": user_token,
            "recommended_items": ",".join(rec_tokens),
            "groundtruth_items": ",".join(gt_tokens),
        }

        for k in ks:
            rec_k = rec_tokens[:k]
            hits = sum(1 for x in rec_k if x in gt_set)
            denom_gt = len(gt_set)

            row[f"recall@{k}"] = float(hits / denom_gt) if denom_gt > 0 else 0.0
            row[f"precision@{k}"] = float(hits / k) if k > 0 else 0.0
            row[f"ndcg@{k}"] = _ndcg_at_k(rec_tokens, gt_set, k)

        rows.append(row)

    df = pd.DataFrame(rows)

    # fixed column order
    ordered_cols = ["userID", "recommended_items", "groundtruth_items"]
    for k in ks:
        ordered_cols += [f"recall@{k}", f"precision@{k}", f"ndcg@{k}"]
    df = df[ordered_cols]

    os.makedirs(os.path.dirname(out_csv_path) or ".", exist_ok=True)
    df.to_csv(out_csv_path, index=False, encoding="utf-8")

    logger.info(set_color("[SAVED user evaluation details]", "green") + f" -> {out_csv_path}")
    logger.info(set_color("Columns", "yellow") + f": {list(df.columns)}")
    logger.info(set_color("Rows", "yellow") + f": {len(df)}")

    # optional overall mean dict (m·ªói metric 1 s·ªë)
    overall = {c: float(df[c].mean()) for c in df.columns if "@" in c}
    logger.info(set_color("[OVERALL mean metrics]", "cyan") + f": {overall}")


def run_single(model_name, dataset_name, pretrained_file='', save_user_csv=True, **kwargs):
    props = ['config/overall.yaml', f'config/{model_name}.yaml']
    print(props)

    model_class = get_model(model_name)

    config = Config(
        model=model_class,
        dataset=dataset_name,
        config_file_list=props,
        config_dict=kwargs
    )

    init_seed(config['seed'], config['reproducibility'])
    init_logger(config)
    logger = getLogger()
    logger.info(config)

    dataset = create_dataset(config)
    logger.info(dataset)

    train_data, valid_data, test_data = data_preparation(config, dataset)

    model = model_class(config, train_data.dataset).to(config['device'])

    if pretrained_file != '':
        checkpoint = torch.load(pretrained_file, map_location=config['device'])
        logger.info(f'Loading from {pretrained_file}')
        model.load_state_dict(checkpoint['state_dict'], strict=False)

    logger.info(model)

    trainer = get_trainer(config['MODEL_TYPE'], config['model'])(config, model)

    best_valid_score, best_valid_result = trainer.fit(
        train_data,
        valid_data,
        saved=True,
        show_progress=config['show_progress']
    )

    test_result = trainer.evaluate(
        test_data,
        load_best_model=True,
        show_progress=config['show_progress']
    )

    logger.info(set_color('best valid ', 'yellow') + f': {best_valid_result}')
    logger.info(set_color('test result', 'yellow') + f': {test_result}')

    # === NEW: xu·∫•t ƒë√∫ng format nh∆∞ user_evaluation_details.csv ===
    if save_user_csv:
        out_dir = os.path.join(config['checkpoint_dir'], "per_user_outputs")
        out_csv = os.path.join(out_dir, f"{dataset_name}__{model_name}__user_evaluation_details.csv")
        save_user_evaluation_details(trainer, test_data, config, out_csv, ks=(10, 50))


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-m', type=str, default='UniSRec')
    parser.add_argument('-d', type=str, default='Handmade_Products_local')
    parser.add_argument('-p', type=str, default='')
    parser.add_argument('--no_user_csv', action='store_true')

    args, _ = parser.parse_known_args()
    print(args)

    run_single(
        args.m,
        args.d,
        pretrained_file=args.p,
        save_user_csv=(not args.no_user_csv)
    )


Overwriting run_with_user_pred.py


### Run training

In [36]:
!python run_with_user_pred.py -m UniSRec -d Handmade_Products_local


2025-12-28 08:19:43.455702: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1766909983.491995   28106 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766909983.502880   28106 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1766909983.525908   28106 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766909983.525944   28106 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766909983.525952   28106 computation_placer.cc:177] computation placer alr

## 6. Collect results & export
The cell below attempts to collect commonly required artifacts for reporting or submission purposes, including:

the most recent log file

the configuration YAML files

the model checkpoint

the metrics/result files generated by RecBole



In [34]:
import shutil, glob, json
from pathlib import Path
from datetime import datetime

# ===== Required identifiers =====
DOMAIN = "Handmade_Products"
DOMAIN_LOCAL = "Handmade_Products_local"   # t√™n dataset trong dataset/processed
RUN_ID = RUN_ID  # n·∫øu ƒë√£ c√≥ ·ªü cell tr∆∞·ªõc
EXPORT_DIR = EXPORT_DIR  # n·∫øu ƒë√£ c√≥ ·ªü cell tr∆∞·ªõc

# ===== Export root =====
export_root = Path(EXPORT_DIR)
export_root.mkdir(parents=True, exist_ok=True)

# ===== Copy processed dataset =====
proc_src = Path("dataset/processed") / DOMAIN_LOCAL
if proc_src.exists():
    shutil.make_archive(
        str(export_root / f"processed_{DOMAIN_LOCAL}"),
        "zip",
        root_dir=str(proc_src)
    )
    print("‚úÖ zipped processed dataset ->", export_root / f"processed_{DOMAIN_LOCAL}.zip")
else:
    print("‚ö†Ô∏è processed dataset dir not found:", proc_src)

# ===== Locate RecBole saved outputs =====
saved = Path("saved")
if saved.exists():
    candidates = [p for p in saved.rglob("*") if p.is_file()]
    candidates.sort(key=lambda p: p.stat().st_mtime, reverse=True)
    print("saved files:", len(candidates))

    keep_ext = {".log", ".txt", ".yaml", ".yml", ".pth", ".pt", ".ckpt", ".json"}
    kept = 0
    for p in candidates:
        if p.suffix.lower() in keep_ext:
            rel = p.relative_to(saved)
            dst = export_root / "saved_subset" / rel
            dst.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy2(p, dst)
            kept += 1
            if kept >= 50:
                break

    print(f"‚úÖ copied {kept} saved files into", export_root / "saved_subset")
    shutil.make_archive(
        str(export_root / "saved_subset"),
        "zip",
        root_dir=str(export_root / "saved_subset")
    )
    print("‚úÖ zipped saved_subset ->", export_root / "saved_subset.zip")
else:
    print("‚ö†Ô∏è RecBole saved/ folder not found. Maybe training did not run.")

# ===== Write run manifest =====
manifest = {
    "run_id": RUN_ID,
    "domain": DOMAIN,
    "dataset_name": DOMAIN_LOCAL,
    "plm": PLM_NAME if "PLM_NAME" in globals() else None,
    "review_csv": REVIEW_CSV,
    "meta_csv": META_CSV,
    "timestamp": datetime.now().isoformat(),
}
with open(export_root / "manifest.json", "w", encoding="utf-8") as f:
    json.dump(manifest, f, ensure_ascii=False, indent=2)

print("‚úÖ wrote manifest ->", export_root / "manifest.json")


‚úÖ zipped processed dataset -> exports/20251228_063300/processed_Handmade_Products_local.zip
saved files: 6
‚úÖ copied 5 saved files into exports/20251228_063300/saved_subset
‚úÖ zipped saved_subset -> exports/20251228_063300/saved_subset.zip
‚úÖ wrote manifest -> exports/20251228_063300/manifest.json
