# Rank-Based Collaborative Filtering (5-core • TRAIN)

**Goal**
- Build a rank-based CF recommender that aggregates neighbors' positive preferences
  instead of predicting absolute ratings.

**What this notebook does**
1. Load 5-core **TRAIN** from `PROCESSED_DIR` (fallback `RAW_DIR`).
2. Build sparse matrices: ratings (CSR) and binary preferences (P: rating ≥ threshold).
3. Fit user–user neighbors (cosine).
4. Rank items for a target user by *vote aggregation* from neighbors.
5. Return Top-N items (and provide a simple unit test).

**Why rank-based?**
- More robust to rating scale biases.
- Simple, explainable, fast to deploy as a baseline.

### Task: Import modules and libraries

In [12]:
import os, sys, numpy as np, polars as pl, pickle, json
from pathlib import Path
from scipy.sparse import csr_matrix, save_npz, load_npz
from sklearn.neighbors import NearestNeighbors

# Add utilities to PYTHONPATH
module_path = os.path.abspath(os.path.join('..', '../utilities'))
if module_path not in sys.path:
    sys.path.append(module_path)

from logger import Logger
from configurations import Configurations

# Logger + paths + params (kept minimal)
logger = Logger(process_name="rank_based", log_file=Configurations.LOG_PATH)
PROCESSED_DIR = Path(Configurations.DATA_PROCESSED_PATH)
MODELS_DIR = Path(Configurations.MODELS_PATH)

CATEGORY       = Configurations.CATEGORIES
K_NEIGHBORS    = 30
N_RECS         = 10
POS_THRESHOLD  = 4.0
MEAN_CENTER    = True
MAX_USERS      = None
MAX_ITEMS      = None

logger.log_info(f"[Init] PROCESSED_DIR={PROCESSED_DIR} | models_dir={MODELS_DIR}")
logger.log_info(f"[Params] category={CATEGORY} | k={K_NEIGHBORS} | n_recs={N_RECS} | thr={POS_THRESHOLD} | mean_center={MEAN_CENTER}")


2025-09-28 10:41:38,259 - INFO - [Init] PROCESSED_DIR=/Users/kevin/Documents/GitHub/Python/VESKL/Personal/NEU/NEU/NEU_7275/Prj/Prj_1/APRS_7275_G6/Amazon-Product-Recommendation-System/data/processed | models_dir=/Users/kevin/Documents/GitHub/Python/VESKL/Personal/NEU/NEU/NEU_7275/Prj/Prj_1/APRS_7275_G6/Amazon-Product-Recommendation-System/models
2025-09-28 10:41:38,262 - INFO - [Params] category=['Electronics', 'Beauty_and_Personal_Care'] | k=30 | n_recs=10 | thr=4.0 | mean_center=True


### Define functions

#### Data Loader

In [19]:
def _candidate_filenames(category: str):
    safe = category.replace('/', '-')
    return [PROCESSED_DIR / f"{safe}.5core.train.parquet"]

def _coerce_timestamp_seconds(ts):
    """Accept pl.Expr or pl.Series. Return Expr or Series of Int64 seconds (None for invalid)."""
    def _conv(x):
        if x is None or x == "":
            return None
        try:
            v = float(x)
        except Exception:
            return None
        if v > 1e12:
            v = v // 1000
        return int(v)

    if isinstance(ts, pl.Expr):
        return ts.map_elements(lambda x: _conv(x), return_dtype=pl.Int64)
    if isinstance(ts, pl.Series):
        out = [_conv(x) for x in ts.to_list()]
        return pl.Series(out).cast(pl.Int64)
    try:
        return _coerce_timestamp_seconds(pl.Series(ts))
    except Exception:
        raise TypeError("ts must be a polars.Expr or polars.Series")

def load_5core_train(category: str) -> pl.DataFrame:
    expected = ['user_id', 'parent_asin', 'rating', 'timestamp', 'history']
    for p in _candidate_filenames(category):
        if p.exists() and p.stat().st_size > 0:
            logger.log_info(f"[Load] {category} ← {p.name}")
            df = pl.read_parquet(p)
            miss = [c for c in expected if c not in df.columns]
            if miss:
                raise ValueError(f"Missing {miss} in {p}")
            df = df.select(expected)
            # normalize & clip ratings to [1,5]
            df = df.with_columns([
                pl.col("rating").map_elements(
                    lambda x: None if x is None or x == "" else float(max(1.0, min(5.0, float(x)))),
                    return_dtype=pl.Float64
                ),
                _coerce_timestamp_seconds(pl.col("timestamp")).alias("timestamp"),
                pl.col("user_id").cast(pl.Utf8),
                pl.col("parent_asin").cast(pl.Utf8),
            ])
            logger.log_info(f"[Load] Done: shape={df.shape} | users={len(df.select('user_id').unique())} | items={len(df.select('parent_asin').unique())}")
            return df
    raise FileNotFoundError(f"5-core TRAIN not found for {category}")


#### Build matrices (ratings CSR + binary P)

In [4]:
def build_matrices(df_train: pl.DataFrame,
                   pos_threshold: float = 4.0,
                   mean_center: bool = True,
                   max_users=None,
                   max_items=None):
    df = df_train
    # optional down-sample by first-seen unique ids
    if max_users is not None:
        keep_u = set(df['user_id'].unique().to_list()[:max_users])
        df = df.filter(pl.col('user_id').is_in(list(keep_u)))
    if max_items is not None:
        keep_i = set(df['parent_asin'].unique().to_list()[:max_items])
        df = df.filter(pl.col('parent_asin').is_in(list(keep_i)))

    df = df.drop_nulls(subset=['user_id', 'parent_asin', 'rating'])
    df = df.with_columns(pl.col('rating').cast(pl.Float32))

    # canonical lists + indexers
    user_rev = np.array(df['user_id'].unique().to_list(), dtype=object)
    item_rev = np.array(df['parent_asin'].unique().to_list(), dtype=object)
    user_indexer = {u: i for i, u in enumerate(user_rev)}
    item_indexer = {a: i for i, a in enumerate(item_rev)}

    # codes + values
    u_codes = np.array([user_indexer[x] for x in df['user_id'].to_list()], dtype=np.int32)
    i_codes = np.array([item_indexer[x] for x in df['parent_asin'].to_list()], dtype=np.int32)
    vals = np.array(df['rating'].to_list(), dtype=np.float32)

    n_users = int(user_rev.size)
    n_items = int(item_rev.size)

    R = csr_matrix((vals, (u_codes, i_codes)), shape=(n_users, n_items), dtype=np.float32)
    P = csr_matrix(((vals >= pos_threshold).astype(np.float32), (u_codes, i_codes)),
                   shape=(n_users, n_items), dtype=np.float32)

    user_means = np.zeros(n_users, dtype=np.float32)
    Rc = None
    if mean_center and n_users > 0:
        Rc = R.copy().astype(np.float32)
        row_sums = np.array(R.sum(axis=1)).ravel().astype(np.float32)
        row_counts = np.diff(R.indptr).astype(np.int32)
        with np.errstate(divide='ignore', invalid='ignore'):
            user_means = np.where(row_counts > 0, row_sums / row_counts, 0.0).astype(np.float32)
        for uu in range(n_users):
            s, e = Rc.indptr[uu], Rc.indptr[uu + 1]
            if s < e:
                Rc.data[s:e] -= user_means[uu]

    user_rev_indexer = user_rev
    item_rev_indexer = item_rev

    logger.log_info(f"[Matrix] R shape={R.shape} nnz={R.nnz} | P nnz={P.nnz} | users={n_users} items={n_items}")
    return R, P, Rc, user_indexer, item_indexer, user_rev_indexer, item_rev_indexer, user_means


#### Fit neighbors (cosine) 

In [5]:
def fit_user_neighbors_for_rank(X: csr_matrix, k_neighbors: int = 30) -> NearestNeighbors:
    """Fit NearestNeighbors (cosine, brute) on user vectors (Rc or R)."""
    k = min(k_neighbors + 1, X.shape[0])
    nn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=k).fit(X)
    logger.log_info(f"[Neighbors] fitted on {X.shape} | k={k_neighbors}")
    return nn


#### Rank aggregation recommender 

In [7]:
def recommend_for_user_rank_based(
    user_id,
    R: csr_matrix,
    P: csr_matrix,
    Rc: csr_matrix,
    user_indexer: dict,
    item_indexer: dict,
    user_rev_indexer: np.ndarray,
    item_rev_indexer: np.ndarray,
    user_means: np.ndarray,
    nn_model: NearestNeighbors,
    top_k_neighbors: int = 30,
    top_n_items: int = 10
):
    if user_id not in user_indexer:
        raise KeyError(f"user_id {user_id} not found")

    uidx = user_indexer[user_id]
    X = Rc if Rc is not None else R

    # neighbors
    dists, inds = nn_model.kneighbors(X.getrow(uidx), return_distance=True)
    dists, inds = dists.ravel(), inds.ravel()
    mask = inds != uidx
    inds, dists = inds[mask], dists[mask]
    if inds.size > top_k_neighbors:
        inds, dists = inds[:top_k_neighbors], dists[:top_k_neighbors]
    sims = np.clip(1.0 - dists, 0.0, 1.0)

    # aggregate neighbor votes
    scores = P[inds, :].T.dot(sims)  # shape (n_items,)

    # candidates = items not rated by user
    n_items = R.shape[1]
    rated = set(R.getrow(uidx).indices.tolist())
    mask_rated = np.zeros(n_items, dtype=bool)
    if rated:
        mask_rated[list(rated)] = True
    cand_idx = np.where(~mask_rated)[0]
    if cand_idx.size == 0:
        return pl.DataFrame({"parent_asin": [], "score": []})

    cand_scores = scores[cand_idx]
    kth = min(top_n_items, cand_idx.size - 1)
    top_pos = np.argpartition(-cand_scores, kth)[:top_n_items]
    top_pairs = sorted(((int(cand_idx[i]), float(cand_scores[i])) for i in top_pos), key=lambda x: -x[1])

    rec_asins = [item_rev_indexer[i] for i, _ in top_pairs]
    rec_scores = [s for _, s in top_pairs]
    return pl.DataFrame({"parent_asin": rec_asins, "score": rec_scores})


#### run_rank_base_CF

In [8]:
def run_rank_base_CF(
    category: str = None,
    k_neighbors: int = None,
    n_recs: int = None,
    pos_threshold: float = None,
    mean_center: bool = None,
    max_users: int = None,
    max_items: int = None,
    target_user=None
):
    """
    Compact end-to-end runner for rank-based CF using polars-only loader/matrices.
    Returns: (recs: pl.DataFrame, artifacts: dict)
    """
    _category = category if category is not None else (CATEGORY[0] if isinstance(CATEGORY, (list, tuple)) else CATEGORY)
    _k = k_neighbors if k_neighbors is not None else K_NEIGHBORS
    _n = n_recs if n_recs is not None else N_RECS
    _th = pos_threshold if pos_threshold is not None else POS_THRESHOLD
    _mc = mean_center if mean_center is not None else MEAN_CENTER

    logger.log_info(f"[Run-Rank] cat={_category} k={_k} n={_n} thr={_th} mean_center={_mc} max_users={max_users} max_items={max_items}")

    # load + build
    df_train = load_5core_train(_category)
    R, P, Rc, user_indexer, item_indexer, user_rev_indexer, item_rev_indexer, user_means = build_matrices(
        df_train, pos_threshold=_th, mean_center=_mc, max_users=max_users, max_items=max_items
    )

    # fit neighbors on mean-centered matrix when available
    X = Rc if (_mc and Rc is not None) else R
    nn_model = fit_user_neighbors_for_rank(X, k_neighbors=_k)

    # pick target user (polars -> python value)
    if target_user is None:
        try:
            target_user = df_train['user_id'].to_list()[0]
        except Exception:
            target_user = user_rev_indexer[0]

    # if not found, try string key then fallback to first
    if target_user not in user_indexer:
        tu = str(target_user)
        if tu in user_indexer:
            target_user = tu
        else:
            logger.log_warning("[Run-Rank] Provided target_user not found; using first user.")
            target_user = user_rev_indexer[0]

    logger.log_info(f"[Run-Rank] target_user={target_user}")

    recs = recommend_for_user_rank_based(
        user_id=target_user,
        R=R, P=P, Rc=Rc,
        user_indexer=user_indexer,
        item_indexer=item_indexer,
        user_rev_indexer=user_rev_indexer,
        item_rev_indexer=item_rev_indexer,
        user_means=user_means,
        nn_model=nn_model,
        top_k_neighbors=_k,
        top_n_items=_n
    )
    logger.log_info(f"[Run-Rank] Got {len(recs)} recs")

    artifacts = dict(
        R=R, P=P, Rc=Rc, user_indexer=user_indexer, item_indexer=item_indexer,
        user_rev_indexer=user_rev_indexer, item_rev_indexer=item_rev_indexer,
        user_means=user_means, nn_model=nn_model, df_train=df_train, target_user=target_user
    )
    return recs, artifacts


#### recommend n product(s) for user at index idx

In [17]:
def unit_test_rank_recommend_user_at_index(user_index: int, n_recs: int, category: str = None, k_neighbors: int = 30):
    """
    Recommend `n_recs` products for the user at position `user_index` (0-based)
    using rank-based CF. Uses polars only.
    """
    try:
        _, art = run_rank_base_CF(
            category=category,
            k_neighbors=k_neighbors,
            n_recs=n_recs,
            pos_threshold=POS_THRESHOLD,
            mean_center=MEAN_CENTER,
            max_users=MAX_USERS,
            max_items=MAX_ITEMS,
            target_user=None
        )
    except Exception as e:
        logger.log_exception(f"[UnitTest-Rank@index={user_index}] Build artifacts failed: {e}")
        raise

    user_rev_indexer = art["user_rev_indexer"]
    if not (0 <= user_index < len(user_rev_indexer)):
        msg = f"[UnitTest-Rank@index={user_index}] Invalid index. Range: [0, {len(user_rev_indexer)-1}]"
        logger.log_error(msg)
        raise IndexError(msg)

    target_user = user_rev_indexer[user_index]
    logger.log_info(f"[UnitTest-Rank@index={user_index}] target_user={target_user}")

    recs = recommend_for_user_rank_based(
        user_id=target_user,
        R=art["R"], P=art["P"], Rc=art["Rc"],
        user_indexer=art["user_indexer"],
        item_indexer=art["item_indexer"],
        user_rev_indexer=art["user_rev_indexer"],
        item_rev_indexer=art["item_rev_indexer"],
        user_means=art["user_means"],
        nn_model=art["nn_model"],
        top_k_neighbors=k_neighbors,
        top_n_items=n_recs
    )

    # validate (polars-only)
    if not isinstance(recs, pl.DataFrame):
        raise AssertionError("recs must be a polars.DataFrame")
    if not {"parent_asin", "score"}.issubset(set(recs.columns)):
        raise AssertionError("recs requires ['parent_asin','score']")
    if len(recs) > n_recs:
        recs = recs.head(n_recs)

    logger.log_info(f"[UnitTest-Rank@index={user_index}] {len(recs)} recs ✅")
    display(recs)
    return recs, target_user


#### training models for list categories

In [18]:
def build_rank_matrices(df_train: pl.DataFrame, pos_threshold: float = 4.0, mean_center: bool = True):
    df = df_train.drop_nulls(subset=['user_id','parent_asin','rating'])
    users = df.select('user_id').unique().to_series().to_list()
    items = df.select('parent_asin').unique().to_series().to_list()
    user_idx = {u: idx for idx,u in enumerate(users)}
    item_idx = {i: idx for idx,i in enumerate(items)}
    u = np.array([user_idx[x] for x in df.select('user_id').to_series().to_list()], dtype=np.int32)
    i = np.array([item_idx[x] for x in df.select('parent_asin').to_series().to_list()], dtype=np.int32)
    v = np.array([None if x is None else float(x) for x in df.select('rating').to_series().to_list()], dtype=np.float32)
    nU, nI = len(users), len(items)
    R = csr_matrix((v, (u, i)), shape=(nU, nI), dtype=np.float32)
    P = csr_matrix(((v >= pos_threshold).astype(np.float32), (u, i)), shape=(nU, nI), dtype=np.float32)
    user_means = np.zeros(nU, dtype=np.float32)
    Rc = None
    if mean_center:
        Rc = R.copy().astype(np.float32)
        row_sums = np.array(R.sum(axis=1)).ravel().astype(np.float32)
        row_cnts = np.diff(R.indptr).astype(np.int32)
        with np.errstate(divide='ignore', invalid='ignore'):
            user_means = np.where(row_cnts>0, row_sums/row_cnts, 0.0).astype(np.float32)
        for uu in range(nU):
            s, e = Rc.indptr[uu], Rc.indptr[uu+1]
            if s < e: Rc.data[s:e] -= user_means[uu]
    user_rev = np.array(users, dtype=object)
    item_rev = np.array(items, dtype=object)
    logger.log_info(f"[Matrix-Rank] R{R.shape} nnz={R.nnz} | P nnz={P.nnz}")
    return R, P, Rc, user_idx, item_idx, user_rev, item_rev, user_means

def build_user_matrices(df_train: pl.DataFrame, mean_center: bool = True):
    df = df_train.drop_nulls(subset=['user_id','parent_asin','rating'])
    users = df.select('user_id').unique().to_series().to_list()
    items = df.select('parent_asin').unique().to_series().to_list()
    user_idx = {u: idx for idx,u in enumerate(users)}
    item_idx = {i: idx for idx,i in enumerate(items)}
    u = np.array([user_idx[x] for x in df.select('user_id').to_series().to_list()], dtype=np.int32)
    i = np.array([item_idx[x] for x in df.select('parent_asin').to_series().to_list()], dtype=np.int32)
    v = np.array([None if x is None else float(x) for x in df.select('rating').to_series().to_list()], dtype=np.float32)
    nU, nI = len(users), len(items)
    R = csr_matrix((v, (u, i)), shape=(nU, nI), dtype=np.float32)
    user_means = np.zeros(nU, dtype=np.float32)
    Rc = None
    if mean_center:
        Rc = R.copy().astype(np.float32)
        row_sums = np.array(R.sum(axis=1)).ravel().astype(np.float32)
        row_cnts = np.diff(R.indptr).astype(np.int32)
        with np.errstate(divide='ignore', invalid='ignore'):
            user_means = np.where(row_cnts>0, row_sums/row_cnts, 0.0).astype(np.float32)
        for uu in range(nU):
            s, e = Rc.indptr[uu], Rc.indptr[uu+1]
            if s < e: Rc.data[s:e] -= user_means[uu]
    user_rev = np.array(users, dtype=object)
    item_rev = np.array(items, dtype=object)
    logger.log_info(f"[Matrix-User] R{R.shape} nnz={R.nnz}")
    return R, Rc, user_idx, item_idx, user_rev, item_rev, user_means

def fit_neighbors(X: csr_matrix, k: int = 30) -> NearestNeighbors:
    nn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=min(k+1, X.shape[0]))
    nn.fit(X)
    logger.log_info(f"[NN] fitted on {X.shape} | k={k}")
    return nn

def _save_rank_artifacts(out_dir: Path, R, P, Rc, user_means, user_rev, item_rev, user_idx, item_idx, nn):
    out_dir.mkdir(parents=True, exist_ok=True)
    save_npz(out_dir / "R.npz", R); save_npz(out_dir / "P.npz", P)
    if Rc is not None: save_npz(out_dir / "Rc.npz", Rc)
    np.save(out_dir / "user_means.npy", user_means)
    with open(out_dir / "user_rev.pkl", "wb") as f: pickle.dump(user_rev, f)
    with open(out_dir / "item_rev.pkl", "wb") as f: pickle.dump(item_rev, f)
    with open(out_dir / "user_idx.json", "w") as f: json.dump({str(k): int(v) for k,v in user_idx.items()}, f)
    with open(out_dir / "item_idx.json", "w") as f: json.dump({str(k): int(v) for k,v in item_idx.items()}, f)
    with open(out_dir / "nn_model.pkl", "wb") as f: pickle.dump(nn, f)

def _save_user_artifacts(out_dir: Path, R, Rc, user_means, user_rev, item_rev, user_idx, item_idx, nn):
    out_dir.mkdir(parents=True, exist_ok=True)
    save_npz(out_dir / "R.npz", R)
    if Rc is not None: save_npz(out_dir / "Rc.npz", Rc)
    np.save(out_dir / "user_means.npy", user_means)
    with open(out_dir / "user_rev.pkl", "wb") as f: pickle.dump(user_rev, f)
    with open(out_dir / "item_rev.pkl", "wb") as f: pickle.dump(item_rev, f)
    with open(out_dir / "user_idx.json", "w") as f: json.dump({str(k): int(v) for k,v in user_idx.items()}, f)
    with open(out_dir / "item_idx.json", "w") as f: json.dump({str(k): int(v) for k,v in item_idx.items()}, f)
    with open(out_dir / "nn_model.pkl", "wb") as f: pickle.dump(nn, f)

def train_models_for_categories(
    categories: list[str],
    algo: str = "rank",
    k_neighbors: int = 30,
    pos_threshold: float = 4.0,
    mean_center: bool = True,
    models_path: str | Path | None = None,
    max_users: int | None = None,
    max_items: int | None = None
) -> pl.DataFrame:
    algo = algo.lower().strip()
    if algo not in {"rank"}:
        raise ValueError("algo must be 'rank'")
    base_out = MODELS_DIR
    base_out.mkdir(parents=True, exist_ok=True)
    out_algo = base_out / algo
    out_algo.mkdir(parents=True, exist_ok=True)
    rows = []
    for cat in categories:
        try:
            logger.log_info(f"=== [{algo.upper()}] Training category: {cat} ===")
            df_train = load_5core_train(cat)
            if max_users is not None or max_items is not None:
                if max_users is not None:
                    keep_users = set(df_train.select('user_id').unique().to_series().to_list()[:max_users])
                    df_train = df_train.filter(pl.col('user_id').is_in(keep_users))
                if max_items is not None:
                    keep_items = set(df_train.select('parent_asin').unique().to_series().to_list()[:max_items])
                    df_train = df_train.filter(pl.col('parent_asin').is_in(keep_items))
                logger.log_info(f"[Sample] {cat} → shape={df_train.shape}")
            out_dir = out_algo / cat
            if algo == "rank":
                R, P, Rc, user_idx, item_idx, user_rev, item_rev, user_means = build_rank_matrices(
                    df_train, pos_threshold=pos_threshold, mean_center=mean_center
                )
                X = Rc if Rc is not None else R
                nn = fit_neighbors(X, k=k_neighbors)
                _save_rank_artifacts(out_dir, R, P, Rc, user_means, user_rev, item_rev, user_idx, item_idx, nn)
                stats = dict(R_nnz=int(R.nnz), P_nnz=int(P.nnz), users=len(user_rev), items=len(item_rev))
            logger.log_info(f"[Saved] {algo} → {out_dir} | stats={stats}")
            rows.append({
                "category": cat, "algo": algo, "models_dir": str(out_dir),
                "k_neighbors": k_neighbors, "mean_center": mean_center, "pos_threshold": pos_threshold if algo=="rank" else None,
                **stats
            })
        except Exception as e:
            logger.log_exception(f"[Error] Training failed for {cat}: {e}")
            rows.append({"category": cat, "algo": algo, "models_dir": None, "k_neighbors": k_neighbors, "mean_center": mean_center, "pos_threshold": pos_threshold if algo=="rank" else None, "error": str(e)})
    summary = pl.from_pandas(__import__("pandas").DataFrame(rows)) if rows else pl.DataFrame(rows)
    logger.log_info(f"[Summary] Trained {len(categories)} categories. OK={(len([r for r in rows if r.get('models_dir')] ))} | FAIL={(len(rows)-len([r for r in rows if r.get('models_dir')]))}")
    return summary

summary_rank = train_models_for_categories(CATEGORY, algo="rank", k_neighbors=K_NEIGHBORS, pos_threshold=POS_THRESHOLD, mean_center=MEAN_CENTER, max_users=MAX_USERS, max_items=MAX_ITEMS)
display(summary_rank)


2025-09-28 10:46:20,302 - INFO - === [RANK] Training category: Electronics ===
2025-09-28 10:46:20,303 - INFO - [Load] Electronics ← Electronics.5core.train.parquet
2025-09-28 10:46:26,679 - INFO - [Load] Done: shape=(12191484, 5) | users=1641026 | items=367052
2025-09-28 10:46:37,065 - INFO - [Matrix-Rank] R(1641026, 367052) nnz=12191484 | P nnz=12191484
2025-09-28 10:46:37,082 - INFO - [NN] fitted on (1641026, 367052) | k=30
2025-09-28 10:46:52,719 - INFO - [Saved] rank → /Users/kevin/Documents/GitHub/Python/VESKL/Personal/NEU/NEU/NEU_7275/Prj/Prj_1/APRS_7275_G6/Amazon-Product-Recommendation-System/models/rank/Electronics | stats={'R_nnz': 12191484, 'P_nnz': 12191484, 'users': 1641026, 'items': 367052}
2025-09-28 10:46:52,724 - INFO - === [RANK] Training category: Beauty_and_Personal_Care ===
2025-09-28 10:46:52,725 - INFO - [Load] Beauty_and_Personal_Care ← Beauty_and_Personal_Care.5core.train.parquet
2025-09-28 10:46:55,746 - INFO - [Load] Done: shape=(5165289, 5) | users=729576 | 

category,algo,models_dir,k_neighbors,mean_center,pos_threshold,R_nnz,P_nnz,users,items
str,str,str,i64,bool,f64,i64,i64,i64,i64
"""Electronics""","""rank""","""/Users/kevin/Documents/GitHub/…",30,True,4.0,12191484,12191484,1641026,367052
"""Beauty_and_Personal_Care""","""rank""","""/Users/kevin/Documents/GitHub/…",30,True,4.0,5165289,5165289,729576,207385


#### Receive request then using models to reply

In [21]:
def _load_rank_artifacts(model_dir: str | Path):
    md = Path(model_dir)
    R = load_npz(md / "R.npz"); P = load_npz(md / "P.npz")
    Rc = load_npz(md / "Rc.npz") if (md / "Rc.npz").exists() else None
    user_means = np.load(md / "user_means.npy")
    with open(md / "user_rev.pkl", "rb") as f: user_rev = pickle.load(f)
    with open(md / "item_rev.pkl", "rb") as f: item_rev = pickle.load(f)
    with open(md / "user_idx.json", "r") as f: user_idx = {k: int(v) for k, v in json.load(f).items()}
    with open(md / "item_idx.json", "r") as f: item_idx = {k: int(v) for k, v in json.load(f).items()}
    with open(md / "nn_model.pkl", "rb") as f: nn_model = pickle.load(f)
    return dict(R=R, P=P, Rc=Rc, user_means=user_means,
                user_rev=user_rev, item_rev=item_rev,
                user_idx=user_idx, item_idx=item_idx, nn_model=nn_model)


def recommend_rank_ui(user_id: str,
                      n_recs: int = 5,
                      k_neighbors: int | None = None,
                      model_dir: str | Path | None = None) -> pl.DataFrame:
    base = Path("./models/rank_based")
    cat = CATEGORY[0] if isinstance(CATEGORY, (list, tuple)) else CATEGORY
    md = Path(model_dir) if model_dir is not None else (base / cat)
    art = _load_rank_artifacts(md)
    R, P, Rc = art["R"], art["P"], art["Rc"]
    nn_model = art["nn_model"]
    user_idx, item_rev = art["user_idx"], art["item_rev"]

    if user_id not in user_idx:
        logger.log_warning(f"[UI-Rank] user_id={user_id} not found.")
        return pl.DataFrame({"parent_asin": [], "score": []})

    uidx = user_idx[user_id]
    X = Rc if Rc is not None else R

    distances, indices = nn_model.kneighbors(X.getrow(uidx), return_distance=True)
    distances, indices = distances.ravel(), indices.ravel()
    mask = indices != uidx
    indices, distances = indices[mask], distances[mask]
    if k_neighbors is not None and indices.size > k_neighbors:
        indices, distances = indices[:k_neighbors], distances[:k_neighbors]
    sims = np.clip(1.0 - distances, 0.0, 1.0)

    scores = P[indices, :].T.dot(sims)

    rated_idx = R.getrow(uidx).indices
    cand_mask = np.ones(R.shape[1], dtype=bool)
    if rated_idx.size:
        cand_mask[rated_idx] = False
    cand_idx = np.where(cand_mask)[0]
    if cand_idx.size == 0:
        return pl.DataFrame({"parent_asin": [], "score": []})

    cand_scores = scores[cand_idx]
    kth = min(n_recs, cand_idx.size - 1)
    top_pos = np.argpartition(-cand_scores, kth)[:n_recs]
    top_pairs = sorted(((int(cand_idx[i]), float(cand_scores[i])) for i in top_pos), key=lambda x: -x[1])

    rec_asins = [item_rev[i] for i, _ in top_pairs]
    rec_scores = [s for _, s in top_pairs]
    return pl.DataFrame({"parent_asin": rec_asins, "score": rec_scores})


def unit_test_ui_rank_recommend(user_id: str,
                                n_recs: int = 5,
                                k_neighbors: int | None = None,
                                model_dir: str | Path | None = None):
    base = Path("./models/rank_based")
    cat = CATEGORY[0] if isinstance(CATEGORY, (list, tuple)) else CATEGORY
    md = Path(model_dir) if model_dir is not None else (base / cat)
    logger.log_info(f"[UnitTest-UI-Rank] model_dir={md} | user_id={user_id} | n_recs={n_recs} | k={k_neighbors}")

    recs = recommend_rank_ui(user_id=user_id, n_recs=n_recs, k_neighbors=k_neighbors, model_dir=md)

    assert isinstance(recs, pl.DataFrame), "recs must be a polars.DataFrame"
    assert {"parent_asin", "score"}.issubset(set(recs.columns)), "missing columns in recs"
    assert len(recs) <= n_recs, f"recs should have at most {n_recs} rows"

    logger.log_info(f"[UnitTest-UI-Rank] returned {len(recs)} items ✅")
    display(recs)
    return recs


### Unit test

#### Recommend 5 products for user at index 3

In [22]:
unit_test_rank_recommend_user_at_index(user_index=3, category=CATEGORY[0], n_recs=5)

2025-09-28 10:48:43,175 - INFO - [Run-Rank] cat=Electronics k=30 n=5 thr=4.0 mean_center=True max_users=None max_items=None
2025-09-28 10:48:43,176 - INFO - [Load] Electronics ← Electronics.5core.train.parquet
2025-09-28 10:48:49,838 - INFO - [Load] Done: shape=(12191484, 5) | users=1641026 | items=367052
2025-09-28 10:49:00,421 - INFO - [Matrix] R shape=(1641026, 367052) nnz=12191484 | P nnz=12191484 | users=1641026 items=367052
2025-09-28 10:49:00,434 - INFO - [Neighbors] fitted on (1641026, 367052) | k=30
2025-09-28 10:49:01,057 - INFO - [Run-Rank] target_user=AGCI7FAH4GL5FI65HYLKWTMFZ2CQ
2025-09-28 10:49:01,258 - INFO - [Run-Rank] Got 5 recs
2025-09-28 10:49:01,258 - INFO - [UnitTest-Rank@index=3] target_user=AHJWNHF3EMMZFAXIDXDEZHLPOBCQ
2025-09-28 10:49:01,419 - INFO - [UnitTest-Rank@index=3] 5 recs ✅


parent_asin,score
str,f64
"""B08RYF42S3""",0.0
"""B07G7T7PDL""",0.0
"""B09KKPTGLF""",0.0
"""B07HZ5L3N5""",0.0
"""B08DR4K78X""",0.0


(shape: (5, 2)
 ┌─────────────┬───────┐
 │ parent_asin ┆ score │
 │ ---         ┆ ---   │
 │ str         ┆ f64   │
 ╞═════════════╪═══════╡
 │ B08RYF42S3  ┆ 0.0   │
 │ B07G7T7PDL  ┆ 0.0   │
 │ B09KKPTGLF  ┆ 0.0   │
 │ B07HZ5L3N5  ┆ 0.0   │
 │ B08DR4K78X  ┆ 0.0   │
 └─────────────┴───────┘,
 'AHJWNHF3EMMZFAXIDXDEZHLPOBCQ')

#### Receive request then using models to reply

In [23]:
unit_test_ui_rank_recommend(user_id="AE222HFVDJ4TJ4V2LDRIAMQM2RPA", n_recs=5, k_neighbors=30, model_dir=MODELS_DIR / "rank" / CATEGORY[0])

2025-09-28 10:49:13,457 - INFO - [UnitTest-UI-Rank] model_dir=/Users/kevin/Documents/GitHub/Python/VESKL/Personal/NEU/NEU/NEU_7275/Prj/Prj_1/APRS_7275_G6/Amazon-Product-Recommendation-System/models/rank/Electronics | user_id=AE222HFVDJ4TJ4V2LDRIAMQM2RPA | n_recs=5 | k=30
2025-09-28 10:49:15,994 - INFO - [UnitTest-UI-Rank] returned 5 items ✅


parent_asin,score
str,f64
"""B00L0YLRUW""",0.86388
"""B07H87G3RV""",0.474342
"""B011BRUOMO""",0.474342
"""B073JWXGNT""",0.474342
"""B01J94SWWU""",0.474342


parent_asin,score
str,f64
"""B00L0YLRUW""",0.86388
"""B07H87G3RV""",0.474342
"""B011BRUOMO""",0.474342
"""B073JWXGNT""",0.474342
"""B01J94SWWU""",0.474342
