# Personalized recommendation

# Item-Item Collaborative Filtering on Purchases: predict next order

item_similarity matrix is built from the user–item interaction matrix.

* **Item similarity**: “Find products that behave alike across customers, then recommend them to me based on what I bought.”


### 1) Data → Orders
- Each user $u$ has timestamped baskets $B_{u,t} \subseteq \mathcal{I}$.

**Why:** we predict the **next** basket from the **previous** one.

---

### 2) Split (leave-last-order-out)
- Train: all $B_{u,t}$ with $t < t_u^{\max}$.
- Test: the last basket $T_u = B_{u,t_u^{\max}}$, only for users with $\ge 2$ orders.

**Why:** evaluates a real next-order scenario.

---

### 3) Sequential counts (directional edges)
For each consecutive pair $B_{u,t-1} \to B_{u,t}$, count all item transitions $i \to j$:
$$
n_{ij} \mathrel{+}= 1 \quad \forall\, i\in B_{u,t-1},\; j\in B_{u,t}.
$$

Totals:
$$
\text{out}_i=\sum_j n_{ij}, \qquad \text{in}_j=\sum_i n_{ij}.
$$

**Why:** captures “what people buy **after** what.”

---

### 4) Edge weights (neighbors)
Smoothed next-order probability:
$$
p(j \mid i)=\frac{n_{ij}}{\text{out}_i+\lambda}.
$$

 popularity dampening:
$$
\tilde p(j \mid i)=\frac{p(j \mid i)}{\sqrt{\max(1,\text{in}_j)}}.
$$

Keep top-$K$ $j$ per $i$.

**Why:** stable estimates; avoids over-favoring globally popular items.

---

### 5) Seeding (user context)
Use **last train order** as seed:
$$
S_u = B_{u,t_u^{\max}-1}.
$$

**Why:** aligns context with the very next purchase.

---

### 6) Scoring & ranking
Score each candidate $j$:
$$
s_u(j)=\sum_{i\in S_u}\tilde p(j\mid i).
$$

Allow $j\in S_u$ if repurchases are desired. Recommend top-$K$ by $s_u(j)$.

**Why:** combines signals from all items the user just bought.

---

### 7) Metrics
$$
\mathbf{Hit@}K=\frac{\sum_u \lvert R_u\cap T_u\rvert}{\sum_u \lvert T_u\rvert},
\qquad
\mathbf{UserHitRate}=\frac{1}{\lvert U\rvert}\sum_u \mathbf{1}[R_u\cap T_u\neq\varnothing]
$$

- **Cold share:** share of test items not seen in train.  
- **Seen-only:** compute metrics on $T_u$ intersected with seen items.

**Why:** checks accuracy overall, per-user success, and coverage.

---

### One-line intuition
“Learn **what tends to follow what** across users, then—given a user’s **last basket**—sum those **next-item probabilities** and pick the top ones.”


In [44]:
# %% [markdown]
# Penultimate→Last item–item CF (next-basket prediction)

# %%
import pandas as pd
from collections import Counter, defaultdict
from math import sqrt

# Paths
DATA_PATH = "/workspace/data/processed/transactions_clean.parquet"
OUTPUT_PATH = "/workspace/data/processed/personalized_cf_recs.parquet"

# Filters & hyperparams
BAD_GROUP_IDS = {"12025DK","12025FI","12025NO","12025SE","970300","459978"}

LAM_SEQ = 50
POP_DISCOUNT = True
K_NEIGHBORS = 100

TOPK = 10
MIN_SCORE = 0.0
ITEM_SUPPORT_THR = 20

# NEW: robustness guards (only these two added)
EDGE_SUPPORT_THR = 10       # at least 3 observations of i->j
MIN_OUT_DEG_I    = 10      # only trust sources with >=10 total transitions

ALLOW_SELF_TRANSITION = False   # do not learn i->i edges
EVAL_ALLOW_REPURCHASE = False   # eval: recommend only new items
EXPORT_ALLOW_REPURCHASE = False # prod export: no repurchases

USER_MIN_ORDERS = 2


In [45]:
# %%
cols = ["shopUserId", "orderId", "groupId", "created"]
tx = pd.read_parquet(DATA_PATH, columns=cols).copy()

tx["created"] = pd.to_datetime(tx["created"], errors="coerce")
tx["groupId"] = tx["groupId"].astype(str).str.strip()

# drop empties / NaNs / banned groups
tx = tx[~tx["groupId"].isin(BAD_GROUP_IDS)].reset_index(drop=True)

print(f"Rows: {len(tx):,} | Users: {tx['shopUserId'].nunique():,} | Orders: {tx['orderId'].nunique():,}")


Rows: 298,972 | Users: 61,452 | Orders: 109,808


Orders & split (leave‑last‑order‑out)

In [46]:
# %%
# One row per (user, order) with unique item set and order timestamp
o0 = (
    tx.groupby(["shopUserId", "orderId"], as_index=False)
      .agg(items=("groupId", lambda s: list(set(map(str, s)))),
           t=("created", "max"))
)

# Keep users with at least USER_MIN_ORDERS orders
user_order_counts = o0.groupby("shopUserId")["orderId"].transform("nunique")
o = o0[user_order_counts >= USER_MIN_ORDERS].copy().reset_index(drop=True)

# Identify the last order timestamp per user
lt = o.groupby("shopUserId")["t"].transform("max")
test_orders = o[o["t"] == lt].copy()  # last order per qualified user (evaluation target)

print(
    "Qualified users:", o["shopUserId"].nunique(),
    "| Qualified orders:", len(o),
    "| Test (last) orders:", len(test_orders)
)


Qualified users: 24632 | Qualified orders: 72988 | Test (last) orders: 24637


Item supports & candidates

In [47]:
# %%
# Take the last two orders per user (penultimate, last)
last_two = (
    o.sort_values("t")
     .groupby("shopUserId", as_index=False, sort=False)
     .tail(2)
     .sort_values(["shopUserId", "t"])
)

# For each user, prev = penultimate items, cur = last items
pairs = (
    last_two.groupby("shopUserId", sort=False)
            .agg(prev=("items", lambda s: list(map(str, s.iloc[0]))),
                 cur =("items", lambda s: list(map(str, s.iloc[1]))))
)

# Seeds for evaluation = penultimate basket
last_seed_train = dict(zip(pairs.index, pairs["prev"]))

print("Train pairs (prev→cur):", len(pairs))
pairs.head()


Train pairs (prev→cur): 24632


Unnamed: 0_level_0,prev,cur
shopUserId,Unnamed: 1_level_1,Unnamed: 2_level_1
100140,[264549],[264549]
100157,[240279],[210789]
100208,[210707],"[210695, 210686, 241562]"
100844,"[250122, 260513]","[260513, 541419, 270544, 270696]"
100948,"[261890, 266882, 200187, 260205, 270599]","[291088, 293647, 292045, 291054]"


Sequential transitions (i → j)

In [48]:
# %%
# Support counted only from prev (sources of transitions)
isup = Counter()
for items in pairs["prev"]:
    isup.update(set(items))

# Items seen anywhere in last-two (prev ∪ cur)
seen_items = set(isup.keys()) | {x for xs in pairs["cur"] for x in xs}

# Candidate prediction targets must have at least ITEM_SUPPORT_THR occurrences as prev
candidates = {i for i, c in isup.items() if c >= ITEM_SUPPORT_THR}

print(
    "Seen items (prev∪cur):", len(seen_items),
    "| Candidate items (as targets):", len(candidates)
)


Seen items (prev∪cur): 1770 | Candidate items (as targets): 563


In [49]:
# %%
psup_seq = Counter()
isup_in = Counter()
isup_out = Counter()

for _, r in pairs.iterrows():
    prev = set(r["prev"])
    cur  = set(r["cur"])
    for i in prev:
        for j in cur:
            if (i == j) and (not ALLOW_SELF_TRANSITION):
                continue
            psup_seq[(i, j)] += 1
            isup_out[i] += 1
            isup_in[j]  += 1

print("Transitions learned (i→j):", len(psup_seq))


Transitions learned (i→j): 105437


In [50]:
# %%
def build_neighbors_seq(psup_seq, isup_in, isup_out, K=None, lam=None, pop_discount=None):
    if K is None: K = K_NEIGHBORS
    if lam is None: lam = LAM_SEQ
    if pop_discount is None: pop_discount = POP_DISCOUNT

    nei = defaultdict(list)
    for (i, j), n_ij in psup_seq.items():
        # NEW: hard filters to avoid spurious edges dominating
        if n_ij < EDGE_SUPPORT_THR:
            continue
        if isup_out[i] < MIN_OUT_DEG_I:
            continue

        # Smoothed conditional prob P(j|i)
        p = n_ij / (isup_out[i] + lam)
        if pop_discount:
            p /= sqrt(max(1, isup_in[j]))
        nei[i].append((j, p))

    # Sort & keep top-K per source
    for i in list(nei):
        nei[i] = sorted(nei[i], key=lambda x: x[1], reverse=True)[:K]
    return nei

# Build neighbor lists
neighbors = build_neighbors_seq(psup_seq, isup_in, isup_out)
len(neighbors)


154

In [51]:
# %%
from collections import defaultdict as ddict

def rec_seq(seed, k=TOPK, allow_repurchase=EVAL_ALLOW_REPURCHASE,
            thr=MIN_SCORE, use_candidates=True):
    S = set(seed)
    scores = ddict(float)
    for i in S:
        for j, w in neighbors.get(i, []):
            if (not allow_repurchase) and (j in S):
                continue
            if w < thr:
                continue
            if use_candidates and (j not in candidates):
                continue
            scores[j] += w
    ranked = [j for j, _ in sorted(scores.items(), key=lambda x: x[1], reverse=True)]
    return ranked[:k]


In [52]:
# %%
hits = tot = users = users_hit = 0

# Build a (user -> last basket) map from test_orders
test_last = dict(zip(test_orders["shopUserId"], test_orders["items"]))

for u, seed in last_seed_train.items():
    tgt = set(test_last.get(u, []))
    if not seed or not tgt:
        continue
    users += 1
    R = set(rec_seq(seed, k=TOPK))
    users_hit += int(len(tgt & R) > 0)
    hits += sum(1 for x in tgt if x in R)
    tot  += len(tgt)

print(
    "Hit@10:", round(hits / tot if tot else 0.0, 4),
    "| Users:", users,
    "| UserHitRate:", round(users_hit / users if users else 0.0, 4)
)

# Cold-start share among test labels
tot_items = sum(len(x) for x in test_orders["items"])
cold_items = sum(1 for xs in test_orders["items"] for x in xs if x not in seen_items)
print("Cold share (test):", round(cold_items / tot_items if tot_items else 0.0, 4))


Hit@10: 0.0844 | Users: 24632 | UserHitRate: 0.1762
Cold share (test): 0.0


In [40]:
# %%
hits = tot = users = users_hit = 0

# Build a (user -> last basket) map from test_orders
test_last = dict(zip(test_orders["shopUserId"], test_orders["items"]))

for u, seed in last_seed_train.items():
    tgt = set(test_last.get(u, []))
    if not seed or not tgt:
        continue
    users += 1
    R = set(rec_seq(seed, k=TOPK))
    users_hit += int(len(tgt & R) > 0)
    hits += sum(1 for x in tgt if x in R)
    tot  += len(tgt)

print(
    "Hit@10:", round(hits / tot if tot else 0.0, 4),
    "| Users:", users,
    "| UserHitRate:", round(users_hit / users if users else 0.0, 4)
)

# Cold-start share among test labels
tot_items = sum(len(x) for x in test_orders["items"])
cold_items = sum(1 for xs in test_orders["items"] for x in xs if x not in seen_items)
print("Cold share (test):", round(cold_items / tot_items if tot_items else 0.0, 4))


Hit@10: 0.1618 | Users: 24632 | UserHitRate: 0.3115
Cold share (test): 0.0


In [41]:
# %%
# In production, recommend for the user's current most recent basket (last order)
last_order_all = (
    o.sort_values("t")
     .groupby("shopUserId", as_index=False)
     .tail(1)[["shopUserId","items"]]
)
seed_map_export = dict(zip(last_order_all["shopUserId"], last_order_all["items"]))

# No popularity backfill — only neighbor-based (similar) items
def rec_seq_export(seed, k=TOPK, allow_repurchase=EXPORT_ALLOW_REPURCHASE, thr=MIN_SCORE):
    S = set(seed)
    scores = ddict(float)
    for i in S:
        for j, w in neighbors.get(i, []):
            if w < thr:
                continue
            if (not allow_repurchase) and (j in S):
                continue
            scores[j] += w
    ranked = [j for j, _ in sorted(scores.items(), key=lambda x: x[1], reverse=True)]
    return ranked[:k]  # may be shorter than k if not enough similar items

# Generate export
rows = []
short_lists = 0
for u, seed in seed_map_export.items():
    recs = rec_seq_export(seed, k=TOPK)
    if len(recs) < TOPK:
        short_lists += 1
    rows.append({"shopUserId": u, "Recent Purchase": seed, "Recommended Items": recs})

df_recs = pd.DataFrame(rows, columns=["shopUserId","Recent Purchase","Recommended Items"])
df_recs.to_parquet(OUTPUT_PATH, engine="pyarrow", index=False)
print(f"Wrote {len(df_recs):,} users → {OUTPUT_PATH} | Users with <{TOPK} recs: {short_lists}")
df_recs.head()


Wrote 24,632 users → /workspace/data/processed/personalized_cf_recs.parquet | Users with <10 recs: 3694


Unnamed: 0,shopUserId,Recent Purchase,Recommended Items
0,252934,"[261610, 291294]","[290150, 261608, 260931, 291120, 290149, 21075..."
1,218595,[265823],"[266072, 264937, 261318]"
2,253943,"[280053, 261193]","[210752, 210758, 210765, 210186]"
3,246488,[260513],"[265843, 260313, 263343, 261370, 266072, 26454..."
4,255340,[291839],"[290290, 262010, 261924, 265041]"


In [42]:
# %%
# Build a transitions dataframe with counts and the exact scoring used in neighbors
rows = []
for (i, j), n_ij in psup_seq.items():
    out_i = isup_out[i]
    in_j  = isup_in[j]
    p = n_ij / (out_i + LAM_SEQ)  # smoothed P(j|i)
    score = p / (sqrt(max(1, in_j)) if POP_DISCOUNT else 1.0)  # same as neighbor weight pre-guards
    rows.append((i, j, n_ij, out_i, in_j, p, score))

transitions_df = pd.DataFrame(
    rows,
    columns=["source_item", "target_item", "count_ij", "out_degree_i", "in_degree_j", "P_j_given_i", "score_used"]
)

# Rank transitions per source by occurrence count (tie-break by score)
transitions_df = transitions_df.sort_values(["source_item", "count_ij", "score_used"], ascending=[True, False, False])
transitions_df["rank_by_count"] = transitions_df.groupby("source_item").cumcount() + 1

# Keep top-10 by count per source (but keep the score columns for visibility)
top10_transitions = transitions_df[transitions_df["rank_by_count"] <= 10].copy()

print(f"Edges total: {len(transitions_df):,} | Top-10 edges kept: {len(top10_transitions):,}")
top10_transitions.head(20)


Edges total: 105,437 | Top-10 edges kept: 12,120


Unnamed: 0,source_item,target_item,count_ij,out_degree_i,in_degree_j,P_j_given_i,score_used,rank_by_count
97483,106065,265041,2,28,1232,0.025641,0.000731,1
83768,106065,210773,2,28,1434,0.025641,0.000677,2
83772,106065,358952,1,28,12,0.012821,0.003701,3
83777,106065,290115,1,28,16,0.012821,0.003205,4
83776,106065,552181,1,28,40,0.012821,0.002027,5
83770,106065,542087,1,28,90,0.012821,0.001351,6
83774,106065,576223,1,28,96,0.012821,0.001308,7
83775,106065,261012,1,28,118,0.012821,0.00118,8
83767,106065,292011,1,28,185,0.012821,0.000943,9
83778,106065,292813,1,28,200,0.012821,0.000907,10


In [43]:
# %%
# Top-10 by score (same weight used in recommendation), alternative perspective
by_score = transitions_df.sort_values(["source_item", "score_used", "count_ij"], ascending=[True, False, False]).copy()
by_score["rank_by_score"] = by_score.groupby("source_item").cumcount() + 1
top10_by_score = by_score[by_score["rank_by_score"] <= 10].copy()

print(f"Top-10 by score rows: {len(top10_by_score):,}")
top10_by_score.head(10)


Top-10 by score rows: 12,120


Unnamed: 0,source_item,target_item,count_ij,out_degree_i,in_degree_j,P_j_given_i,score_used,rank_by_count,rank_by_score
83772,106065,358952,1,28,12,0.012821,0.003701,3,1
83777,106065,290115,1,28,16,0.012821,0.003205,4,2
83776,106065,552181,1,28,40,0.012821,0.002027,5,3
83770,106065,542087,1,28,90,0.012821,0.001351,6,4
83774,106065,576223,1,28,96,0.012821,0.001308,7,5
83775,106065,261012,1,28,118,0.012821,0.00118,8,6
83767,106065,292011,1,28,185,0.012821,0.000943,9,7
83778,106065,292813,1,28,200,0.012821,0.000907,10,8
97483,106065,265041,2,28,1232,0.025641,0.000731,1,9
83769,106065,210762,1,28,338,0.012821,0.000697,11,10


In [None]:
# %%
# Builds a per-user contribution table: for each recommended target j,
# shows which seed items i → j edges contributed and by how much.
from collections import defaultdict as ddict

def explain_user_recs(user_id, k=TOPK):
    seed = seed_map_export.get(user_id, [])
    if not seed:
        return pd.DataFrame(columns=["user","rec_item","from_item","edge_score","share_in_rec"])
    recs = rec_seq_export(seed, k=k)
    contrib_rows = []
    S = set(seed)
    for j in recs:
        total = 0.0
        tmp = []
        for i in S:
            # find edge i->j in neighbors
            for jj, w in neighbors.get(i, []):
                if jj == j:
                    tmp.append((i, w))
                    total += w
                    break
        # normalize to contribution shares
        for i, w in tmp:
            share = (w / total) if total > 0 else 0.0
            contrib_rows.append({
                "user": user_id,
                "rec_item": j,
                "from_item": i,
                "edge_score": w,
                "share_in_rec": share
            })
    df = pd.DataFrame(contrib_rows).sort_values(["rec_item","edge_score"], ascending=[True, False])
    return df

# Example: explain one user (replace with a real id if desired)
# explain_user_recs("252934", k=TOPK).head(10)


Edges total: 105,437 | Top-10 edges kept: 12,120


Unnamed: 0,source_item,target_item,count_ij,out_degree_i,in_degree_j,P_j_given_i,score_used,rank_by_count
97488,106065,265041,2,28,1232,0.025641,0.000731,1
83864,106065,210773,2,28,1434,0.025641,0.000677,2
83869,106065,358952,1,28,12,0.012821,0.003701,3
83868,106065,290115,1,28,16,0.012821,0.003205,4
83870,106065,552181,1,28,40,0.012821,0.002027,5
83861,106065,542087,1,28,90,0.012821,0.001351,6
83871,106065,576223,1,28,96,0.012821,0.001308,7
83865,106065,261012,1,28,118,0.012821,0.00118,8
83872,106065,292011,1,28,185,0.012821,0.000943,9
83862,106065,292813,1,28,200,0.012821,0.000907,10


In [29]:
# %%
# Top-10 by score (same weight used in recommendation), alternative perspective
by_score = transitions_df.sort_values(["source_item", "score_used", "count_ij"], ascending=[True, False, False]).copy()
by_score["rank_by_score"] = by_score.groupby("source_item").cumcount() + 1
top10_by_score = by_score[by_score["rank_by_score"] <= 10].copy()

print(f"Top-10 by score rows: {len(top10_by_score):,}")
top10_by_score.head(10)


Top-10 by score rows: 12,120


Unnamed: 0,source_item,target_item,count_ij,out_degree_i,in_degree_j,P_j_given_i,score_used,rank_by_count,rank_by_score
83869,106065,358952,1,28,12,0.012821,0.003701,3,1
83868,106065,290115,1,28,16,0.012821,0.003205,4,2
83870,106065,552181,1,28,40,0.012821,0.002027,5,3
83861,106065,542087,1,28,90,0.012821,0.001351,6,4
83871,106065,576223,1,28,96,0.012821,0.001308,7,5
83865,106065,261012,1,28,118,0.012821,0.00118,8,6
83872,106065,292011,1,28,185,0.012821,0.000943,9,7
83862,106065,292813,1,28,200,0.012821,0.000907,10,8
97488,106065,265041,2,28,1232,0.025641,0.000731,1,9
83867,106065,210762,1,28,338,0.012821,0.000697,11,10


In [31]:
# %%
# Builds a per-user contribution table: for each recommended target j,
# shows which seed items i → j edges contributed and by how much.
from collections import defaultdict as ddict

def explain_user_recs(user_id, k=TOPK):
    seed = seed_map_export.get(user_id, [])
    if not seed:
        return pd.DataFrame(columns=["user","rec_item","from_item","edge_score","share_in_rec"])
    recs = rec_seq_export(seed, k=k)
    contrib_rows = []
    S = set(seed)
    for j in recs:
        total = 0.0
        tmp = []
        for i in S:
            # find edge i->j in neighbors
            for jj, w in neighbors.get(i, []):
                if jj == j:
                    tmp.append((i, w))
                    total += w
                    break
        # normalize to contribution shares
        for i, w in tmp:
            share = (w / total) if total > 0 else 0.0
            contrib_rows.append({
                "user": user_id,
                "rec_item": j,
                "from_item": i,
                "edge_score": w,
                "share_in_rec": share
            })
    df = pd.DataFrame(contrib_rows).sort_values(["rec_item","edge_score"], ascending=[True, False])
    return df

# Example: explain one user (replace SOME_USER_ID with a real id)
explain_user_recs("252934", k=TOPK).head(10)


Unnamed: 0,user,rec_item,from_item,edge_score,share_in_rec
10,252934,236501,291294,0.000747,1.0
8,252934,260683,261610,0.000751,1.0
6,252934,261618,261610,0.000519,0.616779
7,252934,261618,291294,0.000323,0.383221
2,252934,265546,291294,0.001012,1.0
1,252934,270054,261610,0.001063,1.0
3,252934,290150,291294,0.000968,1.0
4,252934,290278,291294,0.000895,1.0
0,252934,291612,291294,0.001065,1.0
5,252934,292078,291294,0.000885,1.0
