# Personalized recommendation

# Project Outline: User–User Collaborative Filtering with Purchases (Implicit Feedback)

## 1) Goal

Recommend items to users, using only **purchase (0/1) histories**.

## 2) Data & Schema

* **Events**: `user_id`, `item_id`, `timestamp` (optional: `quantity`, `price`).
* **Matrices**:

  * **Interaction matrix** $R \in \{0,1\}^{U \times I}$: $R_{u,i}=1$ if purchased, else 0.
  * (Optional) **Confidence weights** $C_{u,i}$ (e.g., >1 for repeated buys).

## 3) Preprocessing

* Deduplicate by user–item; cap to 0/1.
* (Optional) Time decay: recent purchases get higher weight.
* Filter extreme sparsity (rare items, very cold users) for initial model.

## 4) Similarity (Users)

* Build **user vectors** from purchases (rows of $R$).
* Compute similarity:

  * **Cosine**: $sim(u,v)=\frac{R_u \cdot R_v}{\|R_u\|\|R_v\|}$ (good default, normalizes heavy buyers).
  * **Jaccard**: $|A\cap B|/|A\cup B|$ where sets = purchased items.
* (Optional) Shrinkage for low co-occurrence (downweight tiny overlaps).

## 5) Neighborhood Selection

* **Top-k neighbors** per user (k ≈ 30–50).
* Minimum similarity threshold (e.g., ≥ 0.1) to reduce noise.
* Exclude negative similarities (if using centered signals).

## 6) Scoring (Predict Purchasability)

For user $u$ and candidate item $i$ not yet purchased:

$$
score(u,i) \;=\; \sum_{v \in N(u)} sim(u,v) \cdot R_{v,i}
$$

* (Optional) Normalize by $\sum_{v \in N(u)} |sim(u,v)|$.
* (Optional) Add **popularity penalty** or **IDF for items** to reduce blockbuster bias:

  * $idf(i)=1/\log(1+DF_i)$, where $DF_i$ = #users who bought $i$.

## 7) Ranking & Recommendation

* For each user, rank unseen items by `score(u,i)` descending.
* Apply business rules: diversity, freshness, category caps, inventory.

## 8) Evaluation

* **Offline** (temporal split):

  * Metrics: **Recall\@K**, **Precision\@K**, **MAP\@K**, **NDCG\@K**, **Coverage**.
* **Online** (A/B test):

  * CTR, add-to-cart, conversion, revenue per session, retention.

## 9) Cold-Start Strategies

* **New users**: popularity by segment, onboarding quiz, contextual rules.
* **New items**: content-based fallback (metadata), “related to” pages.

## 10) Scalability & Ops

* Sparse matrix ops (CSR), approximate nearest neighbors (ANN) for user sim.
* Periodic neighbor recomputation; cache top-k neighbors.
* Incremental updates from event stream (batch + mini-batch).
'

* Guard against popularity feedback loops (add exploration).

 **Mean-centering not useful**

   * Instead: **normalize user vectors to unit length**.
   * This prevents users with many interactions from dominating.

 **Count data**

   * Apply a **log transformation** to dampen the effect of very high counts (e.g., a song played 500 times).

---

## Similarity Computation

* **Cosine similarity** works well.
* Alternative: **conditional probability** (e.g., probability user bought item *i* given they bought item *j*).
* This is similar to **association rules**, but requires care: probabilities are not symmetric.

---

## Score Aggregation

  * Define score $S(u, i) = \sum_{j \in items\_purchased\_by\_u} w_{ij}$.
  * Optionally truncate to **top-k most similar items**.
* Neighborhood selection (choosing top-k similar items) still applies.

In hybrid: Content-based similarities (e.g., item text vectors).
This effectively builds a hybrid item-item content-based filter.

Try Replace similarity scores with machine-learned coefficients.
Optimization algorithms can minimize errors (e.g., squared prediction error).
This moves item-item CF closer to modern model-based approaches.



In [10]:
import pandas as pd

cols = ['shopUserId', 'quantity', 'groupId']
tx = pd.read_csv('../data/processed//transactions_clean.csv', usecols=cols + ['status'], low_memory=False)
tx = tx[tx['status'] == 'active'].copy()
tx = tx[cols]  # Drop the 'status' column after filtering
tx[['quantity']] = tx[['quantity']].astype(int)
tx


Unnamed: 0,shopUserId,quantity,groupId
0,812427,1,261873
1,831360,4,261745
2,209204,1,265298
4,831340,1,260596
5,831340,1,260596
...,...,...,...
250024,78202,1,221416
250026,78181,1,265843
250038,78145,1,261518
250039,78136,1,542087


In [11]:
# Aggregate in case same user bought the same product multiple times
# now quantity = total number of units this user has ever bought of this product
user_item = tx.groupby(["shopUserId", "groupId"], as_index=False)["quantity"].sum() 
user_item

Unnamed: 0,shopUserId,groupId,quantity
0,78135,291294,1
1,78136,542087,1
2,78145,261518,1
3,78162,291278,1
4,78162,404269,1
...,...,...,...
119915,831187,210765,2
119916,831202,250124,1
119917,831331,270610,1
119918,831340,260596,2


# user has stronger signal for items they bought more often

In [12]:
import numpy as np
user_item["interaction"] = np.log1p(user_item["quantity"])
user_item = user_item.drop(columns=["quantity"])

In [13]:
user_item

Unnamed: 0,shopUserId,groupId,interaction
0,78135,291294,0.693147
1,78136,542087,0.693147
2,78145,261518,0.693147
3,78162,291278,0.693147
4,78162,404269,0.693147
...,...,...,...
119915,831187,210765,1.098612
119916,831202,250124,0.693147
119917,831331,270610,0.693147
119918,831340,260596,1.098612


In [14]:
from sklearn.preprocessing import LabelEncoder

user_enc = LabelEncoder()
item_enc = LabelEncoder()

user_item["user_idx"] = user_enc.fit_transform(user_item["shopUserId"])
user_item["item_idx"] = item_enc.fit_transform(user_item["groupId"])

user_item

Unnamed: 0,shopUserId,groupId,interaction,user_idx,item_idx
0,78135,291294,0.693147,0,756
1,78136,542087,0.693147,1,1053
2,78145,261518,0.693147,2,314
3,78162,291278,0.693147,3,755
4,78162,404269,0.693147,3,895
...,...,...,...,...,...
119915,831187,210765,1.098612,57989,36
119916,831202,250124,0.693147,57990,91
119917,831331,270610,0.693147,57991,620
119918,831340,260596,1.098612,57992,186


In [57]:
user_item.sort_values(by="interaction", ascending=False)

Unnamed: 0,shopUserId,groupId,interaction,user_idx,item_idx
4089,126151,261637,3.465736,1475,348
47031,395080,503402,3.433987,20405,974
19098,281155,266072,3.367296,7592,503
67902,528903,218982,3.332205,30577,59
6519,174425,261902,3.258097,2439,428
...,...,...,...,...,...
46029,391124,292813,0.693147,19940,789
46028,391109,290104,0.693147,19939,667
46027,391105,260223,0.693147,19938,136
46026,391105,240166,0.693147,19938,71


In [61]:
occurrences = (user_item["groupId"] == "261637").sum()
total = len(user_item)
percent = occurrences / total * 100
print(f"Occurrences of groupId 261637: {occurrences} ({percent:.2f}% of all data)")

Occurrences of groupId 261637: 2895 (2.41% of all data)


In [51]:
from scipy.sparse import coo_matrix

sparse_matrix = coo_matrix(
    (user_item["interaction"], (user_item["user_idx"], user_item["item_idx"]))
)

print(sparse_matrix.shape)  # (n_users, n_items)

(57994, 1123)


In [52]:
from sklearn.metrics.pairwise import cosine_similarity

# compute item-item cosine similarity
item_similarity = cosine_similarity(sparse_matrix.T)  # transpose = items as rows

print(item_similarity.shape)  # (n_items, n_items)

(1123, 1123)


In [53]:
# convert once, after building the sparse matrix
sparse_matrix = sparse_matrix.tocsr()

def recommend_for_user(user_id, top_n=5):
    uidx = user_enc.transform([user_id])[0]
    user_row = sparse_matrix[uidx].toarray().ravel()  # now works
    
    # weighted sum of similarities
    scores = user_row @ item_similarity  
    
    # mask already bought items
    scores[user_row > 0] = -np.inf  
    
    # top N indices
    top_items_idx = np.argsort(scores)[-top_n:][::-1]
    return item_enc.inverse_transform(top_items_idx)

In [62]:
print(recommend_for_user(395080, top_n=5))

['503380' '445897' '440419' '530335' '350225']


In [63]:
def similar_items(product_id, top_n=5):
    # Ensure product_id is passed as a string
    pidx = item_enc.transform([str(product_id)])[0]
    sims = item_similarity[pidx]
    top_idx = np.argsort(sims)[-top_n-1:][::-1]  # +1 to skip itself
    # Return as strings
    return item_enc.inverse_transform(top_idx[1:]).astype(str)

print(similar_items("503380", top_n=5))


['503402' '503373' '503407' '503397' '507707']


In [65]:
import numpy as np

target_item = "261637"   # the item to track (as string)
top_n = 5               # Top-N size to evaluate

user_ids = user_item["shopUserId"].unique()  # all users; or sample if large

count = 0
total = 0

for uid in user_ids:
    recs = recommend_for_user(uid, top_n=top_n)   # expects array-like of groupId
    recs = np.asarray(recs).astype(str)           # ensure string dtype for comparison
    total += len(recs)
    count += np.sum(recs == target_item)

prop = (count / total) if total else 0.0
print(f"Item {target_item} appeared {count} times out of {total} total rec slots ({prop:.2%})")


Item 261637 appeared 10348 times out of 289970 total rec slots (3.57%)
