# Project Outline: User–User Collaborative Filtering with Purchases (Implicit Feedback)

## 1) Goal

Recommend items to users, using only **purchase (0/1) histories**.

## 2) Data & Schema

* **Events**: `user_id`, `item_id`, `timestamp` (optional: `quantity`, `price`).
* **Matrices**:

  * **Interaction matrix** $R \in \{0,1\}^{U \times I}$: $R_{u,i}=1$ if purchased, else 0.
  * (Optional) **Confidence weights** $C_{u,i}$ (e.g., >1 for repeated buys).

## 3) Preprocessing

* Deduplicate by user–item; cap to 0/1.
* (Optional) Time decay: recent purchases get higher weight.
* Filter extreme sparsity (rare items, very cold users) for initial model.

## 4) Similarity (Users)

* Build **user vectors** from purchases (rows of $R$).
* Compute similarity:

  * **Cosine**: $sim(u,v)=\frac{R_u \cdot R_v}{\|R_u\|\|R_v\|}$ (good default, normalizes heavy buyers).
  * **Jaccard**: $|A\cap B|/|A\cup B|$ where sets = purchased items.
* (Optional) Shrinkage for low co-occurrence (downweight tiny overlaps).

## 5) Neighborhood Selection

* **Top-k neighbors** per user (k ≈ 30–50).
* Minimum similarity threshold (e.g., ≥ 0.1) to reduce noise.
* Exclude negative similarities (if using centered signals).

## 6) Scoring (Predict Purchasability)

For user $u$ and candidate item $i$ not yet purchased:

$$
score(u,i) \;=\; \sum_{v \in N(u)} sim(u,v) \cdot R_{v,i}
$$

* (Optional) Normalize by $\sum_{v \in N(u)} |sim(u,v)|$.
* (Optional) Add **popularity penalty** or **IDF for items** to reduce blockbuster bias:

  * $idf(i)=1/\log(1+DF_i)$, where $DF_i$ = #users who bought $i$.

## 7) Ranking & Recommendation

* For each user, rank unseen items by `score(u,i)` descending.
* Apply business rules: diversity, freshness, category caps, inventory.

## 8) Evaluation

* **Offline** (temporal split):

  * Metrics: **Recall\@K**, **Precision\@K**, **MAP\@K**, **NDCG\@K**, **Coverage**.
* **Online** (A/B test):

  * CTR, add-to-cart, conversion, revenue per session, retention.

## 9) Cold-Start Strategies

* **New users**: popularity by segment, onboarding quiz, contextual rules.
* **New items**: content-based fallback (metadata), “related to” pages.

## 10) Scalability & Ops

* Sparse matrix ops (CSR), approximate nearest neighbors (ANN) for user sim.
* Periodic neighbor recomputation; cache top-k neighbors.
* Incremental updates from event stream (batch + mini-batch).
'

* Guard against popularity feedback loops (add exploration).

In [14]:
import pandas as pd

cols = ['orderId', 'shopUserId', 'created', 'quantity', 'name', 'status', 'groupId', 'price_sek']
tx = pd.read_csv('../data/processed//transactions_clean.csv', usecols=cols, low_memory=False)
tx = tx[tx['status'] == 'active'].copy()
tx[['price_sek', 'quantity']] = tx[['price_sek', 'quantity']].astype(int)
tx.head()


Unnamed: 0,orderId,shopUserId,created,quantity,name,status,groupId,price_sek
0,785001,812427,2025-08-05 20:14:28,1,Clean Curves Wire bra,active,261873,820
1,784985,831360,2025-08-05 19:55:36,4,Trosa Freedom Skin-Relief,active,261745,342
2,784978,209204,2025-08-05 19:47:22,1,Bambutrosa 2-pack,active,265298,169
4,784977,831340,2025-08-05 19:46:09,1,Bh uten bøyle Stars,active,260596,515
5,784977,831340,2025-08-05 19:46:09,1,Bh uten bøyle Stars,active,260596,515


In [49]:
# Get all groupIds ever purchased by each shopUserId (across all orders)
result = (
    tx.groupby('shopUserId')['groupId']
    .apply(list)
    .reset_index(name='all_groupIds')
)

In [50]:
# Show the full JSON for transactions, not capped/truncated
import pandas as pd

pd.set_option('display.max_colwidth', None)
result

Unnamed: 0,shopUserId,all_groupIds
0,78135,[291294]
1,78136,[542087]
2,78145,[261518]
3,78162,"[291278, 404269]"
4,78181,"[260313, 240201, 265823, 270794, 263855, 265843]"
...,...,...
57989,831187,"[210765, 210765]"
57990,831202,[250124]
57991,831331,[270610]
57992,831340,"[260596, 260596]"


In [37]:
# Find users with more than 20 unique orderIds
order_counts = tx.groupby('shopUserId')['orderId'].nunique()
users_over_20_orders = order_counts[order_counts > 10]

print("Users with more than 20 unique orderIds:")
print(users_over_20_orders)


Users with more than 20 unique orderIds:
shopUserId
82751     13
84829     21
141913    14
150286    11
174425    16
229978    13
231831    11
247530    14
260702    18
286951    13
287409    11
308539    11
591887    13
Name: orderId, dtype: int64
