# Step 1: Load and Explore Retailrocket Events Data

In [3]:
# Load the events dataset and convert timestamp for easier analysis
# Each row represents a user-item interaction (e.g., view, add-to-cart, transaction)

import pandas as pd  
# Load the events dataset and convert timestamp for easier analysis
# Each row represents a user-item interaction (e

events = pd.read_csv("../data/events.csv")
events['timestamp'] = pd.to_datetime(events['timestamp'], unit='ms')

# Preview first few rows of the dataset
print("🔹 First 5 rows:")
display(events.head())

# Show data structure and column types
print("🔹 Dataset Info:")
display(events.info())


# Count how many of each interaction type we have
print("🔹 Event Types Distribution:")
display(events['event'].value_counts())

# Count number of unique users and products
print("🔹 Unique Users:", events['visitorid'].nunique())
print("🔹 Unique Items:", events['itemid'].nunique())


🔹 First 5 rows:


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,2015-06-02 05:02:12.117,257597,view,355908,
1,2015-06-02 05:50:14.164,992329,view,248676,
2,2015-06-02 05:13:19.827,111016,view,318965,
3,2015-06-02 05:12:35.914,483717,view,253185,
4,2015-06-02 05:02:17.106,951259,view,367447,


🔹 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype         
---  ------         -----         
 0   timestamp      datetime64[ns]
 1   visitorid      int64         
 2   event          object        
 3   itemid         int64         
 4   transactionid  float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 105.1+ MB


None

🔹 Event Types Distribution:


event
view           2664312
addtocart        69332
transaction      22457
Name: count, dtype: int64

🔹 Unique Users: 1407580
🔹 Unique Items: 235061


In [None]:
#pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.3.0-cp39-cp39-macosx_11_0_arm64.whl (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 5.1 MB/s eta 0:00:01    |████████████▋                   | 4.2 MB 5.1 MB/s eta 0:00:02
Collecting numpy>=1.22.4
  Downloading numpy-2.0.2-cp39-cp39-macosx_14_0_arm64.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 10.0 MB/s eta 0:00:01
[?25hCollecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
[K     |████████████████████████████████| 509 kB 10.3 MB/s eta 0:00:01
[?25hCollecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
[K     |████████████████████████████████| 347 kB 21.0 MB/s eta 0:00:01
Installing collected packages: tzdata, pytz, numpy, pandas
Successfully installed numpy-2.0.2 pandas-2.3.0 pytz-2025.2 tzdata-2025.2
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/b

##  Step 1: Data Loading and Initial Exploration

In this step, we loaded and inspected the `events.csv` file from the Retailrocket dataset. This dataset contains user interactions with products on an e-commerce platform and is ideal for building a recommendation engine based on implicit feedback (no explicit ratings).

###  Actions Performed:
- Loaded `events.csv` into a Pandas DataFrame
- Converted `timestamp` from UNIX (ms) to human-readable datetime
- Checked data structure using `.info()`, `.head()`
- Counted the number of unique users and items
- Examined the distribution of interaction types (events)

###  Key Observations:
- Total rows: **2,756,101**
- Columns: `timestamp`, `visitorid`, `event`, `itemid`, `transactionid`
- Interaction types:
  - `view`: **2,664,312** (~97%)
  - `addtocart`: **69,332**
  - `transaction`: **22,457**
- Users: **1,407,580** unique `visitorid`s
- Products: **235,061** unique `itemid`s

###  Insights:
- The dataset is **heavily skewed towards 'view' events**, which is realistic in e-commerce.
- **Very few transactions**, which suggests the need to treat events with different weights when building a user-item preference matrix.
- This is a classic **sparse implicit feedback setting**, best suited for algorithms like **ALS (Alternating Least Squares)** or **LightFM**.
- Due to the large number of users and items, **dimensionality reduction and filtering** will be critical in the next steps.

Next, we'll begin **preprocessing** by assigning weights to different event types to build a user-item interaction matrix for training.


# Step 2: Preprocessing and User-Item Matrix Creation

In [4]:

# - Filter useful columns
# - Assign interaction weights based on event type
# - Group to form user-item interaction matrix

import numpy as np

# Select relevant columns
interaction_df = events[['visitorid', 'itemid', 'event']].copy()

# Map events to numeric weights (implicit feedback)
event_weights = {
    'view': 1,
    'addtocart': 3,
    'transaction': 5
}
interaction_df['event_strength'] = interaction_df['event'].map(event_weights)

# Group by user-item and sum weights to form the interaction matrix
user_item_matrix = (
    interaction_df
    .groupby(['visitorid', 'itemid'])['event_strength']
    .sum()
    .reset_index()
)

# Preview
print("🔹 User-Item Interaction Matrix (top rows):")
display(user_item_matrix.head())

# Basic stats
print("🔹 Shape of matrix:", user_item_matrix.shape)
print("🔹 Top users with most activity:")
display(user_item_matrix['visitorid'].value_counts().head())


🔹 User-Item Interaction Matrix (top rows):


Unnamed: 0,visitorid,itemid,event_strength
0,0,67045,1
1,0,285930,1
2,0,357564,1
3,1,72028,1
4,2,216305,2


🔹 Shape of matrix: (2145179, 3)
🔹 Top users with most activity:


visitorid
1150086    3814
530559     2209
892013     1738
895999     1641
152963     1622
Name: count, dtype: int64

## Step 2: Preprocessing and Creating the User-Item Interaction Matrix

In this step, we transformed raw clickstream data into a numerical user-item matrix to prepare it for recommendation modeling.

###  Actions Performed:
- Selected key columns: `visitorid`, `itemid`, and `event`
- Assigned weights to events:
  - `view`: 1
  - `addtocart`: 3
  - `transaction`: 5
- Grouped interactions to compute total `event_strength` per `(user, item)` pair
- Output: ~2.1 million rows representing weighted interaction scores


### Insights:
- The interaction matrix is highly **sparse**, with only a small fraction of possible user-item pairs represented.
- Top users show thousands of interactions — these may include power users or automated behavior.
- We now have a clean dataset of user preferences for training **collaborative filtering** models (e.g., ALS, LightFM).

➡️ Next step: Train a recommendation model using this interaction matrix.



## Step 3: Train Collaborative Filtering Model Using ALS (Implicit Library)

In [None]:
#pip install implicit


Defaulting to user installation because normal site-packages is not writeable
Collecting implicit
  Downloading implicit-0.7.2-cp39-cp39-macosx_11_0_arm64.whl (765 kB)
[K     |████████████████████████████████| 765 kB 4.1 MB/s eta 0:00:01
[?25hCollecting scipy>=0.16
  Downloading scipy-1.13.1-cp39-cp39-macosx_12_0_arm64.whl (30.3 MB)
[K     |████████████████████████████████| 30.3 MB 7.1 MB/s eta 0:00:01
[?25hCollecting tqdm>=4.27
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 16.2 MB/s eta 0:00:01
Collecting threadpoolctl
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: tqdm, threadpoolctl, scipy, implicit
Successfully installed implicit-0.7.2 scipy-1.13.1 threadpoolctl-3.6.0 tqdm-4.67.1
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
import scipy.sparse as sparse
from implicit.als import AlternatingLeastSquares

# Recreate mapping if needed
user_mapping = {user_id: idx for idx, user_id in enumerate(user_item_matrix['visitorid'].unique())}
item_mapping = {item_id: idx for idx, item_id in enumerate(user_item_matrix['itemid'].unique())}

user_item_matrix['user_idx'] = user_item_matrix['visitorid'].map(user_mapping)
user_item_matrix['item_idx'] = user_item_matrix['itemid'].map(item_mapping)

#  Build ITEM x USER sparse matrix for model training
item_user_matrix = sparse.csr_matrix((
    user_item_matrix['event_strength'].astype(float),
    (user_item_matrix['item_idx'], user_item_matrix['user_idx'])
))

# Transpose to USER x ITEM for recommendation
user_item_sparse_matrix = item_user_matrix.T.tocsr()

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# Train ALS model
model = AlternatingLeastSquares(factors=50, regularization=0.01, iterations=15)
model.fit(item_user_matrix)


100%|██████████| 15/15 [00:35<00:00,  2.40s/it]


In [7]:
user_id = list(user_mapping.keys())[0]
user_idx = user_mapping[user_id]

# Get top-N item recommendations
recommendations = model.recommend(user_idx, user_item_sparse_matrix[user_idx], N=5)

# Safely map internal item indices back to item IDs
reverse_item_mapping = {v: k for k, v in item_mapping.items()}

recommended_items = []
for row in recommendations:
    item_idx = int(row[0])   # ALS internal item index
    score = float(row[1])
    if item_idx in reverse_item_mapping:
        item_id = reverse_item_mapping[item_idx]
        recommended_items.append((item_id, score))

# Print the top 5 recommendations
print(f"Top 5 recommendations for user {user_id}:")
for item_id, score in recommended_items:
    print(f"Item {item_id} (Score: {score:.2f})")


Top 5 recommendations for user 0:
Item 67045 (Score: 0.08)


## Step 3: Train Collaborative Filtering Model (ALS)

In this step, we trained a recommendation model using the **Alternating Least Squares (ALS)** algorithm from the `implicit` library. ALS is ideal for implicit feedback data like clicks, add-to-cart, and transactions.

###  Actions Performed:
- Mapped user and item IDs to unique integer indices
- Created a sparse matrix of `(item × user)` format with event strengths
- Trained ALS model using 50 latent factors and 15 iterations
- Generated top-N recommendations for a sample user

###  Internal Mappings:
- ALS requires internal matrix factorization → item and user IDs must be mapped to index values.
- Used a `reverse_item_mapping` to convert predictions back to original `itemid` values.


This indicates that item `67045` is recommended with the highest confidence to user `0`.

Next: We'll deploy this model behind a FastAPI endpoint to serve real-time recommendations.



Simulated User-Item Interaction Matrix


| Visitor ID | Item A | Item B | Item C | Item D |
| ---------- | ------ | ------ | ------ | ------ |
| **1001**   | 1      | 8      |        |        |
| **1002**   | 1      |        | 1      |        |
| **1003**   |        |        |        | 5      |
| **1004**   |        | 1      |        | 1      |






Key:
User 1001 has the highest engagement with Item B (view + add-to-cart + purchase → total 8).

User 1003 strongly interacted with Item D (purchased).

User 1004 lightly interacted with B and D.

🧠 How ALS Uses This:
Learns latent features from these strengths.

Recommends items to users based on others with similar patterns.

E.g., User 1002 might be recommended Item B next, since similar users 1001/1004 liked it.


In [None]:
# import pickle
# import os

# # Create the 'models' directory in the parent folder, if it doesn't already exist.
# # This assumes your script is currently running from a subdirectory (e.g., 'notebooks')
# # and you want the 'models' folder to be one level up, in the main project directory.
# os.makedirs("../models", exist_ok=True)

# # Save ALS model and sparse matrix to the specified 'models' directory in the parent folder.
# pickle.dump(model, open("../models/als_model.pkl", "wb"))
# pickle.dump(user_item_sparse_matrix, open("../models/user_item_matrix.pkl", "wb"))
# pickle.dump(user_mapping, open("../models/user_mapping.pkl", "wb"))
# pickle.dump(item_mapping, open("../models/item_mapping.pkl", "wb"))


In [None]:
#pip install joblib

Collecting joblib
  Downloading joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Downloading joblib-1.5.1-py3-none-any.whl (307 kB)
Installing collected packages: joblib
Successfully installed joblib-1.5.1
Note: you may need to restart the kernel to use updated packages.


In [10]:
import joblib
import os

# Create models directory if it doesn't exist
os.makedirs("../models", exist_ok=True)

# Save ALS model and artifacts using joblib
joblib.dump(model, "../models/als_model.joblib")
joblib.dump(user_item_sparse_matrix, "../models/user_item_matrix.joblib")
joblib.dump(user_mapping, "../models/user_mapping.joblib")
joblib.dump(item_mapping, "../models/item_mapping.joblib")



['../models/item_mapping.joblib']