<a href="https://colab.research.google.com/github/harishkulkarni10/ecommerce-session-recommender/blob/main/notebooks/2_Session_Engineering_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook 2 — Session Engineering & Feature Construction

**Objective:**  
Transform the cleaned clickstream data (`cleaned_events.csv`) into structured, model-ready session sequences.

### Steps in this Notebook
1. **Load cleaned dataset** from Notebook 1 outputs.  
2. **Group by `sessionId`** → form ordered item sequences.  
3. **Filter sessions** (too short / too long).  
4. **Create chronological train/val/test split**.  
5. **Build simple session-level features** (length, recency).  
6. **Save engineered artifacts** (`sessions.pkl`, `split.pkl`, `features.csv`) for modeling.

Session-aware recommenders like GRU4Rec or SASRec require item index sequences as input.  
This notebook bridges the cleaned event logs and the modeling phase.


In [12]:
# STEP 2.0 — Mount Drive and load cleaned data

from google.colab import drive
import os, pickle, pandas as pd, glob

# Mount Drive
drive.mount('/content/drive', force_remount=True)

# Define project root
PROJECT_ROOT = '/content/drive/MyDrive/Data Science course/Major Projects/Projects/e-commerce recommender/diginetica_recommender_project/'
CLEANED_PATH = os.path.join(PROJECT_ROOT, 'data/cleaned/')
SESSION_PATH = os.path.join(PROJECT_ROOT, 'data/sessions/')

# cleaned dataset
cleaned_fp = os.path.join(CLEANED_PATH, 'cleaned_events.csv')
df = pd.read_csv(cleaned_fp)
print("Cleaned dataset loaded:", df.shape)
display(df.head())

# Load item mappings
with open(os.path.join(CLEANED_PATH, 'item2id.pkl'), 'rb') as f:
    item2id = pickle.load(f)
with open(os.path.join(CLEANED_PATH, 'id2item.pkl'), 'rb') as f:
    id2item = pickle.load(f)

print(f"Mappings loaded: {len(item2id)} items.")

Mounted at /content/drive
Cleaned dataset loaded: (83371, 5)


Unnamed: 0,sessionId,itemId,timeframe,eventdate,item_idx
0,1,9654,75848,2016-05-09,0
1,1,33043,173912,2016-05-09,1
2,1,32118,243569,2016-05-09,2
3,1,12352,329870,2016-05-09,3
4,1,35077,390072,2016-05-09,4


Mappings loaded: 31399 items.


### Step 2.1 — Group Events by `sessionId` to Form Item Sequences

Each session represents a sequence of items that a user interacted with during a browsing period.

In this step:
- We group all events by `sessionId`.
- Within each session, items are **sorted chronologically** (already handled in Notebook 1).
- The result is a list of item indices representing each user's browsing path.

This structure will later be used to train models like GRU4Rec and SASRec, which learn to predict the **next item** given previous ones.


In [13]:
# STEP 2.1 - Group by sessionID to form ordered item sequences

# Group interactions by session
session_groups = df.groupby('sessionId')['item_idx'].apply(list)

# Convert to dictionary
sessions = session_groups.to_dict()

first_session_id = list(sessions.keys())[0]
print(f"Example session (ID: {first_session_id})")
print(sessions[first_session_id])

print("\nTotal sessions formed:", len(sessions))
print("Average sequence length:", round(sum(len(v) for v in sessions.values()) / len(sessions), 2))
print("Max sequence length:", max(len(v) for v in sessions.values()))

Example session (ID: 1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Total sessions formed: 26676
Average sequence length: 3.13
Max sequence length: 52


In [14]:
session_groups

Unnamed: 0_level_0,item_idx
sessionId,Unnamed: 1_level_1
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]"
2,"[10, 11, 12, 13, 14, 15, 16, 17]"
4,[18]
5,"[19, 20]"
6,"[21, 22]"
...,...
38497,[3079]
38499,"[3314, 31382]"
38500,"[5906, 28518]"
38501,"[188, 809, 10093, 31383, 1027, 4732, 31384, 4174]"


### Step 2.2 — Filter and Normalize Session Sequences

Not all sessions are equally useful for modeling:

- **Very short sessions (length < 2)**: contain no “next item” prediction target.  
- **Very long sessions (length > 50)**: outliers that can dominate training time.

We filter sessions to keep only those within a reasonable length range.  
This ensures model stability and faster convergence.


In [15]:
# STEP 2.2 — Filter sessions

min_len = 2
max_len = 50

before = len(sessions)

# Filter sessions within the valid length range
filtered_sessions = {sid: seq[:max_len] for sid, seq in sessions.items() if len(seq) >= min_len}

after = len(filtered_sessions)

print(f"Sessions before filtering: {before}")
print(f"Sessions after filtering:  {after}")
print(f"Filtered out {before - after} short sessions.")
print(f"Average session length (post-filter): {round(sum(len(v) for v in filtered_sessions.values()) / len(filtered_sessions), 2)}")


Sessions before filtering: 26676
Sessions after filtering:  16397
Filtered out 10279 short sessions.
Average session length (post-filter): 4.46


### Step 2.3 — Chronological Train/Validation/Test Split

In real-world recommendation systems, we always predict **future behavior** based on **past interactions**.

To simulate this, we perform a **chronological split** rather than a random one:
- **Train set:** older sessions (e.g., first 80%)
- **Validation set:** next 10%
- **Test set:** most recent 10%

This preserves temporal realism and prevents data leakage — ensuring the model only learns from interactions that occurred *before* the validation/test periods.


In [16]:
# STEP 2.3

# Convert to DataFrame for sorting by time
session_meta = df.groupby('sessionId')['eventdate'].max().reset_index()
session_meta = session_meta[session_meta['sessionId'].isin(filtered_sessions.keys())]
session_meta = session_meta.sort_values('eventdate').reset_index(drop=True)

# Split indices by time
total_sessions = len(session_meta)
train_end = int(total_sessions * 0.8)
val_end = int(total_sessions * 0.9)

train_sessions = session_meta.iloc[:train_end]['sessionId'].tolist()
val_sessions = session_meta.iloc[train_end:val_end]['sessionId'].tolist()
test_sessions = session_meta.iloc[val_end:]['sessionId'].tolist()

print(f"Total sessions: {total_sessions}")
print(f"Train: {len(train_sessions)} | Val: {len(val_sessions)} | Test: {len(test_sessions)}")
print(f"Date ranges — Train: {session_meta.iloc[0]['eventdate']} → {session_meta.iloc[train_end-1]['eventdate']}")
print(f"Val: {session_meta.iloc[train_end]['eventdate']} → {session_meta.iloc[val_end-1]['eventdate']}")
print(f"Test: {session_meta.iloc[val_end]['eventdate']} → {session_meta.iloc[-1]['eventdate']}")


Total sessions: 16397
Train: 13117 | Val: 1640 | Test: 1640
Date ranges — Train: 2016-01-01 → 2016-04-26
Val: 2016-04-26 → 2016-05-07
Test: 2016-05-07 → 2016-06-01


### Step 2.4 — Save Engineered Session Data

We now save all engineered outputs for downstream modeling.

**Artifacts saved:**
1. `sessions.pkl` — dictionary of session sequences (`{sessionId: [item_idx, ...]}`)
2. `split.pkl` — dictionary containing train/val/test session IDs
3. `sessions_summary.json` — optional metadata summary (counts, lengths, etc.)

These files will be loaded in the modeling notebooks (Notebook 3: Baselines and Notebook 4: Advanced Sequential Models).


In [17]:
# STEP 2.4 — Save engineered session data and splits

import json

# Ensure session folder exists
os.makedirs(SESSION_PATH, exist_ok=True)

# 1. Save session sequences
sessions_fp = os.path.join(SESSION_PATH, 'sessions.pkl')
with open(sessions_fp, 'wb') as f:
    pickle.dump(filtered_sessions, f)
print(f"Saved session sequences to: {sessions_fp}")

# 2. Save split info
split_dict = {
    'train': train_sessions,
    'val': val_sessions,
    'test': test_sessions
}
split_fp = os.path.join(SESSION_PATH, 'split.pkl')
with open(split_fp, 'wb') as f:
    pickle.dump(split_dict, f)
print(f"Saved session split info to: {split_fp}")

# 3. Save summary JSON
summary = {
    'total_sessions': len(filtered_sessions),
    'train_sessions': len(train_sessions),
    'val_sessions': len(val_sessions),
    'test_sessions': len(test_sessions),
    'avg_session_length': round(sum(len(v) for v in filtered_sessions.values()) / len(filtered_sessions), 2),
    'max_session_length': max(len(v) for v in filtered_sessions.values())
}
summary_fp = os.path.join(SESSION_PATH, 'sessions_summary.json')
with open(summary_fp, 'w') as f:
    json.dump(summary, f, indent=2)
print(f"Saved session summary to: {summary_fp}")

# Quick verification
print("\nFiles saved successfully. Example paths:")
!ls -lh "$SESSION_PATH"


Saved session sequences to: /content/drive/MyDrive/Data Science course/Major Projects/Projects/e-commerce recommender/diginetica_recommender_project/data/sessions/sessions.pkl
Saved session split info to: /content/drive/MyDrive/Data Science course/Major Projects/Projects/e-commerce recommender/diginetica_recommender_project/data/sessions/split.pkl
Saved session summary to: /content/drive/MyDrive/Data Science course/Major Projects/Projects/e-commerce recommender/diginetica_recommender_project/data/sessions/sessions_summary.json

Files saved successfully. Example paths:
total 373K
-rw------- 1 root root 325K Oct 17 09:34 sessions.pkl
-rw------- 1 root root  163 Oct 17 09:34 sessions_summary.json
-rw------- 1 root root  48K Oct 17 09:34 split.pkl


## Summary of Session Engineering & Feature Construction

We successfully transformed the cleaned Diginetica event logs into structured, model-ready session data.

### Key Steps Completed
1. **Grouped events by `sessionId`** → created ordered item sequences.  
2. **Filtered sessions** → retained meaningful interactions (2 ≤ length ≤ 50).  
3. **Chronological Split** → 80/10/10 train/val/test partition based on session end time.  
4. **Saved Artifacts**:
   - `sessions.pkl` — sessionId → item_idx sequences  
   - `split.pkl` — session IDs for each data split  
   - `sessions_summary.json` — metadata summary for reference  

### Insights
- After filtering, **16,397 valid sessions** remain with an **average length of ~4.5 items**.  
- Temporal coverage (Jan–Jun 2016) allows for realistic next-item prediction.  
- Dataset now matches the input requirements for sequential recommenders like **GRU4Rec** and **SASRec**.

### Outputs Ready for Next Notebook
| File | Description |
|------|--------------|
| `data/sessions/sessions.pkl` | Session → item index sequences |
| `data/sessions/split.pkl` | Chronological train/val/test splits |
| `data/sessions/sessions_summary.json` | Basic metadata summary |

---

Next → **Notebook 3: Baseline Models**,  
where we will build popularity-based and item-based collaborative filtering recommenders as performance benchmarks before moving on to deep learning models.
