# – Train / Validation Split

Objectif de ce notebook :

- Charger `interactions_train` nettoyé
- Faire un split **train / validation** par utilisateur, en respectant l'ordre temporel
- Tester trois stratégies :
  - *leave-last-N-out* par utilisateur
  - *split chronologique* par ratio (ex: 80% train / 20% val)
  - *Leave-One-Out*
- Sauvegarder :
  - `train_interactions.csv`
  - `val_interactions.csv`

Ces fichiers seront utilisés pour entraîner et évaluer les modèles en semaine 3.



In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [None]:
interactions = pd.read_csv(
    'https://raw.githubusercontent.com/halim-y/DSML_Kaggle_Competition/refs/heads/main/data/raw/interactions_train.csv'
)

# renommer comme d'habitude
interactions = interactions.rename(columns={
    "u": "user_id",
    "i": "item_id",
    "t": "timestamp"
})

interactions = interactions.sort_values(["user_id", "timestamp"])
interactions.head()



Unnamed: 0,user_id,item_id,timestamp
21035,0,0,1680191000.0
28842,0,1,1680783000.0
3958,0,2,1680801000.0
29592,0,3,1683715000.0
6371,0,3,1683715000.0


## 2. Méthode 1 — Leave-last-N-out

Cette stratégie utilise les **N dernières interactions** de chaque utilisateur comme validation.

In [None]:
def leave_last_n_out(df, n=1, user_col="user_id", time_col="timestamp"):
    df_sorted = df.sort_values([user_col, time_col]).copy()

    def assign(group):
        m = len(group)
        if m <= n:
            return pd.Series(["train"] * m, index=group.index)
        return pd.Series(["train"] * (m - n) + ["val"] * n, index=group.index)

    labels = df_sorted.groupby(user_col, group_keys=False).apply(assign)

    train_df = df_sorted[labels == "train"]
    val_df   = df_sorted[labels == "val"]

    return train_df, val_df


In [None]:
train_ll1, val_ll1 = leave_last_n_out(interactions.copy(), n=1)

print("Total interactions :", len(interactions))
print("Train size :", len(train_ll1))
print("Val size   :", len(val_ll1))


Total interactions : 87047
Train size : 79209
Val size   : 7838


  labels = df_sorted.groupby(user_col, group_keys=False).apply(assign)


## 3. Méthode 2 — Split chronologique par utilisateur

Cette méthode consiste à :

- trier les interactions par ordre temporel
- utiliser une proportion (ex : 80%) comme *train*
- utiliser le reste (ex : 20%) comme *validation*

In [None]:
def chrono_split(df, val_ratio=0.2, user_col="user_id", time_col="timestamp"):
    df_sorted = df.sort_values([user_col, time_col]).copy()

    def assign(group):
        m = len(group)
        if m < 2:
            return pd.Series(["train"] * m, index=group.index)
        cut = int(np.floor((1 - val_ratio) * m))
        cut = max(1, min(cut, m - 1))  # sécurité
        return pd.Series(["train"] * cut + ["val"] * (m - cut), index=group.index)

    labels = df_sorted.groupby(user_col, group_keys=False).apply(assign)

    train_df = df_sorted[labels == "train"]
    val_df   = df_sorted[labels == "val"]

    return train_df, val_df


In [None]:
train_chrono, val_chrono = chrono_split(interactions.copy(), val_ratio=0.2)

print("Train size :", len(train_chrono))
print("Val size   :", len(val_chrono))
print("Total      :", len(train_chrono) + len(val_chrono))


Train size : 66582
Val size   : 20465
Total      : 87047


  labels = df_sorted.groupby(user_col, group_keys=False).apply(assign)


## 4. Final Choice

We will adopt the chronological Leave-N-Out evaluation strategy where N=1 (holding out the most recent interaction out of minimum 3 for each user). This approach was chosen because it is:

- **More Realistic**: It simulates a real-world forecasting scenario where we must predict the next likely action based on past history.

- **Temporally Consistent**: It respects the strict timeline of user behavior, preventing future data from leaking into the training process.

- **Robust to Cold-Start Items**: It naturally isolates 'vanishing items' (items with only one recent interaction) in the test set, allowing us to rigorously test our Content-Based model's ability to handle items with zero training history."

In [None]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/HEC/Data Science and Machine Learning/Team Project/data/interactions_merged.csv')

Mounted at /content/drive


In [None]:
train_df, val_df = leave_last_n_out(df, n=1)

  labels = df_sorted.groupby(user_col, group_keys=False).apply(assign)


## Sanity Check

In [None]:
n_users_total = interactions['user_id'].nunique()
n_users_train = train_df['user_id'].nunique()
n_users_val   = val_df['user_id'].nunique()

print(f"\n>>> Split Statistics:")
print(f"   Total Interactions: {len(interactions)}")
print(f"   Train Rows:         {len(train_df)} ({len(train_df)/len(interactions):.1%})")
print(f"   Validation Rows:    {len(val_df)}   ({len(val_df)/len(interactions):.1%})")
print("-" * 30)
print(f"   Total Users:        {n_users_total}")
print(f"   Users in Train:     {n_users_train}")
print(f"   Users in Val:       {n_users_val}")

if n_users_val == n_users_total:
    print("\n>>> CHECK PASSED: Every user has exactly 1 validation item.")
else:
    print(f"\n>>> WARNING: {n_users_total - n_users_val} users are missing from validation.")


>>> Split Statistics:
   Total Interactions: 87047
   Train Rows:         79207 (91.0%)
   Validation Rows:    7838   (9.0%)
------------------------------
   Total Users:        7838
   Users in Train:     7838
   Users in Val:       7838

>>> CHECK PASSED: Every user has exactly 1 validation item.


## Data Leakage Check

In [None]:
# We group by user and check the max train time vs min val time
max_train_times = train_df.groupby('user_id')['timestamp'].max()
min_val_times   = val_df.groupby('user_id')['timestamp'].min()

# Align indices
common_users = max_train_times.index.intersection(min_val_times.index)
leakage = max_train_times.loc[common_users] > min_val_times.loc[common_users]

if leakage.sum() == 0:
    print(">>> CHECK PASSED: No temporal leakage detected.")
else:
    print(f">>> CRITICAL FAIL: {leakage.sum()} users have training data that occurred AFTER validation data.")

>>> CHECK PASSED: No temporal leakage detected.


In [None]:
train_df.to_csv('/content/drive/MyDrive/Colab Notebooks/HEC/Data Science and Machine Learning/Team Project/data/train_marged_interactions.csv', index=False)
val_df.to_csv('/content/drive/MyDrive/Colab Notebooks/HEC/Data Science and Machine Learning/Team Project/data/val_merged_interactions.csv', index=False)