# Temporal Feature Engineering (Leakage-Safe)
This notebook adds a time-aware features capturing short-term user/movie dynamics.

**Key rule:** for each interaction (user, movie, timestamp), all temporal features are computed using **only prior events**.


## 0) Imports
Imports are kept at the top for reproducibility.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

## 1) Load prepared data and rebuild the full chronological stream
We concatenate the previously persisted splits so temporal features can use the **full history**.
Then we sort by timestamp to enforce chronological processing.

In [2]:
DATA_DIR = Path("../data/raw/movielens-20m-dataset/")  

In [3]:
paths = {
    "ratings": DATA_DIR / "rating.csv",
    "movies": DATA_DIR / "movie.csv",
    "tags": DATA_DIR / "tag.csv",
    "links": DATA_DIR / "link.csv",
    "genome_tags": DATA_DIR / "genome_tags.csv",
    "genome_scores": DATA_DIR / "genome_scores.csv",
}

In [4]:
genome_pca = pd.read_parquet("../data/processed/genome_pca_50.parquet")

In [13]:
movies = pd.read_csv(paths["movies"])
movies["year"] = movies["title"].str.extract(r"\((\d{4})\)").astype(float)
genre_dummies = movies["genres"].str.get_dummies(sep="|")
movies = pd.concat([movies, genre_dummies], axis=1)
movies = movies.merge(genome_pca, on="movieId", how="left")

In [14]:
ratings_all = pd.read_csv(paths["ratings"])
ratings_all["timestamp"] = pd.to_datetime(ratings_all["timestamp"], unit="s")  # if original is unix
ratings["timestamp_dt"] = pd.to_datetime(ratings["timestamp"], format="%d-%m-%Y")
ratings_all = ratings_all.sort_values(["timestamp","userId","movieId"]).reset_index(drop=True)
ratings_all = ratings_all.merge(movies, on="movieId", how="left")
ratings_all = ratings_all.merge(genome_pca, on="movieId", how="left")

NameError: name 'ratings' is not defined

In [None]:
ratings_all = ratings_all.sort_values(["timestamp", "userId", "movieId"]).reset_index(drop=True)

In [None]:
ratings_all["high_rating"] = (ratings_all["rating"] >= 4).astype("int8")

## 2) Parameters
We use interaction-based exponential decay (EWMA) to capture recent behavior.
- `EWMA_SPAN`: how fast older interactions are forgotten (smaller = faster decay)
- `LAST_N`: window size for last-N statistics


In [None]:
EWMA_SPAN = 10   # ~50-interaction memory (tune later)
LAST_N = 5         # last-N rolling window

## 3) User temporal features
We compute (shifted by 1 to exclude the current event):
1) `user_rating_ewm`  : decay-weighted mean rating
2) `user_like_ewm`    : decay-weighted like rate
3) `user_lastN_mean`  : mean of last N ratings
4) `user_lastN_like`  : like rate over last N ratings


In [None]:
g_user = ratings_all.groupby("userId", sort=False)

ratings_all["user_rating_ewm"] = g_user["rating"].transform(
    lambda s: s.shift(1).ewm(span=EWMA_SPAN, adjust=False).mean()
)

ratings_all["user_like_ewm"] = g_user["high_rating"].transform(
    lambda s: s.shift(1).ewm(span=EWMA_SPAN, adjust=False).mean()
)

ratings_all["user_lastN_mean"] = g_user["rating"].transform(
    lambda s: s.shift(1).rolling(LAST_N, min_periods=1).mean()
)

ratings_all["user_lastN_like"] = g_user["high_rating"].transform(
    lambda s: s.shift(1).rolling(LAST_N, min_periods=1).mean()
)

ratings_all[["user_rating_ewm","user_like_ewm","user_lastN_mean","user_lastN_like"]].head(10)


## 4) Movie temporal features
We compute (shifted by 1 to exclude the current event):
5) `movie_rating_ewm` : decay-weighted mean rating
6) `movie_like_ewm`   : decay-weighted like rate
7) `movie_pop_ewm`    : decay-weighted popularity proxy (EWMA of interaction=1)
8) `movie_trend_ewm`  : short-term rating trend (EWMA difference)


In [None]:
g_movie = ratings_all.groupby("movieId", sort=False)

ratings_all["movie_rating_ewm"] = g_movie["rating"].transform(
    lambda s: s.shift(1).ewm(span=EWMA_SPAN, adjust=False).mean()
)

ratings_all["movie_like_ewm"] = g_movie["high_rating"].transform(
    lambda s: s.shift(1).ewm(span=EWMA_SPAN, adjust=False).mean()
)

ratings_all["movie_pop_ewm"] = g_movie["rating"].transform(
    lambda s: pd.Series(1.0, index=s.index).shift(1).ewm(span=EWMA_SPAN, adjust=False).mean()
)

ratings_all["movie_trend_ewm"] = g_movie["rating"].transform(
    lambda s: s.shift(1).ewm(span=EWMA_SPAN, adjust=False).mean().diff()
)

ratings_all[["movie_rating_ewm","movie_like_ewm","movie_pop_ewm","movie_trend_ewm"]].head(10)


## 5) Persist updated datasets
We re-split by timestamp boundaries and save updated parquet files.
Because features were computed on the full history, the splits do not lose historical information.


In [None]:
train_start = "01-01-2008"
train_end = "31-12-2012"
val_end = "31-12-2013"

train2 = ratings_all[(ratings_all['timestamp_dt'] <= train_end) & (ratings_all['timestamp_dt'] >= train_start)]
val2   = ratings_all[(ratings_all['timestamp_dt'] > train_end) & (ratings_all['timestamp_dt'] <= val_end)]
test2  = ratings_all[ratings_all['timestamp_dt'] > val_end]
print("Rows:", {"train": len(train2), "val": len(val2), "test": len(test2)})

train2.to_parquet("../data/processed/train_prepared_v2_temporal.parquet", index=False)
val2.to_parquet("../data/processed/val_prepared_v2_temporal.parquet", index=False)
test2.to_parquet("../data/processed/test_prepared_v2_temporal.parquet", index=False)

print("Saved: *_prepared_v2_temporal.parquet")