# Feature Engineering & Feature Selection (Baseline)
Chronologically safe, no leakage.
These notebooks uses the infomration in its most simple form as it can be found in the datasets of the challenge. The idea is top develop simple features ir order to train a base model to work with.

Output:

- ```train_prepared.parquet```
- ```val_prepared.parquet```
- ```test_prepared.parquet```

## 0. Imports

In [4]:
import pandas as pd
import numpy as np

from pathlib import Path

## 1. Load Data

In [29]:
DATA_DIR = Path("../data/raw/movielens-20m-dataset/")  

In [30]:
paths = {
    "ratings": DATA_DIR / "rating.csv",
    "movies": DATA_DIR / "movie.csv",
    "tags": DATA_DIR / "tag.csv",
    "links": DATA_DIR / "link.csv",
    "genome_tags": DATA_DIR / "genome_tags.csv",
    "genome_scores": DATA_DIR / "genome_scores.csv",
}

In [31]:
ratings_dtypes = {"userId": "int32", "movieId": "int32", "rating": "float32", "timestamp": "string"}
movies_dtypes = {"movieId": "int32", "title": "string", "genres": "string"}
tags_dtypes = {"userId": "int32", "movieId": "int32", "tag": "string", "timestamp": "string"}
links_dtypes = {"movieId": "int32", "imdbId": "Int64", "tmdbId": "Int64"}  # nullable ints
g_tags_dtypes = {"tagId": "int32", "tag": "string"}
g_scores_dtypes = {"movieId": "int32", "tagId": "int32", "relevance": "float32"}

ratings = pd.read_csv(paths["ratings"], dtype=ratings_dtypes)
movies = pd.read_csv(paths["movies"], dtype=movies_dtypes)
tags = pd.read_csv(paths["tags"], dtype=tags_dtypes)
links = pd.read_csv(paths["links"], dtype=links_dtypes)
genome_tags = pd.read_csv(paths["genome_tags"], dtype=g_tags_dtypes)
genome_scores = pd.read_csv(paths["genome_scores"], dtype=g_scores_dtypes)

ratings["timestamp"] = pd.to_datetime(ratings["timestamp"], format="%Y-%m-%d %H:%M:%S")
ratings["timestamp"] = ratings["timestamp"].dt.strftime("%d-%m-%Y")
ratings["timestamp_dt"] = pd.to_datetime(ratings["timestamp"], format="%d-%m-%Y")

tags["timestamp"] = pd.to_datetime(tags["timestamp"], format="%Y-%m-%d %H:%M:%S")
tags["timestamp"] = tags["timestamp"].dt.strftime("%d-%m-%Y")
tags["timestamp_dt"] = pd.to_datetime(tags["timestamp"], format="%d-%m-%Y")

## 2. Binary Target

In [32]:
ratings['high_rating'] = (ratings['rating'] >= 4).astype('int8')

## 3. Chronological ordering

In [33]:
ratings = ratings.sort_values('timestamp')

## 4. User features (past only)

In [34]:
user_hist = (
    ratings.groupby('userId')
    .expanding()
    .agg(user_mean_rating=('rating','mean'),
         user_like_rate=('high_rating','mean'))
    .reset_index(level=0, drop=True)
)
ratings = ratings.join(user_hist)
ratings["user_n_ratings"] = ratings.groupby("userId").cumcount().add(1).astype("int32")

## 5. Movie features (past only)

In [35]:
movie_hist = (
    ratings.groupby('movieId')
    .expanding()
    .agg(movie_mean_rating=('rating','mean'),
         movie_like_rate=('high_rating','mean'))
    .reset_index(level=0, drop=True)
)
ratings = ratings.join(movie_hist)
ratings["movie_n_ratings"] = ratings.groupby("movieId").cumcount().add(1).astype("int32")


## 6. Movie static features

In [36]:
movies["year"] = movies["title"].str.extract(r"\((\d{4})\)").astype("Int16")
movies = movies.join(movies['genres'].str.get_dummies(sep='|'))
ratings = ratings.merge(movies[['movieId','year'] + list(movies.columns[3:])], on='movieId', how='left')

In [37]:
movies

Unnamed: 0,movieId,title,genres,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,1995,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27273,131254,Kein Bund für's Leben (2007),Comedy,2007,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy,2002,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
27275,131258,The Pirates (2014),Adventure,2014,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27276,131260,Rentun Ruusu (2001),(no genres listed),2001,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
#we eliminate dup col
ratings = ratings.loc[:, ~ratings.columns.duplicated()]

## 7. Final Feature Matrix

In [40]:
X = ratings[feature_cols]
y = ratings['high_rating']

## 8. Baseline dataset construction

Its imoportant that we mantain chronological consistency through the train. Too little information always affect the training. But to many information -irrelevant o historically inacurate- may also take our model to some weird places. We cut it out taking only the last years of the dataset. 

In [45]:
train_start = "01-01-2008"
train_end = "31-12-2012"
val_end = "31-12-2013"

train = ratings[(ratings['timestamp_dt'] <= train_end) & (ratings['timestamp_dt'] >= train_start)]
val   = ratings[(ratings['timestamp_dt'] > train_end) & (ratings['timestamp_dt'] <= val_end)]
test  = ratings[ratings['timestamp_dt'] > val_end]

total = len(train) + len(val) + len(test)
train_rat = round(len(train) / total, 2)
val_rat = round(len(val) / total, 2)
test_rat = round(len(test) / total, 2)

In [44]:
print("Total dataset:", len(train) + len(val) + len(test))

Total dataset: 5936360


## 9. Persist Training Datasets

In [42]:
train.to_parquet("../data/processed/train_prepared.parquet", index=False)
val.to_parquet("../data/processed/val_prepared.parquet", index=False)
test.to_parquet("../data/processed/test_prepared.parquet", index=False)

## 10.Pros and Cons of Time-Based Training (Summary)

### Pros
- **No leakage:**  
  Ensures user and movie features only use past information, matching real-world deployment.

- **Realistic evaluation:**  
  Models are tested on future data (e.g., train ≤2012 → validate 2013 → test 2014), reflecting actual system behavior.

- **Captures temporal dynamics:**  
  Measures how well the model generalizes when user preferences and movie popularity evolve over time.

- **Essential for cumulative features:**  
  Features like user_mean_rating, movie_like_rate, and n_ratings depend on historical order; random splits would corrupt them.

- **More interpretable:**  
  Model decisions reflect true historical patterns instead of random mixing of past and future events.

---

### Cons
- **Smaller training set:**  
  Limiting the model to past years reduces the amount of data available for learning.

- **Stronger cold-start:**  
  Newer movies or users have little or no historical data in earlier years.

- **More complex pipeline:**  
  Requires careful chronological sorting, incremental feature computation, and strict handling of time boundaries.

- **Potential feature instability:**  
  User and movie stats in early years can be very noisy when activity volume is low.

