# Modelos Baseline

En este notebook se implementaran los modelos baseline del proyecto y se guardaran las métricas con el mismo dataset que se utilizará para el modelo principal para hacer bentchmarking

## Lectura de los datos

In [36]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import os
from utils.recomender_metrics import prepare_ground_truth, evaluate_recommendations, calculate_item_popularity, get_all_items, print_evaluation_results
import numpy as np
from collections import defaultdict
from surprise.model_selection import PredefinedKFold
from utils.UKnn import UserKNN, calculate_mae, calculate_rmse

In [None]:
"""
data_dir = "../data/raw"
os.makedirs(data_dir, exist_ok=True)

url = "http://files.grouplens.org/datasets/movielens/ml-1m.zip"
response = requests.get(url)
with zipfile.ZipFile(BytesIO(response.content)) as z:
    z.extractall(os.path.join(data_dir, "ml-1m"))
"""

In [2]:
# Read ratings.dat
ratings_csv = pd.read_csv("../data/raw/ml-1m/ratings.dat", 
                         sep="::",
                         header=None,
                         names=['userId', 'movieId', 'rating', 'timestamp'],
                         engine='python')

In [3]:
# Read movies.dat
movies_csv = pd.read_csv("../data/raw/ml-1m/movies.dat",
                        sep="::",
                        header=None,
                        names=['movieId', 'title', 'genres'],
                        encoding='latin-1',
                        engine='python')

In [4]:
# Read users.dat
users_csv = pd.read_csv("../data/raw/ml-1m/users.dat",
                       sep="::",
                       header=None,
                       names=['userId', 'gender', 'age', 'occupation', 'zipcode'],
                       engine='python')

In [5]:
ratings_csv

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [6]:
movies_csv

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


### Separamos train/test/val

Los splits seran de tamaño 80, 10, 10. Importante mantener la misma seed random para que la comparación sea válida entre los modelos. Por otro lado, para hacer los splits, nos aseguramos que cada usuario tenga 80% de sus ratings en el split de testing y el 20% restante entre los otros dos. Lo que queremos de nuestro modelo es, teniendo un buen conocimiento del usuario (nuestro trabajo será perfilarlo correctamente), conseguir nuevas recomendaciones relevantes.

In [7]:
def split_by_user(ratings_df, train_size=0.8, val_size=0.1, random_state=42):
    np.random.seed(random_state)
    
    # Initialize empty dataframes for each split
    train_ratings = []
    val_ratings = []
    test_ratings = []
    
    # Group by user
    for _, user_ratings in ratings_df.groupby('userId'):
        # Shuffle user ratings
        user_ratings = user_ratings.sample(frac=1, random_state=random_state)
        
        # Calculate split points
        n_ratings = len(user_ratings)
        train_idx = int(n_ratings * train_size)
        val_idx = int(n_ratings * (train_size + val_size))
        
        # Split user ratings
        train_ratings.append(user_ratings.iloc[:train_idx])
        val_ratings.append(user_ratings.iloc[train_idx:val_idx])
        test_ratings.append(user_ratings.iloc[val_idx:])
    
    # Concatenate all splits
    return (pd.concat(train_ratings), 
            pd.concat(test_ratings), 
            pd.concat(val_ratings))

# Perform the split
ratings_train, ratings_test, ratings_val = split_by_user(
    ratings_csv, 
    train_size=0.8, 
    val_size=0.1, 
    random_state=42
)

Ahora, con las funciones de utils, generamos el ground truth para cada usuario de su split de test y validation. Así, cuando hagamos las recomendaciones, sabremos si el elemento era relevante o no.

In [26]:
testset = list(zip(
    ratings_test['userId'].astype(str), 
    ratings_test['movieId'].astype(str), 
    ratings_test['rating']
))

trainset = list(zip(
    ratings_train['userId'].astype(str), 
    ratings_train['movieId'].astype(str), 
    ratings_train['rating']
))

In [9]:
ground_truth_test = prepare_ground_truth(testset)
print("Ground truth for test set prepared.")

Ground truth for test set prepared.


In [10]:
item_popularity = calculate_item_popularity(trainset)
print("Item popularity calculated.")

Item popularity calculated.


In [11]:
all_items = get_all_items(trainset, testset)
print("All items retrieved.")

All items retrieved.


In [13]:
ground_truth_test['3']

{'1210', '1259', '1304', '2617', '2858', '733'}

## Random model

Generaremos, para cada usuario, aleatoriamente una lista de hasta 50 recomendaciones para poder evaluar las métricas a distintos puntos.

In [14]:
print(len(ratings_train), len(ratings_val), len(ratings_test))

797758 99692 102759


In [15]:
# Como no queremos repetir recomendaciones que el usuario ya ha visto (estan en el train), creamos un set de items vistos por cada usuario
user_seen_items = ratings_train.groupby('userId')['movieId'].apply(set).to_dict()

In [16]:
def generate_random_recommendations(ratings_df, all_movie_ids, user_seen_items, max_recommendations=50, random_state=42):
    np.random.seed(random_state)
    recommendations = {}
    
    for user_id in ratings_df['userId'].unique():
        seen_items = user_seen_items.get(user_id, set())
        candidate_items = list(set(all_movie_ids) - seen_items)
        
        n_recommendations = min(max_recommendations, len(candidate_items))
        recommended_items = np.random.choice(candidate_items, size=n_recommendations, replace=False).tolist()
        
        recommendations[user_id] = recommended_items
    
    return recommendations

In [17]:
recommendations = generate_random_recommendations(ratings_test, all_items, user_seen_items)

In [19]:
# Run evaluation
results = evaluate_recommendations(
    recommendations=recommendations,
    ground_truth=ground_truth_test,
    k_values=[20, 50],
    item_popularity=item_popularity,
    all_items=all_items
)

print_evaluation_results(results)


RECOMMENDATION EVALUATION RESULTS

CATALOG COVERAGE:
  @20: 1.0000
  @50: 1.0000

F1:
  @20: 0.0000
  @50: 0.0000

MAP:
  @20: 0.0000
  @50: 0.0000

MRR:
  @20: 0.0000
  @50: 0.0000

NDCG:
  @20: 0.0000
  @50: 0.0000

NOVELTY:
  @20: 13.3524
  @50: 13.3470

PRECISION:
  @20: 0.0000
  @50: 0.0000

RECALL:
  @20: 0.0000
  @50: 0.0000



## Most popular items

En este caso, todas las recomendaciones seran iguales para todos los usuarios: recomendaremos las 50 películas más populares

In [20]:
# Agarramos el diccionario de popularidad y sacamos las 50 peliculas mas populares
most_popular_items = sorted(item_popularity, key=item_popularity.get, reverse=True)
most_popular_items = most_popular_items[:50]

# Ahora llenamos las recomendaciones con las mismas peliculas para todos los usuarios
recommendations = {user_id: most_popular_items for user_id in ratings_test['userId'].unique()}

In [21]:
# Run evaluation
results = evaluate_recommendations(
    recommendations=recommendations,
    ground_truth=ground_truth_test,
    k_values=[3, 5],
    item_popularity=item_popularity,
    all_items=all_items
)

print_evaluation_results(results)


RECOMMENDATION EVALUATION RESULTS

CATALOG COVERAGE:
  @ 3: 0.0008
  @ 5: 0.0014

F1:
  @ 3: 0.0000
  @ 5: 0.0000

MAP:
  @ 3: 0.0000
  @ 5: 0.0000

MRR:
  @ 3: 0.0000
  @ 5: 0.0000

NDCG:
  @ 3: 0.0000
  @ 5: 0.0000

NOVELTY:
  @ 3: 8.3350
  @ 5: 8.3839

PRECISION:
  @ 3: 0.0000
  @ 5: 0.0000

RECALL:
  @ 3: 0.0000
  @ 5: 0.0000



## User-based KNN

Como modelo informado base, escogemos el user-based KNN. Partiendo siempre desde la hipotesis de que tenemos ya bastentes datos de nuestros usuarios y ahora nuestro objetivo es recomendarles películas relevantes.

In [37]:
myUserKnn = UserKNN(k=7, similarity='cosine')
myUserKnn.fit(trainset)

In [38]:
# Predicciones para todo el testset
predictions = myUserKnn.predict_all(testset)
rmse = calculate_rmse(predictions)
mae = calculate_mae(predictions)
print(f"RMSE: {rmse:.4f}, MAE: {mae:.4f}")

RMSE: 0.9376, MAE: 0.7303


In [None]:
# Para todos los usuarios del test
test_users = ratings_test['userId'].unique().astype(str).tolist()
top_n_all = myUserKnn.get_top_n(user_ids=test_users, n=50)

In [None]:
# Convertir top_n al formato correcto (solo item_ids)
recommendations = {
    uid: [item_id for item_id, _ in items]
    for uid, items in top_n_all.items()
}

In [None]:
results = evaluate_recommendations(
    recommendations=recommendations,
    ground_truth=ground_truth_test,
    k_values=[3, 5],
    item_popularity=item_popularity,
    all_items=all_items
)

print_evaluation_results(results)