# Modelos Baseline

En este notebook se implementaran los modelos baseline del proyecto y se guardaran las métricas con el mismo dataset que se utilizará para el modelo principal para hacer bentchmarking

## Lectura de los datos

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from utils.recomender_metrics import prepare_ground_truth, evaluate_recommendations, print_evaluation_results
from collections import defaultdict
from surprise.model_selection import PredefinedKFold
from utils.UKnn import UserKNN, calculate_mae, calculate_rmse

In [2]:
data_folder = '../data/raw/ml-100k/'

In [3]:
r_cols = ['user_id', 'item_id', 'rating', 'timestamp']

# Read the training and testing sets
train_df = pd.read_csv(f'{data_folder}u1.base', sep='\t', names=r_cols, encoding='latin-1')
test_df = pd.read_csv(f'{data_folder}u1.test', sep='\t', names=r_cols, encoding='latin-1')

train_df['rating'] = train_df['rating'].astype(int)
test_df['rating'] = test_df['rating'].astype(int)

train_df['rating'] = pd.to_numeric(train_df['rating'], errors='coerce').astype('Int64')
test_df['rating'] = pd.to_numeric(test_df['rating'], errors='coerce').astype('Int64')

print("Training Data Head:")
print(train_df.head())

Training Data Head:
   user_id  item_id  rating  timestamp
0        1        1       5  874965758
1        1        2       3  876893171
2        1        3       4  878542960
3        1        4       3  876893119
4        1        5       3  889751712


In [4]:
i_cols = [
    'item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL',
    'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy',
    'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
    'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'
]

movies_df = pd.read_csv(f'{data_folder}u.item', sep='|', names=i_cols, encoding='latin-1')

print("\nMovie Data Head:")
print(movies_df.head())


Movie Data Head:
   item_id              title release_date  video_release_date  \
0        1   Toy Story (1995)  01-Jan-1995                 NaN   
1        2   GoldenEye (1995)  01-Jan-1995                 NaN   
2        3  Four Rooms (1995)  01-Jan-1995                 NaN   
3        4  Get Shorty (1995)  01-Jan-1995                 NaN   
4        5     Copycat (1995)  01-Jan-1995                 NaN   

                                            IMDb_URL  unknown  Action  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0   
3  http://us.imdb.com/M/title-exact?Get%20Shorty%...        0       1   
4  http://us.imdb.com/M/title-exact?Copycat%20(1995)        0       0   

   Adventure  Animation  Children's  ...  Fantasy  Film-Noir  Horror  Musical  \
0          0          1           1  ...        0          0     

In [5]:
u_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

users_df = pd.read_csv(f'{data_folder}u.user', sep='|', names=u_cols, encoding='latin-1')

# Display the first few rows of the user data
print("\nUser Data Head:")
print(users_df.head())


User Data Head:
   user_id  age gender  occupation zip_code
0        1   24      M  technician    85711
1        2   53      F       other    94043
2        3   23      M      writer    32067
3        4   24      M  technician    43537
4        5   33      F       other    15213


## Estructuras para métricas

Para calcular las métricas, necesitamos saber, por ejemplo, el `ground truth` o la popularidad de los items

In [6]:
# Agrupamos el DataFrame de test por usuario y convertimos los item_id de cada grupo en un conjunto (set)
ground_truth = test_df[test_df['rating'] > 2].groupby('user_id')['item_id'].apply(set).to_dict()

k_values = [10, 20, 50]

# Contamos popularidad como suma de ratings para dar más peso a los items mejor puntuados
item_popularity = train_df.groupby('item_id')['rating'].sum().to_dict()

# Creamos el diccionario iterando sobre el dataframe de películas
item_features = {}
for index, row in movies_df.iterrows():
    item_id = row['item_id']
    # Creamos un conjunto con los nombres de las columnas de género donde el valor es 1
    genres = {genre for genre in i_cols if row[genre] == 1}
    item_features[item_id] = genres

all_items = set(movies_df['item_id'])

## Random model

Generaremos, para cada usuario, aleatoriamente una lista de hasta 50 recomendaciones para poder evaluar las métricas a distintos puntos.

In [7]:
# Obtener todos los IDs de películas únicos
all_movie_ids = movies_df['item_id'].unique().tolist()

# Crear el diccionario de ítems vistos por usuario (SOLO con datos de entrenamiento)
user_seen_items = train_df.groupby('user_id')['item_id'].apply(set).to_dict()

# Obtener la lista de usuarios para los que generaremos recomendaciones
users_in_train = train_df['user_id'].unique().tolist()

In [8]:
print(f"Total de películas: {len(all_movie_ids)}")
print(f"Total de usuarios en el set de entrenamiento: {len(users_in_train)}")
print(f"Películas vistas por el usuario 1: {user_seen_items[1]}") 

Total de películas: 1682
Total de usuarios en el set de entrenamiento: 943
Películas vistas por el usuario 1: {1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 15, 16, 18, 19, 21, 22, 25, 26, 28, 29, 30, 32, 34, 35, 37, 38, 40, 41, 42, 43, 45, 46, 48, 50, 52, 55, 57, 58, 59, 63, 66, 68, 71, 75, 77, 79, 83, 87, 88, 89, 93, 94, 95, 99, 101, 105, 106, 109, 110, 111, 115, 116, 119, 122, 123, 124, 126, 127, 131, 133, 135, 136, 137, 138, 139, 141, 142, 144, 146, 147, 149, 152, 153, 156, 158, 162, 165, 166, 167, 168, 169, 172, 173, 176, 178, 179, 181, 182, 187, 191, 192, 194, 195, 197, 198, 199, 203, 204, 205, 207, 211, 216, 217, 220, 223, 231, 234, 237, 238, 239, 240, 244, 245, 246, 247, 249, 251, 256, 257, 261, 263, 268, 269, 270, 271}


In [9]:
print(len(train_df), len(test_df))

80000 20000


In [10]:
def generate_random_recommendations(users_to_recommend, all_movie_ids, user_seen_items, max_recommendations=50, random_state=42):
    """
    Genera recomendaciones aleatorias para una lista de usuarios,
    asegurándose de no recomendar ítems que ya han visto.
    """
    np.random.seed(random_state)
    recommendations = {}
    
    for user_id in users_to_recommend:
        # Obtener el conjunto de ítems que el usuario ya ha visto (del diccionario)
        seen_items = user_seen_items.get(user_id, set())
        
        # Calcular los ítems candidatos (todos menos los ya vistos)
        candidate_items = list(set(all_movie_ids) - seen_items)
        
        # Determinar cuántas recomendaciones generar
        n_recommendations = min(max_recommendations, len(candidate_items))
        
        # Si hay candidatos, seleccionar aleatoriamente
        if n_recommendations > 0:
            recommended_items = np.random.choice(candidate_items, size=n_recommendations, replace=False).tolist()
            recommendations[user_id] = recommended_items
        else:
            # En el caso improbable de que un usuario haya visto todo
            recommendations[user_id] = []
            
    return recommendations

In [11]:
random_recs = generate_random_recommendations(
    users_to_recommend=users_in_train,
    all_movie_ids=all_movie_ids,
    user_seen_items=user_seen_items,
    max_recommendations=50, # Generar hasta 50 recomendaciones por usuario
    random_state=42
)

In [12]:
# Run evaluation
results = evaluate_recommendations(
    recommendations=random_recs,
    ground_truth=ground_truth,
    k_values=k_values,
    item_popularity=item_popularity,
    all_items=all_items,
    item_features=item_features
)

print_evaluation_results(results)


RECOMMENDATION EVALUATION RESULTS

CATALOG COVERAGE:
  @10: 0.9970
  @20: 1.0000
  @50: 1.0000

F1:
  @10: 0.0115
  @20: 0.0171
  @50: 0.0263

INTRA LIST SIMILARITY:
  @10: 0.1807
  @20: 0.1807
  @50: 0.1820

MAP:
  @10: 0.0085
  @20: 0.0062
  @50: 0.0049

MRR:
  @10: 0.0651
  @20: 0.0741
  @50: 0.0803

NDCG:
  @10: 0.4785
  @20: 0.3892
  @50: 0.3219

NOVELTY:
  @10: 12.4810
  @20: 12.4729
  @50: 12.4791

PRECISION:
  @10: 0.0244
  @20: 0.0241
  @50: 0.0230

RECALL:
  @10: 0.0075
  @20: 0.0133
  @50: 0.0307



## Most popular items

En este caso, todas las recomendaciones seran iguales para todos los usuarios: recomendaremos las 50 películas más populares

In [13]:
# Agarramos el diccionario de popularidad y sacamos las 50 peliculas mas populares
most_popular_items = sorted(item_popularity, key=item_popularity.get, reverse=True)
most_popular_items = most_popular_items[:50]

# Ahora llenamos las recomendaciones con las mismas peliculas para todos los usuarios
pop_recs = {user_id: most_popular_items for user_id in users_in_train}

In [14]:
# Run evaluation
results = evaluate_recommendations(
    recommendations=pop_recs,
    ground_truth=ground_truth,
    k_values=k_values,
    item_popularity=item_popularity,
    all_items=all_items,
    item_features=item_features
)

print_evaluation_results(results)


RECOMMENDATION EVALUATION RESULTS

CATALOG COVERAGE:
  @10: 0.0059
  @20: 0.0119
  @50: 0.0297

F1:
  @10: 0.1072
  @20: 0.1461
  @50: 0.1740

INTRA LIST SIMILARITY:
  @10: 0.1497
  @20: 0.1935
  @50: 0.1776

MAP:
  @10: 0.0885
  @20: 0.0749
  @50: 0.0726

MRR:
  @10: 0.3726
  @20: 0.3797
  @50: 0.3815

NDCG:
  @10: 0.5838
  @20: 0.5428
  @50: 0.5232

NOVELTY:
  @10: 7.5304
  @20: 7.7083
  @50: 8.0123

PRECISION:
  @10: 0.1743
  @20: 0.1621
  @50: 0.1373

RECALL:
  @10: 0.0774
  @20: 0.1330
  @50: 0.2373



## User-based KNN

Como modelo informado base, escogemos el user-based KNN. Partiendo siempre desde la hipotesis de que tenemos ya bastentes datos de nuestros usuarios y ahora nuestro objetivo es recomendarles películas relevantes.

In [15]:
train_df_str = train_df.copy()
train_df_str['user_id'] = train_df_str['user_id'].astype(str)
train_df_str['item_id'] = train_df_str['item_id'].astype(str)

test_df_str = test_df.copy()
test_df_str['user_id'] = test_df_str['user_id'].astype(str)
test_df_str['item_id'] = test_df_str['item_id'].astype(str)

trainset = [tuple(row) for row in train_df_str[['user_id', 'item_id', 'rating']].values]
testset = [tuple(row) for row in test_df_str[['user_id', 'item_id', 'rating']].values]

In [16]:
myUserKnn = UserKNN(k=7, similarity='cosine')
myUserKnn.fit(trainset)

In [18]:
# Predicciones para todo el testset
predictions = myUserKnn.predict_all(testset)
rmse = calculate_rmse(predictions)
mae = calculate_mae(predictions)
print(f"RMSE: {rmse:.4f}, MAE: {mae:.4f}")

Item 599 unknown, returning user mean.
Item 711 unknown, returning user mean.
Item 814 unknown, returning user mean.
Item 830 unknown, returning user mean.
Item 852 unknown, returning user mean.
Item 857 unknown, returning user mean.
Item 1156 unknown, returning user mean.
Item 1236 unknown, returning user mean.
Item 1309 unknown, returning user mean.
Item 1310 unknown, returning user mean.
Item 1320 unknown, returning user mean.
Item 1343 unknown, returning user mean.
Item 1348 unknown, returning user mean.
Item 1364 unknown, returning user mean.
Item 1373 unknown, returning user mean.
Item 1457 unknown, returning user mean.
Item 1458 unknown, returning user mean.
Item 1492 unknown, returning user mean.
Item 1493 unknown, returning user mean.
Item 1498 unknown, returning user mean.
Item 1505 unknown, returning user mean.
Item 1520 unknown, returning user mean.
Item 1533 unknown, returning user mean.
Item 1536 unknown, returning user mean.
Item 1543 unknown, returning user mean.
Item 1

In [20]:
users_in_train = train_df['user_id'].astype(str).unique().tolist()
top_n_all = myUserKnn.get_top_n(user_ids=users_in_train, n=50)

In [22]:
# Convertir top_n al formato correcto (solo item_ids)
uknn_recs = {
    uid: [item_id for item_id, _ in items]
    for uid, items in top_n_all.items()
}

In [25]:
# Run evaluation
results = evaluate_recommendations(
    recommendations=uknn_recs,
    ground_truth=ground_truth,
    k_values=k_values,
    item_popularity=item_popularity,
    all_items=all_items,
    item_features=item_features
)

print_evaluation_results(results)


RECOMMENDATION EVALUATION RESULTS

CATALOG COVERAGE:
  @10: 0.1367
  @20: 0.2111
  @50: 0.3502

F1:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

INTRA LIST SIMILARITY:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

MAP:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

MRR:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

NDCG:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

NOVELTY:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

PRECISION:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

RECALL:
  @10: 0.0000
  @20: 0.0000
  @50: 0.0000

