
In this Jupyter Notebook, 1 will develop a movie recommendation system that leverages deep learning techniques to predict user preferences based on their past ratings.

I will cover the following steps:

1. **Download and preprocess the MovieLens dataset**

2. **Split the dataset into training, validation, and testing sets**

3. **Implement a neural network architecture for the recommendation system**
    
4. **Train and evaluate the model using different metrics**

So first of all, I will load the data in the different variables links, movies, ratings and tags from the .csv.

I will also display the first 5 rows of each one with the head() function and some metrics with the describe() function.

In [1]:
import pandas as pd

#MOVIES DATASET
movies = pd.read_csv("ml-latest-small/movies.csv")
print("Summary of the dataset: \n", movies.describe())
print("---------------------------------------------")
print("First 5 rows: \n", movies.head())

Summary of the dataset: 
              movieId
count    9742.000000
mean    42200.353623
std     52160.494854
min         1.000000
25%      3248.250000
50%      7300.000000
75%     76232.000000
max    193609.000000
---------------------------------------------
First 5 rows: 
    movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


In [2]:
#RATINGS DATASET
ratings = pd.read_csv("ml-latest-small/ratings.csv")
print("Summary of the dataset: \n",ratings.describe())
print("---------------------------------------------")
print("First 5 rows: \n", ratings.head())

Summary of the dataset: 
               userId        movieId         rating     timestamp
count  100836.000000  100836.000000  100836.000000  1.008360e+05
mean      326.127564   19435.295718       3.501557  1.205946e+09
std       182.618491   35530.987199       1.042529  2.162610e+08
min         1.000000       1.000000       0.500000  8.281246e+08
25%       177.000000    1199.000000       3.000000  1.019124e+09
50%       325.000000    2991.000000       3.500000  1.186087e+09
75%       477.000000    8122.000000       4.000000  1.435994e+09
max       610.000000  193609.000000       5.000000  1.537799e+09
---------------------------------------------
First 5 rows: 
    userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


In [3]:
#TAGS DATASET
tags = pd.read_csv("ml-latest-small/tags.csv")
print("Summary of the dataset: \n",tags.describe())
print("---------------------------------------------")
print("First 5 rows: \n", tags.head())

Summary of the dataset: 
             userId        movieId     timestamp
count  3683.000000    3683.000000  3.683000e+03
mean    431.149335   27252.013576  1.320032e+09
std     158.472553   43490.558803  1.721025e+08
min       2.000000       1.000000  1.137179e+09
25%     424.000000    1262.500000  1.137521e+09
50%     474.000000    4454.000000  1.269833e+09
75%     477.000000   39263.000000  1.498457e+09
max     610.000000  193565.000000  1.537099e+09
---------------------------------------------
First 5 rows: 
    userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferrell  1445714992
3       2    89774     Boxing story  1445715207
4       2    89774              MMA  1445715200


In [4]:
#LINKS DATASET
links = pd.read_csv("ml-latest-small/links.csv")
print("Summary of the dataset: \n",links.describe())
print("---------------------------------------------")
print("First 5 rows: \n", links.head())

Summary of the dataset: 
              movieId        imdbId         tmdbId
count    9742.000000  9.742000e+03    9734.000000
mean    42200.353623  6.771839e+05   55162.123793
std     52160.494854  1.107228e+06   93653.481487
min         1.000000  4.170000e+02       2.000000
25%      3248.250000  9.518075e+04    9665.500000
50%      7300.000000  1.672605e+05   16529.000000
75%     76232.000000  8.055685e+05   44205.750000
max    193609.000000  8.391976e+06  525662.000000
---------------------------------------------
First 5 rows: 
    movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0


# **FIRST MODEL**
# Using only original data (genres and ratings)


In [5]:
movies["genres"] = movies["genres"].fillna("").astype(str)
movies["genres"] = movies["genres"].str.split('|')
moviesExploded = movies.explode("genres")
movies_dummies = pd.get_dummies(moviesExploded["genres"], prefix="", prefix_sep="", dtype=int)
movies_dummies = moviesExploded[["movieId"]].join(movies_dummies).groupby("movieId").max()
movies_final = movies.drop(columns=["genres"]).merge(movies_dummies, on="movieId")
print(movies_final.head())
#The feature genres is dummyfied

ratings["rating"] = ratings["rating"].astype(int)
print(ratings.head())

links["tmdbId"] = links["tmdbId"].fillna(-1).astype(int)
print(links.head())

print(tags.head())

   movieId                               title  (no genres listed)  Action  \
0        1                    Toy Story (1995)                   0       0   
1        2                      Jumanji (1995)                   0       0   
2        3             Grumpier Old Men (1995)                   0       0   
3        4            Waiting to Exhale (1995)                   0       0   
4        5  Father of the Bride Part II (1995)                   0       0   

   Adventure  Animation  Children  Comedy  Crime  Documentary  ...  Film-Noir  \
0          1          1         1       1      0            0  ...          0   
1          1          0         1       0      0            0  ...          0   
2          0          0         0       1      0            0  ...          0   
3          0          0         0       1      0            0  ...          0   
4          0          0         0       1      0            0  ...          0   

   Horror  IMAX  Musical  Mystery  Romance  

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import torch
import numpy as np

df = ratings.merge(movies_final, on="movieId", how="left")
#Ratings is merged with the data from movies.csv

df2 = df.copy()
numerical_col = [col for col in df2.columns if col not in ["title"]]
df_shuffle = df2.sample(frac=1, random_state=123).drop(columns=["timestamp", "title"])
#Features timestamp and title are removed

df_train = df_shuffle.iloc[:int(len(df_shuffle) * 0.8), :]
df_val = df_shuffle.iloc[int(len(df_shuffle) * 0.8):int(len(df_shuffle) * 0.9), :]
df_test = df_shuffle.iloc[int(len(df_shuffle) * 0.9):, :]
#Data is splited into training, validation and test sets(80-10-10)

scalers = {}

feature_cols = [col for col in df_shuffle.columns if col not in ["rating", "timestamp", "title"]]
x_train, y_train = df_train[feature_cols].to_numpy(dtype=np.float32), df_train["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_val, y_val = df_val[feature_cols].to_numpy(dtype=np.float32), df_val["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_test, y_test = df_test[feature_cols].to_numpy(dtype=np.float32), df_test["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
#Features transformed into numerical and  divided into input and output/target variable

print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)
print(x_test.shape, y_test.shape)

x_train, y_train = torch.tensor(x_train), torch.tensor(y_train).float()
x_val, y_val = torch.tensor(x_val), torch.tensor(y_val).float()
x_test, y_test = torch.tensor(x_test), torch.tensor(y_test).float()
#Sets transformed into tensors

(80668, 22) (80668, 1)
(10084, 22) (10084, 1)
(10084, 22) (10084, 1)


In [7]:
import pandas as pd
#Here, I have created for each set (train, validation and testing) new 6 variables. For training and validation, the data related to the average of the 
#rating is taken from the set itself, but the columns of the testing set can not be created from the average of the rating of the training set (would be considered data leakage)
#So the solution is using the average from the data of the training set
#NAs are also removed

avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")
avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_train = df_train.merge(avg_movie_rating, on="movieId", how="left")
df_train = df_train.merge(count_movie_rating, on="movieId", how="left")
df_train = df_train.merge(std_movie_rating, on="movieId", how="left")
df_train = df_train.merge(avg_user_rating, on="userId", how="left")
df_train = df_train.merge(count_user_rating, on="userId", how="left")
df_train = df_train.merge(std_user_rating, on="userId", how="left")

df_train["count_movie_rating"] = df_train["count_movie_rating"].fillna(0)
df_train["count_user_rating"] = df_train["count_user_rating"].fillna(0)
df_train["avg_movie_rating"] = df_train["avg_movie_rating"].fillna(0)
df_train["avg_user_rating"] = df_train["avg_user_rating"].fillna(0)
df_train["std_movie_rating"] = df_train["std_movie_rating"].fillna(0)
df_train["std_user_rating"] = df_train["std_user_rating"].fillna(0)



avg_movie_rating = df_val.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_val.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_val.groupby("movieId")["rating"].std().rename("std_movie_rating")
avg_user_rating = df_val.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_val.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_val.groupby("userId")["rating"].std().rename("std_user_rating")

df_val = df_val.merge(avg_movie_rating, on="movieId", how="left")
df_val = df_val.merge(count_movie_rating, on="movieId", how="left")
df_val = df_val.merge(std_movie_rating, on="movieId", how="left")
df_val = df_val.merge(avg_user_rating, on="userId", how="left")
df_val = df_val.merge(count_user_rating, on="userId", how="left")
df_val = df_val.merge(std_user_rating, on="userId", how="left")

df_val["count_movie_rating"] = df_val["count_movie_rating"].fillna(0)
df_val["count_user_rating"] = df_val["count_user_rating"].fillna(0)
df_val["avg_movie_rating"] = df_val["avg_movie_rating"].fillna(0)
df_val["avg_user_rating"] = df_val["avg_user_rating"].fillna(0)
df_val["std_movie_rating"] = df_val["std_movie_rating"].fillna(0)
df_val["std_user_rating"] = df_val["std_user_rating"].fillna(0)



avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")
avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_test = df_test.merge(avg_movie_rating, on="movieId", how="left")
df_test = df_test.merge(count_movie_rating, on="movieId", how="left")
df_test = df_test.merge(std_movie_rating, on="movieId", how="left")
df_test = df_test.merge(avg_user_rating, on="userId", how="left")
df_test = df_test.merge(count_user_rating, on="userId", how="left")
df_test = df_test.merge(std_user_rating, on="userId", how="left")

df_test["count_movie_rating"] = df_test["count_movie_rating"].fillna(0)
df_test["count_user_rating"] = df_test["count_user_rating"].fillna(0)
df_test["avg_movie_rating"] = df_test["avg_movie_rating"].fillna(0)
df_test["avg_user_rating"] = df_test["avg_user_rating"].fillna(0)
df_test["std_movie_rating"] = df_test["std_movie_rating"].fillna(0)
df_test["std_user_rating"] = df_test["std_user_rating"].fillna(0)


print(np.isnan(df_test).any())
print(df2.head())


userId                False
movieId               False
rating                False
(no genres listed)    False
Action                False
Adventure             False
Animation             False
Children              False
Comedy                False
Crime                 False
Documentary           False
Drama                 False
Fantasy               False
Film-Noir             False
Horror                False
IMAX                  False
Musical               False
Mystery               False
Romance               False
Sci-Fi                False
Thriller              False
War                   False
Western               False
avg_movie_rating      False
count_movie_rating    False
std_movie_rating      False
avg_user_rating       False
count_user_rating     False
std_user_rating       False
dtype: bool
   userId  movieId  rating  timestamp                        title  \
0       1        1       4  964982703             Toy Story (1995)   
1       1        3       4  96498124

In [8]:
numerical_col = ["movieId", "rating", "userId", "std_movie_rating",  "avg_movie_rating",  "count_movie_rating", "avg_user_rating",  "count_user_rating",  "std_user_rating"]
scaler = MinMaxScaler()
#Only numerical columns are normalized

df_train[numerical_col] = scaler.fit_transform(df_train[numerical_col])
df_val[numerical_col] = scaler.fit_transform(df_val[numerical_col])
df_test[numerical_col] = scaler.fit_transform(df_test[numerical_col])

In [9]:
import torch
from torch.utils.data import TensorDataset, DataLoader

feature_cols = [col for col in df_train.columns if col != "rating"]

x_train = torch.tensor(df_train[feature_cols].values, dtype=torch.float32)
y_train = torch.tensor(df_train["rating"].values, dtype=torch.float32).unsqueeze(1)

x_val = torch.tensor(df_val[feature_cols].values, dtype=torch.float32)
y_val = torch.tensor(df_val["rating"].values, dtype=torch.float32).unsqueeze(1)

x_test = torch.tensor(df_test[feature_cols].values, dtype=torch.float32)
y_test = torch.tensor(df_test["rating"].values, dtype=torch.float32).unsqueeze(1)

torch.manual_seed(123)

train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

train_loader = DataLoader(dataset=train_dataset, batch_size=1024, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=1024, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=1024, shuffle=False)

#Tensors are transformed into TensorDatasets and then into DataLoaders

In [10]:
import torch.nn as nn
import torch.optim as optim
import torch

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(28, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )

    def forward(self, x):
        return self.model(x)
device = "cuda" 
model = NeuralNetwork().to(device)  #To change to the GPU

lossFunction = torch.nn.HuberLoss() 
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.001)
#The neural network is defined

In [11]:
import torch
import torch.nn.functional as F
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# TRAINING FUNCTION
def train_loop(dataloader, model, lossFunction, optimizer):
    train_size = len(dataloader.dataset)    
    nbatches = len(dataloader)  

    model.train()
    loss_train = 0  
    all_preds = []
    all_targets = []

    for nbatch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        logits = model(X)
        
        loss = lossFunction(logits, y)
        loss.backward()   
        optimizer.step()  
        optimizer.zero_grad()

        loss_train += loss.item()

        all_preds.extend(logits.detach().cpu().numpy()) 
        all_targets.extend(y.cpu().numpy())  

    avg_loss = loss_train / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'TRAINING -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')


# VALIDATION FUNCTION
def val_loop(dataloader, model, lossFunction):
    val_size = len(dataloader.dataset)
    nbatches = len(dataloader)

    model.eval()

    loss_val = 0
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            logits = model(X)

            loss_val += lossFunction(logits, y).item()
            
            all_preds.extend(logits.cpu().numpy())
            all_targets.extend(y.cpu().numpy())

    avg_loss = loss_val / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'VALIDATION -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')


In [12]:
#The model is trained and validated for 50 epochs
for i in range(50): 
    print(f"Iteration {i+1}/50 \n-----------------------------")
    train_loop(train_loader, model, lossFunction, optimizer)
    val_loop(val_loader, model, lossFunction)

Iteration 1/50 
-----------------------------
TRAINING -> Loss: 0.052538, MSE: 0.105261, RMSE: 0.324439, R²: -1.223391
VALIDATION -> Loss: 0.017285, MSE: 0.034591, RMSE: 0.185986, R²: 0.286081
Iteration 2/50 
-----------------------------
TRAINING -> Loss: 0.019013, MSE: 0.038034, RMSE: 0.195022, R²: 0.196624
VALIDATION -> Loss: 0.013096, MSE: 0.026218, RMSE: 0.161921, R²: 0.458879
Iteration 3/50 
-----------------------------
TRAINING -> Loss: 0.017171, MSE: 0.034358, RMSE: 0.185360, R²: 0.274259
VALIDATION -> Loss: 0.011494, MSE: 0.023015, RMSE: 0.151708, R²: 0.524987
Iteration 4/50 
-----------------------------
TRAINING -> Loss: 0.016355, MSE: 0.032714, RMSE: 0.180870, R²: 0.308993
VALIDATION -> Loss: 0.010864, MSE: 0.021754, RMSE: 0.147491, R²: 0.551029
Iteration 5/50 
-----------------------------
TRAINING -> Loss: 0.015886, MSE: 0.031773, RMSE: 0.178251, R²: 0.328862
VALIDATION -> Loss: 0.010596, MSE: 0.021222, RMSE: 0.145676, R²: 0.562010
Iteration 6/50 
-----------------------

In [13]:
#And display the evaluation metrics of the model
import torch
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.preprocessing import binarize
from scipy.stats import rankdata

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

x_train = x_train.to(device)
y_train = y_train.to(device)
model.to(device)
with torch.no_grad():
    y_train_pred = model(x_train)
y_train_np = y_train.cpu().numpy()
y_train_pred_np = y_train_pred.cpu().numpy()

# --- METRICS ---

# R^2 Score
ss_total = np.sum((y_train_np - np.mean(y_train_np)) ** 2)
ss_residual = np.sum((y_train_np - y_train_pred_np) ** 2)
r2_score = 1 - (ss_residual / ss_total) if ss_total != 0 else 0.0

# MAE
mae = np.mean(np.abs(y_train_np - y_train_pred_np))

# MSE
mse = np.mean((y_train_np - y_train_pred_np) ** 2)

# RMSE
rmse = np.sqrt(mse)

# ACCURACY AND RECALL
threshold = np.median(y_train_np)  

y_train_bin = binarize(y_train_np.reshape(-1, 1), threshold=threshold).flatten()
y_train_pred_bin = binarize(y_train_pred_np.reshape(-1, 1), threshold=threshold).flatten()

precision = precision_score(y_train_bin, y_train_pred_bin)
recall = recall_score(y_train_bin, y_train_pred_bin)

# NDCG 
def dcg_score(y_true, y_score, k=10):
    order = np.argsort(y_score)[::-1]  
    y_true_sorted = np.take(y_true, order[:k])
    
    gains = 2 ** y_true_sorted - 1
    discounts = np.log2(np.arange(2, len(y_true_sorted) + 2))
    
    return np.sum(gains / discounts)

def ndcg_score(y_true, y_score, k=10):
    best_dcg = dcg_score(y_true, y_true, k)  #
    actual_dcg = dcg_score(y_true, y_score, k)
    
    return actual_dcg / best_dcg if best_dcg > 0 else 0

ndcg = ndcg_score(y_train_np, y_train_pred_np)

print(f"R^2 Score: {r2_score:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Precisión: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"NDCG@10: {ndcg:.4f}")


R^2 Score: 0.4452
MAE: 0.1217
MSE: 0.0263
RMSE: 0.1621
Precisión: 0.6199
Recall: 0.9309
NDCG@10: 1.0000


# **SECOND MODEL**
# Adding to the first model the data from TMBD 


In [14]:
tmdb = pd.read_csv("ml-latest-small/movie_info_tmdb.csv", usecols=lambda column: column != "origin_country")
df = tmdb.merge(links, on="tmdbId", how="left")
df = df.merge(ratings, on="movieId", how="left")

df2 = df.copy()
df2 = df2.drop(columns=["title"])
df2["release_date"] = pd.to_datetime(df2["release_date"], errors="coerce").dt.year
df2 = pd.get_dummies(df2, columns=["original_language"], dtype=float)
columns_to_convert = [col for col in df2.columns if col != "title"]
df2[columns_to_convert] = df2[columns_to_convert].apply(pd.to_numeric, errors="coerce")
numerical_col = [col for col in df2.columns if col not in ["title"]]
#Only the year is taken from release_date, title is removed and original_language is dummyfied
#The TMDB data, links.csv and ratings.csv are merged

In [15]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import torch
import numpy as np

df_shuffle = df2.sample(frac=1, random_state=123).drop(columns=["timestamp"])
df_shuffle = df_shuffle.dropna()
#Feature timestamp is removed and NAs are removed

df_train = df_shuffle.iloc[:int(len(df_shuffle) * 0.6), :]
df_val = df_shuffle.iloc[int(len(df_shuffle) * 0.6):int(len(df_shuffle) * 0.8), :]
df_test = df_shuffle.iloc[int(len(df_shuffle) * 0.8):, :]
#Data is split into train, validate and testing sets(60-20-20)

scalers = {}

feature_cols = [col for col in df_shuffle.columns if col not in ["rating", "timestamp", "title"]]
x_train, y_train = df_train[feature_cols].to_numpy(dtype=np.float32), df_train["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_val, y_val = df_val[feature_cols].to_numpy(dtype=np.float32), df_val["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_test, y_test = df_test[feature_cols].to_numpy(dtype=np.float32), df_test["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
#Features transformed into numerical and  divided into input and output/target variable

print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)
print(x_test.shape, y_test.shape)

x_train, y_train = torch.tensor(x_train), torch.tensor(y_train).float()
x_val, y_val = torch.tensor(x_val), torch.tensor(y_val).float()
x_test, y_test = torch.tensor(x_test), torch.tensor(y_test).float()



(61074, 58) (61074, 1)
(20358, 58) (20358, 1)
(20358, 58) (20358, 1)


In [16]:
import pandas as pd
#Here, I have created for each set (train, validation and testing) new 6 variables. For training and validation, the data related to the average of the 
#rating is taken from the set itself, but the columns of the testing set can not be created from the average of the rating of the training set (would be considered data leakage)
#So the solution is using the average from the data of the training set
#NAs are also removed
avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_train = df_train.merge(avg_movie_rating, on="movieId", how="left")
df_train = df_train.merge(count_movie_rating, on="movieId", how="left")
df_train = df_train.merge(std_movie_rating, on="movieId", how="left")

df_train = df_train.merge(avg_user_rating, on="userId", how="left")
df_train = df_train.merge(count_user_rating, on="userId", how="left")
df_train = df_train.merge(std_user_rating, on="userId", how="left")

df_train["count_movie_rating"] = df_train["count_movie_rating"].fillna(0)
df_train["count_user_rating"] = df_train["count_user_rating"].fillna(0)
df_train["avg_movie_rating"] = df_train["avg_movie_rating"].fillna(0)
df_train["avg_user_rating"] = df_train["avg_user_rating"].fillna(0)
df_train["std_movie_rating"] = df_train["std_movie_rating"].fillna(0)
df_train["std_user_rating"] = df_train["std_user_rating"].fillna(0)


avg_movie_rating = df_val.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_val.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_val.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_val.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_val.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_val.groupby("userId")["rating"].std().rename("std_user_rating")

df_val = df_val.merge(avg_movie_rating, on="movieId", how="left")
df_val = df_val.merge(count_movie_rating, on="movieId", how="left")
df_val = df_val.merge(std_movie_rating, on="movieId", how="left")

df_val = df_val.merge(avg_user_rating, on="userId", how="left")
df_val = df_val.merge(count_user_rating, on="userId", how="left")
df_val = df_val.merge(std_user_rating, on="userId", how="left")

df_val["count_movie_rating"] = df_val["count_movie_rating"].fillna(0)
df_val["count_user_rating"] = df_val["count_user_rating"].fillna(0)
df_val["avg_movie_rating"] = df_val["avg_movie_rating"].fillna(0)
df_val["avg_user_rating"] = df_val["avg_user_rating"].fillna(0)
df_val["std_movie_rating"] = df_val["std_movie_rating"].fillna(0)
df_val["std_user_rating"] = df_val["std_user_rating"].fillna(0)


avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_test = df_test.merge(avg_movie_rating, on="movieId", how="left")
df_test = df_test.merge(count_movie_rating, on="movieId", how="left")
df_test = df_test.merge(std_movie_rating, on="movieId", how="left")

df_test = df_test.merge(avg_user_rating, on="userId", how="left")
df_test = df_test.merge(count_user_rating, on="userId", how="left")
df_test = df_test.merge(std_user_rating, on="userId", how="left")

df_test["count_movie_rating"] = df_test["count_movie_rating"].fillna(0)
df_test["count_user_rating"] = df_test["count_user_rating"].fillna(0)
df_test["avg_movie_rating"] = df_test["avg_movie_rating"].fillna(0)
df_test["avg_user_rating"] = df_test["avg_user_rating"].fillna(0)
df_test["std_movie_rating"] = df_test["std_movie_rating"].fillna(0)
df_test["std_user_rating"] = df_test["std_user_rating"].fillna(0)


print(np.isnan(df_train).any())
print(df2.head())


tmdbId                False
ratingTmdb            False
release_date          False
votes                 False
budget                False
                      ...  
count_movie_rating    False
std_movie_rating      False
avg_user_rating       False
count_user_rating     False
std_user_rating       False
Length: 65, dtype: bool
   tmdbId  ratingTmdb  release_date    votes      budget      revenue  \
0     862         8.0        1995.0  18705.0  30000000.0  394436586.0   
1     862         8.0        1995.0  18705.0  30000000.0  394436586.0   
2     862         8.0        1995.0  18705.0  30000000.0  394436586.0   
3     862         8.0        1995.0  18705.0  30000000.0  394436586.0   
4     862         8.0        1995.0  18705.0  30000000.0  394436586.0   

   runtime  movieId  imdbId  userId  ...  original_language_sr  \
0     81.0        1  114709     1.0  ...                   0.0   
1     81.0        1  114709     5.0  ...                   0.0   
2     81.0        1  114709    

In [17]:
scaler = MinMaxScaler()
#Only numerical columns are normalized


numerical_col = df_train.select_dtypes(include=['number']).columns
df_train[numerical_col] = scaler.fit_transform(df_train[numerical_col])
df_val[numerical_col] = scaler.transform(df_val[numerical_col])  
df_test[numerical_col] = scaler.transform(df_test[numerical_col]) 

print(df_train.shape)

(61074, 65)


In [18]:
import torch
from torch.utils.data import TensorDataset, DataLoader

feature_cols = [col for col in df_train.columns if col != "rating"]

x_train = torch.tensor(df_train[feature_cols].values, dtype=torch.float32)
y_train = torch.tensor(df_train["rating"].values, dtype=torch.float32).unsqueeze(1)
x_val = torch.tensor(df_val[feature_cols].values, dtype=torch.float32)
y_val = torch.tensor(df_val["rating"].values, dtype=torch.float32).unsqueeze(1)
x_test = torch.tensor(df_test[feature_cols].values, dtype=torch.float32)
y_test = torch.tensor(df_test["rating"].values, dtype=torch.float32).unsqueeze(1)

torch.manual_seed(123)

train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

train_loader = DataLoader(dataset=train_dataset, batch_size=1024, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=1024, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=1024, shuffle=False)
#Tensors are transformed into TensorDatasets and then into DataLoaders

In [19]:
import torch.nn as nn
import torch.optim as optim
import torch

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(64, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, 1)
        )

    def forward(self, x):
        return self.model(x)
device = "cuda" 
model = NeuralNetwork().to(device)  #To change to the GPU

lossFunction = torch.nn.HuberLoss() 
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.001)
#The neural network is defined

In [20]:
import torch
import torch.nn.functional as F
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# TRAINING FUNCTION
def train_loop(dataloader, model, lossFunction, optimizer):
    train_size = len(dataloader.dataset)    
    nbatches = len(dataloader)  

    model.train()
    loss_train = 0  
    all_preds = []
    all_targets = []

    for nbatch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        logits = model(X)
        
        loss = lossFunction(logits, y)
        loss.backward()   
        optimizer.step()  
        optimizer.zero_grad()

        loss_train += loss.item()

        all_preds.extend(logits.detach().cpu().numpy())  
        all_targets.extend(y.cpu().numpy())  


    avg_loss = loss_train / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'TRAINING -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')


# VALIDATION FUNCTION
def val_loop(dataloader, model, lossFunction):
    val_size = len(dataloader.dataset)
    nbatches = len(dataloader)

    model.eval()

    loss_val = 0
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            logits = model(X)

            loss_val += lossFunction(logits, y).item()
            
            all_preds.extend(logits.cpu().numpy())
            all_targets.extend(y.cpu().numpy())

    avg_loss = loss_val / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'VALIDATION -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')

In [None]:
#The model is trained and validated for 50 epochs
for i in range(100): 
    print(f"Iteration {i+1}/100 \n-----------------------------")
    train_loop(train_loader, model, lossFunction, optimizer)
    val_loop(val_loader, model, lossFunction)

Iteration 1/50 
-----------------------------
TRAINING -> Loss: 0.080063, MSE: 0.160725, RMSE: 0.400905, R²: -2.385594
VALIDATION -> Loss: 0.018461, MSE: 0.036931, RMSE: 0.192174, R²: 0.216698
Iteration 2/50 
-----------------------------
TRAINING -> Loss: 0.028573, MSE: 0.057176, RMSE: 0.239116, R²: -0.204393
VALIDATION -> Loss: 0.013844, MSE: 0.027692, RMSE: 0.166410, R²: 0.412644
Iteration 3/50 
-----------------------------
TRAINING -> Loss: 0.023259, MSE: 0.046514, RMSE: 0.215672, R²: 0.020196
VALIDATION -> Loss: 0.012744, MSE: 0.025492, RMSE: 0.159662, R²: 0.459316
Iteration 4/50 
-----------------------------
TRAINING -> Loss: 0.021594, MSE: 0.043196, RMSE: 0.207835, R²: 0.090104
VALIDATION -> Loss: 0.012352, MSE: 0.024709, RMSE: 0.157191, R²: 0.475918
Iteration 5/50 
-----------------------------
TRAINING -> Loss: 0.020822, MSE: 0.041662, RMSE: 0.204113, R²: 0.122402
VALIDATION -> Loss: 0.012488, MSE: 0.024980, RMSE: 0.158051, R²: 0.470171
Iteration 6/50 
----------------------

In [22]:
#And display the evaluation metrics of the model
import torch
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.preprocessing import binarize
from scipy.stats import rankdata

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

x_train = x_train.to(device)
y_train = y_train.to(device)
model.to(device)
with torch.no_grad():
    y_train_pred = model(x_train)
y_train_np = y_train.cpu().numpy()
y_train_pred_np = y_train_pred.cpu().numpy()

# --- METRICS ---

# R^2 Score
ss_total = np.sum((y_train_np - np.mean(y_train_np)) ** 2)
ss_residual = np.sum((y_train_np - y_train_pred_np) ** 2)
r2_score = 1 - (ss_residual / ss_total) if ss_total != 0 else 0.0

# MAE
mae = np.mean(np.abs(y_train_np - y_train_pred_np))

# MSE
mse = np.mean((y_train_np - y_train_pred_np) ** 2)

# RMSE
rmse = np.sqrt(mse)

# ACCURACY AND RECALL
threshold = np.median(y_train_np)  

y_train_bin = binarize(y_train_np.reshape(-1, 1), threshold=threshold).flatten()
y_train_pred_bin = binarize(y_train_pred_np.reshape(-1, 1), threshold=threshold).flatten()

precision = precision_score(y_train_bin, y_train_pred_bin)
recall = recall_score(y_train_bin, y_train_pred_bin)

# NDCG 
def dcg_score(y_true, y_score, k=10):
    order = np.argsort(y_score)[::-1]  
    y_true_sorted = np.take(y_true, order[:k])
    
    gains = 2 ** y_true_sorted - 1
    discounts = np.log2(np.arange(2, len(y_true_sorted) + 2))
    
    return np.sum(gains / discounts)

def ndcg_score(y_true, y_score, k=10):
    best_dcg = dcg_score(y_true, y_true, k)  #
    actual_dcg = dcg_score(y_true, y_score, k)
    
    return actual_dcg / best_dcg if best_dcg > 0 else 0

ndcg = ndcg_score(y_train_np, y_train_pred_np)

print(f"R^2 Score: {r2_score:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Precisión: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"NDCG@10: {ndcg:.4f}")

R^2 Score: 0.4699
MAE: 0.1182
MSE: 0.0252
RMSE: 0.1586
Precisión: 0.6185
Recall: 0.9331
NDCG@10: 1.0000


# **THIRD MODEL**
# Adding to the first model the data from IMDB 

In [23]:
imdb = pd.read_csv("ml-latest-small/movie_info_imdb.csv", usecols=lambda column: column != "origin_country")
imdb['imdbId'] = imdb['imdbId'].str.replace('tt', '', regex=False).astype(int)
columns_to_drop = ['Actors', 'Awards', 'DVD', 'Director', 'Genre', 'Title', 'Type', 'Website', 'Year', 'Poster', 'Production', 'Rated', 'Plot', 'Writer', 'Response', 'Ratings']
imdb = imdb.drop(columns=columns_to_drop)

imdb = imdb.astype({
    "BoxOffice": "string",  
    "Country": "string",
    "Language": "string",
    "Metascore": "float",
    "Released": "string",
    "Runtime": "string",
    "imdbRating": "float",
    "imdbVotes": "string"
})

imdb['BoxOffice'] = imdb['BoxOffice'].replace('[\$,]', '', regex=True).astype(float)
imdb['Released'] = imdb['Released'].str.extract(r'(\d{4})').astype(float)
imdb['Runtime'] = imdb['Runtime'].str.extract(r'(\d+)').astype(float)
imdb['imdbVotes'] = imdb['imdbVotes'].str.replace(',', '', regex=True).astype(float)

imdb = imdb.merge(links, on="imdbId", how="left")
imdb = imdb.merge(ratings, on="movieId", how="left")
#The data from IMDB, links.csv and ratings.csv are merged

imdb = imdb.dropna()
#'imdbId', 'BoxOffice', 'Language', 'Metascore', 'Released', 'Runtime', 'imdbRating', 'imdbVotes' are the selected features from the IMDB data ( I have considered the others not so usefull)
#and are filtered to transform all the data into numeric

       

In [24]:
df2 = imdb
df2 = pd.get_dummies(df2, columns=["Country"], dtype=float)

print(df2.columns)
df2["Language"] = df2["Language"].str.split(",")
moviesExploded = df2.explode("Language")
moviesExploded["Language"] = moviesExploded["Language"].str.strip()
movies_dummies = pd.get_dummies(moviesExploded["Language"], dtype=int)
movies_dummies = moviesExploded[["movieId"]].join(movies_dummies).groupby("movieId").max()
df2 = df2.drop(columns=["Language"]).merge(movies_dummies, on="movieId")
#Country and Language are dummyfied

columns_to_convert = [col for col in df2.columns]
df2[columns_to_convert] = df2[columns_to_convert].apply(pd.to_numeric, errors="coerce")
#All columns are transformed to numeric

df2 = df2.dropna()

Index(['imdbId', 'BoxOffice', 'Language', 'Metascore', 'Released', 'Runtime',
       'imdbRating', 'imdbVotes', 'movieId', 'tmdbId',
       ...
       'Country_United States, United Kingdom, Switzerland, Panama',
       'Country_United States, Venezuela',
       'Country_United States, Vietnam, United Kingdom, Canada, Denmark',
       'Country_United States, West Germany', 'Country_West Germany',
       'Country_West Germany, France',
       'Country_West Germany, France, United Kingdom',
       'Country_West Germany, Italy, France',
       'Country_West Germany, United States',
       'Country_Yugoslavia, United States'],
      dtype='object', length=931)


In [25]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import torch
import numpy as np

df_shuffle = df2.sample(frac=1, random_state=123).drop(columns=["timestamp"])
df_shuffle = df_shuffle.dropna()
#Feature timestamp is removed and NAs are removed

df_train = df_shuffle.iloc[:int(len(df_shuffle) * 0.6), :]
df_val = df_shuffle.iloc[int(len(df_shuffle) * 0.6):int(len(df_shuffle) * 0.8), :]
df_test = df_shuffle.iloc[int(len(df_shuffle) * 0.8):, :]
#Data is splited into training, validation and test sets(60-20-20)

scalers = {}

feature_cols = [col for col in df_shuffle.columns]
x_train, y_train = df_train[feature_cols].to_numpy(dtype=np.float32), df_train["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_val, y_val = df_val[feature_cols].to_numpy(dtype=np.float32), df_val["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_test, y_test = df_test[feature_cols].to_numpy(dtype=np.float32), df_test["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
#Features transformed into numerical and  divided into input and output/target variable

print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)
print(x_test.shape, y_test.shape)

x_train, y_train = torch.tensor(x_train), torch.tensor(y_train).float()
x_val, y_val = torch.tensor(x_val), torch.tensor(y_val).float()
x_test, y_test = torch.tensor(x_test), torch.tensor(y_test).float()

(54776, 1087) (54776, 1)
(18259, 1087) (18259, 1)
(18259, 1087) (18259, 1)


In [26]:
import pandas as pd
#Here, I have created for each set (train, validation and testing) new 6 variables. For training and validation, the data related to the average of the 
#rating is taken from the set itself, but the columns of the testing set can not be created from the average of the rating of the training set (would be considered data leakage)
#So the solution is using the average from the data of the training set
#NAs are also removed
avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_train = df_train.merge(avg_movie_rating, on="movieId", how="left")
df_train = df_train.merge(count_movie_rating, on="movieId", how="left")
df_train = df_train.merge(std_movie_rating, on="movieId", how="left")

df_train = df_train.merge(avg_user_rating, on="userId", how="left")
df_train = df_train.merge(count_user_rating, on="userId", how="left")
df_train = df_train.merge(std_user_rating, on="userId", how="left")

df_train["count_movie_rating"] = df_train["count_movie_rating"].fillna(0)
df_train["count_user_rating"] = df_train["count_user_rating"].fillna(0)
df_train["avg_movie_rating"] = df_train["avg_movie_rating"].fillna(0)
df_train["avg_user_rating"] = df_train["avg_user_rating"].fillna(0)
df_train["std_movie_rating"] = df_train["std_movie_rating"].fillna(0)
df_train["std_user_rating"] = df_train["std_user_rating"].fillna(0)



avg_movie_rating = df_val.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_val.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_val.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_val.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_val.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_val.groupby("userId")["rating"].std().rename("std_user_rating")

df_val = df_val.merge(avg_movie_rating, on="movieId", how="left")
df_val = df_val.merge(count_movie_rating, on="movieId", how="left")
df_val = df_val.merge(std_movie_rating, on="movieId", how="left")

df_val = df_val.merge(avg_user_rating, on="userId", how="left")
df_val = df_val.merge(count_user_rating, on="userId", how="left")
df_val = df_val.merge(std_user_rating, on="userId", how="left")

df_val["count_movie_rating"] = df_val["count_movie_rating"].fillna(0)
df_val["count_user_rating"] = df_val["count_user_rating"].fillna(0)
df_val["avg_movie_rating"] = df_val["avg_movie_rating"].fillna(0)
df_val["avg_user_rating"] = df_val["avg_user_rating"].fillna(0)
df_val["std_movie_rating"] = df_val["std_movie_rating"].fillna(0)
df_val["std_user_rating"] = df_val["std_user_rating"].fillna(0)



avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_test = df_test.merge(avg_movie_rating, on="movieId", how="left")
df_test = df_test.merge(count_movie_rating, on="movieId", how="left")
df_test = df_test.merge(std_movie_rating, on="movieId", how="left")

df_test = df_test.merge(avg_user_rating, on="userId", how="left")
df_test = df_test.merge(count_user_rating, on="userId", how="left")
df_test = df_test.merge(std_user_rating, on="userId", how="left")

df_test["count_movie_rating"] = df_test["count_movie_rating"].fillna(0)
df_test["count_user_rating"] = df_test["count_user_rating"].fillna(0)
df_test["avg_movie_rating"] = df_test["avg_movie_rating"].fillna(0)
df_test["avg_user_rating"] = df_test["avg_user_rating"].fillna(0)
df_test["std_movie_rating"] = df_test["std_movie_rating"].fillna(0)
df_test["std_user_rating"] = df_test["std_user_rating"].fillna(0)


print(np.isnan(df_train).any())
print(df2.head())

imdbId                False
BoxOffice             False
Metascore             False
Released              False
Runtime               False
                      ...  
count_movie_rating    False
std_movie_rating      False
avg_user_rating       False
count_user_rating     False
std_user_rating       False
Length: 1093, dtype: bool
   imdbId    BoxOffice  Metascore  Released  Runtime  imdbRating  imdbVotes  \
0  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
1  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
2  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
3  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
4  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   

   movieId  tmdbId  userId  ...  Ungwatsi  Urdu  Vietnamese  Washoe  Welsh  \
0      1.0   862.0     1.0  ...         0     0           0       0      0   
1      1.0   862.0     5.0  ...    

In [27]:
scaler = MinMaxScaler()

numerical_col = df_train.select_dtypes(include=['number']).columns
#Only numerical columns are normalized

df_train[numerical_col] = scaler.fit_transform(df_train[numerical_col])
df_val[numerical_col] = scaler.transform(df_val[numerical_col])  
df_test[numerical_col] = scaler.transform(df_test[numerical_col]) 

print(df_train.shape)

(54776, 1093)


In [28]:
import torch
from torch.utils.data import TensorDataset, DataLoader

feature_cols = [col for col in df_train.columns if col != "rating"]

x_train = torch.tensor(df_train[feature_cols].values, dtype=torch.float32)
y_train = torch.tensor(df_train["rating"].values, dtype=torch.float32).unsqueeze(1)

x_val = torch.tensor(df_val[feature_cols].values, dtype=torch.float32)
y_val = torch.tensor(df_val["rating"].values, dtype=torch.float32).unsqueeze(1)

x_test = torch.tensor(df_test[feature_cols].values, dtype=torch.float32)
y_test = torch.tensor(df_test["rating"].values, dtype=torch.float32).unsqueeze(1)

torch.manual_seed(123)

train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

train_loader = DataLoader(dataset=train_dataset, batch_size=1024, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=1024, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=1024, shuffle=False)
#Tensors are transformed into TensorDatasets and then into DataLoaders

In [29]:
import torch.nn as nn
import torch.optim as optim
import torch

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(1092, 2048),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 16),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        return self.model(x)
device = "cuda" 
model = NeuralNetwork().to(device)  #To change to the GPU

lossFunction = torch.nn.HuberLoss() 
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.001)
#The neural network is defined

In [30]:
import torch
import torch.nn.functional as F
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# TRAINING FUNCTION
def train_loop(dataloader, model, lossFunction, optimizer):
    train_size = len(dataloader.dataset)    
    nbatches = len(dataloader)  

    model.train()
    loss_train = 0  
    all_preds = []
    all_targets = []

    for nbatch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        logits = model(X)
        
        loss = lossFunction(logits, y)
        loss.backward()   
        optimizer.step()  
        optimizer.zero_grad()

        loss_train += loss.item()

        all_preds.extend(logits.detach().cpu().numpy())  
        all_targets.extend(y.cpu().numpy())  


    avg_loss = loss_train / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'TRAINING -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')


# VALIDATION FUNCTION
def val_loop(dataloader, model, lossFunction):
    val_size = len(dataloader.dataset)
    nbatches = len(dataloader)

    model.eval()

    loss_val = 0
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            logits = model(X)

            loss_val += lossFunction(logits, y).item()
            
            all_preds.extend(logits.cpu().numpy())
            all_targets.extend(y.cpu().numpy())

    avg_loss = loss_val / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'VALIDATION -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')

In [None]:
#The model is trained and validated for 50 epochs
for i in range(50): 
    print(f"Iteration {i+1}/50 \n-----------------------------")
    train_loop(train_loader, model, lossFunction, optimizer)
    val_loop(val_loader, model, lossFunction)

Iteration 1/50 
-----------------------------
TRAINING -> Loss: 0.084616, MSE: 0.170415, RMSE: 0.412814, R²: -2.624800
VALIDATION -> Loss: 0.019125, MSE: 0.038267, RMSE: 0.195618, R²: 0.188671
Iteration 2/50 
-----------------------------
TRAINING -> Loss: 0.036733, MSE: 0.073477, RMSE: 0.271066, R²: -0.562882
VALIDATION -> Loss: 0.016121, MSE: 0.032258, RMSE: 0.179604, R²: 0.316075
Iteration 3/50 
-----------------------------
TRAINING -> Loss: 0.033069, MSE: 0.066172, RMSE: 0.257240, R²: -0.407513
VALIDATION -> Loss: 0.015273, MSE: 0.030563, RMSE: 0.174821, R²: 0.352012
Iteration 4/50 
-----------------------------
TRAINING -> Loss: 0.031577, MSE: 0.063169, RMSE: 0.251334, R²: -0.343627
VALIDATION -> Loss: 0.014726, MSE: 0.029469, RMSE: 0.171666, R²: 0.375195
Iteration 5/50 
-----------------------------
TRAINING -> Loss: 0.030384, MSE: 0.060795, RMSE: 0.246567, R²: -0.293138
VALIDATION -> Loss: 0.014195, MSE: 0.028405, RMSE: 0.168538, R²: 0.397757
Iteration 6/50 
-------------------

In [32]:
#And display the evaluation metrics of the model
import torch
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.preprocessing import binarize
from scipy.stats import rankdata

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

x_train = x_train.to(device)
y_train = y_train.to(device)
model.to(device)
with torch.no_grad():
    y_train_pred = model(x_train)
y_train_np = y_train.cpu().numpy()
y_train_pred_np = y_train_pred.cpu().numpy()

# --- METRICS ---

# R^2 Score
ss_total = np.sum((y_train_np - np.mean(y_train_np)) ** 2)
ss_residual = np.sum((y_train_np - y_train_pred_np) ** 2)
r2_score = 1 - (ss_residual / ss_total) if ss_total != 0 else 0.0

# MAE
mae = np.mean(np.abs(y_train_np - y_train_pred_np))

# MSE
mse = np.mean((y_train_np - y_train_pred_np) ** 2)

# RMSE
rmse = np.sqrt(mse)

# ACCURACY AND RECALL
threshold = np.median(y_train_np)  

y_train_bin = binarize(y_train_np.reshape(-1, 1), threshold=threshold).flatten()
y_train_pred_bin = binarize(y_train_pred_np.reshape(-1, 1), threshold=threshold).flatten()

precision = precision_score(y_train_bin, y_train_pred_bin)
recall = recall_score(y_train_bin, y_train_pred_bin)

# NDCG 
def dcg_score(y_true, y_score, k=10):
    order = np.argsort(y_score)[::-1]  
    y_true_sorted = np.take(y_true, order[:k])
    
    gains = 2 ** y_true_sorted - 1
    discounts = np.log2(np.arange(2, len(y_true_sorted) + 2))
    
    return np.sum(gains / discounts)

def ndcg_score(y_true, y_score, k=10):
    best_dcg = dcg_score(y_true, y_true, k)  #
    actual_dcg = dcg_score(y_true, y_score, k)
    
    return actual_dcg / best_dcg if best_dcg > 0 else 0

ndcg = ndcg_score(y_train_np, y_train_pred_np)

print(f"R^2 Score: {r2_score:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Precisión: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"NDCG@10: {ndcg:.4f}")

R^2 Score: 0.4805
MAE: 0.1186
MSE: 0.0244
RMSE: 0.1563
Precisión: 0.6205
Recall: 0.9401
NDCG@10: 1.0000


# **FOURTH MODEL**
# Adding to the first model the data from IMDB and TMDB

In [33]:
imdb = pd.read_csv("ml-latest-small/movie_info_imdb.csv", usecols=lambda column: column != "origin_country")
imdb['imdbId'] = imdb['imdbId'].str.replace('tt', '', regex=False).astype(int)
columns_to_drop = ['Actors', 'Awards', 'DVD', 'Director', 'Genre', 'Title', 'Type', 'Website', 'Year', 'Poster', 'Production', 'Rated', 'Plot', 'Writer', 'Response', 'Ratings']
imdb = imdb.drop(columns=columns_to_drop)

imdb = imdb.astype({
    "BoxOffice": "string",  
    "Country": "string",
    "Language": "string",
    "Metascore": "float",
    "Released": "string",
    "Runtime": "string",
    "imdbRating": "float",
    "imdbVotes": "string"
})

imdb['BoxOffice'] = imdb['BoxOffice'].replace('[\$,]', '', regex=True).astype(float)
imdb['Released'] = imdb['Released'].str.extract(r'(\d{4})').astype(float)
imdb['Runtime'] = imdb['Runtime'].str.extract(r'(\d+)').astype(float)
imdb['imdbVotes'] = imdb['imdbVotes'].str.replace(',', '', regex=True).astype(float)


tmdb = pd.read_csv("ml-latest-small/movie_info_tmdb.csv", usecols=lambda column: column != "origin_country")
tmdb = tmdb.drop(columns=["title", "original_language"])
tmdb["release_date"] = pd.to_datetime(tmdb["release_date"], errors="coerce").dt.year

imdb = imdb.merge(links, on="imdbId", how="left")
imdb = imdb.merge(ratings, on="movieId", how="left")
df2 = imdb.merge(tmdb, on="tmdbId", how="left")

imdb = imdb.dropna()
print(imdb.columns)

print(imdb.describe)
#All the steps from previous models are performed and data from TMDB, IMDB, links and ratings are merged all together

Index(['imdbId', 'BoxOffice', 'Country', 'Language', 'Metascore', 'Released',
       'Runtime', 'imdbRating', 'imdbVotes', 'movieId', 'tmdbId', 'userId',
       'rating', 'timestamp'],
      dtype='object')
<bound method NDFrame.describe of          imdbId    BoxOffice                               Country  \
0        114709  223225679.0                         United States   
1        114709  223225679.0                         United States   
2        114709  223225679.0                         United States   
3        114709  223225679.0                         United States   
4        114709  223225679.0                         United States   
...         ...          ...                                   ...   
100799  4912910  220159104.0  United States, China, France, Norway   
100800  4912910  220159104.0  United States, China, France, Norway   
100801  7690670   20545116.0                         United States   
100803  7349662   49275340.0                  United States

In [34]:
df2 = imdb

df2 = pd.get_dummies(df2, columns=["Country"], dtype=float)

df2["Language"] = df2["Language"].str.split(",")
moviesExploded = df2.explode("Language")
moviesExploded["Language"] = moviesExploded["Language"].str.strip()
movies_dummies = pd.get_dummies(moviesExploded["Language"], dtype=int)
movies_dummies = moviesExploded[["movieId"]].join(movies_dummies).groupby("movieId").max()
df2 = df2.drop(columns=["Language"]).merge(movies_dummies, on="movieId")

columns_to_convert = [col for col in df2.columns]
df2[columns_to_convert] = df2[columns_to_convert].apply(pd.to_numeric, errors="coerce")

df2 = df2.dropna()

print(df2.columns)
print(df2.head)

Index(['imdbId', 'BoxOffice', 'Metascore', 'Released', 'Runtime', 'imdbRating',
       'imdbVotes', 'movieId', 'tmdbId', 'userId',
       ...
       'Ungwatsi', 'Urdu', 'Vietnamese', 'Washoe', 'Welsh', 'Wolof', 'Xhosa',
       'Yiddish', 'Yoruba', 'Zulu'],
      dtype='object', length=1088)
<bound method NDFrame.head of         imdbId    BoxOffice  Metascore  Released  Runtime  imdbRating  \
0       114709  223225679.0       96.0    1995.0     81.0         8.3   
1       114709  223225679.0       96.0    1995.0     81.0         8.3   
2       114709  223225679.0       96.0    1995.0     81.0         8.3   
3       114709  223225679.0       96.0    1995.0     81.0         8.3   
4       114709  223225679.0       96.0    1995.0     81.0         8.3   
...        ...          ...        ...       ...      ...         ...   
91289  4912910  220159104.0       87.0    2018.0    147.0         7.7   
91290  4912910  220159104.0       87.0    2018.0    147.0         7.7   
91291  7690670   2054

In [35]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import torch
import numpy as np

df_shuffle = df2.sample(frac=1, random_state=123).drop(columns=["timestamp"])
df_shuffle = df_shuffle.dropna()
#Feature timestamp is removed and NAs are removed


df_train = df_shuffle.iloc[:int(len(df_shuffle) * 0.6), :]
df_val = df_shuffle.iloc[int(len(df_shuffle) * 0.6):int(len(df_shuffle) * 0.8), :]
df_test = df_shuffle.iloc[int(len(df_shuffle) * 0.8):, :]
#Data is splited into training, validation and test sets(60-20-20)

scalers = {}

feature_cols = [col for col in df_shuffle.columns]
x_train, y_train = df_train[feature_cols].to_numpy(dtype=np.float32), df_train["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_val, y_val = df_val[feature_cols].to_numpy(dtype=np.float32), df_val["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_test, y_test = df_test[feature_cols].to_numpy(dtype=np.float32), df_test["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
#Features transformed into numerical and  divided into input and output/target variable

print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)
print(x_test.shape, y_test.shape)

x_train, y_train = torch.tensor(x_train), torch.tensor(y_train).float()
x_val, y_val = torch.tensor(x_val), torch.tensor(y_val).float()
x_test, y_test = torch.tensor(x_test), torch.tensor(y_test).float()

(54776, 1087) (54776, 1)
(18259, 1087) (18259, 1)
(18259, 1087) (18259, 1)


In [36]:
import pandas as pd
#Here, I have created for each set (train, validation and testing) new 6 variables. For training and validation, the data related to the average of the 
#rating is taken from the set itself, but the columns of the testing set can not be created from the average of the rating of the training set (would be considered data leakage)
#So the solution is using the average from the data of the training set
#NAs are also removed
print(df2.columns)

avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_train = df_train.merge(avg_movie_rating, on="movieId", how="left")
df_train = df_train.merge(count_movie_rating, on="movieId", how="left")
df_train = df_train.merge(std_movie_rating, on="movieId", how="left")

df_train = df_train.merge(avg_user_rating, on="userId", how="left")
df_train = df_train.merge(count_user_rating, on="userId", how="left")
df_train = df_train.merge(std_user_rating, on="userId", how="left")

df_train["count_movie_rating"] = df_train["count_movie_rating"].fillna(0)
df_train["count_user_rating"] = df_train["count_user_rating"].fillna(0)
df_train["avg_movie_rating"] = df_train["avg_movie_rating"].fillna(0)
df_train["avg_user_rating"] = df_train["avg_user_rating"].fillna(0)
df_train["std_movie_rating"] = df_train["std_movie_rating"].fillna(0)
df_train["std_user_rating"] = df_train["std_user_rating"].fillna(0)



avg_movie_rating = df_val.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_val.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_val.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_val.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_val.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_val.groupby("userId")["rating"].std().rename("std_user_rating")

df_val = df_val.merge(avg_movie_rating, on="movieId", how="left")
df_val = df_val.merge(count_movie_rating, on="movieId", how="left")
df_val = df_val.merge(std_movie_rating, on="movieId", how="left")

df_val = df_val.merge(avg_user_rating, on="userId", how="left")
df_val = df_val.merge(count_user_rating, on="userId", how="left")
df_val = df_val.merge(std_user_rating, on="userId", how="left")

df_val["count_movie_rating"] = df_val["count_movie_rating"].fillna(0)
df_val["count_user_rating"] = df_val["count_user_rating"].fillna(0)
df_val["avg_movie_rating"] = df_val["avg_movie_rating"].fillna(0)
df_val["avg_user_rating"] = df_val["avg_user_rating"].fillna(0)
df_val["std_movie_rating"] = df_val["std_movie_rating"].fillna(0)
df_val["std_user_rating"] = df_val["std_user_rating"].fillna(0)



avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_test = df_test.merge(avg_movie_rating, on="movieId", how="left")
df_test = df_test.merge(count_movie_rating, on="movieId", how="left")
df_test = df_test.merge(std_movie_rating, on="movieId", how="left")

df_test = df_test.merge(avg_user_rating, on="userId", how="left")
df_test = df_test.merge(count_user_rating, on="userId", how="left")
df_test = df_test.merge(std_user_rating, on="userId", how="left")

df_test["count_movie_rating"] = df_test["count_movie_rating"].fillna(0)
df_test["count_user_rating"] = df_test["count_user_rating"].fillna(0)
df_test["avg_movie_rating"] = df_test["avg_movie_rating"].fillna(0)
df_test["avg_user_rating"] = df_test["avg_user_rating"].fillna(0)
df_test["std_movie_rating"] = df_test["std_movie_rating"].fillna(0)
df_test["std_user_rating"] = df_test["std_user_rating"].fillna(0)


print(np.isnan(df_train).any())
print(df2.head())

Index(['imdbId', 'BoxOffice', 'Metascore', 'Released', 'Runtime', 'imdbRating',
       'imdbVotes', 'movieId', 'tmdbId', 'userId',
       ...
       'Ungwatsi', 'Urdu', 'Vietnamese', 'Washoe', 'Welsh', 'Wolof', 'Xhosa',
       'Yiddish', 'Yoruba', 'Zulu'],
      dtype='object', length=1088)
imdbId                False
BoxOffice             False
Metascore             False
Released              False
Runtime               False
                      ...  
count_movie_rating    False
std_movie_rating      False
avg_user_rating       False
count_user_rating     False
std_user_rating       False
Length: 1093, dtype: bool
   imdbId    BoxOffice  Metascore  Released  Runtime  imdbRating  imdbVotes  \
0  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
1  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
2  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
3  114709  223225679.0       96.0    1995.0     81.0      

In [37]:
scaler = MinMaxScaler()

numerical_col = df_train.select_dtypes(include=['number']).columns
#Only numerical columns are normalized

df_train[numerical_col] = scaler.fit_transform(df_train[numerical_col])
df_val[numerical_col] = scaler.transform(df_val[numerical_col])  
df_test[numerical_col] = scaler.transform(df_test[numerical_col]) 

print(df_train.shape)

(54776, 1093)


In [38]:
import torch
from torch.utils.data import TensorDataset, DataLoader

feature_cols = [col for col in df_train.columns if col != "rating"]

x_train = torch.tensor(df_train[feature_cols].values, dtype=torch.float32)
y_train = torch.tensor(df_train["rating"].values, dtype=torch.float32).unsqueeze(1)

x_val = torch.tensor(df_val[feature_cols].values, dtype=torch.float32)
y_val = torch.tensor(df_val["rating"].values, dtype=torch.float32).unsqueeze(1)

x_test = torch.tensor(df_test[feature_cols].values, dtype=torch.float32)
y_test = torch.tensor(df_test["rating"].values, dtype=torch.float32).unsqueeze(1)

torch.manual_seed(1234)

train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

train_loader = DataLoader(dataset=train_dataset, batch_size=1024, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=1024, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=1024, shuffle=False)
#Tensors are transformed into TensorDatasets and then into DataLoaders

In [39]:
import torch.nn as nn
import torch.optim as optim
import torch

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(1092, 2048),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 16),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        return self.model(x)
device = "cuda" 
model = NeuralNetwork().to(device)  #To change to the GPU

lossFunction = torch.nn.HuberLoss() 
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.001)
#The neural network is defined

In [40]:
import torch
import torch.nn.functional as F
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# TRAINING FUNCTION
def train_loop(dataloader, model, lossFunction, optimizer):
    train_size = len(dataloader.dataset)    
    nbatches = len(dataloader)  

    model.train()
    loss_train = 0  
    all_preds = []
    all_targets = []

    for nbatch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        logits = model(X)
        
        loss = lossFunction(logits, y)
        loss.backward()   
        optimizer.step()  
        optimizer.zero_grad()

        loss_train += loss.item()

        all_preds.extend(logits.detach().cpu().numpy())  
        all_targets.extend(y.cpu().numpy())  

    avg_loss = loss_train / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'TRAINING -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')


# VALIDATION FUNCTION
def val_loop(dataloader, model, lossFunction):
    val_size = len(dataloader.dataset)
    nbatches = len(dataloader)

    model.eval()

    loss_val = 0
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            logits = model(X)

            loss_val += lossFunction(logits, y).item()
            
            all_preds.extend(logits.cpu().numpy())
            all_targets.extend(y.cpu().numpy())

    avg_loss = loss_val / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'VALIDATION -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')


In [None]:
#The model is trained and validated for 50 epochs
for i in range(50): 
    print(f"Iteration {i+1}/50 \n-----------------------------")
    train_loop(train_loader, model, lossFunction, optimizer)
    val_loop(val_loader, model, lossFunction)

Iteration 1/30 
-----------------------------
TRAINING -> Loss: 0.043108, MSE: 0.086530, RMSE: 0.294160, R²: -0.840525
VALIDATION -> Loss: 0.018207, MSE: 0.036432, RMSE: 0.190873, R²: 0.227559
Iteration 2/30 
-----------------------------
TRAINING -> Loss: 0.023532, MSE: 0.047078, RMSE: 0.216974, R²: -0.001357
VALIDATION -> Loss: 0.015604, MSE: 0.031225, RMSE: 0.176707, R²: 0.337962
Iteration 3/30 
-----------------------------
TRAINING -> Loss: 0.020512, MSE: 0.041026, RMSE: 0.202549, R²: 0.127362
VALIDATION -> Loss: 0.014633, MSE: 0.029281, RMSE: 0.171118, R²: 0.379179
Iteration 4/30 
-----------------------------
TRAINING -> Loss: 0.019468, MSE: 0.038945, RMSE: 0.197344, R²: 0.171634
VALIDATION -> Loss: 0.014468, MSE: 0.028953, RMSE: 0.170155, R²: 0.386144
Iteration 5/30 
-----------------------------
TRAINING -> Loss: 0.018755, MSE: 0.037507, RMSE: 0.193668, R²: 0.202207
VALIDATION -> Loss: 0.013411, MSE: 0.026832, RMSE: 0.163805, R²: 0.431103
Iteration 6/30 
----------------------

In [42]:
#And display the evaluation metrics of the model
import torch
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.preprocessing import binarize
from scipy.stats import rankdata

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

x_train = x_train.to(device)
y_train = y_train.to(device)
model.to(device)
with torch.no_grad():
    y_train_pred = model(x_train)
y_train_np = y_train.cpu().numpy()
y_train_pred_np = y_train_pred.cpu().numpy()

# --- METRICS ---

# R^2 Score
ss_total = np.sum((y_train_np - np.mean(y_train_np)) ** 2)
ss_residual = np.sum((y_train_np - y_train_pred_np) ** 2)
r2_score = 1 - (ss_residual / ss_total) if ss_total != 0 else 0.0

# MAE
mae = np.mean(np.abs(y_train_np - y_train_pred_np))

# MSE
mse = np.mean((y_train_np - y_train_pred_np) ** 2)

# RMSE
rmse = np.sqrt(mse)

# ACCURACY AND RECALL
threshold = np.median(y_train_np)  

y_train_bin = binarize(y_train_np.reshape(-1, 1), threshold=threshold).flatten()
y_train_pred_bin = binarize(y_train_pred_np.reshape(-1, 1), threshold=threshold).flatten()

precision = precision_score(y_train_bin, y_train_pred_bin)
recall = recall_score(y_train_bin, y_train_pred_bin)

# NDCG 
def dcg_score(y_true, y_score, k=10):
    order = np.argsort(y_score)[::-1]  
    y_true_sorted = np.take(y_true, order[:k])
    
    gains = 2 ** y_true_sorted - 1
    discounts = np.log2(np.arange(2, len(y_true_sorted) + 2))
    
    return np.sum(gains / discounts)

def ndcg_score(y_true, y_score, k=10):
    best_dcg = dcg_score(y_true, y_true, k)  #
    actual_dcg = dcg_score(y_true, y_score, k)
    
    return actual_dcg / best_dcg if best_dcg > 0 else 0

ndcg = ndcg_score(y_train_np, y_train_pred_np)

print(f"R^2 Score: {r2_score:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Precisión: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"NDCG@10: {ndcg:.4f}")

R^2 Score: 0.4629
MAE: 0.1223
MSE: 0.0253
RMSE: 0.1589
Precisión: 0.6534
Recall: 0.9150
NDCG@10: 1.0000


# **FIFTH MODEL**
# Like the fourth one but using the 1M ratings dataset

In [43]:
imdb = pd.read_csv("ml-latest-small/movie_info_imdb.csv", usecols=lambda column: column != "origin_country")
#This is the chang with respect to the previous one. Data is not from the 100k dataset but from the 1M
imdb['imdbId'] = imdb['imdbId'].str.replace('tt', '', regex=False).astype(int)
columns_to_drop = ['Actors', 'Awards', 'DVD', 'Director', 'Genre', 'Title', 'Type', 'Website', 'Year', 'Poster', 'Production', 'Rated', 'Plot', 'Writer', 'Response', 'Ratings']
imdb = imdb.drop(columns=columns_to_drop)

imdb = imdb.astype({
    "BoxOffice": "string", 
    "Country": "string",
    "Language": "string",
    "Metascore": "float",
    "Released": "string",
    "Runtime": "string",
    "imdbRating": "float",
    "imdbVotes": "string"
})

imdb['BoxOffice'] = imdb['BoxOffice'].replace('[\$,]', '', regex=True).astype(float)
imdb['Released'] = imdb['Released'].str.extract(r'(\d{4})').astype(float)
imdb['Runtime'] = imdb['Runtime'].str.extract(r'(\d+)').astype(float)
imdb['imdbVotes'] = imdb['imdbVotes'].str.replace(',', '', regex=True).astype(float)

tmdb = pd.read_csv("ml-latest-small/movie_info_tmdb.csv", usecols=lambda column: column != "origin_country")
tmdb = tmdb.drop(columns=["title", "original_language"])
tmdb["release_date"] = pd.to_datetime(tmdb["release_date"], errors="coerce").dt.year

imdb = imdb.merge(links, on="imdbId", how="left")
imdb = imdb.merge(ratings, on="movieId", how="left")
df2 = imdb.merge(tmdb, on="tmdbId", how="left")

imdb = imdb.dropna()
print(imdb.columns)
print(imdb.describe)
#All the steps from previous models are performed and data from TMDB, IMDB, links and ratings are merged all together

Index(['imdbId', 'BoxOffice', 'Country', 'Language', 'Metascore', 'Released',
       'Runtime', 'imdbRating', 'imdbVotes', 'movieId', 'tmdbId', 'userId',
       'rating', 'timestamp'],
      dtype='object')
<bound method NDFrame.describe of          imdbId    BoxOffice                               Country  \
0        114709  223225679.0                         United States   
1        114709  223225679.0                         United States   
2        114709  223225679.0                         United States   
3        114709  223225679.0                         United States   
4        114709  223225679.0                         United States   
...         ...          ...                                   ...   
100799  4912910  220159104.0  United States, China, France, Norway   
100800  4912910  220159104.0  United States, China, France, Norway   
100801  7690670   20545116.0                         United States   
100803  7349662   49275340.0                  United States

In [44]:
df2 = imdb
df2 = pd.get_dummies(df2, columns=["Country"], dtype=float)

df2["Language"] = df2["Language"].str.split(",")
moviesExploded = df2.explode("Language")
moviesExploded["Language"] = moviesExploded["Language"].str.strip()
movies_dummies = pd.get_dummies(moviesExploded["Language"], dtype=int)
movies_dummies = moviesExploded[["movieId"]].join(movies_dummies).groupby("movieId").max()
df2 = df2.drop(columns=["Language"]).merge(movies_dummies, on="movieId")


columns_to_convert = [col for col in df2.columns]
df2[columns_to_convert] = df2[columns_to_convert].apply(pd.to_numeric, errors="coerce")

df2 = df2.dropna()

print(df2.columns)
print(df2.head)

Index(['imdbId', 'BoxOffice', 'Metascore', 'Released', 'Runtime', 'imdbRating',
       'imdbVotes', 'movieId', 'tmdbId', 'userId',
       ...
       'Ungwatsi', 'Urdu', 'Vietnamese', 'Washoe', 'Welsh', 'Wolof', 'Xhosa',
       'Yiddish', 'Yoruba', 'Zulu'],
      dtype='object', length=1088)
<bound method NDFrame.head of         imdbId    BoxOffice  Metascore  Released  Runtime  imdbRating  \
0       114709  223225679.0       96.0    1995.0     81.0         8.3   
1       114709  223225679.0       96.0    1995.0     81.0         8.3   
2       114709  223225679.0       96.0    1995.0     81.0         8.3   
3       114709  223225679.0       96.0    1995.0     81.0         8.3   
4       114709  223225679.0       96.0    1995.0     81.0         8.3   
...        ...          ...        ...       ...      ...         ...   
91289  4912910  220159104.0       87.0    2018.0    147.0         7.7   
91290  4912910  220159104.0       87.0    2018.0    147.0         7.7   
91291  7690670   2054

In [45]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import torch
import numpy as np

df_shuffle = df2.sample(frac=1, random_state=123).drop(columns=["timestamp"])
df_shuffle = df_shuffle.dropna()
#Feature timestamp is removed and NAs are removed

df_train = df_shuffle.iloc[:int(len(df_shuffle) * 0.6), :]
df_val = df_shuffle.iloc[int(len(df_shuffle) * 0.6):int(len(df_shuffle) * 0.8), :]
df_test = df_shuffle.iloc[int(len(df_shuffle) * 0.8):, :]
#Data is splited into training, validation and test sets(60-20-20)

scalers = {}

feature_cols = [col for col in df_shuffle.columns if col != "rating"]
x_train, y_train = df_train[feature_cols].to_numpy(dtype=np.float32), df_train["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_val, y_val = df_val[feature_cols].to_numpy(dtype=np.float32), df_val["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
x_test, y_test = df_test[feature_cols].to_numpy(dtype=np.float32), df_test["rating"].to_numpy(dtype=np.float32).reshape(-1, 1)
#Features transformed into numerical and  divided into input and output/target variable

print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)
print(x_test.shape, y_test.shape)

x_train, y_train = torch.tensor(x_train), torch.tensor(y_train).float()
x_val, y_val = torch.tensor(x_val), torch.tensor(y_val).float()
x_test, y_test = torch.tensor(x_test), torch.tensor(y_test).float()

(54776, 1086) (54776, 1)
(18259, 1086) (18259, 1)
(18259, 1086) (18259, 1)


In [46]:
import pandas as pd
#Here, I have created for each set (train, validation and testing) new 6 variables. For training and validation, the data related to the average of the 
#rating is taken from the set itself, but the columns of the testing set can not be created from the average of the rating of the training set (would be considered data leakage)
#So the solution is using the average from the data of the training set
#NAs are also removed
print(df2.columns)

avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_train = df_train.merge(avg_movie_rating, on="movieId", how="left")
df_train = df_train.merge(count_movie_rating, on="movieId", how="left")
df_train = df_train.merge(std_movie_rating, on="movieId", how="left")

df_train = df_train.merge(avg_user_rating, on="userId", how="left")
df_train = df_train.merge(count_user_rating, on="userId", how="left")
df_train = df_train.merge(std_user_rating, on="userId", how="left")

df_train["count_movie_rating"] = df_train["count_movie_rating"].fillna(0)
df_train["count_user_rating"] = df_train["count_user_rating"].fillna(0)
df_train["avg_movie_rating"] = df_train["avg_movie_rating"].fillna(0)
df_train["avg_user_rating"] = df_train["avg_user_rating"].fillna(0)
df_train["std_movie_rating"] = df_train["std_movie_rating"].fillna(0)
df_train["std_user_rating"] = df_train["std_user_rating"].fillna(0)



avg_movie_rating = df_val.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_val.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_val.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_val.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_val.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_val.groupby("userId")["rating"].std().rename("std_user_rating")

df_val = df_val.merge(avg_movie_rating, on="movieId", how="left")
df_val = df_val.merge(count_movie_rating, on="movieId", how="left")
df_val = df_val.merge(std_movie_rating, on="movieId", how="left")

df_val = df_val.merge(avg_user_rating, on="userId", how="left")
df_val = df_val.merge(count_user_rating, on="userId", how="left")
df_val = df_val.merge(std_user_rating, on="userId", how="left")

df_val["count_movie_rating"] = df_val["count_movie_rating"].fillna(0)
df_val["count_user_rating"] = df_val["count_user_rating"].fillna(0)
df_val["avg_movie_rating"] = df_val["avg_movie_rating"].fillna(0)
df_val["avg_user_rating"] = df_val["avg_user_rating"].fillna(0)
df_val["std_movie_rating"] = df_val["std_movie_rating"].fillna(0)
df_val["std_user_rating"] = df_val["std_user_rating"].fillna(0)



avg_movie_rating = df_train.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
count_movie_rating = df_train.groupby("movieId")["rating"].count().rename("count_movie_rating")
std_movie_rating = df_train.groupby("movieId")["rating"].std().rename("std_movie_rating")

avg_user_rating = df_train.groupby("userId")["rating"].mean().rename("avg_user_rating")
count_user_rating = df_train.groupby("userId")["rating"].count().rename("count_user_rating")
std_user_rating = df_train.groupby("userId")["rating"].std().rename("std_user_rating")

df_test = df_test.merge(avg_movie_rating, on="movieId", how="left")
df_test = df_test.merge(count_movie_rating, on="movieId", how="left")
df_test = df_test.merge(std_movie_rating, on="movieId", how="left")

df_test = df_test.merge(avg_user_rating, on="userId", how="left")
df_test = df_test.merge(count_user_rating, on="userId", how="left")
df_test = df_test.merge(std_user_rating, on="userId", how="left")

df_test["count_movie_rating"] = df_test["count_movie_rating"].fillna(0)
df_test["count_user_rating"] = df_test["count_user_rating"].fillna(0)
df_test["avg_movie_rating"] = df_test["avg_movie_rating"].fillna(0)
df_test["avg_user_rating"] = df_test["avg_user_rating"].fillna(0)
df_test["std_movie_rating"] = df_test["std_movie_rating"].fillna(0)
df_test["std_user_rating"] = df_test["std_user_rating"].fillna(0)


print(np.isnan(df_train).any())
print(df2.head())

Index(['imdbId', 'BoxOffice', 'Metascore', 'Released', 'Runtime', 'imdbRating',
       'imdbVotes', 'movieId', 'tmdbId', 'userId',
       ...
       'Ungwatsi', 'Urdu', 'Vietnamese', 'Washoe', 'Welsh', 'Wolof', 'Xhosa',
       'Yiddish', 'Yoruba', 'Zulu'],
      dtype='object', length=1088)
imdbId                False
BoxOffice             False
Metascore             False
Released              False
Runtime               False
                      ...  
count_movie_rating    False
std_movie_rating      False
avg_user_rating       False
count_user_rating     False
std_user_rating       False
Length: 1093, dtype: bool
   imdbId    BoxOffice  Metascore  Released  Runtime  imdbRating  imdbVotes  \
0  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
1  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
2  114709  223225679.0       96.0    1995.0     81.0         8.3  1112586.0   
3  114709  223225679.0       96.0    1995.0     81.0      

In [47]:
scaler = MinMaxScaler()

numerical_col = df_train.select_dtypes(include=['number']).columns
#Only numerical columns are normalized

df_train[numerical_col] = scaler.fit_transform(df_train[numerical_col])
df_val[numerical_col] = scaler.transform(df_val[numerical_col])  # Usar transform en validación
df_test[numerical_col] = scaler.transform(df_test[numerical_col]) 

print(df_train.shape)

(54776, 1093)


In [48]:
import torch
from torch.utils.data import TensorDataset, DataLoader

feature_cols = [col for col in df_train.columns if col != "rating"]
print(feature_cols)

x_train = torch.tensor(df_train[feature_cols].values, dtype=torch.float32)
y_train = torch.tensor(df_train["rating"].values, dtype=torch.float32).unsqueeze(1)

x_val = torch.tensor(df_val[feature_cols].values, dtype=torch.float32)
y_val = torch.tensor(df_val["rating"].values, dtype=torch.float32).unsqueeze(1)

x_test = torch.tensor(df_test[feature_cols].values, dtype=torch.float32)
y_test = torch.tensor(df_test["rating"].values, dtype=torch.float32).unsqueeze(1)

torch.manual_seed(1234)

train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

train_loader = DataLoader(dataset=train_dataset, batch_size=1024, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=1024, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=1024, shuffle=False)
#Tensors are transformed into TensorDatasets and then into DataLoaders

['imdbId', 'BoxOffice', 'Metascore', 'Released', 'Runtime', 'imdbRating', 'imdbVotes', 'movieId', 'tmdbId', 'userId', 'Country_Afghanistan, Ireland, Japan, Netherlands, Iran', 'Country_Argentina', 'Country_Argentina, Spain', 'Country_Argentina, Spain, France, United Kingdom', 'Country_Argentina, Spain, Germany', 'Country_Argentina, United States', 'Country_Aruba, Hong Kong, United States', 'Country_Australia', 'Country_Australia, Canada', 'Country_Australia, China, Germany, United States', 'Country_Australia, France', 'Country_Australia, Germany', 'Country_Australia, Germany, United States', 'Country_Australia, South Africa', 'Country_Australia, United Kingdom', 'Country_Australia, United Kingdom, France', 'Country_Australia, United Kingdom, United States', 'Country_Australia, United Kingdom, United States, France', 'Country_Australia, United States', 'Country_Australia, United States, South Korea, Taiwan, United Kingdom, France, Canada, Thailand, Denmark', 'Country_Australia, United S

In [49]:
import torch.nn as nn
import torch.optim as optim
import torch

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(1092, 2048),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 16),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        return self.model(x)
device = "cuda" 
model = NeuralNetwork().to(device)  #To change to the GPU

lossFunction = torch.nn.HuberLoss() 
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.001)
#The neural network is defined

In [50]:
import torch
import torch.nn.functional as F
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# TRAINING FUNCTION
def train_loop(dataloader, model, lossFunction, optimizer):
    train_size = len(dataloader.dataset)    
    nbatches = len(dataloader)  

    model.train()
    loss_train = 0  
    all_preds = []
    all_targets = []

    for nbatch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        logits = model(X)
        
        loss = lossFunction(logits, y)
        loss.backward()   
        optimizer.step()  
        optimizer.zero_grad()

        loss_train += loss.item()

        all_preds.extend(logits.detach().cpu().numpy())  
        all_targets.extend(y.cpu().numpy())  


    avg_loss = loss_train / nbatches

    
    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'TRAINING -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')


# VALIDATION FUNCTION
def val_loop(dataloader, model, lossFunction):
    val_size = len(dataloader.dataset)
    nbatches = len(dataloader)

    model.eval()

    loss_val = 0
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            logits = model(X)

            loss_val += lossFunction(logits, y).item()
            
            all_preds.extend(logits.cpu().numpy())
            all_targets.extend(y.cpu().numpy())

    avg_loss = loss_val / nbatches

    all_preds = np.array(all_preds)
    all_targets = np.array(all_targets)

    mse = mean_squared_error(all_targets, all_preds)
    rmse = np.sqrt(mse)
    r2 = r2_score(all_targets, all_preds)

    print(f'VALIDATION -> Loss: {avg_loss:.6f}, MSE: {mse:.6f}, RMSE: {rmse:.6f}, R²: {r2:.6f}')

In [None]:
#The model is trained and validated for 100 epochs
for i in range(100): 
    print(f"Iteration {i+1}/100 \n-----------------------------")
    train_loop(train_loader, model, lossFunction, optimizer)
    val_loop(val_loader, model, lossFunction)

Iteration 1/100 
-----------------------------
TRAINING -> Loss: 0.043108, MSE: 0.086530, RMSE: 0.294160, R²: -0.840525
VALIDATION -> Loss: 0.018207, MSE: 0.036432, RMSE: 0.190873, R²: 0.227559
Iteration 2/100 
-----------------------------
TRAINING -> Loss: 0.023532, MSE: 0.047078, RMSE: 0.216974, R²: -0.001357
VALIDATION -> Loss: 0.015604, MSE: 0.031225, RMSE: 0.176707, R²: 0.337962
Iteration 3/100 
-----------------------------
TRAINING -> Loss: 0.020512, MSE: 0.041026, RMSE: 0.202549, R²: 0.127362
VALIDATION -> Loss: 0.014633, MSE: 0.029281, RMSE: 0.171118, R²: 0.379179
Iteration 4/100 
-----------------------------
TRAINING -> Loss: 0.019468, MSE: 0.038945, RMSE: 0.197344, R²: 0.171634
VALIDATION -> Loss: 0.014468, MSE: 0.028953, RMSE: 0.170155, R²: 0.386144
Iteration 5/100 
-----------------------------
TRAINING -> Loss: 0.018755, MSE: 0.037507, RMSE: 0.193668, R²: 0.202207
VALIDATION -> Loss: 0.013411, MSE: 0.026832, RMSE: 0.163805, R²: 0.431103
Iteration 6/100 
----------------

In [52]:
#And display the evaluation metrics of the model
import torch
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.preprocessing import binarize
from scipy.stats import rankdata

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

x_train = x_train.to(device)
y_train = y_train.to(device)
model.to(device)
with torch.no_grad():
    y_train_pred = model(x_train)
y_train_np = y_train.cpu().numpy()
y_train_pred_np = y_train_pred.cpu().numpy()

# --- METRICS ---

# R^2 Score
ss_total = np.sum((y_train_np - np.mean(y_train_np)) ** 2)
ss_residual = np.sum((y_train_np - y_train_pred_np) ** 2)
r2_score = 1 - (ss_residual / ss_total) if ss_total != 0 else 0.0

# MAE
mae = np.mean(np.abs(y_train_np - y_train_pred_np))

# MSE
mse = np.mean((y_train_np - y_train_pred_np) ** 2)

# RMSE
rmse = np.sqrt(mse)

# ACCURACY AND RECALL
threshold = np.median(y_train_np)  

y_train_bin = binarize(y_train_np.reshape(-1, 1), threshold=threshold).flatten()
y_train_pred_bin = binarize(y_train_pred_np.reshape(-1, 1), threshold=threshold).flatten()

precision = precision_score(y_train_bin, y_train_pred_bin)
recall = recall_score(y_train_bin, y_train_pred_bin)

# NDCG 
def dcg_score(y_true, y_score, k=10):
    order = np.argsort(y_score)[::-1]  
    y_true_sorted = np.take(y_true, order[:k])
    
    gains = 2 ** y_true_sorted - 1
    discounts = np.log2(np.arange(2, len(y_true_sorted) + 2))
    
    return np.sum(gains / discounts)

def ndcg_score(y_true, y_score, k=10):
    best_dcg = dcg_score(y_true, y_true, k)  #
    actual_dcg = dcg_score(y_true, y_score, k)
    
    return actual_dcg / best_dcg if best_dcg > 0 else 0

ndcg = ndcg_score(y_train_np, y_train_pred_np)

print(f"R^2 Score: {r2_score:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Precisión: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"NDCG@10: {ndcg:.4f}")

R^2 Score: 0.5611
MAE: 0.1054
MSE: 0.0206
RMSE: 0.1436
Precisión: 0.6502
Recall: 0.9453
NDCG@10: 1.0000


In [53]:
#This is the function I have created to, by giving a userId in the arguments, return the recommended movies for that user
def recommend_movies(user_id, model, df2, top_n=10, device="cuda"):

    avg_movie_rating = df2.groupby("movieId")["rating"].mean().rename("avg_movie_rating")
    count_movie_rating = df2.groupby("movieId")["rating"].count().rename("count_movie_rating")
    std_movie_rating = df2.groupby("movieId")["rating"].std().rename("std_movie_rating")
    avg_user_rating = df2.groupby("userId")["rating"].mean().rename("avg_user_rating")
    count_user_rating = df2.groupby("userId")["rating"].count().rename("count_user_rating")
    std_user_rating = df2.groupby("userId")["rating"].std().rename("std_user_rating")  

    movies_seen = df2[df2["userId"] == user_id]["movieId"].unique()
    movies_seen = df2[df2["userId"] == user_id]["movieId"].unique()
    unseen_movies_df = df2[~df2["movieId"].isin(movies_seen)]
    unseen_movies_df = unseen_movies_df.drop_duplicates(subset="movieId")   

    unseen_movies_df = unseen_movies_df.merge(avg_movie_rating, on="movieId", how="left")
    unseen_movies_df = unseen_movies_df.merge(count_movie_rating, on="movieId", how="left")
    unseen_movies_df = unseen_movies_df.merge(std_movie_rating, on="movieId", how="left")

    unseen_movies_df = unseen_movies_df.merge(avg_user_rating, on="userId", how="left")
    unseen_movies_df = unseen_movies_df.merge(count_user_rating, on="userId", how="left")
    unseen_movies_df = unseen_movies_df.merge(std_user_rating, on="userId", how="left")

    df_copy = unseen_movies_df.copy()

    feature_cols = [col for col in unseen_movies_df.columns if col not in ["rating", "timestamp"]]
    scaler = MinMaxScaler()

    unseen_movies_df[feature_cols] = scaler.fit_transform(unseen_movies_df[feature_cols])

    x_unseen = torch.tensor(unseen_movies_df[feature_cols].to_numpy(dtype=np.float32)).to(device)

    model.eval()
    with torch.no_grad():
        predicted_ratings = model(x_unseen).squeeze() 

    unseen_movies_df["predicted_rating"] = predicted_ratings.cpu().numpy()

    top_unseen = unseen_movies_df.sort_values(by="predicted_rating", ascending=False).head(10)

    df_copy['movieIdNormalized'] = scaler.fit_transform(df_copy[['movieId']])

    filtered_df = df_copy[df_copy['movieIdNormalized'].isin(top_unseen['movieId'])]
    merged_df = filtered_df.merge(movies, on='movieId', how='left')
    result = merged_df[['movieId', 'title']]
    print(result)

print("The films recommended for that user are: ")
print(recommend_movies( 1, model, df2, top_n=10, device="cuda"))


The films recommended for that user are: 
    movieId                                              title
0    1596.0                                Career Girls (1997)
1    2151.0                  Gods Must Be Crazy II, The (1989)
2    2436.0                          Tea with Mussolini (1999)
3    3241.0                           Cup, The (Phörpa) (1999)
4    3266.0  Man Bites Dog (C'est arrivé près de chez vous)...
5    3855.0  Affair of Love, An (Liaison pornographique, Un...
6    4021.0                          Before Night Falls (2000)
7    5135.0                             Monsoon Wedding (2001)
8  110130.0                                Nut Job, The (2014)
9  122912.0             Avengers: Infinity War - Part I (2018)
None


  unseen_movies_df["predicted_rating"] = predicted_ratings.cpu().numpy()
