# Evaluating the accuracy of two different racomandation strategies

The main objective of this project is to develop, implement, and evaluate multiple recommendation models using state-of-the-art techniques in collaborative filtering. In particular, the work focuses on comparing traditional methods—such as item-kNN and matrix factorization—with more advanced neural approache that is MultiVAE. The aim is to understand how different modeling choices, data preprocessing strategies, and evaluation metrics impact recommendation quality, and to identify the methods that achieve the best performance on the given dataset.

Importing the necessary libraries:

In [16]:
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats.mstats import winsorize



## DATASET

From the Steam dataset, I am using the following components:
- train_interactions.csv & test_interactions_in.csv
    - user_id
    - item_id

In [17]:
dataset_foldername = "~/OneDrive - Università degli Studi di Milano-Bicocca/Magistrale/AI/cleaned_datasets_students"

In [18]:
train_interactions = pd.read_csv(f"{dataset_foldername}/train_interactions.csv")
games = pd.read_csv(f"{dataset_foldername}/games.csv")
test_interactions_in = pd.read_csv(f"{dataset_foldername}/test_interactions_in.csv")

## Interaction.csv pre-processing

In [19]:
train_filtered = train_interactions.drop_duplicates(["user_id", "item_id"])

# User with >=5 interactions
train_filtered = train_filtered.groupby("user_id").filter(lambda x: len(x) >= 5)

# Items with >=2 interactions
item_freq = train_filtered.groupby("item_id").size()
valid_items = item_freq[item_freq >= 2].index
train_filtered = train_filtered[train_filtered["item_id"].isin(valid_items)]


train_filtered["split"] = "train"
test_interactions_in["split"] = "test"

all_interactions = pd.concat([train_filtered, test_interactions_in], ignore_index=True)


all_interactions = all_interactions.rename(
    columns={
        "user_id": "old_user_id",
        "item_id": "old_item_id"
    }
)
# --- USERS ---
user_id_mapping = {old_id: new_id for new_id, old_id in enumerate(all_interactions['old_user_id'].unique())}
all_interactions['user_id'] = all_interactions['old_user_id'].map(user_id_mapping)
new_to_old_user_id_mapping = {v: k for k, v in user_id_mapping.items()}

# --- ITEMS ---
item_id_mapping = {old_id: new_id for new_id, old_id in enumerate(all_interactions['old_item_id'].unique())}
all_interactions['item_id'] = all_interactions['old_item_id'].map(item_id_mapping)
new_to_old_item_id_mapping = {v: k for k, v in item_id_mapping.items()}


In [20]:
test_mapped  = all_interactions[all_interactions["split"] == "test"].copy()
train_mapped = all_interactions[all_interactions["split"] == "train"].copy()

item_freq = train_mapped.groupby("item_id").size()
valid_items = set(item_freq[item_freq >= 2].index)

train_mapped = train_mapped[train_mapped["item_id"].isin(valid_items)]
test_mapped = test_mapped[test_mapped["item_id"].isin(valid_items)]


In [21]:
import scipy.sparse as sp
from scipy.sparse import csr_matrix
num_users = all_interactions["user_id"].nunique()
num_items = len(valid_items)

X_train_binary = sp.csr_matrix(
    (np.ones(len(train_mapped)),
     (train_mapped["user_id"].values, train_mapped["item_id"].values)),
    shape=(num_users, num_items)
)


X_test_in_binary = sp.csr_matrix(
    (np.ones(len(test_mapped)),
     (test_mapped["user_id"].values, test_mapped["item_id"].values)),
    shape=(num_users, num_items)
)



## Recommendation

### ITEM-KNN
Item-kNN is a neighborhood-based collaborative filtering method that recommends items by measuring similarity between products. The underlying idea is that items consumed by similar groups of users are likely to be related. To generate recommendations, item-kNN computes item–item similarity scores—commonly using cosine similarity—based on user interaction patterns. For a given user, the model identifies items similar to those they have already interacted with and ranks them according to aggregated similarity.

### MultiVAE
MultiVAE (Multinomial Variational Autoencoder) is a deep generative model designed for collaborative filtering, leveraging the power of variational inference to learn meaningful latent representations of users. Unlike traditional methods, MultiVAE models the user–item interaction vector as a multinomial distribution and uses an encoder–decoder architecture to reconstruct user preferences while regularizing the latent space through a KL divergence term. The encoder maps each user’s interaction history to a continuous latent vector, while the decoder predicts the probability distribution over all items. This allows MultiVAE to capture complex, nonlinear patterns in the data and produce highly personalized recommendations.

#### ITEM-KNN

In [22]:

from Components.my_cosine_similarity import my_cosine_similarity
import scipy.sparse as sp
from scipy.sparse import csr_matrix
import torch
from Components.item_knn import item_knn_scores, scores2recommendations, save_user_item

scores = item_knn_scores(X_train_binary, X_test_in_binary, 50)
df_recos = scores2recommendations(scores, X_test_in_binary, 20)
df_recos["user_id"] = df_recos["user_id"].map(new_to_old_user_id_mapping)
df_recos["item_id"] = df_recos["item_id"].map(new_to_old_item_id_mapping)

save_user_item(df_recos, "submission_itemknn.csv")


  self._set_arrayXarray(i, j, x)


User-item recommendation file saved to submission_itemknn.csv


In [23]:
import importlib
import Components.generate_recommendations as gr
importlib.reload(gr)


<module 'Components.generate_recommendations' from 'c:\\Users\\matte\\Desktop\\AIProject\\Components\\generate_recommendations.py'>

#### MultiVAE-train

In [24]:
import numpy as np
import scipy.sparse as sp
import torch
from Components.multiVAE import MultiVAE

# ============================================================
# INITIALIZE MODEL
# ============================================================
n_items = X_train_binary.shape[1]
train_user_ids = train_mapped["user_id"].unique()
n_users_train = len(train_user_ids)
X_train_dense = torch.FloatTensor(X_train_binary.toarray())

# row_sums = X_train_dense.sum(1, keepdim=True)
# X_train_dense = X_train_dense / torch.clamp(row_sums, min=1.0)

p_dims = [600, 200, n_items]
model = MultiVAE(p_dims, dropout=0.5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ============================================================
# TRAINING LOOP
# ============================================================

epochs = 30
batch_size = 2000

total_anneal_steps = 200000   # recommended by the paper
anneal_cap = 1.0              # max value for beta
update_count = 0              # global step counter

for epoch in range(epochs):
    perm = torch.randperm(n_users_train)
    epoch_loss = 0.0

    for start in range(0, n_users_train, batch_size):
        end = start + batch_size
        batch_idx = perm[start:end]
        batch = X_train_dense[batch_idx]

        # ===== KL annealing =====
        if total_anneal_steps > 0:
            beta = min(anneal_cap, update_count / total_anneal_steps)
        else:
            beta = anneal_cap

        logits, mu, logvar = model(batch)
        loss, _, _ = model.loss_function(logits, batch, mu, logvar, beta)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        update_count += 1

    print(f"Epoch {epoch+1}/{epochs} - Loss: {epoch_loss:.4f}")


Epoch 1/30 - Loss: 9184.0706
Epoch 2/30 - Loss: 8090.0362
Epoch 3/30 - Loss: 8100.1107
Epoch 4/30 - Loss: 8053.6545
Epoch 5/30 - Loss: 8052.5176
Epoch 6/30 - Loss: 8060.7126
Epoch 7/30 - Loss: 8020.0146
Epoch 8/30 - Loss: 7963.2427
Epoch 9/30 - Loss: 7941.4982
Epoch 10/30 - Loss: 7919.0898
Epoch 11/30 - Loss: 7908.7787
Epoch 12/30 - Loss: 7882.6528
Epoch 13/30 - Loss: 7882.7292
Epoch 14/30 - Loss: 7842.0348
Epoch 15/30 - Loss: 7794.2228
Epoch 16/30 - Loss: 7795.9257
Epoch 17/30 - Loss: 7798.6572
Epoch 18/30 - Loss: 7793.8939
Epoch 19/30 - Loss: 7773.3136
Epoch 20/30 - Loss: 7784.3473
Epoch 21/30 - Loss: 7777.4090
Epoch 22/30 - Loss: 7799.2559
Epoch 23/30 - Loss: 7776.3572
Epoch 24/30 - Loss: 7721.0755
Epoch 25/30 - Loss: 7698.5770
Epoch 26/30 - Loss: 7686.6870
Epoch 27/30 - Loss: 7701.5273
Epoch 28/30 - Loss: 7665.9412
Epoch 29/30 - Loss: 7671.2166
Epoch 30/30 - Loss: 7661.5078


In [25]:
import importlib
import Components.generate_recommendations as gr
importlib.reload(gr)


<module 'Components.generate_recommendations' from 'c:\\Users\\matte\\Desktop\\AIProject\\Components\\generate_recommendations.py'>

### Multivae-recommandation

In [26]:
from Components.generate_recommendations import multivae_recommend, save_submission

test_users = np.sort(test_mapped["user_id"].unique())
n_test_users = len(test_users)

user_to_row = {u: i for i, u in enumerate(test_users)}
index_to_user = {i: u for i, u in enumerate(test_users)}

rows = test_mapped["user_id"].map(user_to_row).values
cols = test_mapped["item_id"].values
data = np.ones(len(test_mapped))

X_test_in_binaryMV = sp.csr_matrix(
    (data, (rows, cols)),
    shape=(n_test_users, num_items)
)

X_dense_test_in = torch.FloatTensor(X_test_in_binaryMV.toarray())

row_sums_test = X_dense_test_in.sum(1, keepdim=True)
X_dense_test_in = X_dense_test_in / torch.clamp(row_sums_test, min=1.0)

known_items = {}

# known_items 
for row in train_mapped.itertuples():
    u = row.user_id
    if u in user_to_row:   
        known_items[user_to_row[u]] = known_items.get(user_to_row[u], set())
        known_items[user_to_row[u]].add(row.item_id)

# known_items
for row in test_mapped.itertuples():
    u = row.user_id
    known_items[user_to_row[u]] = known_items.get(user_to_row[u], set())
    known_items[user_to_row[u]].add(row.item_id)

# convert to lists
known_items = {k: list(v) for k, v in known_items.items()}

rec_df = multivae_recommend(
    model=model,
    X_dense_test_in=X_dense_test_in,
    index_to_user=index_to_user,
    known_items=known_items,
    top_k=20
)


rec_df["user_id"] = rec_df["user_id"].map(new_to_old_user_id_mapping)

# item_id: mapped → old
rec_df["item_id"] = rec_df["item_id"].map(new_to_old_item_id_mapping)

In [27]:
save_submission(rec_df, "submission_multivae.csv")
print("MultiVAE recommendations saved to submission_multivae.csv")

File saved to submission_multivae.csv
MultiVAE recommendations saved to submission_multivae.csv
