# Machine Learning for Recommendations
Autoencoders and Neural Networks for Collaborative Filtering

**Dataset:** MovieLens 100K

This notebook implements:
- An autoencoder that reconstructs user rating vectors (masked loss)
- A Neural Collaborative Filtering (MLP with embeddings)
- Evaluation: RMSE and Precision@K


In [1]:
# Install required packages (run once if needed)
!pip install numpy pandas scikit-learn matplotlib tensorflow requests




In [2]:
import os, zipfile, requests
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers, models, losses, optimizers, callbacks
print('TensorFlow version:', tf.__version__)

TensorFlow version: 2.19.0


## 1) Download MovieLens 100K and load ratings and item metadata

In [3]:
# download dataset if not present
if not os.path.exists('ml-100k'):
    url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
    r = requests.get(url)
    open('ml-100k.zip','wb').write(r.content)
    with zipfile.ZipFile('ml-100k.zip','r') as z:
        z.extractall()
    print('Downloaded and extracted MovieLens 100K')

# load u.data
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=['user_id','item_id','rating','timestamp'], encoding='latin-1')
# load u.item for titles (optional)
items_cols = ['item_id','title','release_date','video_release_date','IMDb_URL'] + [f'genre_{i}' for i in range(19)]
items = pd.read_csv('ml-100k/u.item', sep='|', names=items_cols, usecols=range(5+19), encoding='latin-1')
items['item_id'] = items['item_id'].astype(int)

print('Ratings shape:', ratings.shape)
ratings.head()

Downloaded and extracted MovieLens 100K
Ratings shape: (100000, 4)


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


## 2) Build user–item matrix and create train/test masks

In [4]:
n_users = ratings['user_id'].nunique()
n_items = ratings['item_id'].nunique()
print('Users:', n_users, 'Items:', n_items)  # expected 943,1682

# map ids to contiguous indices
user_map = {uid: i for i, uid in enumerate(sorted(ratings['user_id'].unique()))}
item_map = {iid: i for i, iid in enumerate(sorted(ratings['item_id'].unique()))}
ratings['user_idx'] = ratings['user_id'].map(user_map)
ratings['item_idx'] = ratings['item_id'].map(item_map)

# build user-item matrix with zeros for missing
R = np.zeros((n_users, n_items), dtype=np.float32)
for r in ratings.itertuples():
    R[r.user_idx, r.item_idx] = r.rating

print('R shape:', R.shape)

Users: 943 Items: 1682
R shape: (943, 1682)


### Create train/test split by masking 20% of each user's ratings (per-user holdout)

In [5]:
# create train matrix and test mask
rng = np.random.default_rng(42)
train_R = R.copy()
test_mask = np.zeros_like(R, dtype=bool)

for u in range(n_users):
    rated_items = np.where(R[u] > 0)[0]
    if len(rated_items) < 5:
        continue
    test_count = max(1, int(0.2 * len(rated_items)))
    test_items = rng.choice(rated_items, size=test_count, replace=False)
    train_R[u, test_items] = 0.0  # hide in train
    test_mask[u, test_items] = True

# verify
print('Total ratings:', (R>0).sum(), 'Train ratings:', (train_R>0).sum(), 'Test ratings:', test_mask.sum())

Total ratings: 100000 Train ratings: 80367 Test ratings: 19633


## 3) Autoencoder model (Keras)
We'll build a simple dense autoencoder that takes a user vector (num_items) and reconstructs it. Loss is masked MSE computed only over observed training entries.

In [7]:
num_items = n_items
input_dim = num_items
latent_dim = 64

# Build model
inputs = layers.Input(shape=(input_dim,), name='ratings_input')
# mask_input is not used in the model definition, it's used in the custom training loop.
# mask_input = layers.Input(shape=(input_dim,), name='mask_input')  # mask to indicate observed entries
x = layers.Dense(512, activation='relu')(inputs)
x = layers.Dense(256, activation='relu')(x)
latent = layers.Dense(latent_dim, activation='relu', name='latent')(x)
x = layers.Dense(256, activation='relu')(latent)
x = layers.Dense(512, activation='relu')(x)
outputs = layers.Dense(input_dim, activation='linear')(x)

# autoencoder = models.Model(inputs=[inputs, mask_input], outputs=outputs, name='autoencoder')
autoencoder = models.Model(inputs=inputs, outputs=outputs, name='autoencoder')


# custom loss: masked MSE (mask==1 means observed entry)
def masked_mse(y_true, y_pred):
    # y_true contains original ratings, but we'll also supply mask as last concatenated part?
    # Instead, use mask_input via model.add_loss in training step; here define a dummy loss
    return tf.reduce_mean(tf.square(y_true - y_pred))

# We'll compile with a placeholder loss and use a custom training loop to apply mask
autoencoder.compile(optimizer=optimizers.Adam(learning_rate=1e-3), loss='mse')
autoencoder.summary()

In [9]:
# Prepare training data: input is train_R, mask indicates observed entries in train_R
X_train = train_R.copy()
mask_train = (train_R > 0).astype(np.float32)

# Normalize ratings to 0-1 range? We keep original scale (1-5), model will predict raw ratings.
# Train with sample-weighting per-element: Keras fit supports sample_weight per sample, not per element.
# We'll train using a custom training loop to compute masked loss.
batch_size = 32
epochs = 20

# Convert to tf datasets
dataset = tf.data.Dataset.from_tensor_slices((X_train, mask_train)).shuffle(buffer_size=n_users, seed=42).batch(batch_size)

optimizer = optimizers.Adam(1e-3)

# Training loop
train_losses = []
for epoch in range(epochs):
    epoch_loss = 0.0
    steps = 0
    for batch_x, batch_mask in dataset:
        with tf.GradientTape() as tape:
            # Pass only batch_x to the autoencoder
            preds = autoencoder(batch_x, training=True)
            # compute masked MSE only where mask==1
            sq_err = tf.square((batch_x - preds)) * batch_mask
            # avoid division by zero: sum mask per sample
            denom = tf.reduce_sum(batch_mask, axis=1)
            # sum sq_err per sample and normalize
            per_sample_loss = tf.math.divide_no_nan(tf.reduce_sum(sq_err, axis=1), denom)
            loss = tf.reduce_mean(per_sample_loss)
        grads = tape.gradient(loss, autoencoder.trainable_weights)
        optimizer.apply_gradients(zip(grads, autoencoder.trainable_weights))
        epoch_loss += loss.numpy()
        steps += 1
    train_losses.append(epoch_loss/steps)
    if (epoch+1) % 5 == 0 or epoch==0:
        print(f'Epoch {epoch+1}/{epochs} - loss: {train_losses[-1]:.4f}')

Epoch 1/20 - loss: 5.0180
Epoch 5/20 - loss: 0.9867
Epoch 10/20 - loss: 0.7995
Epoch 15/20 - loss: 0.6172
Epoch 20/20 - loss: 0.4786


## 4) Evaluate Autoencoder on test entries (RMSE)

In [11]:
# Get reconstructed full matrix
preds_full = autoencoder.predict(R, batch_size=64)
# Evaluate RMSE on test_mask entries (these are hidden during train)
test_indices = np.where(test_mask)
y_true = R[test_indices]
y_pred = preds_full[test_indices]
rmse_auto = np.sqrt(mean_squared_error(y_true, y_pred))
print('Autoencoder RMSE on test entries: {:.4f}'.format(rmse_auto))

[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
Autoencoder RMSE on test entries: 1.2833


## 5) Generate Top-N recommendations using Autoencoder
We will recommend items with highest predicted rating for each user excluding items they already rated in train_R.

In [12]:
def get_top_n_from_preds(pred_matrix, train_matrix, n=10):
    top_n = {}
    num_users = pred_matrix.shape[0]
    for u in range(num_users):
        already = set(np.where(train_matrix[u]>0)[0])
        scores = list(enumerate(pred_matrix[u]))
        candidates = [(i, s) for (i,s) in scores if i not in already]
        candidates.sort(key=lambda x: x[1], reverse=True)
        top_n[u] = candidates[:n]
    return top_n

topn_auto = get_top_n_from_preds(preds_full, train_R, n=10)
# show sample user recommendations (map item idx back to original item id and title)
sample_u = 0
print('Top-10 for user idx', sample_u)
for idx,score in topn_auto[sample_u]:
    # map idx to original item id
    item_id = list(item_map.keys())[list(item_map.values()).index(idx)]
    title = items[items['item_id']==item_id]['title'].values[0] if item_id in items['item_id'].values else str(item_id)
    print(f'{title} (pred={score:.2f})')

Top-10 for user idx 0
Aiqing wansui (1994) (pred=7.50)
Boys, Les (1997) (pred=6.95)
Someone Else's America (1995) (pred=6.93)
World of Apu, The (Apur Sansar) (1959) (pred=6.67)
Bitter Sugar (Azucar Amargo) (1996) (pred=6.56)
Brothers in Trouble (1995) (pred=6.55)
Pather Panchali (1955) (pred=6.55)
Faust (1994) (pred=6.49)
Saint of Fort Washington, The (1993) (pred=6.39)
Santa with Muscles (1996) (pred=6.36)


## 6) Neural Collaborative Filtering (MLP) using embeddings
Prepare training examples from train_R (observed entries) and train an embedding-based MLP to predict ratings.

In [14]:
# Prepare training data from train_R observed entries
users, items_idx = np.where(train_R>0)
# Get the ratings values using the indices
ratings_vals = train_R[users, items_idx]

X_users = users.astype(np.int32)
X_items = items_idx.astype(np.int32)
y = ratings_vals.astype(np.float32)

# Build MLP model with embeddings
embed_dim = 32
user_input = layers.Input(shape=(), dtype='int32', name='user_input')
item_input = layers.Input(shape=(), dtype='int32', name='item_input')
u_embed = layers.Embedding(input_dim=n_users, output_dim=embed_dim, name='user_emb')(user_input)
i_embed = layers.Embedding(input_dim=n_items, output_dim=embed_dim, name='item_emb')(item_input)
# flatten
u_vec = layers.Flatten()(u_embed)
i_vec = layers.Flatten()(i_embed)
x = layers.Concatenate()([u_vec, i_vec])
x = layers.Dense(128, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
out = layers.Dense(1, activation='linear')(x)

mlp_model = models.Model(inputs=[user_input, item_input], outputs=out, name='mlp_cf')
mlp_model.compile(optimizer=optimizers.Adam(1e-3), loss='mse')
mlp_model.summary()

In [15]:
# Train-test split for examples
Xu_train, Xu_val, Xi_train, Xi_val, y_train, y_val = train_test_split(X_users, X_items, y, test_size=0.2, random_state=42)
history = mlp_model.fit([Xu_train, Xi_train], y_train, validation_data=([Xu_val, Xi_val], y_val), epochs=10, batch_size=1024)

Epoch 1/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 17ms/step - loss: 11.4822 - val_loss: 1.0974
Epoch 2/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - loss: 1.0050 - val_loss: 0.9154
Epoch 3/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 0.8701 - val_loss: 0.9014
Epoch 4/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.8566 - val_loss: 0.8982
Epoch 5/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 0.8506 - val_loss: 0.8992
Epoch 6/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.8438 - val_loss: 0.8998
Epoch 7/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.8459 - val_loss: 0.8996
Epoch 8/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.8446 - val_loss: 0.8997
Epoch 9/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0

## 7) Evaluate MLP model on held-out test entries (from test_mask)

In [16]:
# Build list of test pairs (from test_mask)
test_users, test_items = np.where(test_mask)
y_test = R[test_users, test_items]
y_pred_mlp = mlp_model.predict([test_users, test_items], batch_size=1024).flatten()
rmse_mlp = np.sqrt(mean_squared_error(y_test, y_pred_mlp))
print('MLP RMSE on test entries: {:.4f}'.format(rmse_mlp))

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step
MLP RMSE on test entries: 0.9429


## 8) Top-N from MLP predictions
We'll compute predictions for all user-item pairs not rated in train_R using the MLP (may be slow), then show Top‑N.

In [17]:
# For demo, compute top-N for a sample user using MLP predictions
def top_n_mlp_for_user(user_idx, n=10):
    # predict for all items not in train_R[user_idx]
    candidates = [i for i in range(n_items) if train_R[user_idx,i]==0]
    u_arr = np.array([user_idx]*len(candidates))
    i_arr = np.array(candidates)
    preds = mlp_model.predict([u_arr, i_arr], batch_size=1024).flatten()
    pairs = list(zip(candidates, preds))
    pairs.sort(key=lambda x: x[1], reverse=True)
    return pairs[:n]

sample_user = 0
top_mlp = top_n_mlp_for_user(sample_user, n=10)
for idx, score in top_mlp:
    item_id = list(item_map.keys())[list(item_map.values()).index(idx)]
    title = items[items['item_id']==item_id]['title'].values[0]
    print(f'{title} (pred={score:.2f})')

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
Pather Panchali (1955) (pred=4.74)
Faust (1994) (pred=4.69)
Close Shave, A (1995) (pred=4.65)
Bitter Sugar (Azucar Amargo) (1996) (pred=4.64)
Raise the Red Lantern (1991) (pred=4.49)
World of Apu, The (Apur Sansar) (1959) (pred=4.48)
Casablanca (1942) (pred=4.46)
Schindler's List (1993) (pred=4.44)
Santa with Muscles (1996) (pred=4.44)
Rear Window (1954) (pred=4.43)


## 9) Comparison & Evaluation Summary
We compare RMSE of autoencoder and MLP on test entries and report Precision@K if desired.

In [18]:
print('Autoencoder RMSE:', round(float(rmse_auto),4))
print('MLP RMSE:', round(float(rmse_mlp),4))

Autoencoder RMSE: 1.2833
MLP RMSE: 0.9429
