# Training a Recommender System with TensorFlow

In this notebook I build and train a collaborative filtering recommender system.

In [226]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

## Step 1: Load and preprocess the dataset

We load the preprocessed dataset and make sure that both `userId` and `movieId` are strings.  
This is required since TensorFlow's `StringLookup` layer expects string inputs.

In [227]:
ratings_df = pd.read_csv("ratings_meta_small.csv")
ratings_df["userId"] = ratings_df["userId"].astype(str)
ratings_df["movieId"] = ratings_df["movieId"].astype(str)

## Step 2: Extract unique users and movies

We collect all unique users and movies to build vocabularies for the embedding lookups.

In [228]:
unique_users  = ratings_df["userId"].unique()
unique_movies = ratings_df["movieId"].unique()

## Step 3: Build lookup layers

We use `StringLookup` layers to map each user- and movieID into integer indices.  
This allows us to connect users and movies to embeddings.

In [229]:
user_lookup  = tf.keras.layers.StringLookup(vocabulary=unique_users, mask_token=None)
movie_lookup = tf.keras.layers.StringLookup(vocabulary=unique_movies, mask_token=None)

## Step 4: Define embedding size

We choose 128 dimensions for both user and movie embeddings.  
Larger embeddings capture more information, but also increase model complexity.

In [230]:
emb_dim = 128

## Step 5: Build the model

We define a collaborative filtering model using embeddings:

- User- and movieIDs are mapped to embeddings.
- Embeddings are concatenated and passed through Dense layers with ReLU activations.
- A final Dense(1) that predicts the rating. No activation is used, so the value can be any real number.

This architecture allows the model to learn complex interactions between users and movies.

In [None]:
# Inputs
user_in  = tf.keras.Input(shape=(), dtype=tf.string, name="userId")
movie_in = tf.keras.Input(shape=(), dtype=tf.string, name="movieId")

# Lookups
u_idx = user_lookup(user_in)
m_idx = movie_lookup(movie_in)

# Embeddings
u_emb = tf.keras.layers.Embedding(user_lookup.vocabulary_size(), emb_dim)(u_idx)
m_emb = tf.keras.layers.Embedding(movie_lookup.vocabulary_size(), emb_dim)(m_idx)
u_emb = tf.keras.layers.Flatten()(u_emb)
m_emb = tf.keras.layers.Flatten()(m_emb)

# Concatenate user + movie vectors
x = tf.keras.layers.Concatenate()([u_emb, m_emb])

# Add Dense layers
x = tf.keras.layers.Dense(128, activation="relu")(x)
x = tf.keras.layers.Dense(64, activation="relu")(x)

# Output layer
pred = tf.keras.layers.Dense(1)(x)

# Model
model = tf.keras.Model(inputs=[user_in, movie_in], outputs=pred)

## Step 6: Compile the model

We compile the model with:
- **Optimizer**: Adam (adaptive learning rate)
- **Loss**: Mean Squared Error (MSE), since we predict numerical ratings
- **Metric**: Root Mean Squared Error (RMSE), easier to interpret on rating scale

In [None]:
model.compile(
    optimizer="adam",
    loss="mse",
    metrics=[tf.keras.metrics.RootMeanSquaredError(name="rmse")]
)

## Step 7: Train/validation split

We split the data into:
- **Training set** (90%)
- **Validation set** (10%)

This allows us to evaluate generalization performance.

In [233]:
train_df, val_df = train_test_split(ratings_df, test_size=0.1, random_state=42)

X_train = {"userId": train_df["userId"].values,
           "movieId": train_df["movieId"].values}
y_train = train_df["rating"].values.astype("float32")

X_val = {"userId": val_df["userId"].values,
         "movieId": val_df["movieId"].values}
y_val = val_df["rating"].values.astype("float32")


## Step 8: Create TensorFlow datasets

We convert the NumPy arrays into TensorFlow datasets for efficient training:
- `shuffle` randomizes the data each epoch
- `batch` groups samples into batches of size 256

In [234]:
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(y_train)).batch(256)
val_ds   = tf.data.Dataset.from_tensor_slices((X_val, y_val)).batch(256)

## Step 9: Train the model

We train for up to 10 epochs with early stopping:
- **Monitor**: validation RMSE
- **Patience**: stop if it does not improve for 2 epochs
- **Restore best weights**: rollback to the best-performing model


In [235]:
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=[tf.keras.callbacks.EarlyStopping(monitor="val_rmse", patience=2, restore_best_weights=True)]
)

Epoch 1/10
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - loss: 1.4403 - rmse: 1.2001 - val_loss: 0.8016 - val_rmse: 0.8953
Epoch 2/10
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - loss: 0.7179 - rmse: 0.8473 - val_loss: 0.7902 - val_rmse: 0.8889
Epoch 3/10
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 0.6179 - rmse: 0.7861 - val_loss: 0.7981 - val_rmse: 0.8934
Epoch 4/10
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 0.5107 - rmse: 0.7146 - val_loss: 0.8284 - val_rmse: 0.9102


## Step 10: Save the trained model

We save the trained recommender system to disk so it can be reused for predictions without retraining.

In [None]:
model.save("recommender_model2.keras")