# About

Movie Lens 100k dataset <br>

In this notebook, we explore two approaches to predicting movie ratings.
- The first is **collaborative filtering**, where we artificially initalise random movie/user feature vectors. Then, we train these feature vectors using our cost objective of reducing the squared error between prediction and actual rating (plus regularisation). This method is expected to yield poorer accuracies in comparison to the second method.
- **Content-based filtering** is where we start with defined movie/user features and then train both a movie network and user network to extract meaningful movie/user feature vectors that aid in minimising the cost objective - squared error between predicted and actual values (plus regularisation). In this method, we use user-related information such as occupation/genre and movie-related information such as genre breakdown to yield more accurate predictions.

In [None]:
import numpy as np
import tensorflow as tf
import tabulate
from helpers import *
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pickle

from helpers import *
from variables import *

In [2]:
def print_as_table(data,headers,top_n_rows=5):
    """
    Uses the tabulate module to print array data in a tabular format
    """
    table = tabulate.tabulate(data[:top_n_rows],
                              headers=headers,
                              tablefmt="fancy_grid",
                              numalign="right",
                              stralign="center",
                              colalign=("center", "center", "right"))
    print(table)

# Collaborative Filtering

## Import Data

In [3]:
Y,R,movie_mapping = loadDataCollaborativeFiltering()
print("Y shape:", Y.shape)
print("R shape:", R.shape)
for i in np.random.randint(0,Y.shape[0],5):
    print(f"Average rating for movie {i} is {np.average(Y[i])}")

Y shape: (1682, 943)
R shape: (1682, 943)
Average rating for movie 31 is 0.3255567338282078
Average rating for movie 107 is 0.21208907741251326
Average rating for movie 1447 is 0.01166489925768823
Average rating for movie 227 is 0.9872746553552492
Average rating for movie 1367 is 0.02332979851537646


In [4]:
# normalise ratings by subtracting mean rating for every movie so that mean rating for each movie~0
Ynorm,Ymean = normaliseRatings(Y)

## Define Cost Objective

In [None]:
def calcCostObjective(X, W, b, Y, R, lambda_):
    """
    Calculates the Cost Objective according the squared error function and regularisation

    X: movie features vector (num_movies, num_features)
    W: user features vector (num_users, num_features)
    b: user bias vector (num_users,)
    Y: ratings matrix (num_movies, num_users)
    R: indicator matrix for ratings (num_movies, num_users)
    lambda_: regularization parameter
    """
    # Predicted ratings
    j = (tf.matmul(X, tf.transpose(W))+b-Y)*R

    # Compute the cost
    cost = 0.5 * tf.reduce_sum(j**2)

    # Add regularization terms
    cost += (lambda_ / 2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))

    return cost

## Build & Train

In [40]:
# define model parameters
num_features = 100
num_movies = Y.shape[0]
num_users = Y.shape[1]

# randomly initalise movie and user feature vectors
X = tf.Variable(tf.random.uniform((num_movies,num_features)))
W = tf.Variable(tf.random.uniform((num_users,num_features)))
b = tf.Variable(tf.random.uniform((num_users,)))

In [None]:
# define training hyperparameters
iters = 2000
lambda_ = 1

# train
train_cf_model(X,W,b,Ynorm,R,calcCostObjective,lambda_,iters)

Training loss at iteration 0: 4888692.50 
Training loss at iteration 100: 1181118.88 
Training loss at iteration 200: 382624.03 
Training loss at iteration 300: 179562.73 
Training loss at iteration 400: 117144.17 
Training loss at iteration 500: 93589.85 
Training loss at iteration 600: 82242.39 
Training loss at iteration 700: 75319.00 
Training loss at iteration 800: 70298.96 
Training loss at iteration 900: 66241.76 
Training loss at iteration 1000: 62738.40 
Training loss at iteration 1100: 59591.98 
Training loss at iteration 1200: 56703.28 
Training loss at iteration 1300: 54020.71 
Training loss at iteration 1400: 51515.67 
Training loss at iteration 1500: 49170.25 
Training loss at iteration 1600: 46971.32 
Training loss at iteration 1700: 44907.84 
Training loss at iteration 1800: 42969.78 
Training loss at iteration 1900: 41147.70 


Compare predicted vs actual ratings for random movies that user 34 has rated

In [71]:
# predictions
p = (tf.matmul(X,tf.transpose(W)) + b)[:,34]
# actual ratings
r = Y[:,34]

# mask for movies that user 34 has actually rated
mask = R[:,34]!=0
# get indices of movies rated by user 34
rated_indices = np.where(mask)

print("User 34's ratings:\n")
for midx in np.random.choice(rated_indices[0],10):
    print(f"Movie {midx}: Predicted Rating {p[midx]:.1f} Actual Rating {r[midx]}")


User 34's ratings:

Movie 326: Predicted Rating 2.9 Actual Rating 3.0
Movie 258: Predicted Rating 2.8 Actual Rating 4.0
Movie 327: Predicted Rating 1.8 Actual Rating 3.0
Movie 325: Predicted Rating 2.3 Actual Rating 3.0
Movie 677: Predicted Rating 2.4 Actual Rating 3.0
Movie 679: Predicted Rating 3.3 Actual Rating 4.0
Movie 320: Predicted Rating 2.8 Actual Rating 3.0
Movie 875: Predicted Rating 1.8 Actual Rating 2.0
Movie 878: Predicted Rating 2.9 Actual Rating 4.0
Movie 325: Predicted Rating 2.3 Actual Rating 3.0


# Content-Based Filtering

## Load and Explore Data

In [3]:
# load raw movie/user features
movie_raw_features,user_raw_features = load_raw_features()

In [4]:
# show movie features
print_as_table(movie_raw_features,movie_raw_features_headers)

╒════════════╤═══════════════════╤════════════════╤══════════════════════╤════════════════════════════════════════════════════════╤═══════════╤══════════╤═════════════╤═════════════╤══════════════╤══════════╤═════════╤═══════════════╤═════════╤═══════════╤═════════════╤══════════╤═══════════╤═══════════╤═══════════╤══════════╤════════════╤═══════╤═══════════╕
│  movie id  │    movie title    │   release date │  video release date  │                        IMDb URL                        │   unknown │   Action │   Adventure │   Animation │   Children's │   Comedy │   Crime │   Documentary │   Drama │   Fantasy │   Film-Noir │   Horror │   Musical │   Mystery │   Romance │   Sci-Fi │   Thriller │   War │   Western │
╞════════════╪═══════════════════╪════════════════╪══════════════════════╪════════════════════════════════════════════════════════╪═══════════╪══════════╪═════════════╪═════════════╪══════════════╪══════════╪═════════╪═══════════════╪═════════╪═══════════╪═════════════╪══════

In [15]:
df['video release date'].unique()

array([''], dtype=object)

In [None]:

df = pd.DataFrame(movie_raw_features,columns=movie_raw_features_headers)
df['Release Year']=pd.to_datetime(df['release date']).dt.year.astype('Int64')

0       1995
1       1995
2       1995
3       1995
4       1995
        ... 
1677    1998
1678    1998
1679    1998
1680    1994
1681    1996
Name: release date, Length: 1682, dtype: Int64

In [5]:
# show user features
print_as_table(user_raw_features,user_raw_features_headers)

╒═══════════╤═══════╤══════════╤══════════════╤════════════╕
│  user id  │  age  │   gender │  occupation  │   zip code │
╞═══════════╪═══════╪══════════╪══════════════╪════════════╡
│     1     │  24   │        M │  technician  │      85711 │
├───────────┼───────┼──────────┼──────────────┼────────────┤
│     2     │  53   │        F │    other     │      94043 │
├───────────┼───────┼──────────┼──────────────┼────────────┤
│     3     │  23   │        M │    writer    │      32067 │
├───────────┼───────┼──────────┼──────────────┼────────────┤
│     4     │  24   │        M │  technician  │      43537 │
├───────────┼───────┼──────────┼──────────────┼────────────┤
│     5     │  33   │        F │    other     │      15213 │
╘═══════════╧═══════╧══════════╧══════════════╧════════════╛


In [195]:
# we need to process the raw features so that there are no non-numerical values
# i have done this in the eda_notebook so please check that out. here we will simply load in the arrays
movie_features,movie_features_headers,user_features,user_features_headers = load_cleaned_features()
targets = np.load("cleaned_data/targets.npy",allow_pickle=True)[...,np.newaxis]

# show cleaned movie features
print_as_table(movie_features,movie_features_headers)
# show cleaned user features
print_as_table(user_features,user_features_headers)

╒════════════╤════════════════════════════╤═════════════════════════════════════════════════════════════════════╤═══════════╤══════════╤═════════════╤═════════════╤══════════════╤══════════╤═════════╤═══════════════╤═════════╤═══════════╤═════════════╤══════════╤═══════════╤═══════════╤═══════════╤══════════╤════════════╤═══════╤═══════════╤════════════════╤══════════════╕
│  movie id  │        movie title         │                                                            IMDb URL │   unknown │   Action │   Adventure │   Animation │   Children's │   Comedy │   Crime │   Documentary │   Drama │   Fantasy │   Film-Noir │   Horror │   Musical │   Mystery │   Romance │   Sci-Fi │   Thriller │   War │   Western │   Release Year │   avg rating │
╞════════════╪════════════════════════════╪═════════════════════════════════════════════════════════════════════╪═══════════╪══════════╪═════════════╪═════════════╪══════════════╪══════════╪═════════╪═══════════════╪═════════╪═══════════╪══════════

We will not be using the following features during training:
- Movie features: movie id, movie title, IMDB URL
- User features: user id, zip code (although potentially useful)

## Standardisation and train-test split

In [196]:
# our model will train better if we have scaled features/targets to have mean of 0 and variance of 1
scalerMovies = StandardScaler()
movie_train = scalerMovies.fit_transform(movie_features[:,3:])

scalerUsers = StandardScaler()
user_train = scalerUsers.fit_transform(user_features[:,2:])

scalerTargets = StandardScaler()
targets_train = scalerTargets.fit_transform(targets)

In [197]:
# visualise scaled feature values
print_as_table(movie_train,movie_features_headers[3:])

╒════════════╤═══════════╤═════════════╤═════════════╤══════════════╤═══════════╤═══════════╤═══════════════╤═══════════╤═══════════╤═════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤════════════╤═══════════╤═══════════╤════════════════╤══════════════╕
│  unknown   │  Action   │   Adventure │   Animation │   Children's │    Comedy │     Crime │   Documentary │     Drama │   Fantasy │   Film-Noir │    Horror │   Musical │   Mystery │   Romance │    Sci-Fi │   Thriller │       War │   Western │   Release Year │   avg rating │
╞════════════╪═══════════╪═════════════╪═════════════╪══════════════╪═══════════╪═══════════╪═══════════════╪═══════════╪═══════════╪═════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪════════════╪═══════════╪═══════════╪════════════════╪══════════════╡
│ -0.0100005 │ -0.586419 │   -0.399325 │   -0.193386 │    -0.278168 │   1.53366 │ -0.295984 │    -0.0873951 │ -0.814712 │  -0.11707 │   -0.132799 │ -0.236972 │ -0.228303

In [87]:
# train-test split: 80-20
movie_train,movie_test, user_train,user_test, targets_train,targets_test =train_test_split(movie_train,user_train,targets_train,test_size=0.2, shuffle=True)
print("Movie training data shape: ",movie_train.shape)
print("Movie test data shape: ",movie_test.shape)
print("User training data shape: ",user_train.shape)
print("User test data shape: ",user_test.shape)
print("Targets training data shape: ",targets_train.shape)
print("Targets test data shape: ",targets_test.shape)

Movie training data shape:  (80000, 21)
Movie test data shape:  (20000, 21)
User training data shape:  (80000, 24)
User test data shape:  (20000, 24)
Targets training data shape:  (80000, 1)
Targets test data shape:  (20000, 1)


## Build and Train Networks
There will be a movie network that creates movie vectors for each set of movie features we feed and a separate user network that creates user vectors for each set of user features we feed. I will be using the architecture from Andrew Ng's Recommender Systems class for both networks. The output of these networks will be a 32D vector.

In [88]:
num_outputs = 32
movie_NN = tf.keras.models.Sequential([   
    tf.keras.layers.Dense(256,activation='relu'),
    tf.keras.layers.Dense(128,activation='relu'),
    tf.keras.layers.Dense(num_outputs)
])

user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256,activation='relu'),
    tf.keras.layers.Dense(128,activation='relu'),
    tf.keras.layers.Dense(num_outputs)
])

# forward pass in movieNN
input_movie = tf.keras.Input(shape=(movie_train.shape[1],))
vm = movie_NN(input_movie)
vm = tf.keras.layers.Lambda(lambda x: tf.linalg.l2_normalize(x,axis=1))(vm) # normalise to unit length

# forward pass in userNN
input_user = tf.keras.Input(shape=(user_train.shape[1],))
vu = user_NN(input_user)
vu = tf.keras.layers.Lambda(lambda x: tf.linalg.l2_normalize(x,axis=1))(vu) # normalise to unit length

# dot product
output = tf.keras.layers.Dot(axes=1)([vm,vu])

# build model
model = tf.keras.Model([input_movie,input_user],output)

model.summary()

In [149]:
# we will compile the model with squared error loss cost function and no regularisation
model.compile(loss=tf.keras.losses.MeanSquaredError(),
              optimizer=tf.keras.optimizers.Adam(0.01)) # unusually high lr seems to work

# fit model
model.fit([movie_train,user_train],targets_train[...,np.newaxis],epochs = 30)

Epoch 1/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - loss: 0.9205
Epoch 2/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - loss: 0.9199
Epoch 3/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - loss: 0.9194
Epoch 4/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - loss: 0.9191
Epoch 5/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - loss: 0.9188
Epoch 6/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - loss: 0.9185
Epoch 7/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - loss: 0.9183
Epoch 8/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - loss: 0.9182
Epoch 9/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - loss: 0.9180
Epoch 10/30
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m

<keras.src.callbacks.history.History at 0x1e12d3dd340>

In [150]:
model.evaluate([movie_test,user_test],targets_test)

[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.9245


0.9244731664657593

Comparable with training loss so this suggests overfitting has not occurred.

In [151]:
# save model
tf.keras.models.save_model(model,"models/content_based/lr1e-2_epoch30.keras")

## Predictions

In [152]:
with open("cleaned_data/movie_dict.pkl","rb") as f: 
    movie_dict = pickle.load(f)

movie_vecs = np.load("cleaned_data/movie_vecs.npy")

In [156]:
# make predictions for a particular user

# get random idx
random_idx = np.random.randint(0,user_train.shape[0])

# get user vec
user_vec = user_train[random_idx]

# tile vector 1682 times because there are 1682 movies to predict ratings for
user_vec=np.tile(user_vec,(len(movie_vecs),1))

# scale movie vecs
movie_vecs_scaled = scalerMovies.transform(movie_vecs)

# make predictions
p_unscaled = model.predict([movie_vecs_scaled,user_vec])
p = scalerTargets.inverse_transform(p_unscaled)



[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step


In [193]:
print("Predictions standard deviation: ",p.std())

Predictions standard deviation:  2.3093013e-07


My model is predicting the same rating (more or less) for every movie. This could mean one of a few things:
- My movie feature vectors are too similar
- 

# Improvements to Make
- User feature vectors can be enriched by included user zip codes as input features in the user network. It is likely users that live in close proximity have similar movie taste.