|General <br />  Notation  | Description| Python (if any) |
|:-------------|:------------------------------------------------------------||
| $r(i,j)$     | scalar; = 1  if user j rated movie i  = 0  otherwise             ||
| $y(i,j)$     | scalar; = rating given by user j on movie  i    (if r(i,j) = 1 is defined) ||
|$\mathbf{w}^{(j)}$ | vector; parameters for user j ||
|$b^{(j)}$     |  scalar; parameter for user j ||
| $\mathbf{x}^{(i)}$ |   vector; feature ratings for movie i        ||     
| $n_u$        | number of users |num_users|
| $n_m$        | number of movies | num_movies |
| $n$          | number of features | num_features                    |
| $\mathbf{X}$ |  matrix of vectors $\mathbf{x}^{(i)}$         | X |
| $\mathbf{W}$ |  matrix of vectors $\mathbf{w}^{(j)}$         | W |
| $\mathbf{b}$ |  vector of bias parameters $b^{(j)}$ | b |
| $\mathbf{R}$ | matrix of elements $r(i,j)$                    | R |



In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [2]:
df = pd.read_csv('ml-latest-small/movies.csv')
df.set_index('movieId', inplace = True)
df = df.drop('genres', axis =1 )
df # dataframe with titles

Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1,Toy Story (1995)
2,Jumanji (1995)
3,Grumpier Old Men (1995)
4,Waiting to Exhale (1995)
5,Father of the Bride Part II (1995)
...,...
193581,Black Butler: Book of the Atlantic (2017)
193583,No Game No Life: Zero (2017)
193585,Flint (2017)
193587,Bungo Stray Dogs: Dead Apple (2018)


In [3]:
# we're using automatically generated features, so this is not necessary 

# genres = df['genres'].str.split('|').explode().unique()
# print(genres)

# for genre in genres:
#     df[genre] = np.where(df['genres'].str.contains(genre), 1 ,0) 
# df

In [4]:
# user rating df
urdf = pd.read_csv('ml-latest-small/ratings.csv')
urdf

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [5]:
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

"""
1. go to urdf and loop through all unique user IDs
    - for each UID, we pull the movieID and rating given by the user
2. create a new column for each user ID
    - match movieID and go to that row, fill in the rating
"""

for uid in urdf['userId'].unique():
    df.loc[urdf[urdf['userId'] == uid]['movieId'],'id' + str(uid) ] = urdf[urdf['userId'] == uid ]['rating'].values

In [6]:
# df populated with ratings
df

Unnamed: 0_level_0,title,id1,id2,id3,id4,id5,id6,id7,id8,id9,...,id601,id602,id603,id604,id605,id606,id607,id608,id609,id610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),4.0,,,,4.0,,4.5,,,...,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,Jumanji (1995),,,,,,4.0,,4.0,,...,,4.0,,5.0,3.5,,,2.0,,
3,Grumpier Old Men (1995),4.0,,,,,5.0,,,,...,,,,,,,,2.0,,
4,Waiting to Exhale (1995),,,,,,3.0,,,,...,,,,,,,,,,
5,Father of the Bride Part II (1995),,,,,,5.0,,,,...,,,,3.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,Black Butler: Book of the Atlantic (2017),,,,,,,,,,...,,,,,,,,,,
193583,No Game No Life: Zero (2017),,,,,,,,,,...,,,,,,,,,,
193585,Flint (2017),,,,,,,,,,...,,,,,,,,,,
193587,Bungo Stray Dogs: Dead Apple (2018),,,,,,,,,,...,,,,,,,,,,


### 4.1 Collaborative filtering cost function

The collaborative filtering cost function is given by
$$J({\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},b^{(0)},...,\mathbf{w}^{(n_u-1)},b^{(n_u-1)}})= \left[ \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+ \underbrace{\left[
\frac{\lambda}{2}
\sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2
+ \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2
\right]}_{regularization}
$$
The first summation in (1) is "for all $i$, $j$ where $r(i,j)$ equals $1$" and could be written:

$$
= \left[ \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+\text{regularization}
$$


In [7]:
def collab_filtering(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    # mini j is R (check if filled) * X (nm, n) DOT W.T (n, nu) + b (1, nu) - Y
    
    # # non tf version
    # j = R*(X@W.T + b -Y )
    # J = 0.5 * (j**2).sum() + (lambda_/2) * ((X**2).sum() + (W**2).sum())
    
    
    # tf version
    j = R * (X @ tf.transpose(W) + b - Y) 
    J = 0.5 * tf.reduce_sum(j**2) + lambda_/2 * ( tf.reduce_sum(X**2) + tf.reduce_sum(W**2) )
    return J

In [8]:
# personal ratings of random movies I've seen a long time ago

df.loc[187541, 'my rating'] =  5 #                                 Incredibles 2 (2018)
df.loc[187593, 'my rating'] =  4.5 #                                    Deadpool 2 (2018)
df.loc[143245, 'my rating'] =  4 #                             The Little Prince (2015)
df.loc[6539, 'my rating'] =  5 #      Pirates of the Caribbean: The Curse of the Bla...
df.loc[86880, 'my rating'] = 4.5  #     Pirates of the Caribbean: On Stranger Tides (2...
df.loc[45722, 'my rating'] =  4.5 #     Pirates of the Caribbean: Dead Man's Chest (2006)
df.loc[53125, 'my rating'] =  4.5 #       Pirates of the Caribbean: At World's End (2007)
df.loc[193583, 'my rating'] = 5  #                         No Game No Life: Zero (2017)
df.loc[8360, 'my rating'] =  4 #                                         Shrek 2 (2004)
df.loc[4306, 'my rating'] =  4 #                                           Shrek (2001)
df.loc[53121, 'my rating'] =  3.7 #                                Shrek the Third (2007)
df.loc[60069, 'my rating'] =  3 #                                         WALL·E (2008)
df.loc[111659, 'my rating'] =  2 #                                    Maleficent (2014)
df.loc[6550, 'my rating'] =  3.5 #                                  Johnny English (2003)
df.loc[90522, 'my rating'] =  4 #                          Johnny English Reborn (2011)
df.loc[88140, 'my rating'] =  4.7 #             Captain America: The First Avenger (2011)
df.loc[67295, 'my rating'] =  2.5 #     Kung Fu Panda: Secrets of the Furious Five (2008)
df.loc[63859, 'my rating'] =  2 #                                           Bolt (2008)
df.loc[60072, 'my rating'] = 1.5  #                                         Wanted (2008)
df.loc[60074, 'my rating'] =  2 #                                        Hancock (2008)
df.loc[122912, 'my rating'] = 4.6  #               Avengers: Infinity War - Part I (2018)
df.loc[89745, 'my rating'] =  4.7 #                                  Avengers, The (2012)
df.loc[122892, 'my rating'] = 4.7  #                       Avengers: Age of Ultron (2015)


In [9]:
# rating df
rdf = df.drop('title', axis = 1) 
rdf

Unnamed: 0_level_0,id1,id2,id3,id4,id5,id6,id7,id8,id9,id10,...,id602,id603,id604,id605,id606,id607,id608,id609,id610,my rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,4.5,,,,...,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0,
2,,,,,,4.0,,4.0,,,...,4.0,,5.0,3.5,,,2.0,,,
3,4.0,,,,,5.0,,,,,...,,,,,,,2.0,,,
4,,,,,,3.0,,,,,...,,,,,,,,,,
5,,,,,,5.0,,,,,...,,,3.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,,,,,,,,,,,...,,,,,,,,,,
193583,,,,,,,,,,,...,,,,,,,,,,5.0
193585,,,,,,,,,,,...,,,,,,,,,,
193587,,,,,,,,,,,...,,,,,,,,,,


In [10]:
# truth vector R (tells you if cell is filled)
R = ~rdf.isna().values
R

array([[ True, False, False, ...,  True,  True, False],
       [False, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [11]:
# normalized df (basically subtract all user ratings by the mean of a movie's rating)
Ymean = rdf.mean(axis =1).values
Ynorm = rdf - rdf.mean()

# remove all nan values to 0 
Ynorm[rdf.isna()] = 0
Ynorm

Unnamed: 0_level_0,id1,id2,id3,id4,id5,id6,id7,id8,id9,id10,...,id602,id603,id604,id605,id606,id607,id608,id609,id610,my rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.366379,0.0,0.0,0.0,0.363636,0.000000,1.269737,0.000000,0.0,0.0,...,0.000000,0.492047,-0.48,0.789593,-1.157399,0.213904,-0.634176,-0.27027,1.311444,0.000000
2,0.000000,0.0,0.0,0.0,0.000000,0.506369,0.000000,0.425532,0.0,0.0,...,0.607407,0.000000,1.52,0.289593,0.000000,0.000000,-1.134176,0.00000,0.000000,0.000000
3,-0.366379,0.0,0.0,0.0,0.000000,1.506369,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,-1.134176,0.00000,0.000000,0.000000
4,0.000000,0.0,0.0,0.0,0.000000,-0.493631,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
5,0.000000,0.0,0.0,0.0,0.000000,1.506369,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,-0.48,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
193583,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,1.178261
193585,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
193587,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000


In [12]:
# Collaborative Filtering Params
num_movies, num_users = rdf.shape
num_features = 80

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')

# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

In [13]:
iterations = 500
lambda_ = 1
for iter in range(iterations):
    # Use TensorFlow’s GradientTape
    # to record the operations used to compute the cost 
    with tf.GradientTape() as tape:

        # Compute the cost (forward pass included in cost)
        cost_value = collab_filtering(X, W, b, Ynorm, R, lambda_)

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss
    grads = tape.gradient( cost_value, [X,W,b] )

    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients( zip(grads, [X,W,b]) )

    # Log periodically.

    if iter % 20 == 0:
        if iter == 0:
            print(f'initial cost is {cost_value:0.2f}')
            cost_prev = cost_value
        else:
            cost_red =  (1 - cost_value/cost_prev)*100
            print(f"Training loss at iteration {iter}: {cost_value:0.1f}, cost reduction of: {cost_red:0.2f}%")
            
            if cost_red < 5:
                print('cost reduction less than 5%, early stopping')
                break
            cost_prev = cost_value


initial cost is 4556431.88
Training loss at iteration 20: 221777.1, cost reduction of: 95.13%
Training loss at iteration 40: 85113.3, cost reduction of: 61.62%
Training loss at iteration 60: 41693.3, cost reduction of: 51.01%
Training loss at iteration 80: 24277.3, cost reduction of: 41.77%
Training loss at iteration 100: 16077.5, cost reduction of: 33.78%
Training loss at iteration 120: 11745.7, cost reduction of: 26.94%
Training loss at iteration 140: 9277.9, cost reduction of: 21.01%
Training loss at iteration 160: 7797.2, cost reduction of: 15.96%
Training loss at iteration 180: 6872.3, cost reduction of: 11.86%
Training loss at iteration 200: 6274.1, cost reduction of: 8.70%
Training loss at iteration 220: 5874.2, cost reduction of: 6.37%
Training loss at iteration 240: 5598.3, cost reduction of: 4.70%
cost reduction less than 5%, early stopping


In [14]:
# prediction dataframe
pred = pd.DataFrame(X@ np.transpose(W)+ b + Ymean.reshape(-1,1), columns = rdf.columns)
pred.index = df.index

# let's compare the prediction values to our ratings
pred *= R
pred[pred!=0].describe()

Unnamed: 0,id1,id2,id3,id4,id5,id6,id7,id8,id9,id10,...,id602,id603,id604,id605,id606,id607,id608,id609,id610,my rating
count,232.0,29.0,39.0,216.0,44.0,314.0,152.0,47.0,46.0,140.0,...,135.0,943.0,100.0,221.0,1115.0,187.0,831.0,37.0,1302.0,23.0
mean,3.557003,3.943654,3.580309,3.75205,3.711154,3.194684,3.504583,3.558334,3.310612,3.462066,...,3.520731,3.571103,3.285531,3.289325,3.587247,3.508439,3.285478,3.494013,3.459925,3.441616
std,1.033479,0.871187,2.250399,1.464191,1.089686,1.099066,1.603333,1.107605,1.811515,1.316021,...,1.124311,1.514366,0.882029,1.183428,1.110166,1.126141,1.444905,0.748854,1.336323,1.239283
min,-0.241915,1.41377,0.55327,-0.39793,0.991188,-0.904631,-0.807506,0.472217,-0.997162,-2.076476,...,0.4499,-1.422321,0.416715,-1.040211,-1.452105,0.110032,-1.677389,1.783394,-2.535009,1.122523
25%,3.015651,3.493484,1.59102,2.734462,2.954138,2.516398,2.363551,2.898736,1.937826,2.705624,...,2.871679,2.706472,2.689451,2.783022,2.974428,2.6718,2.536686,2.884441,2.619429,2.781877
50%,3.770287,4.077183,2.297439,3.990094,3.708486,3.113965,3.902818,3.480409,3.717633,3.594832,...,3.399598,3.902058,3.205507,3.358342,3.804735,3.637474,3.402824,3.380041,3.557577,3.79302
75%,4.368133,4.526664,5.739227,5.055172,4.465837,3.971743,4.719611,4.318235,4.473728,4.470119,...,4.140466,4.488628,3.959627,3.975089,4.311743,4.269591,4.403686,3.843501,4.338586,4.36099
max,5.031035,5.788553,7.356044,6.368377,5.48438,5.91489,6.004442,5.698433,5.929364,5.662229,...,5.999404,6.445673,5.616507,6.009458,5.996855,5.609879,6.393304,5.13697,6.236011,5.179659


In [15]:
# summary statistics of my ratings & the predictions 

dff = pd.DataFrame()
dff['title'] = df.loc[ rdf[~rdf['my rating'].isna()].index ].title
dff['actual'] = rdf[~rdf['my rating'].isna()]['my rating']
dff['pred'] = pred[~pred[pred!= 0]['my rating'].isna()]['my rating'].round(2)
# dff['title'] = df.loc[dff.index,'title'].values
dff['mean rating'] = df.loc[dff.index.values].mean(axis=1, numeric_only = True).round(2)
dff['number of ratings'] = df.loc[dff.index.values].count(axis=1)
dff

Unnamed: 0_level_0,title,actual,pred,mean rating,number of ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4306,Shrek (2001),4.0,4.13,3.87,172
6539,Pirates of the Caribbean: The Curse of the Bla...,5.0,4.86,3.79,151
6550,Johnny English (2003),3.5,2.7,3.0,15
8360,Shrek 2 (2004),4.0,3.71,3.58,94
45722,Pirates of the Caribbean: Dead Man's Chest (2006),4.5,4.16,3.52,74
53121,Shrek the Third (2007),3.7,2.86,3.05,23
53125,Pirates of the Caribbean: At World's End (2007),4.5,3.96,3.46,58
60069,WALL·E (2008),3.0,3.46,4.05,106
60072,Wanted (2008),1.5,1.15,3.09,24
60074,Hancock (2008),2.0,1.44,3.0,31
