# Collaborative Filtering Recommender Systems

This notebook is prepared by the notes from Unsupervised Learning, Recommenders, Reinforcement Learning by DeepLearning.AI Coursera. I would like to thank Andrew NG for this great lecture. The notations and  vectrized cost function are directly copied from the lecture. I prepared the data from scratch, get input data, made the code modular (app.py, train_predict_page.py and utils.py). I've also train the model with different hyperparameters. Also, you can run the project on a dashboard which prepared by streamlit:)


The data set is derived from the [MovieLens "ml-latest-small"](https://grouplens.org/datasets/movielens/latest/) dataset.   
[F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1‚Äì19:19. <https://doi.org/10.1145/2827872>]

## Making recommendations

Recommend items to you based on:
    
    Ratings of users who gave similar ratings as you. 
    
Predict rating of user j on movie i:  w_j * x_i +b_j

In [1]:
from platform import python_version
print(python_version())

3.10.0


In [2]:
from datetime import datetime
import numpy as np
import tensorflow as tf
from tensorflow import keras
from recsys_utils import *

*by DeepLearning.AI*


|General <br />  Notation  | Description| Python (if any) |
|:-------------|:------------------------------------------------------------||
| r(i,j)     | scalar; = 1  if user j rated movie i  = 0  otherwise             ||
| y(i,j)     | scalar; = rating given by user j on movie  i    (if r(i,j) = 1 is defined) ||
| f{w}_j | vector; parameters for user j ||
| b_j     |  scalar; parameter for user j ||
| f{x}_i |   vector; feature ratings for movie i        ||     
| n_u        | number of users |num_users|
| n_m        | number of movies | num_movies |
| n         | number of features | num_features                    |
| f{X} |  matrix of vectors f{x}_i        | X |
| f{W} |  matrix of vectors f{w}_j        | W |
| f{b} |  vector of bias parameters b_j | b |
| f{R} | matrix of elements r(i,j)                    | R |

## Load Datasets

Let's load the datasets and make EDA. 

In [3]:
df_movie = pd.read_csv('./data/ml-latest-small/movies.csv')
df_ratings = pd.read_csv('./data/ml-latest-small/ratings.csv')
df_tags = pd.read_csv('./data/ml-latest-small/tags.csv')
df_links = pd.read_csv('./data/ml-latest-small/links.csv')

In [4]:
df_movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
df_ratings[df_ratings.userId == 3]

Unnamed: 0,userId,movieId,rating,timestamp
261,3,31,0.5,1306463578
262,3,527,0.5,1306464275
263,3,647,0.5,1306463619
264,3,688,0.5,1306464228
265,3,720,0.5,1306463595
266,3,849,5.0,1306463611
267,3,914,0.5,1306463567
268,3,1093,0.5,1306463627
269,3,1124,0.5,1306464216
270,3,1263,0.5,1306463569


In [7]:
df_ratings['datetime'] = pd.to_datetime(df_ratings['timestamp'], unit='s')
df_ratings

Unnamed: 0,userId,movieId,rating,timestamp,datetime
0,1,1,4.0,964982703,2000-07-30 18:45:03
1,1,3,4.0,964981247,2000-07-30 18:20:47
2,1,6,4.0,964982224,2000-07-30 18:37:04
3,1,47,5.0,964983815,2000-07-30 19:03:35
4,1,50,5.0,964982931,2000-07-30 18:48:51
...,...,...,...,...,...
100831,610,166534,4.0,1493848402,2017-05-03 21:53:22
100832,610,168248,5.0,1493850091,2017-05-03 22:21:31
100833,610,168250,5.0,1494273047,2017-05-08 19:50:47
100834,610,168252,5.0,1493846352,2017-05-03 21:19:12


## General Analysis

In [8]:
# How many users in the data?

num_users = len(list(df_ratings.userId.unique()))
num_users

610

In [9]:
# How many movies in the data?

num_movies = len(list(df_ratings.movieId.unique()))
num_movies

9724

In [10]:
num_movies_control = len(list(df_movie.movieId.unique()))
num_movies_control

9742

## Average Ratings for each Movie

Calculate the average ratings & number of ratings for each movie. 

In [11]:
df_temp = df_ratings.groupby('movieId').agg({'movieId': 'count', 'rating':'mean'})
df_temp.rename(columns={'movieId':'number_of_ratings', 'rating': 'mean_rating'}, inplace=True)
df_ratings_mean = df_temp.reset_index()

df_ratings_mean = pd.merge(df_ratings_mean, df_movie, on='movieId', how='left')
df_ratings_mean.tail(10)

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres
9714,193565,1,3.5,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi
9715,193567,1,3.0,anohana: The Flower We Saw That Day - The Movi...,Animation|Drama
9716,193571,1,4.0,Silver Spoon (2014),Comedy|Drama
9717,193573,1,4.0,Love Live! The School Idol Movie (2015),Animation
9718,193579,1,3.5,Jon Stewart Has Left the Building (2015),Documentary
9719,193581,1,4.0,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9720,193583,1,3.5,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9721,193585,1,3.5,Flint (2017),Drama
9722,193587,1,3.5,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9723,193609,1,4.0,Andrew Dice Clay: Dice Rules (1991),Comedy


Movie Id and Index are not mathced. We need to keep the index number for further selection. 

In [12]:
df_ratings_mean['movie_id_2'] = df_ratings_mean.index

In [13]:
df_ratings_mean.tail()

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2
9719,193581,1,4.0,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,9719
9720,193583,1,3.5,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,9720
9721,193585,1,3.5,Flint (2017),Drama,9721
9722,193587,1,3.5,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,9722
9723,193609,1,4.0,Andrew Dice Clay: Dice Rules (1991),Comedy,9723


In [14]:
movieList = list(df_ratings_mean.title)
len(movieList)

9724

In [15]:
movieList_2 = list(set(df_ratings_mean.title))
len(movieList_2)

9719

In [16]:
import collections
print([item for item, count in collections.Counter(movieList).items() if count > 1])

['Emma (1996)', 'Saturn 3 (1980)', 'Confessions of a Dangerous Mind (2002)', 'Eros (2004)', 'War of the Worlds (2005)']


In [17]:
df_ratings_mean = df_ratings_mean.drop_duplicates(subset=["title"], keep="first")

In [18]:
df_ratings_mean[df_ratings_mean.title == "Emma (1996)"]

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2
650,838,30,3.916667,Emma (1996),Comedy|Drama|Romance,650


In [19]:
df_ratings_mean[df_ratings_mean.title == "Confessions of a Dangerous Mind (2002)"]

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2
4163,6003,15,3.6,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Thriller,4163


In [20]:
df_ratings_mean[df_ratings_mean.title == "Eros (2004)"]

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2
5838,32600,1,3.5,Eros (2004),Drama,5838


In [21]:
df_ratings_mean[df_ratings_mean.title == "War of the Worlds (2005)"]

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2
5915,34048,50,3.15,War of the Worlds (2005),Action|Adventure|Sci-Fi|Thriller,5915


In [22]:
xx = df_ratings_mean.drop_duplicates(subset=["title"], keep="first")

In [23]:
xx[xx.title == "Confessions of a Dangerous Mind (2002)"]

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2
4163,6003,15,3.6,Confessions of a Dangerous Mind (2002),Comedy|Crime|Drama|Thriller,4163


## Genre

In [24]:
genre_types = df_movie.genres
genre_types_m = genre_types.str.split('|',expand=True)
all_genres = []
for i in range(10):
    all_genres.append(list(genre_types_m[i].unique()))

# https://stackoverflow.com/questions/952914/how-do-i-make-a-flat-list-out-of-a-list-of-lists
list_genre = [item for sublist in all_genres for item in sublist]
list_genre = list(set(list_genre))
list_genre.remove("(no genres listed)")
list_genre.remove(None)
list_genre

['Thriller',
 'Western',
 'Mystery',
 'Documentary',
 'IMAX',
 'Film-Noir',
 'Crime',
 'Musical',
 'Adventure',
 'Fantasy',
 'Action',
 'War',
 'Drama',
 'Animation',
 'Romance',
 'Horror',
 'Sci-Fi',
 'Children',
 'Comedy']

In [25]:
for col in list_genre:
    df_ratings_mean[col] = 0
    
for col in list_genre:
    df_ratings_mean[col] = np.where(df_ratings_mean['genres'].str.contains(col) == True, 1, 0)

## Create matrices

ùëü(ùëñ,ùëó) 	scalar; = 1 if user j rated movie i = 0 otherwise	
ùë¶(ùëñ,ùëó) 	scalar; = rating given by user j on movie i (if r(i,j) = 1 is defined)	

In [26]:
y_matrix = df_ratings.pivot(index='movieId', columns='userId', values='rating')
y_matrix =y_matrix.fillna(0)

df_ratings['temp'] = 1
r_matrix = df_ratings.pivot(index='movieId', columns='userId', values='temp')
r_matrix =r_matrix.fillna(0)

In [27]:
y_matrix = y_matrix.reset_index()
y_matrix = y_matrix.drop('movieId', axis=1)

r_matrix = r_matrix.reset_index()
r_matrix = r_matrix.drop('movieId', axis=1)

In [28]:
Y = y_matrix.to_numpy()
R = r_matrix.to_numpy()

## Collaborative filtering cost function

*by DeepLearning.AI*

The collaborative filtering cost function is given by

$$J({\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},b^{(0)},...,\mathbf{w}^{(n_u-1)},b^{(n_u-1)}})= \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2
+\underbrace{
\frac{\lambda}{2}
\sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2
+ \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2
}_{regularization}
\tag{1}$$
The first summation in (1) is "for all $i$, $j$ where $r(i,j)$ equals $1$" and could be written:

$$
= \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2
+\text{regularization}
$$


In [29]:
# Cost function with loop
def cofi_cost_func(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    nm, nu = Y.shape
    J = 0
    for j in range(nu):
        for i in range(nm):
            pred = np.dot(W[j,:], X[i,:]) + b[0,j] 
            squared_error = np.dot(R[i,j], np.square(pred - Y[i,j]))
            J += squared_error
    J /= 2  
    reg_w = lambda_ * np.sum(np.square(W)) * 0.5
    reg_x = lambda_ *np.sum(np.square(X)) * 0.5
    
    J = J + reg_w + reg_x
    return J

In [30]:
# Vectorized cost function
# *by DeepLearning.AI*
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

## Input From User

Now it is time to add some personal data the dataset. I am aware that it doesn't well designed, but it was a fun to write:)

In [31]:
def get_ratings_from_user(df_ratings_mean, genre):
    """
    This function select the movies based on the genre. Get top 5 movies and then remove them from the dataset. 
    Since, a movie may has more than one genres.
    
    df_ratings_mean: (df) the rating matrix 
    genre: (str) specific genre
    """
    selected_movies = df_ratings_mean[df_ratings_mean['genres'].str.contains(genre)].sort_values(
        by='number_of_ratings', ascending=False)
    selected_movies = selected_movies.head(30)
    selected_movies = selected_movies.sample(frac=1)
    for i in range(5):
        print('Movie number ', str(i), ': ', selected_movies.title.iloc[i])
        print('Movie number index', selected_movies.movie_id_2.iloc[i])
        rating_i = int(input('Your rating: '))
        movie_idx = selected_movies.movieId.iloc[i]
        idx = selected_movies.movie_id_2.iloc[i]
        my_ratings[idx] = rating_i
        df_ratings_mean = df_ratings_mean[(df_ratings_mean.movieId  != movie_idx)]
    return my_ratings, df_ratings_mean  

In [32]:
df_ratings_mean_temp = df_ratings_mean.copy()

In [33]:
df_ratings_mean_temp.head()

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2,Thriller,Western,Mystery,Documentary,...,Fantasy,Action,War,Drama,Animation,Romance,Horror,Sci-Fi,Children,Comedy
0,1,215,3.92093,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,1
1,2,110,3.431818,Jumanji (1995),Adventure|Children|Fantasy,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
2,3,52,3.259615,Grumpier Old Men (1995),Comedy|Romance,2,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
3,4,7,2.357143,Waiting to Exhale (1995),Comedy|Drama|Romance,3,0,0,0,0,...,0,0,0,1,0,1,0,0,0,1
4,5,49,3.071429,Father of the Bride Part II (1995),Comedy,4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [34]:
all_genres = []
for genre in list_genre[0]:
    genre_temp = df_ratings_mean_temp[np.array(df_ratings_mean_temp.filter(regex=list_genre[0]) == 1).reshape
                                      (len(df_ratings_mean_temp),)].sort_values(
        by='number_of_ratings', ascending=False).head(2)
    all_genres.append(genre_temp)

all_genres_df = pd.concat(all_genres)

In [35]:
all_genres_df

Unnamed: 0,movieId,number_of_ratings,mean_rating,title,genres,movie_id_2,Thriller,Western,Mystery,Documentary,...,Fantasy,Action,War,Drama,Animation,Romance,Horror,Sci-Fi,Children,Comedy
257,296,307,4.197068,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,257,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
510,593,279,4.16129,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,510,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
257,296,307,4.197068,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,257,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
510,593,279,4.16129,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,510,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
257,296,307,4.197068,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,257,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
510,593,279,4.16129,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,510,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
257,296,307,4.197068,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,257,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
510,593,279,4.16129,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,510,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
257,296,307,4.197068,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,257,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
510,593,279,4.16129,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,510,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [36]:
# Create a series
my_ratings = np.zeros(num_movies) 

In [37]:
print(df_ratings_mean.shape)
df_ratings_mean_temp = df_ratings_mean.copy()

print("Give 0, if you haven't seen the movie yet. Give 1-5 ratings. Let's start!:) ")
comedy_b = input('Love comedies? Y/N: ')
if comedy_b == 'Y':
    my_ratings, df_ratings_mean_temp = get_ratings_from_user(df_ratings_mean_temp, "Comedy")

print(df_ratings_mean_temp.shape)

romance_b = input('Love romance? Y/N: ')
if romance_b == 'Y':
    my_ratings, df_ratings_mean_temp = get_ratings_from_user(df_ratings_mean_temp, "Romance")

scific_b = input('Love scifi? Y/N: ')
if scific_b == 'Y':
    my_ratings, df_ratings_mean_temp = get_ratings_from_user(df_ratings_mean_temp, "Sci")

name = input('What is you favourite movie? ')
fav_mov = df_ratings_mean_temp[df_ratings_mean_temp['title'].str.lower().str.contains(name.lower())].sort_values(
    by='number_of_ratings', ascending=False)
# print(fav_mov)
if len(fav_mov)>=1:
    for i in range(min(4, len(fav_mov))):
        print('Movie number ', str(i), ': ', fav_mov.title.iloc[i])
        rating_i = int(input('Your rating: '))
        print('Movie number index', fav_mov.movie_id_2.iloc[i])
        movie_idx = fav_mov.movieId.iloc[i]
        idx = fav_mov.movie_id_2.iloc[i]
        my_ratings[idx] = rating_i
        df_ratings_mean_temp = df_ratings_mean_temp.iloc[i: , :]
else:
    print('Thank you:) Wait for the recommendation!')

print('Thank you:) Wait for the recommendation!')

(9719, 25)
Give 0, if you haven't seen the movie yet. Give 1-5 ratings. Let's start!:) 
Love comedies? Y/N: Y
Movie number  0 :  Forrest Gump (1994)
Movie number index 314
Your rating: 1
Movie number  1 :  Shrek (2001)
Movie number index 3189
Your rating: 1
Movie number  2 :  True Lies (1994)
Movie number index 337
Your rating: 1
Movie number  3 :  Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Movie number index 4421
Your rating: 1
Movie number  4 :  Fifth Element, The (1997)
Movie number index 1157
Your rating: 5
(9714, 25)
Love romance? Y/N: N
Love scifi? Y/N: N
What is you favourite movie? Matrix
Movie number  0 :  Matrix, The (1999)
Your rating: 4
Movie number index 1938
Movie number  1 :  Matrix Reloaded, The (2003)
Your rating: 4
Movie number index 4345
Movie number  2 :  Matrix Revolutions, The (2003)
Your rating: 4
Movie number index 4631
Movie number  3 :  Animatrix, The (2003)
Your rating: 4
Movie number index 5656
Thank you:) Wait for the recommendation!


In [38]:
my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]

print('\nNew user ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0 :
        print(i)
        print(f'Rated {my_ratings[i]} for  {df_ratings_mean.loc[i,"title"]}');


New user ratings:

314
Rated 1.0 for  Forrest Gump (1994)
337
Rated 1.0 for  True Lies (1994)
1157
Rated 5.0 for  Fifth Element, The (1997)
1938
Rated 4.0 for  Matrix, The (1999)
3189
Rated 1.0 for  Shrek (2001)
4345
Rated 4.0 for  Matrix Reloaded, The (2003)
4421
Rated 1.0 for  Pirates of the Caribbean: The Curse of the Black Pearl (2003)
4631
Rated 4.0 for  Matrix Revolutions, The (2003)
5656
Rated 4.0 for  Animatrix, The (2003)


In [39]:
movieList[4345]

'Matrix Reloaded, The (2003)'

In [40]:
df_ratings_mean.loc[5150,"title"]

'Shrek 2 (2004)'

In [42]:
my_ratings

array([0., 0., 0., ..., 0., 0., 0.])

In [43]:
my_ratings[314]

1.0

## Add User Inputs to Dataset - example

How to add the new data from user? Here you can find a small-example dataset to represent the calculations. At first, let's take the 10 rows for 2 moview in the dataset: 

In [44]:
example_series = df_ratings[(df_ratings.movieId == 1) | (df_ratings.movieId == 2)]
example_series = example_series.head(10)

example_series

Unnamed: 0,userId,movieId,rating,timestamp,datetime,temp
0,1,1,4.0,964982703,2000-07-30 18:45:03,1
516,5,1,4.0,847434962,1996-11-08 06:36:02,1
560,6,2,4.0,845553522,1996-10-17 11:58:42,1
874,7,1,4.5,1106635946,2005-01-25 06:52:26,1
1026,8,2,4.0,839463806,1996-08-08 00:23:26,1
1434,15,1,2.5,1510577970,2017-11-13 12:59:30,1
1667,17,1,4.5,1305696483,2011-05-18 05:28:03,1
1772,18,1,3.5,1455209816,2016-02-11 16:56:56,1
1773,18,2,3.0,1455617462,2016-02-16 10:11:02,1
2274,19,1,4.0,965705637,2000-08-08 03:33:57,1


We will create the Y and R matrices. Remember, Y shows the ratings from each use to each movie and R is boolean. Matrix rows represents the movies and columns represents the users. In  our example dataset, there are 2 movies and 9 users.

In [45]:
ex_y = example_series.pivot(index='movieId', columns='userId', values='rating')
ex_y =ex_y.fillna(0)

ex_y = ex_y.reset_index()
ex_y = ex_y.drop('movieId', axis=1)

ex_Y = ex_y.to_numpy()
print(ex_Y.shape)

ex_Y

(2, 9)


array([[4. , 4. , 0. , 4.5, 0. , 2.5, 4.5, 3.5, 4. ],
       [0. , 0. , 4. , 0. , 4. , 0. , 0. , 3. , 0. ]])

In [46]:
ex_r = example_series.pivot(index='movieId', columns='userId', values='temp')
ex_r =ex_r.fillna(0)

ex_r = ex_r.reset_index()
ex_r = ex_r.drop('movieId', axis=1)

ex_R = ex_r.to_numpy()

ex_R

array([[1., 1., 0., 1., 0., 1., 1., 1., 1.],
       [0., 0., 1., 0., 1., 0., 0., 1., 0.]])

These are examples ratings from the new user. 

In [47]:
my_ratings_ex = np.zeros(2) 
my_ratings_ex[0] = 3
my_ratings_ex[1] = 1

my_ratings_ex

array([3., 1.])

We will add new user ratings to Y and R as a new column by using numpy's c_ module. 

In [48]:
ex_Y_2 = np.c_[my_ratings_ex, ex_Y]
ex_Y_2

array([[3. , 4. , 4. , 0. , 4.5, 0. , 2.5, 4.5, 3.5, 4. ],
       [1. , 0. , 0. , 4. , 0. , 4. , 0. , 0. , 3. , 0. ]])

In [49]:
ex_R_2 = np.c_[(my_ratings_ex != 0).astype(int), ex_R]
ex_R_2

array([[1., 1., 1., 0., 1., 0., 1., 1., 1., 1.],
       [1., 0., 0., 1., 0., 1., 0., 0., 1., 0.]])

Now it is time to calculate the mean and normalize Y values. Before the new user ratings:

In [50]:
Ynorm_ex_pre, Ymean_ex_pre = normalizeRatings(ex_Y, ex_R)

Ymean_ex_pre

array([[3.85714286],
       [3.66666667]])

After the new user's ratings:

In [51]:
Ynorm_ex, Ymean_ex = normalizeRatings(ex_Y_2, ex_R_2)

Ymean_ex

array([[3.75],
       [3.  ]])

## Add User Inputs to Dataset 

*by DeepLearning.AI*

In [52]:
def normalizeRatings(Y, R):
    """
    Preprocess data by subtracting mean rating for every movie (every row).
    Only include real ratings R(i,j)=1.
    [Ynorm, Ymean] = normalizeRatings(Y, R) normalized Y so that each movie
    has a rating of 0 on average. Unrated moves then have a mean rating (0)
    Returns the mean rating in Ymean.
    """
    Ymean = (np.sum(Y*R,axis=1)/(np.sum(R, axis=1)+1e-12)).reshape(-1,1)
    Ynorm = Y - np.multiply(Ymean, R) 
    return(Ynorm, Ymean)

In [53]:
Y.shape

(9724, 610)

In [54]:
# Add new user ratings to Y
Y = np.c_[my_ratings, Y]

# Add new user indicator matrix to R
R = np.c_[(my_ratings != 0).astype(int), R]

# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)

In [55]:
Y.shape

(9724, 611)

In [56]:
Y[5149,0]

0.0

## Prepare for Training

*by DeepLearning.AI*

In [57]:
#  Useful Values
num_movies, num_users = Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')

# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

## Train

*by DeepLearning.AI*

In [58]:
iterations = 100
lambda_ = 1
for iter in range(iterations):
    # Use TensorFlow‚Äôs GradientTape
    # to record the operations used to compute the cost 
    with tf.GradientTape() as tape:

        # Compute the cost (forward pass included in cost)
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss
    grads = tape.gradient(cost_value, [X,W,b] )

    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients( zip(grads, [X,W,b]) )

    # Log periodically.
    if iter % 20 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

Training loss at iteration 0: 5540417.9
Training loss at iteration 20: 279376.6
Training loss at iteration 40: 108050.9
Training loss at iteration 60: 53023.5
Training loss at iteration 80: 30312.4


## Recommendations

*by DeepLearning.AI

We will predict the rating of movie i for user j by computing: w_j * x_i + b_j. 
    
These are the recommendations for the user j. 

In [59]:
# Make a prediction using trained weights and biases
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()
p.shape

(9724, 611)

In [60]:
#restore the mean
pm = p + Ymean

# Find the predictions for the new user (user id is 0)
my_predictions = pm[:,0]
my_predictions.shape

(9724,)

These are the predicted and actual ratings given by user. The predictions look fine. 

In [61]:
print('\n\nOriginal vs Predicted ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
        print(f'Original {my_ratings[i]}, Predicted {my_predictions[i]:0.2f} for {movieList[i]}')



Original vs Predicted ratings:

Original 1.0, Predicted 1.17 for Forrest Gump (1994)
Original 1.0, Predicted 1.03 for True Lies (1994)
Original 5.0, Predicted 4.83 for Fifth Element, The (1997)
Original 4.0, Predicted 3.87 for Matrix, The (1999)
Original 1.0, Predicted 1.09 for Shrek (2001)
Original 4.0, Predicted 3.96 for Matrix Reloaded, The (2003)
Original 1.0, Predicted 1.09 for Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Original 4.0, Predicted 3.90 for Matrix Revolutions, The (2003)
Original 4.0, Predicted 3.86 for Animatrix, The (2003)


These are the highest predicted ratings and the recommendations for the user:

In [62]:
# sort predictions
idx_sorted_pred = tf.argsort(my_predictions, direction='DESCENDING')

for i in range(10):
    j = idx_sorted_pred[i]
    if j not in my_rated:
        print(f'Predicting rating {my_predictions[j]:0.2f} for movie {movieList[j]}')

Predicting rating 6.79 for movie 2001: A Space Odyssey (1968)
Predicting rating 5.88 for movie Raising Arizona (1987)
Predicting rating 5.41 for movie Goldfinger (1964)
Predicting rating 5.34 for movie Die Hard (1988)
Predicting rating 5.32 for movie Star Wars: Episode V - The Empire Strikes Back (1980)
Predicting rating 5.30 for movie Monsters, Inc. (2001)
Predicting rating 5.30 for movie The Revenant (2015)
Predicting rating 5.26 for movie Fargo (1996)
Predicting rating 5.19 for movie Brazil (1985)
Predicting rating 5.19 for movie Star Wars: Episode IV - A New Hope (1977)


Here is another way of giving recommendations.

The index "idx_sorted_pred" has been created earlier and shows the movies have the highest predictions. We can select the best 20 movies according to our predictions, filter them so that they have 20 ratings, and then sort them with the mean ratings. 

In [63]:
# Find the movies has ratings more than 20 
filter=(df_ratings_mean["number_of_ratings"] > 20)

# add predictions
df_ratings_mean["pred"] = my_predictions

# reorder columns
df_ratings_mean_ = df_ratings_mean.reindex(columns=["pred", "mean_rating", "number_of_ratings", "title"])

# get 20 movies based on the sorted predictions and filter them by the number of ratings. sort values by mean_rating
df_ratings_mean_.loc[idx_sorted_pred[:20]].loc[filter].sort_values("mean_rating", ascending=False)

ValueError: Length of values (9724) does not match length of index (9719)

In [None]:
# !py -3. -m pip install nbconvert
# !jupyter nbconvert --to html  Movie_Recommendation.ipynb 