# Recommender Systems (Movie Recommender in Python)

This is a python implementation of the tutorial we did in R. Here we are essentially looking at three recommender system methods, namely user and item based collaborative filtering and lastly, matrix factorization. We use these above mentioned approaches to **build a system for recommending movies (or anything for that matter) to users based on their past viewing habits**. 

    TLDR; we apply several methods of **collaborative filtering** to build a system for recommending movies to users based on their past viewing habits.

In [410]:
# get modules 
import pandas as pd
import numpy as np
import math

# get data
viewed_movies = pd.read_csv('viewed_movies.csv')
ratings_red = pd.read_csv('ratings_red.csv')

In [411]:
# view viewed movies data
viewed_movies = viewed_movies.set_index('userId')
display(viewed_movies)
print("Shape of Matrix:", viewed_movies.shape) # num users and num movies

Unnamed: 0_level_0,2001: A Space Odyssey (1968),Apocalypse Now (1979),"Big Lebowski, The (1998)","Bourne Identity, The (2002)",Clear and Present Danger (1994),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)","Departed, The (2006)",Donnie Darko (2001),Ferris Bueller's Day Off (1986),"Green Mile, The (1999)",Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),Indiana Jones and the Temple of Doom (1984),Interview with the Vampire: The Vampire Chronicles (1994),Jumanji (1995),Kill Bill: Vol. 2 (2004),"Shining, The (1980)",Sleepless in Seattle (1993),Star Trek: Generations (1994),There's Something About Mary (1998),Up (2009)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0
20,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0
187,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,0,0,0
198,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0
212,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1
222,0,1,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1
282,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,0,0,0
328,1,1,1,1,0,0,1,1,1,0,1,1,0,0,1,1,0,0,0,1
330,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0
372,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


Shape of Matrix: (15, 20)


In [412]:
# view ratings data
display(ratings_red)
print("Shape of DataFrame:", ratings_red.shape) # num users and num movies

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,349,4.0,964982563,Clear and Present Danger (1994),Action|Crime|Drama|Thriller
1,1,1208,4.0,964983250,Apocalypse Now (1979),Action|Drama|War
2,1,1258,3.0,964983414,"Shining, The (1980)",Horror
3,1,1732,5.0,964981125,"Big Lebowski, The (1998)",Comedy|Crime
4,1,2115,5.0,964982529,Indiana Jones and the Temple of Doom (1984),Action|Adventure|Fantasy
...,...,...,...,...,...,...
111,594,253,0.5,1108950825,Interview with the Vampire: The Vampire Chroni...,Drama|Horror
112,594,329,4.0,1109036731,Star Trek: Generations (1994),Adventure|Drama|Sci-Fi
113,594,539,5.0,1109036787,Sleepless in Seattle (1993),Comedy|Drama|Romance
114,594,1208,3.0,1108798893,Apocalypse Now (1979),Action|Drama|War


Shape of DataFrame: (116, 6)


In [413]:
# rename first harry potter movie
viewed_movies.rename(columns={'Harry Potter and the Sorcerer\'s Stone (a.k.a. Harry Potter and the Philosopher\'s Stone) (2001)':'Harry Potter and the Philosopher\'s Stone (2001)'}, inplace=True)
display(viewed_movies)

Unnamed: 0_level_0,2001: A Space Odyssey (1968),Apocalypse Now (1979),"Big Lebowski, The (1998)","Bourne Identity, The (2002)",Clear and Present Danger (1994),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)","Departed, The (2006)",Donnie Darko (2001),Ferris Bueller's Day Off (1986),"Green Mile, The (1999)",Harry Potter and the Philosopher's Stone (2001),Indiana Jones and the Temple of Doom (1984),Interview with the Vampire: The Vampire Chronicles (1994),Jumanji (1995),Kill Bill: Vol. 2 (2004),"Shining, The (1980)",Sleepless in Seattle (1993),Star Trek: Generations (1994),There's Something About Mary (1998),Up (2009)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0
20,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0
187,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,0,0,0
198,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0
212,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1
222,0,1,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1
282,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,0,0,0
328,1,1,1,1,0,0,1,1,1,0,1,1,0,0,1,1,0,0,0,1
330,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0
372,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


We convert the data to matrix form; otherwise some of the later functions we use will give an error.

In [414]:
viewed_movies

Unnamed: 0_level_0,2001: A Space Odyssey (1968),Apocalypse Now (1979),"Big Lebowski, The (1998)","Bourne Identity, The (2002)",Clear and Present Danger (1994),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)","Departed, The (2006)",Donnie Darko (2001),Ferris Bueller's Day Off (1986),"Green Mile, The (1999)",Harry Potter and the Philosopher's Stone (2001),Indiana Jones and the Temple of Doom (1984),Interview with the Vampire: The Vampire Chronicles (1994),Jumanji (1995),Kill Bill: Vol. 2 (2004),"Shining, The (1980)",Sleepless in Seattle (1993),Star Trek: Generations (1994),There's Something About Mary (1998),Up (2009)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0
20,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0
187,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,0,0,0
198,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0
212,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1
222,0,1,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1
282,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,0,0,0
328,1,1,1,1,0,0,1,1,1,0,1,1,0,0,1,1,0,0,0,1
330,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0
372,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


***
# User-Based Collaborative Filtering

## Basic Idea

A really simple recommender system would just recommend the most popular movies (that a user hasn't seen before). This information is obtained by summing the values of each column of *viewed movies*:

In [415]:
# sum the columns of dataframe
viewed_movies.sum(axis=0)

# sort it descending
viewed_movies.sum(axis=0).sort_values(ascending=False)

Shining, The (1980)                                          11
Big Lebowski, The (1998)                                     10
Apocalypse Now (1979)                                         9
Kill Bill: Vol. 2 (2004)                                      8
2001: A Space Odyssey (1968)                                  7
Departed, The (2006)                                          7
Ferris Bueller's Day Off (1986)                               7
Green Mile, The (1999)                                        7
Bourne Identity, The (2002)                                   6
Jumanji (1995)                                                6
Indiana Jones and the Temple of Doom (1984)                   5
Harry Potter and the Philosopher's Stone (2001)               5
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)       5
Up (2009)                                                     5
Interview with the Vampire: The Vampire Chronicles (1994)     4
Donnie Darko (2001)                     

This approach has an intuitive appeal but is pretty unsophisticated (everyone gets the same recommendations, barring the filtering out of seen movies!) In other words, everyone's vote counts the same.

User-based CF extends the approach by changing how much each person's vote counts. Specifically, when recommending what I should watch next, a user-based CF system will up-weight the votes of people that are "more similar" to me. In this context "similar" means "has seen many of the same movies as me". You can think of this as replacing the 1's in the *viewed_movies* dataframe with a number that increases with similarity to the user we're trying to recommend a movie to.

There are lots of different similarity measures. The one we'll use is called **cosine similarity** and is widely used.

Cosine similarity derives its name from the fact that it measures the cosine of the angle between two non-zero vectors. The closer the vectors lie to each other, the smaller the angle, and the closer the cosine is to 1. It can be shown that for two vectors $\boldsymbol x$ and $\boldsymbol y$:

$$cos(\theta) = \frac{\boldsymbol x \cdot \boldsymbol y}{||\boldsymbol x|| \ ||\boldsymbol y||} = \frac{\sum_{i=1}^{n}x_iy_i}{\sqrt{\sum_{i=1}^{n}x^2_i} \sqrt{\sum_{i=1}^{n}y^2_i}}$$

We can use the `crossprod()` function in R to calculate the dot products.

In [416]:
# function calculating cosine similarity
def cosine_similarity(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)


Cosine similarity lies between 0 and 1 inclusive and increases with similarity. Here are a few test cases to get a feel for it:

In [417]:
# convert dataframe to numpy array
matrix_view_movies = viewed_movies.to_numpy()

In [418]:
# Testing function
display(cosine_similarity(matrix_view_movies[1,],matrix_view_movies[2,])) # user 1's viewing history vs 2, etc...
display(cosine_similarity(matrix_view_movies[5,],matrix_view_movies[9,]))
display(cosine_similarity(matrix_view_movies[1,],matrix_view_movies[10,]))
display(cosine_similarity(matrix_view_movies[0,],matrix_view_movies[2,]))
display(cosine_similarity(matrix_view_movies[4,],matrix_view_movies[7,]))

0.1889822365046136

0.2182178902359924

0.35355339059327373

0.3086066999241838

0.7071067811865476


Let's get similarities between user pairs. We'll do this with a loop below, because it's easier to see what's going on, but this will be inefficient and very slow for bigger datasets. 

> As an exercise, see if you can do the same without loops.

In [419]:
# get similarity between users using loop for 15 users
user_similarities = np.zeros(shape=(15, 15))
for i in range(0,14):
    for j in range(i+1,15):
        user_similarities[i,j] = cosine_similarity(matrix_view_movies[i,], matrix_view_movies[j,])

user_similarities = user_similarities + user_similarities.transpose()
np.fill_diagonal(user_similarities, 0)
user_similarities = np.around(user_similarities, decimals = 3)
user_similarities

array([[0.   , 0.   , 0.309, 0.667, 0.333, 0.309, 0.68 , 0.471, 0.408,
        0.471, 0.289, 0.594, 0.365, 0.408, 0.167],
       [0.   , 0.   , 0.189, 0.204, 0.204, 0.189, 0.167, 0.289, 0.5  ,
        0.   , 0.354, 0.485, 0.   , 0.   , 0.204],
       [0.309, 0.189, 0.   , 0.617, 0.463, 0.286, 0.378, 0.546, 0.661,
        0.436, 0.401, 0.55 , 0.338, 0.189, 0.154],
       [0.667, 0.204, 0.617, 0.   , 0.5  , 0.154, 0.544, 0.471, 0.51 ,
        0.471, 0.289, 0.594, 0.183, 0.408, 0.   ],
       [0.333, 0.204, 0.463, 0.5  , 0.   , 0.309, 0.408, 0.707, 0.51 ,
        0.471, 0.289, 0.594, 0.365, 0.408, 0.   ],
       [0.309, 0.189, 0.286, 0.154, 0.309, 0.   , 0.504, 0.546, 0.567,
        0.218, 0.535, 0.642, 0.676, 0.   , 0.463],
       [0.68 , 0.167, 0.378, 0.544, 0.408, 0.504, 0.   , 0.77 , 0.667,
        0.385, 0.589, 0.728, 0.745, 0.5  , 0.136],
       [0.471, 0.289, 0.546, 0.471, 0.707, 0.546, 0.77 , 0.   , 0.722,
        0.5  , 0.51 , 0.84 , 0.645, 0.289, 0.118],
       [0.408, 0.5  , 0.

In [420]:
# Convert the array to a DataFrame
user_sim_matrix = pd.DataFrame(user_similarities, columns=viewed_movies.index, index=viewed_movies.index)
user_sim_matrix.index.name = None
user_sim_matrix


Unnamed: 0,1,20,187,198,212,222,282,328,330,372,432,434,495,562,594
1,0.0,0.0,0.309,0.667,0.333,0.309,0.68,0.471,0.408,0.471,0.289,0.594,0.365,0.408,0.167
20,0.0,0.0,0.189,0.204,0.204,0.189,0.167,0.289,0.5,0.0,0.354,0.485,0.0,0.0,0.204
187,0.309,0.189,0.0,0.617,0.463,0.286,0.378,0.546,0.661,0.436,0.401,0.55,0.338,0.189,0.154
198,0.667,0.204,0.617,0.0,0.5,0.154,0.544,0.471,0.51,0.471,0.289,0.594,0.183,0.408,0.0
212,0.333,0.204,0.463,0.5,0.0,0.309,0.408,0.707,0.51,0.471,0.289,0.594,0.365,0.408,0.0
222,0.309,0.189,0.286,0.154,0.309,0.0,0.504,0.546,0.567,0.218,0.535,0.642,0.676,0.0,0.463
282,0.68,0.167,0.378,0.544,0.408,0.504,0.0,0.77,0.667,0.385,0.589,0.728,0.745,0.5,0.136
328,0.471,0.289,0.546,0.471,0.707,0.546,0.77,0.0,0.722,0.5,0.51,0.84,0.645,0.289,0.118
330,0.408,0.5,0.661,0.51,0.51,0.567,0.667,0.722,0.0,0.433,0.619,0.849,0.559,0.5,0.51
372,0.471,0.0,0.436,0.471,0.471,0.218,0.385,0.5,0.433,0.0,0.204,0.42,0.258,0.289,0.236


In [421]:
# who are the most similar users to user 222?
user_sim_matrix[222].sort_values(ascending=False)

#t(sort(user_similarities["222",]))

495    0.676
434    0.642
330    0.567
328    0.546
432    0.535
282    0.504
594    0.463
1      0.309
212    0.309
187    0.286
372    0.218
20     0.189
198    0.154
222    0.000
562    0.000
Name: 222, dtype: float64

Let's see if this makes sense from the viewing histories. Below we show user 222's history, together with the user who is most similar to user 222 (user 495) and another user who is very dissimilar (user 562).

In [422]:
viewed_movies.loc[[222,495,562]].transpose()

Unnamed: 0,222,495,562
2001: A Space Odyssey (1968),0,0,0
Apocalypse Now (1979),1,1,0
"Big Lebowski, The (1998)",1,1,0
"Bourne Identity, The (2002)",0,0,0
Clear and Present Danger (1994),0,0,0
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",0,0,0
"Departed, The (2006)",1,1,0
Donnie Darko (2001),0,0,0
Ferris Bueller's Day Off (1986),0,1,1
"Green Mile, The (1999)",0,0,1


### Recommending Movies for a Single User

As an example, let's consider the process of recommending a movie to one user, say user 222. How would we do this with a user-based collaborative filtering system? 

First, we need to know what movies have they already seen (so we don't recommend these).

In [423]:
viewed_movies.loc[222]

2001: A Space Odyssey (1968)                                 0
Apocalypse Now (1979)                                        1
Big Lebowski, The (1998)                                     1
Bourne Identity, The (2002)                                  0
Clear and Present Danger (1994)                              0
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)      0
Departed, The (2006)                                         1
Donnie Darko (2001)                                          0
Ferris Bueller's Day Off (1986)                              0
Green Mile, The (1999)                                       0
Harry Potter and the Philosopher's Stone (2001)              0
Indiana Jones and the Temple of Doom (1984)                  0
Interview with the Vampire: The Vampire Chronicles (1994)    0
Jumanji (1995)                                               1
Kill Bill: Vol. 2 (2004)                                     1
Shining, The (1980)                                    

The basic idea is now to recommend what's popular by adding up the number of users that have seen each movie, but *to weight each user by their similarity to user 222*. 

Let's work through the calculations for one movie, say 2001: A Space Odyssey (movie 1). The table below shows who's seen 2001: A Space Odyssey, and how similar each person is to user 222.

In [424]:
seen_movie  = viewed_movies.loc[:,"2001: A Space Odyssey (1968)"] # who has seen movie Space Odyssey
sim_to_user = user_sim_matrix[222] # similarity of users to user 222
df_222 =  pd.concat([seen_movie, sim_to_user], axis=1)
df_222.columns = ["seen_movie", "sim_to_user"]
df_222

Unnamed: 0,seen_movie,sim_to_user
1,0,0.309
20,0,0.189
187,1,0.286
198,1,0.154
212,1,0.309
222,0,0.0
282,0,0.504
328,1,0.546
330,1,0.567
372,1,0.218


The basic idea in user-based collaborative filtering is that user 372's vote counts less than user 434's, because user 434 is more similar to user 222 (in terms of viewing history). Note that this only means user 434 counts more in the context of making recommendations to user 222. When recommending to users *other than user 222*, user 372 may carry more weight.

We can now work out an overall recommendation score for **2001: A Space Odyssey** by multiplying together the two elements in each row of the table above, and summing these products (taking the dot product):

In [425]:
# overall score for 2001: A Space Odyssey
np.dot(viewed_movies.loc[:,"2001: A Space Odyssey (1968)"], user_sim_matrix[222]) # or np.dot(seen_movie, sim_to_user)

2.722

Note this score will increase with:
* (a) the number of people who've seen the movie (more 1's in the first column above) and 
* (b) if the people who've seen it are similar to user 1

Let's repeat this calculation for all movies and compare recommendation scores:

In [426]:
#np.dot(viewed_movies, user_sim_matrix[222])
rec_scores = np.matmul(user_sim_matrix[222], matrix_view_movies)
rec_scores = pd.Series(rec_scores)
rec_scores.index = viewed_movies.columns
rec_scores

2001: A Space Odyssey (1968)                                 2.722
Apocalypse Now (1979)                                        3.925
Big Lebowski, The (1998)                                     3.993
Bourne Identity, The (2002)                                  2.983
Clear and Present Danger (1994)                              0.951
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)      1.838
Departed, The (2006)                                         3.470
Donnie Darko (2001)                                          2.041
Ferris Bueller's Day Off (1986)                              3.244
Green Mile, The (1999)                                       2.711
Harry Potter and the Philosopher's Stone (2001)              2.253
Indiana Jones and the Temple of Doom (1984)                  2.155
Interview with the Vampire: The Vampire Chronicles (1994)    1.851
Jumanji (1995)                                               2.396
Kill Bill: Vol. 2 (2004)                                     3

In [427]:
# same as above but in one line
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
np.matmul(user_sim_matrix[222], viewed_movies)


2001: A Space Odyssey (1968)                                 2.722
Apocalypse Now (1979)                                        3.925
Big Lebowski, The (1998)                                     3.993
Bourne Identity, The (2002)                                  2.983
Clear and Present Danger (1994)                              0.951
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)      1.838
Departed, The (2006)                                         3.470
Donnie Darko (2001)                                          2.041
Ferris Bueller's Day Off (1986)                              3.244
Green Mile, The (1999)                                       2.711
Harry Potter and the Philosopher's Stone (2001)              2.253
Indiana Jones and the Temple of Doom (1984)                  2.155
Interview with the Vampire: The Vampire Chronicles (1994)    1.851
Jumanji (1995)                                               2.396
Kill Bill: Vol. 2 (2004)                                     3

To come up with a final recommendation, we just need to remember to remove movies user 222 has already seen, and sort the remaining movies in descending order of recommendation score.

We do that below, after tidying up the results a bit by putting them in a data frame.

In [428]:
# get recommendation scores (cross product of similarity and viewed status)
score = np.matmul(user_sim_matrix[222], viewed_movies)

# get seen movies
seen = viewed_movies.loc[222]

# get movie titles
title = viewed_movies.columns

# join data in data frame
user_scores = pd.DataFrame([title, score, seen]).transpose()
user_scores.columns = ["title", "score", "seen"]

# Remove rows where movie has been seen by user
user_scores = user_scores[user_scores["seen"] != 1]
user_scores.sort_values(by="score", ascending=False)



Unnamed: 0,title,score,seen
15,"Shining, The (1980)",4.07,0
8,Ferris Bueller's Day Off (1986),3.244,0
3,"Bourne Identity, The (2002)",2.983,0
0,2001: A Space Odyssey (1968),2.722,0
9,"Green Mile, The (1999)",2.711,0
10,Harry Potter and the Philosopher's Stone (2001),2.253,0
11,Indiana Jones and the Temple of Doom (1984),2.155,0
7,Donnie Darko (2001),2.041,0
12,Interview with the Vampire: The Vampire Chroni...,1.851,0
5,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.838,0


Therefore, our top recommendation for user 222 is "The Shining".

Now that we've understood the calculations, let's get recommendations for one more user, user 372:

In [429]:
# recommendations for user 372
score = np.matmul(user_sim_matrix[372], viewed_movies)
seen = viewed_movies.loc[372]
title = viewed_movies.columns

user_scores = pd.DataFrame([title, score, seen]).transpose()
user_scores.columns = ["title", "score", "seen"]
user_scores = user_scores[user_scores["seen"] != 1]
user_scores.sort_values(by="score", ascending=False)

Unnamed: 0,title,score,seen
2,"Big Lebowski, The (1998)",4.063,0
14,Kill Bill: Vol. 2 (2004),2.854,0
8,Ferris Bueller's Day Off (1986),2.756,0
9,"Green Mile, The (1999)",2.673,0
6,"Departed, The (2006)",2.418,0
11,Indiana Jones and the Temple of Doom (1984),2.247,0
3,"Bourne Identity, The (2002)",1.942,0
10,Harry Potter and the Philosopher's Stone (2001),1.824,0
19,Up (2009),1.813,0
7,Donnie Darko (2001),1.789,0


We would recommend "The Big Lebowski" to user 372.

## Function to Generate a UBCF Recommendation for Any User

In [430]:
def user_based_recommendations(user, user_sim, viewed_mov):
    if type(user) == str:
        user = int(user)
    else:
        user = int(user)
    
    col_names = list(viewed_mov.columns)
    user_scores = pd.DataFrame({'title': col_names, 
    'score': np.matmul(user_sim[user], viewed_movies),
    'seen': viewed_mov.loc[user, :]})
    
    return user_scores[user_scores['seen'] == 0].sort_values(by='score', ascending=False).drop(columns=['seen']).reset_index().drop(columns=['index'])


Let's check the function is working by running it on a user we've used before:

In [431]:
user_based_recommendations(user = 222, user_sim = user_sim_matrix, viewed_mov = viewed_movies)

Unnamed: 0,title,score
0,"Shining, The (1980)",4.07
1,Ferris Bueller's Day Off (1986),3.244
2,"Bourne Identity, The (2002)",2.983
3,2001: A Space Odyssey (1968),2.722
4,"Green Mile, The (1999)",2.711
5,Harry Potter and the Philosopher's Stone (2001),2.253
6,Indiana Jones and the Temple of Doom (1984),2.155
7,Donnie Darko (2001),2.041
8,Interview with the Vampire: The Vampire Chroni...,1.851
9,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.838


## Recommendations for All Users with UBCF

In [432]:
# list of users
all_users = viewed_movies.index

# recommendations for users
all_recommendations = {}

# loop through all users
for user in all_users:
  recommendations = user_based_recommendations(user, user_sim_matrix, viewed_movies)
  all_recommendations[user] = recommendations

# all recommendation results stored in dictionary - all_recommendations
all_recommendations[222]

Unnamed: 0,title,score
0,"Shining, The (1980)",4.07
1,Ferris Bueller's Day Off (1986),3.244
2,"Bourne Identity, The (2002)",2.983
3,2001: A Space Odyssey (1968),2.722
4,"Green Mile, The (1999)",2.711
5,Harry Potter and the Philosopher's Stone (2001),2.253
6,Indiana Jones and the Temple of Doom (1984),2.155
7,Donnie Darko (2001),2.041
8,Interview with the Vampire: The Vampire Chroni...,1.851
9,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.838


## Create Matrix Showing Recommendation Scores for All Users

In [433]:
# New User_Based_Recommendations Function allowing for seen movies and no sorting
def user_based_recommendations(user, user_sim, viewed_mov):
    if type(user) == str:
        user = int(user)
    else:
        user = int(user)
    
    col_names = list(viewed_mov.columns)
    user_scores = pd.DataFrame({'title': col_names, 
    'score': np.matmul(user_sim[user], viewed_movies),
    'seen': viewed_mov.loc[user, :]})
    
    return user_scores.drop(columns=['seen']).reset_index().drop(columns='index')


# recommendations with seen movies
user_based_recommendations(222,user_sim_matrix, viewed_movies)

Unnamed: 0,title,score
0,2001: A Space Odyssey (1968),2.722
1,Apocalypse Now (1979),3.925
2,"Big Lebowski, The (1998)",3.993
3,"Bourne Identity, The (2002)",2.983
4,Clear and Present Danger (1994),0.951
5,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.838
6,"Departed, The (2006)",3.47
7,Donnie Darko (2001),2.041
8,Ferris Bueller's Day Off (1986),3.244
9,"Green Mile, The (1999)",2.711


In [434]:
# get shape for array
ncols, nrows = viewed_movies.shape

# create empty array
recommendation_scores = np.zeros((nrows, ncols))

# loop through all users and get recommendations 
for i in range(ncols):
    user = user_sim_matrix.index[i]
    user_rec = user_based_recommendations(user = user, user_sim = user_sim_matrix, viewed_mov = viewed_movies)
    recommendation_scores[:,i] = user_rec['score'].values
    for j in range(20):
        if viewed_movies.iloc[i,j] == 1:
            recommendation_scores[j,i] = 0
        else: continue

recommendation_scores = pd.DataFrame(recommendation_scores, columns=user_sim_matrix.index)
recommendation_scores.index =  viewed_movies.columns
recommendation_scores

Unnamed: 0,1,20,187,198,212,222,282,328,330,372,432,434,495,562,594
2001: A Space Odyssey (1968),3.253,1.871,0.0,0.0,0.0,2.722,3.88,0.0,0.0,0.0,2.912,0.0,2.89,2.447,1.315
Apocalypse Now (1979),0.0,1.834,3.658,3.594,3.697,0.0,0.0,0.0,0.0,0.0,3.951,0.0,0.0,2.778,0.0
"Big Lebowski, The (1998)",0.0,2.227,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.063,4.437,0.0,0.0,3.29,2.028
"Bourne Identity, The (2002)",2.442,0.0,2.725,2.612,2.712,2.983,0.0,0.0,0.0,1.942,0.0,0.0,2.807,2.007,1.554
Clear and Present Danger (1994),0.0,0.485,0.859,1.261,0.927,0.951,1.408,1.311,1.257,0.891,0.889,0.0,0.907,0.772,0.464
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",1.978,0.0,0.0,0.0,2.271,1.838,2.484,2.868,0.0,1.76,2.263,0.0,1.622,1.461,1.165
"Departed, The (2006)",3.116,1.984,3.16,2.745,3.182,0.0,0.0,0.0,0.0,2.418,0.0,0.0,0.0,2.231,1.996
Donnie Darko (2001),1.782,1.463,0.0,2.192,2.274,2.041,2.543,0.0,0.0,1.789,2.13,0.0,2.084,1.342,1.079
Ferris Bueller's Day Off (1986),3.259,1.645,3.125,3.21,0.0,3.244,0.0,0.0,0.0,2.756,3.277,0.0,0.0,0.0,1.448
"Green Mile, The (1999)",0.0,1.71,3.105,0.0,3.042,2.711,0.0,4.073,0.0,2.673,0.0,0.0,2.934,0.0,1.603


Above, we display all these recommendation scores in the $15 \times 20$ matrix relating users to movies, with blanks in the cells where a user has already watched a movie.

>A variant on the above is a *k-nearest-neighbours* approach that bases recommendations *only on k most similar users*. This is faster when there are many users. Shown below.

***
## $K$-Nearest Neighbours for User Based Filtering

Using the same data we have been playing with lets implement the `KNN` algorithm for recommendations. 

In [435]:
user_sim_matrix

Unnamed: 0,1,20,187,198,212,222,282,328,330,372,432,434,495,562,594
1,0.0,0.0,0.309,0.667,0.333,0.309,0.68,0.471,0.408,0.471,0.289,0.594,0.365,0.408,0.167
20,0.0,0.0,0.189,0.204,0.204,0.189,0.167,0.289,0.5,0.0,0.354,0.485,0.0,0.0,0.204
187,0.309,0.189,0.0,0.617,0.463,0.286,0.378,0.546,0.661,0.436,0.401,0.55,0.338,0.189,0.154
198,0.667,0.204,0.617,0.0,0.5,0.154,0.544,0.471,0.51,0.471,0.289,0.594,0.183,0.408,0.0
212,0.333,0.204,0.463,0.5,0.0,0.309,0.408,0.707,0.51,0.471,0.289,0.594,0.365,0.408,0.0
222,0.309,0.189,0.286,0.154,0.309,0.0,0.504,0.546,0.567,0.218,0.535,0.642,0.676,0.0,0.463
282,0.68,0.167,0.378,0.544,0.408,0.504,0.0,0.77,0.667,0.385,0.589,0.728,0.745,0.5,0.136
328,0.471,0.289,0.546,0.471,0.707,0.546,0.77,0.0,0.722,0.5,0.51,0.84,0.645,0.289,0.118
330,0.408,0.5,0.661,0.51,0.51,0.567,0.667,0.722,0.0,0.433,0.619,0.849,0.559,0.5,0.51
372,0.471,0.0,0.436,0.471,0.471,0.218,0.385,0.5,0.433,0.0,0.204,0.42,0.258,0.289,0.236


In [436]:
def  knn(user, user_sim, viewed_mov, k):
    
    # turn into integer if not already
    if type(user) == str:
        user = int(user)
    else:
        user = int(user)
    
    # top k users similar to user, set other users similarity to 0
    sim_peeps = user_sim[user].sort_values(ascending=False)[0:k].index
    user_sim[user][np.isin(user_sim.columns, sim_peeps, invert=True)] = 0 # invert = True, so users that are not among the top k similar users

    # get dataframe with scores/recommendations
    col_names = list(viewed_mov.columns)
    user_scores = pd.DataFrame({'title': col_names, 
    'score': np.matmul(user_sim[user], viewed_movies),
    'seen': viewed_mov.loc[user, :]})
    
    # sort by score and remove seen column and remove seen movies
    return user_scores[user_scores['seen'] == 0].sort_values(by='score', ascending=False).drop(columns=['seen']).reset_index().drop(columns=['index'])

In [437]:
# Apply KNN for UBCF
knn(user=328, user_sim=user_sim_matrix, viewed_mov=viewed_movies, k=5)

Unnamed: 0,title,score
0,"Green Mile, The (1999)",2.332
1,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.562
2,Jumanji (1995),1.562
3,There's Something About Mary (1998),1.562
4,Clear and Present Danger (1994),0.84
5,Interview with the Vampire: The Vampire Chroni...,0.722
6,Sleepless in Seattle (1993),0.722
7,Star Trek: Generations (1994),0.0


***
# Item-Based Collaborative Filtering

## Basic Idea

Item-based collaborative filtering works very similarly to its user-based counterpart, although you might find it slightly less intuitive. It is also based on similarities, but similarities between *movies* rather than *users*.

There are two main conceptual parts to item-based collaborative filtering:

1. One movie is similar to another if many of the same users have seen both movies.
2. When deciding what movie to recommend to a particular user, movies are evaluated on how similar they are to movies *that the user has already seen*.

Let's start by computing the similarities between all pairs of movies. We can reuse the same code we used to compute user similarities, if we first transpose the *viewed_movies* matrix.

In [452]:
viewed_movies.iloc[:,1]

1      1
20     0
187    0
198    0
212    0
222    1
282    1
328    1
330    1
372    1
432    0
434    1
495    1
562    0
594    1
Name: Apocalypse Now (1979), dtype: int64

In [453]:
# transpose the viewed_movies dataframe
movies_user = viewed_movies.T

# get all similarities between Movies (not users)
movie_similarities = np.zeros(shape=(20, 20))
for i in range(0,19):
    for j in range(i+1,20):
        movie_similarities[i,j] = cosine_similarity(viewed_movies.iloc[:,i], viewed_movies.iloc[:,j])


movie_similarities = movie_similarities + movie_similarities.transpose()
np.fill_diagonal(movie_similarities, 0)
movie_similarities = np.around(movie_similarities, decimals = 3)
movie_similarities

array([[0.   , 0.504, 0.717, 0.463, 0.267, 0.676, 0.429, 0.756, 0.571,
        0.429, 0.676, 0.507, 0.378, 0.309, 0.535, 0.798, 0.218, 0.   ,
        0.378, 0.507],
       [0.504, 0.   , 0.738, 0.544, 0.471, 0.298, 0.756, 0.5  , 0.63 ,
        0.504, 0.447, 0.596, 0.333, 0.544, 0.707, 0.603, 0.385, 0.333,
        0.667, 0.447],
       [0.717, 0.738, 0.   , 0.516, 0.447, 0.566, 0.717, 0.632, 0.717,
        0.598, 0.566, 0.707, 0.316, 0.387, 0.783, 0.763, 0.183, 0.   ,
        0.474, 0.566],
       [0.463, 0.544, 0.516, 0.   , 0.289, 0.548, 0.772, 0.612, 0.617,
        0.617, 0.73 , 0.548, 0.408, 0.667, 0.722, 0.615, 0.236, 0.   ,
        0.408, 0.548],
       [0.267, 0.471, 0.447, 0.289, 0.   , 0.316, 0.267, 0.354, 0.267,
        0.535, 0.316, 0.632, 0.   , 0.289, 0.25 , 0.426, 0.   , 0.   ,
        0.354, 0.316],
       [0.676, 0.298, 0.566, 0.548, 0.316, 0.   , 0.338, 0.671, 0.338,
        0.507, 0.6  , 0.4  , 0.447, 0.548, 0.474, 0.539, 0.258, 0.   ,
        0.447, 0.2  ],
       [0.

In [454]:
# convert to data frame for visibility
movie_sim_matrix = pd.DataFrame(movie_similarities, columns=viewed_movies.columns, index=viewed_movies.columns)
movie_sim_matrix.index.name = None
movie_sim_matrix

Unnamed: 0,2001: A Space Odyssey (1968),Apocalypse Now (1979),"Big Lebowski, The (1998)","Bourne Identity, The (2002)",Clear and Present Danger (1994),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)","Departed, The (2006)",Donnie Darko (2001),Ferris Bueller's Day Off (1986),"Green Mile, The (1999)",Harry Potter and the Philosopher's Stone (2001),Indiana Jones and the Temple of Doom (1984),Interview with the Vampire: The Vampire Chronicles (1994),Jumanji (1995),Kill Bill: Vol. 2 (2004),"Shining, The (1980)",Sleepless in Seattle (1993),Star Trek: Generations (1994),There's Something About Mary (1998),Up (2009)
2001: A Space Odyssey (1968),0.0,0.504,0.717,0.463,0.267,0.676,0.429,0.756,0.571,0.429,0.676,0.507,0.378,0.309,0.535,0.798,0.218,0.0,0.378,0.507
Apocalypse Now (1979),0.504,0.0,0.738,0.544,0.471,0.298,0.756,0.5,0.63,0.504,0.447,0.596,0.333,0.544,0.707,0.603,0.385,0.333,0.667,0.447
"Big Lebowski, The (1998)",0.717,0.738,0.0,0.516,0.447,0.566,0.717,0.632,0.717,0.598,0.566,0.707,0.316,0.387,0.783,0.763,0.183,0.0,0.474,0.566
"Bourne Identity, The (2002)",0.463,0.544,0.516,0.0,0.289,0.548,0.772,0.612,0.617,0.617,0.73,0.548,0.408,0.667,0.722,0.615,0.236,0.0,0.408,0.548
Clear and Present Danger (1994),0.267,0.471,0.447,0.289,0.0,0.316,0.267,0.354,0.267,0.535,0.316,0.632,0.0,0.289,0.25,0.426,0.0,0.0,0.354,0.316
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",0.676,0.298,0.566,0.548,0.316,0.0,0.338,0.671,0.338,0.507,0.6,0.4,0.447,0.548,0.474,0.539,0.258,0.0,0.447,0.2
"Departed, The (2006)",0.429,0.756,0.717,0.772,0.267,0.338,0.0,0.567,0.714,0.571,0.507,0.507,0.378,0.617,0.935,0.57,0.218,0.0,0.567,0.676
Donnie Darko (2001),0.756,0.5,0.632,0.612,0.354,0.671,0.567,0.0,0.567,0.378,0.671,0.447,0.5,0.408,0.707,0.603,0.289,0.0,0.5,0.447
Ferris Bueller's Day Off (1986),0.571,0.63,0.717,0.617,0.267,0.338,0.714,0.567,0.0,0.571,0.676,0.507,0.189,0.309,0.668,0.684,0.436,0.0,0.378,0.507
"Green Mile, The (1999)",0.429,0.504,0.598,0.617,0.535,0.507,0.571,0.378,0.571,0.0,0.338,0.676,0.378,0.463,0.535,0.798,0.436,0.0,0.378,0.338


We can use the result to see, for example, what movies are most similar to "Apocalypse Now":

In [459]:
movie_sim_matrix.loc[:,"Apocalypse Now (1979)"].sort_values(ascending=False)

Departed, The (2006)                                         0.756
Big Lebowski, The (1998)                                     0.738
Kill Bill: Vol. 2 (2004)                                     0.707
There's Something About Mary (1998)                          0.667
Ferris Bueller's Day Off (1986)                              0.630
Shining, The (1980)                                          0.603
Indiana Jones and the Temple of Doom (1984)                  0.596
Jumanji (1995)                                               0.544
Bourne Identity, The (2002)                                  0.544
2001: A Space Odyssey (1968)                                 0.504
Green Mile, The (1999)                                       0.504
Donnie Darko (2001)                                          0.500
Clear and Present Danger (1994)                              0.471
Harry Potter and the Philosopher's Stone (2001)              0.447
Up (2009)                                                    0

### Recommending Movies for a Single User

Let's again look at a concrete example of recommending a movie to a particular user, say user 372.

User 372 has seen the following movies:

In [468]:
viewed_movies.loc[372,:][viewed_movies.loc[372,:]==1]

2001: A Space Odyssey (1968)    1
Apocalypse Now (1979)           1
Shining, The (1980)             1
Name: 372, dtype: int64

We now implement the main idea behind item-based filtering. For each movie, we find the similarities between that movie and each of the three movies user 372 has seen, and sum up those similarities. The resulting sum is that movie's "recommendation score".

We start by identifying the movies the user has seen:

In [479]:
# users seen movies
user_seen =  viewed_movies.loc[372,:][viewed_movies.loc[372,:]==1]

Index(['2001: A Space Odyssey (1968)', 'Apocalypse Now (1979)',
       'Shining, The (1980)'],
      dtype='object')

We then compute the similarities between all movies and these "seen" movies. For example, similarities for the first seen movie, *2001: A Space Odyssey* are:

In [481]:
# get similarity scores of top 10 similar movies
movie_sim_matrix.loc[:,user_seen.index[0]].sort_values(ascending=False).head(10)


Shining, The (1980)                                        0.798
Donnie Darko (2001)                                        0.756
Big Lebowski, The (1998)                                   0.717
Harry Potter and the Philosopher's Stone (2001)            0.676
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)    0.676
Ferris Bueller's Day Off (1986)                            0.571
Kill Bill: Vol. 2 (2004)                                   0.535
Indiana Jones and the Temple of Doom (1984)                0.507
Up (2009)                                                  0.507
Apocalypse Now (1979)                                      0.504
Name: 2001: A Space Odyssey (1968), dtype: float64

We can do the same for each of the three seen movies or, more simply, do all three at once:

In [486]:
movie_sim_matrix.loc[:,user_seen.index]

Unnamed: 0,2001: A Space Odyssey (1968),Apocalypse Now (1979),"Shining, The (1980)"
2001: A Space Odyssey (1968),0.0,0.504,0.798
Apocalypse Now (1979),0.504,0.0,0.603
"Big Lebowski, The (1998)",0.717,0.738,0.763
"Bourne Identity, The (2002)",0.463,0.544,0.615
Clear and Present Danger (1994),0.267,0.471,0.426
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",0.676,0.298,0.539
"Departed, The (2006)",0.429,0.756,0.57
Donnie Darko (2001),0.756,0.5,0.603
Ferris Bueller's Day Off (1986),0.571,0.63,0.684
"Green Mile, The (1999)",0.429,0.504,0.798


Each movie's recommendation score is obtained by summing across columns, each column representing a seen movie:

In [490]:
row_sums = movie_sim_matrix.loc[:,user_seen.index].sum(axis=1)
row_sums.sort_values(ascending=False)

Big Lebowski, The (1998)                                     2.218
Ferris Bueller's Day Off (1986)                              1.885
Kill Bill: Vol. 2 (2004)                                     1.882
Donnie Darko (2001)                                          1.859
Indiana Jones and the Temple of Doom (1984)                  1.777
Departed, The (2006)                                         1.755
Green Mile, The (1999)                                       1.731
Harry Potter and the Philosopher's Stone (2001)              1.662
Bourne Identity, The (2002)                                  1.622
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)      1.513
Up (2009)                                                    1.493
Shining, The (1980)                                          1.401
There's Something About Mary (1998)                          1.347
2001: A Space Odyssey (1968)                                 1.302
Jumanji (1995)                                               1

The preceding explanation hopefully makes the details of the calculations clear, but it is quite unwieldy. We can do all the calculations more neatly as:

In [491]:
# get recommendation score for similar movies given seen movies - user 372
score = movie_sim_matrix.loc[:,user_seen.index].sum(axis=1)
seen = viewed_movies.loc[372]
title = viewed_movies.columns


user_scores = pd.DataFrame([title, score, seen]).transpose()
user_scores.columns = ["title", "score", "seen"]
user_scores = user_scores[user_scores["seen"] != 1]
user_scores.sort_values(by="score", ascending=False)

Unnamed: 0,title,score,seen
2,"Big Lebowski, The (1998)",2.218,0
8,Ferris Bueller's Day Off (1986),1.885,0
14,Kill Bill: Vol. 2 (2004),1.882,0
7,Donnie Darko (2001),1.859,0
11,Indiana Jones and the Temple of Doom (1984),1.777,0
6,"Departed, The (2006)",1.755,0
9,"Green Mile, The (1999)",1.731,0
10,Harry Potter and the Philosopher's Stone (2001),1.662,0
3,"Bourne Identity, The (2002)",1.622,0
5,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.513,0


Again we will end up recommending "The Big Lebowski" to this particular user.

Let's repeat the process to generate a recommendation for one more user, user 222:

In [492]:
# do for user 222
user_seen =  viewed_movies.loc[222,:][viewed_movies.loc[222,:]==1]

score = movie_sim_matrix.loc[:,user_seen.index].sum(axis=1)
seen = viewed_movies.loc[222]
title = viewed_movies.columns

user_scores = pd.DataFrame([title, score, seen]).transpose()
user_scores.columns = ["title", "score", "seen"]
user_scores = user_scores[user_scores["seen"] != 1]
user_scores.sort_values(by="score", ascending=False)


Unnamed: 0,title,score,seen
3,"Bourne Identity, The (2002)",4.177,0
8,Ferris Bueller's Day Off (1986),3.923,0
15,"Shining, The (1980)",3.786,0
7,Donnie Darko (2001),3.761,0
10,Harry Potter and the Philosopher's Stone (2001),3.589,0
9,"Green Mile, The (1999)",3.387,0
0,2001: A Space Odyssey (1968),3.379,0
11,Indiana Jones and the Temple of Doom (1984),3.091,0
12,Interview with the Vampire: The Vampire Chroni...,2.893,0
5,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",2.871,0


Here we see a different top recommendation (The Bourne Identity) to what was produced by the user-based system.

### Function to Generate an IBCF Recommendation for Any Users

In [500]:
def item_based_recommendations(user, movie_sim, viewed_mov):
        if type(user) == str:
            user = int(user)
        else:
            user = int(user)
        
        user_seen = viewed_movies.loc[user,:][viewed_movies.loc[user,:]==1]
        score = movie_sim_matrix.loc[:,user_seen.index].sum(axis=1)
        seen = viewed_movies.loc[user]
        title = viewed_movies.columns

        user_scores = pd.DataFrame([title, score, seen]).transpose()
        user_scores.columns = ["title", "score", "seen"]
        user_scores = user_scores[user_scores["seen"] != 1]
        return user_scores.sort_values(by="score", ascending=False).drop(columns='seen')

Let's check that its working with a user we've seen before, user 372:

In [501]:
item_based_recommendations(user = 372, movie_sim = movie_sim_matrix, viewed_mov = viewed_movies)

Unnamed: 0,title,score
2,"Big Lebowski, The (1998)",2.218
8,Ferris Bueller's Day Off (1986),1.885
14,Kill Bill: Vol. 2 (2004),1.882
7,Donnie Darko (2001),1.859
11,Indiana Jones and the Temple of Doom (1984),1.777
6,"Departed, The (2006)",1.755
9,"Green Mile, The (1999)",1.731
10,Harry Potter and the Philosopher's Stone (2001),1.662
3,"Bourne Identity, The (2002)",1.622
5,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.513


In [502]:
# list of users
all_users = viewed_movies.index

# recommendations for users
all_recommendations = {}

# loop through all users
for user in all_users:
  recommendations = item_based_recommendations(user, movie_sim_matrix, viewed_movies)
  all_recommendations[user] = recommendations

# all recommendation results stored in dictionary - all_recommendations
all_recommendations[372]

Unnamed: 0,title,score
2,"Big Lebowski, The (1998)",2.218
8,Ferris Bueller's Day Off (1986),1.885
14,Kill Bill: Vol. 2 (2004),1.882
7,Donnie Darko (2001),1.859
11,Indiana Jones and the Temple of Doom (1984),1.777
6,"Departed, The (2006)",1.755
9,"Green Mile, The (1999)",1.731
10,Harry Potter and the Philosopher's Stone (2001),1.662
3,"Bourne Identity, The (2002)",1.622
5,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",1.513


> This would once again be better displayed in a user $\times$ movie matrix, with blanks in the already-seen cells.  

In [513]:
# New Item_Based_Recommendations Function allowing for seen movies and no sorting
def item_based_recommendations(user, movie_sim, viewed_mov):
        if type(user) == str:
            user = int(user)
        else:
            user = int(user)
        
        user_seen = viewed_movies.loc[user,:][viewed_movies.loc[user,:]==1]
        score = movie_sim_matrix.loc[:,user_seen.index].sum(axis=1)
        seen = viewed_movies.loc[user]
        title = viewed_movies.columns

        user_scores = pd.DataFrame([title, score, seen]).transpose()
        user_scores.columns = ["title", "score", "seen"]
        return user_scores.drop(columns='seen')

# recommendation with seen movies
item_based_recommendations(222, movie_sim_matrix, viewed_movies)['score'].values

array([3.3790000000000004, 3.859, 3.665, 4.177, 2.3939999999999997,
       2.8710000000000004, 4.268, 3.761, 3.9230000000000005, 3.387, 3.589,
       3.091, 2.8930000000000007, 3.489, 4.164, 3.786, 2.0380000000000003,
       1.241, 3.5010000000000003, 3.3160000000000003], dtype=object)

In [517]:
# User x Movie matrix (or array transformed into dataframe)

# get shape for array
nrows, ncols = viewed_movies.shape

# create empty array
recommendation_scores = np.zeros((nrows, ncols))

# loop through all users and get recommendations 
for i in range(nrows):
    user = user_sim_matrix.index[i]
    user_rec = item_based_recommendations(user = user, movie_sim = movie_sim_matrix, viewed_mov = viewed_movies)
    recommendation_scores[i,:] = user_rec['score'].values
    for j in range(20):
        if viewed_movies.iloc[i,j] == 1:
            recommendation_scores[i,j] = 0
        else: continue

recommendation_scores = pd.DataFrame(recommendation_scores, columns=movie_sim_matrix.index)
recommendation_scores.index =  viewed_movies.index
recommendation_scores

Unnamed: 0,2001: A Space Odyssey (1968),Apocalypse Now (1979),"Big Lebowski, The (1998)","Bourne Identity, The (2002)",Clear and Present Danger (1994),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)","Departed, The (2006)",Donnie Darko (2001),Ferris Bueller's Day Off (1986),"Green Mile, The (1999)",Harry Potter and the Philosopher's Stone (2001),Indiana Jones and the Temple of Doom (1984),Interview with the Vampire: The Vampire Chronicles (1994),Jumanji (1995),Kill Bill: Vol. 2 (2004),"Shining, The (1980)",Sleepless in Seattle (1993),Star Trek: Generations (1994),There's Something About Mary (1998),Up (2009)
1,3.222,0.0,0.0,3.129,0.0,2.626,3.388,2.914,3.376,0.0,2.606,0.0,1.479,2.235,3.389,0.0,1.352,0.333,2.399,2.606
20,2.124,1.833,2.035,0.0,1.21,0.0,2.234,2.362,1.94,1.925,0.0,1.531,1.691,0.0,2.247,2.062,1.223,0.408,2.118,1.896
187,0.0,3.683,0.0,3.884,2.06,0.0,3.934,0.0,3.734,3.623,3.75,3.209,0.0,3.21,0.0,0.0,2.077,0.5,3.131,3.115
198,0.0,3.243,0.0,3.307,2.623,0.0,3.132,3.487,3.388,0.0,3.119,0.0,1.971,2.259,3.441,0.0,1.443,0.0,2.203,2.55
212,0.0,3.369,0.0,3.489,2.039,2.919,3.613,3.676,0.0,3.072,0.0,3.195,1.783,2.47,3.732,0.0,1.443,0.0,2.426,0.0
222,3.379,0.0,0.0,4.177,2.394,2.871,0.0,3.761,3.923,3.387,3.589,3.091,2.893,0.0,0.0,3.786,2.038,1.241,0.0,0.0
282,4.953,0.0,0.0,0.0,3.584,4.008,0.0,5.013,0.0,0.0,4.677,0.0,2.984,4.116,0.0,0.0,2.446,0.333,3.928,4.653
328,0.0,0.0,0.0,0.0,4.302,5.648,0.0,0.0,0.0,6.353,0.0,0.0,3.932,5.466,0.0,0.0,2.775,0.333,5.322,0.0
330,0.0,0.0,0.0,0.0,4.848,0.0,0.0,0.0,0.0,0.0,0.0,6.85,0.0,0.0,0.0,0.0,0.0,2.318,0.0,7.226
372,0.0,0.0,2.218,1.622,1.164,1.513,1.755,1.859,1.885,1.731,1.662,1.777,1.163,1.222,1.882,0.0,0.951,0.333,1.347,1.493


***
# Collaborative Filtering with Matrix Factorization 

In this section we're going to look at a different way of doing collaborative filtering, one based on the idea of *matrix factorization*, a topic from linear algebra.

Matrix factorization, also called matrix decomposition, takes a matrix and represents it as a product of other (usually two) matrices. There are many ways to do matrix factorization, and different problems tend to use different methods. Factorization often involves finding underlying **latent factors** containing information about the dataset. 

In recommendation systems, matrix factorization is used to decompose the ratings matrix into the product of two matrices. This is done in such a way that the known ratings are matched as closely as possible. 

The key feature of matrix factorization for recommendation systems is that while the ratings matrix is incomplete (i.e. some entries are blank), the two matrices the ratings matrix is decomposed into are *complete* (no blank entries). This gives a straightforward way of filling in blank spaces in the original ratings matrix, as we'll see.

In [536]:
# get ratings in wide format
ratings_wide = pd.read_csv("ratings_for_excel_example.csv")
ratings_wide.index = viewed_movies.index
ratings_wide.drop(ratings_wide.columns[0], axis=1, inplace=True)
ratings_wide

Unnamed: 0,2001: A Space Odyssey (1968),Apocalypse Now (1979),"Big Lebowski, The (1998)","Bourne Identity, The (2002)",Clear and Present Danger (1994),"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)","Departed, The (2006)",Donnie Darko (2001),Ferris Bueller's Day Off (1986),"Green Mile, The (1999)",Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),Indiana Jones and the Temple of Doom (1984),Interview with the Vampire: The Vampire Chronicles (1994),Jumanji (1995),Kill Bill: Vol. 2 (2004),"Shining, The (1980)",Sleepless in Seattle (1993),Star Trek: Generations (1994),There's Something About Mary (1998),Up (2009)
1,,4.0,5.0,,4.0,,,,,5.0,,5.0,,,,3.0,,,,
20,,,,3.0,,3.5,,,,,4.5,,,3.0,,,,,,
187,4.0,,5.0,,,3.0,,3.5,,,,,3.5,,4.0,4.5,,,,
198,4.0,,3.0,,,3.0,,,,5.0,,3.0,,,,3.0,,,,
212,4.0,,3.0,,,,,,2.5,,3.5,,,,,4.0,,,,3.0
222,,3.5,3.5,,,,4.5,,,,,,,2.5,4.0,,,,3.0,2.5
282,,4.5,4.0,4.5,,,4.5,,4.5,4.0,,4.0,,,3.5,4.0,,,,
328,4.0,4.0,3.5,4.0,,,3.5,1.0,2.0,,4.0,4.0,,,5.0,4.0,,,,4.5
330,1.5,2.5,5.0,2.0,,4.5,4.5,4.5,4.0,4.0,4.5,,4.5,1.5,5.0,4.0,2.0,,5.0,
372,2.0,5.0,,,,,,,,,,,,,,4.0,,,,


We start by defining a function that will compute the sum of squared differences between the observed movie ratings and any other set of predicted ratings (for example, ones predicted by matrix factorization). Note that we only count movies that have already been rated in the accuracy calculation.

In [570]:
def  recommender_accuracy(x, observed_rating):
    
    #extract user and movie factors from parameter vector
    user_factors = x[:75].reshape(15, 5)
    movie_factors = x[75:175].reshape(5, 20)

    # get predictions from dot products of respective user and movie factor
    predicted_ratings = np.dot(user_factors, movie_factors)
    
    # convert ratings matrix to numpy array
    observed_rating = observed_rating.to_numpy()
    
    # model accuracy is sum of squared errors over all rated movies
    errors = np.power(observed_rating - predicted_ratings, 2)
    
    # only use rated movies
    rated_movies = ~np.isnan(observed_rating)
    mean_error = np.mean(errors[rated_movies])
    
    return np.sqrt(mean_error)

# See if it works
recommender_accuracy(x=np.random.rand(175), observed_rating=ratings_wide)

2.7816235031804966

The code `rated_movies = ~np.isnan(observed_ratings)` is finding the indices of the rated movies in the observed_ratings matrix. The tilde (`~`) operator inverts the truthiness of the array obtained from `np.isnan(observed_ratings)`. So, rated_movies will be an array of booleans with `True` values indicating the indices of the rated movies and `False` values indicating the indices of the unrated movies.

The next line `mean_error = np.mean(errors[rated_movies])` is computing the mean of the squared errors of the rated movies only. The `errors[rated_movies]` is a slicing operation that retrieves only the elements of the errors array where the corresponding index in the `rated_movies` array is `True`. So, `np.mean(errors[rated_movies])` is **computing the mean of the squared errors for the rated movies only**.

This function isn't general, because it refers specifically to a ratings matrix with 15 users, 20 movies, and 5 latent factors. Make the function general.

In [565]:
def general_recommender_accuracy(x, observed_ratings, n_users, n_movies, n_factors):
    user_factors = x[:n_users * n_factors].reshape(n_users, n_factors)
    movie_factors = x[n_users * n_factors:].reshape(n_factors, n_movies)

    predicted_ratings = np.dot(user_factors, movie_factors)
    observed_ratings = observed_ratings.to_numpy()

    errors = (observed_ratings - predicted_ratings) ** 2
    rated_movies = ~np.isnan(observed_ratings)

    return  np.sqrt(np.mean(errors[rated_movies]))

# See if it works
general_recommender_accuracy(x=np.random.rand(175), observed_ratings=ratings_wide, n_users=15, n_movies=20, n_factors=5)

2.8155598795990606


We'll now optimize the values in the user and movie latent factors, choosing them so that the root mean square error (the square root of the average squared difference between observed and predicted ratings) is a minimum.

In [575]:
# Optimize
from scipy.optimize import minimize
np.random.seed(10)

# BFGS Method
n_users, n_factors, n_movies = [15,5, 20]
result = minimize(fun=general_recommender_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_movies),
  args=(ratings_wide, n_users, n_movies, n_factors),
   method='BFGS',
   options={"maxiter":100000})

# see result
print("Convergence using BFGS:", result.success)
print("Minimized Value: ", result.fun)

# different method
result = minimize(fun=general_recommender_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_movies),
  args=(ratings_wide, n_users, n_movies, n_factors),
   method='Nelder-Mead',
   options={"maxiter":100000})

print("\nConvergence using Nelder-Mead:", result.success)
print("Minimized Value: ", result.fun)

Convergence using BFGS: False
Minimized Value:  2.798051690321165e-06

Convergence using Nelder-Mead: False
Minimized Value:  0.35704055022203957


`False` means that the optimization procedure was not able to find a suitable solution. In other words, it means the optimization algorithm did not converge to an optimal solution. The result may not be a global minimum, or there may be other problems such as a lack of iterations, or a step size that was too large.

The best value of the objective function found by `optim()` after 100000 iterations is `0.357`, but note that it hasn't converged yet, so we should really run for longer or try another optimizer! Ignoring this for now, we can extract the optimal user and movie factors. With a bit of work, these can be interpreted and often give useful information. 

In [597]:
# user factors
user_factors = np.reshape(result.x[:75], (15, 5))
print("User Factor: \n", pd.DataFrame(user_factors).head(6), "\n Shape:", user_factors.shape, "\n")

# movie factors
movie_factors = np.reshape(result.x[75:], (5,20))
print("Movie Factor: \n", pd.DataFrame(movie_factors).head(6), "\n Shape:", movie_factors.shape, "\n")


User Factor: 
            0         1         2         3         4
0  -1.224750  3.055283  1.278412  0.462291  0.318429
1  -7.511932 -1.979446  2.185415 -1.560631  4.994708
2   0.094348  1.505176 -0.045879 -0.531632  1.568056
3  -2.059086  1.603002  1.218100 -1.336221  0.420934
4   0.010674  5.657869 -0.181747  3.092363 -0.412841
5  -0.708986  2.607733  0.663077  0.092511  0.569372
6  -1.400756  3.163890 -0.117680  0.580854  0.573996
7  -1.498154  2.209820  1.054796 -0.384355  0.293028
8   0.490492  3.086054  1.784994  0.203665  0.651645
9  -2.571154 -1.760446  5.042933 -7.559127  0.485099
10 -0.097378  3.789989  1.211593  1.701005 -0.001484
11 -1.224907  3.198814  0.429032  0.292321  0.408947
12 -2.242870 -1.448053 -3.038783 -3.028022  3.344606
13 -1.072568  0.423909  2.911750 -6.118121 -0.974906
14  0.697647 -0.665813 -2.340890 -1.658454  4.041338 
 Shape: (15, 5) 

Movie Factor: 
          0         1         2         3         4         5         6   \
0 -0.672425 -1.624590 -0.96

Most importantly, we can get **predicted movie ratings** for any user, by taking the appropriate dot product of user and movie factors. Here we show the predictions for user 1:

In [610]:
# CHeck Predictions for User 1
predicted_ratings = np.dot(user_factors, movie_factors)

# get dataframe for visibility
predicted_ratings = pd.DataFrame(predicted_ratings)
predicted_ratings.index = viewed_movies.index
predicted_ratings.columns = viewed_movies.columns
predicted_ratings = np.round(predicted_ratings,1)

# see for user 1 and compare to observed
pd.DataFrame({"Predicted Rating":predicted_ratings.loc[1,:], "Observed Rating": ratings_wide.loc[1,:]})


Unnamed: 0,Predicted Rating,Observed Rating
2001: A Space Odyssey (1968),3.1,
Apocalypse Now (1979),4.5,4.0
"Big Lebowski, The (1998)",5.0,5.0
"Bourne Identity, The (2002)",4.3,
Clear and Present Danger (1994),3.6,4.0
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",4.3,
"Departed, The (2006)",4.5,
Donnie Darko (2001),3.3,
Ferris Bueller's Day Off (1986),3.0,
"Green Mile, The (1999)",4.9,5.0


In [614]:
def lmf_ratings_prediction(user):
    if type(user) == str:
        user = int(user)
    else:
        user = int(user)

    observed_rating = ratings_wide.loc[user,:]
    predicted_ratings = np.dot(user_factors, movie_factors)
    predicted_ratings = pd.DataFrame(predicted_ratings)
    predicted_ratings.index = viewed_movies.index
    predicted_ratings.columns = viewed_movies.columns
    predicted_ratings = np.round(predicted_ratings,1)
    predicted_ratings = predicted_ratings.loc[user,:]

    print("The Predicted and Observed Ratings for User", user)

    return pd.DataFrame({"Predicted Rating":predicted_ratings, "Observed Rating": observed_rating})


# try function out
lmf_ratings_prediction(222)
    

The Predicted and Observed Ratings for User 222


Unnamed: 0,Predicted Rating,Observed Rating
2001: A Space Odyssey (1968),3.6,
Apocalypse Now (1979),3.2,3.5
"Big Lebowski, The (1998)",3.9,3.5
"Bourne Identity, The (2002)",3.1,
Clear and Present Danger (1994),2.4,
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",3.8,
"Departed, The (2006)",3.8,4.5
Donnie Darko (2001),3.7,
Ferris Bueller's Day Off (1986),3.2,
"Green Mile, The (1999)",4.0,


### Adding L2 Regularization

One trick that can improve the performance of matrix factorization collaborative filtering is to add L2 regularization. **L2 regularization adds a penalty term to the function that we're trying to minimize, which penalizes large parameter values**.

We first rewrite the *evaluate_fit* function to make use of L2 regularization:

In [643]:
# Error with Penalty Function
def lmf_with_l2_accuracy(x, observed_ratings, lam, n_users, n_movies, n_factors):

    # extract user and movie factors from parameter vector
    user_factors = x[:n_users * n_factors].reshape(n_users, n_factors)
    movie_factors = x[n_users * n_factors:].reshape(n_factors, n_movies)

    # get predicted ratings
    predicted_ratings = np.dot(user_factors, movie_factors)

    # convert observed ratings to numpy
    observed_ratings = observed_ratings.to_numpy()

    # calculate error
    errors = (observed_ratings - predicted_ratings) ** 2

    # only want to look at rated movies
    rated_movies = ~np.isnan(observed_ratings)

    # L2 norm penalizes large parameter values
    penalty = np.sqrt(np.sum(user_factors ** 2) + np.sum(movie_factors ** 2))

    # model accuracy contains an error term and a weighted penalty
    accuracy = np.sqrt(np.mean(errors[rated_movies])) + (lam * penalty)

    return accuracy
    

# see if functions works
lmf_with_l2_accuracy(np.random.randn(175), ratings_wide, 0.03, 15, 20, 5)




4.623011496605781

We now rerun the optimization with this new evaluation function:

In [644]:
# Optimize
np.random.seed(10)

# define parameters
n_users, n_factors, n_movies = [15,5, 20]
lam = 3e-2

# BFGS Method
result_1 = minimize(fun=lmf_with_l2_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_movies),
  args=(ratings_wide, lam, n_users, n_movies, n_factors),
   method='BFGS',
   options={"maxiter":100000})

# see result
print("Convergence using BFGS:", result_1.success)
print("Minimized Value: ", result_1.fun)

# different method
result_2 = minimize(fun=lmf_with_l2_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_movies),
  args=(ratings_wide, lam, n_users, n_movies, n_factors),
   method='Nelder-Mead',
   options={"maxiter":100000})

print("\nConvergence using Nelder-Mead:", result_2.success)
print("Minimized Value: ", result_2.fun)

Convergence using BFGS: False
Minimized Value:  0.4035368891219943

Convergence using Nelder-Mead: False
Minimized Value:  1.0631188352908936


The best value found is **worse** than before, but remember that we changed the objective function to include the L2 penalty term, so the numbers are not comparable. We need to extract just the RMSE that we're interested in. To do that we first need to extract the optimal parameter values (user and movie factors), and multiply these matrices together to get predicted ratings. From there, its easy to calculate the errors.

In [653]:
# user factors
user_factors = np.reshape(result_2.x[:75], (15, 5))
print("User Factor: \n", pd.DataFrame(user_factors).head(6), "\n Shape:", user_factors.shape, "\n")

# movie factors
movie_factors = np.reshape(result_2.x[75:], (5,20))
print("Movie Factor: \n", pd.DataFrame(movie_factors).head(6), "\n Shape:", movie_factors.shape, "\n")


User Factor: 
           0         1         2         3         4
0 -1.037028  2.131542  0.192244  1.150360 -0.498892
1 -2.327359 -0.772815 -0.220758  1.282603 -1.031220
2  0.158328  1.525393  1.862505  0.838169  2.465011
3 -1.485079  0.236429  1.461663 -0.384189  0.497572
4  1.867111  2.348096  0.983295  2.712720  0.977028
5 -0.667403  2.180597 -0.724539  1.125074 -0.691379 
 Shape: (15, 5) 

Movie Factor: 
          0         1         2         3         4         5         6   \
0  0.054152 -1.717805 -2.649813 -1.115366  0.124969 -1.106722 -2.133135   
1  0.405439 -0.006375  0.091965  1.787396  1.080646  0.576360  1.193874   
2  2.484843  0.517761 -0.177487 -1.078013  0.511764  0.984893  0.166716   
3  0.419064  1.039252  2.412912 -1.801490  0.998760  1.287053  0.270388   
4 -0.639493 -2.249350  1.410162 -3.531764 -0.977954 -0.225907  0.297055   

         7         8         9         10        11        12        13  \
0 -1.480052 -0.209929 -2.533031 -1.029542 -1.246948  2.06117

In [654]:
def lmf_with_l2_ratings_prediction(user):
    if type(user) == str:
        user = int(user)
    else:
        user = int(user)

    observed_rating = ratings_wide.loc[user,:]
    predicted_ratings = np.dot(user_factors, movie_factors)
    predicted_ratings = pd.DataFrame(predicted_ratings)
    predicted_ratings.index = viewed_movies.index
    predicted_ratings.columns = viewed_movies.columns
    predicted_ratings = np.round(predicted_ratings,1)
    predicted_ratings = predicted_ratings.loc[user,:]

    print("The Predicted and Observed Ratings for User", user)

    return pd.DataFrame({"Predicted Rating":predicted_ratings, "Observed Rating": observed_rating})


# try function out
lmf_with_l2_ratings_prediction(222)
    

The Predicted and Observed Ratings for User 222


Unnamed: 0,Predicted Rating,Observed Rating
2001: A Space Odyssey (1968),-0.0,
Apocalypse Now (1979),3.5,3.5
"Big Lebowski, The (1998)",3.8,3.5
"Bourne Identity, The (2002)",5.8,
Clear and Present Danger (1994),3.7,
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",2.9,
"Departed, The (2006)",4.0,4.5
Donnie Darko (2001),3.4,
Ferris Bueller's Day Off (1986),1.3,
"Green Mile, The (1999)",3.9,


In [659]:
# calculate error: RMSE of (observed - predicted)
predicted_ratings = np.dot(user_factors, movie_factors)
observed_ratings = ratings_wide
observed_ratings = observed_ratings.to_numpy()
errors = (observed_ratings - predicted_ratings) ** 2
rated_movies = ~np.isnan(ratings_wide)
print("RMSE of LMF with L2 Regularization:", np.round(np.sqrt(np.mean(errors[rated_movies])),4))


RMSE of LMF with L2 Regularisation: 0.4541


Compare this with what we achieved without L2 regularization.

### Adding Bias terms

> adding bias terms helps to capture and correct for any biases in the data. This is important because it can improve the accuracy of the model.

Bias terms are additive factors that model the fact that some users are more generous than others (and so will give higher ratings, on average) and some movies are better than others (and so will get higher ratings, on average). 

Let's adapt our evaluation function further to include bias terms for both users and movies:

In [661]:
# Error with Penalty Function and Terms
def lmf_with_l2_bias_accuracy(x, observed_ratings, lam, n_users, n_movies, n_factors):

    # get total number of params for slicing purposes
    n_params = n_users * n_factors + n_factors * n_movies + n_users + n_movies

    # extract user and movie factors from parameter vector
    user_factors = x[:n_users * n_factors].reshape(n_users, n_factors)
    movie_factors = x[n_users * n_factors:n_params - n_users - n_movies].reshape(n_factors, n_movies)

    # the bias vectors are repeated to make the later matrix calculations easier 
    user_bias = x[-n_users - n_movies:-n_movies].reshape(n_users, 1)
    movie_bias = x[-n_movies:].reshape(1, n_movies)

    # get predicted ratings
    predicted_ratings = np.dot(user_factors, movie_factors) + user_bias + movie_bias

    # convert observed ratings to numpy
    observed_ratings = observed_ratings.to_numpy()

    # calculate error
    errors = (observed_ratings - predicted_ratings) ** 2

    # only want to look at rated movies
    rated_movies = ~np.isnan(observed_ratings)

    # L2 norm penalizes large parameter values
    penalty = np.sqrt(np.sum(user_factors ** 2) + np.sum(movie_factors ** 2))

    # model accuracy contains an error term and a weighted penalty
    accuracy = np.sqrt(np.mean(errors[rated_movies])) + (lam * penalty)

    return accuracy
    

# see if functions works
lmf_with_l2_bias_accuracy(np.random.randn(15*5+20*5+20+15), ratings_wide, 0.03, 15, 20, 5) # 210 elements in x


5.658386750856897

The following code is extracting bias terms from the input vector `x` and reshaping them into matrices:

```py 
user_bias = x[-n_users - n_movies:-n_movies].reshape((n_users, 1))
movie_bias = x[-n_movies:].reshape((1, n_movies))
```

`n_users` and `n_movies` are the number of users and movies respectively, and are used to determine the start and end indices of the bias terms in x.

- `x[-n_users - n_movies:-n_movies]` takes the last `n_users` elements from the input vector `x` and returns them as a 1D array.
- `reshape((n_users, 1))` reshapes the 1D array into a `n_users` by 1 matrix, representing the user bias terms.
- `x[-n_movies:]` takes the last `n_movies` elements from the input vector `x` and returns them as a 1D array.
- `reshape((1, n_movies))` reshapes the 1D array into a 1 by `n_movies` matrix, representing the movie bias terms.

Again, rerun the optimization:

In [722]:
# Optimize
np.random.seed(10)

# define parameters
n_users, n_factors, n_movies = [15,5, 20]
lam = 3e-2

# BFGS Method
result_1 = minimize(fun=lmf_with_l2_bias_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_movies+n_movies+n_users),
  args=(ratings_wide, lam, n_users, n_movies, n_factors),
   method='BFGS',
   options={"maxiter":100000})

# see result
print("Convergence using BFGS:", result_1.success)
print("Minimized Value: ", np.round(result_1.fun,3))

# different method
result_2 = minimize(fun=lmf_with_l2_bias_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_movies+n_movies+n_users),
  args=(ratings_wide, lam, n_users, n_movies, n_factors),
   method='Nelder-Mead',
   options={"maxiter":1000000})

print("\nConvergence using Nelder-Mead:", result_2.success)
print("Minimized Value: ", np.round(result_2.fun,3))

Convergence using BFGS: False
Minimized Value:  0.20951858049770766

Convergence using Nelder-Mead: False
Minimized Value:  0.8121184507538218


This value isn't comparable to either of the previous values, for the same reason as before: the objective function has changed to include bias terms. Extracting just the RMSE:

In [723]:
# user and movie factors from optimization
n_params = n_users * n_factors + n_factors * n_movies + n_users + n_movies
user_factors = result_2.x[:n_users * n_factors].reshape(n_users, n_factors)
movie_factors = result_2.x[n_users * n_factors:n_params - n_users - n_movies].reshape(n_factors, n_movies)

# the bias vectors are repeated to make the later matrix calculations easier 
user_bias = result_2.x[-n_users - n_movies:-n_movies].reshape(n_users, 1)
movie_bias = result_2.x[-n_movies:].reshape(1, n_movies)

# predicted ratings
predicted_ratings = np.dot(user_factors, movie_factors) + user_bias + movie_bias

# check RMSE
errors = (observed_ratings - predicted_ratings) ** 2
rated_movies = ~np.isnan(observed_ratings)
print("RMSE for LMF with L2 Regularisation and Bias Terms:", np.sqrt(np.mean(errors[rated_movies])))

RMSE for LMF with L2 Regularisation and Bias Terms: 0.3148230273219289


We can examine and interpret the user or movie latent factors, or bias terms, if we want to. Below we show the movie bias terms, which gives some reflection of movie quality (with some notable exceptions!)

In [724]:
# make dataFrames and then join them
df1 = pd.DataFrame(viewed_movies.columns).reset_index(drop=True)
df2 = pd.DataFrame(movie_bias).T.reset_index(drop=True)
df_final = pd.concat([df1, df2], axis=1)
df_final.columns = ["Movie", "Bias"]
df_final.sort_values(by="Bias", ascending=False)


Unnamed: 0,Movie,Bias
12,Interview with the Vampire: The Vampire Chroni...,2.135332
1,Apocalypse Now (1979),1.59657
18,There's Something About Mary (1998),1.278623
9,"Green Mile, The (1999)",1.209568
6,"Departed, The (2006)",0.928785
5,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...",0.82536
15,"Shining, The (1980)",0.75645
2,"Big Lebowski, The (1998)",0.564407
0,2001: A Space Odyssey (1968),0.51289
17,Star Trek: Generations (1994),0.432205


Finally, we again get predicted ratings for one user:

In [725]:
# Predicted Ratings for User 1

def lmf_l2_bias_ratings_prediction(user):
    if type(user) == str:
        user = int(user)
    else:
        user = int(user)

    observed_rating = ratings_wide.loc[user,:]
    predicted_ratings = np.dot(user_factors, movie_factors) + user_bias + movie_bias
    predicted_ratings = pd.DataFrame(predicted_ratings)
    predicted_ratings.index = viewed_movies.index
    predicted_ratings.columns = viewed_movies.columns
    predicted_ratings = np.round(predicted_ratings,1)
    predicted_ratings = predicted_ratings.loc[user,:]

    print("The Predicted and Observed Ratings for User", user)

    return pd.DataFrame({"Predicted Rating":predicted_ratings, "Observed Rating": observed_rating})


# try function out
lmf_l2_bias_ratings_prediction(1)
    

The Predicted and Observed Ratings for User 1


Unnamed: 0,Predicted Rating,Observed Rating
2001: A Space Odyssey (1968),-4.0,
Apocalypse Now (1979),3.9,4.0
"Big Lebowski, The (1998)",5.1,5.0
"Bourne Identity, The (2002)",7.2,
Clear and Present Danger (1994),4.1,4.0
"Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)",1.4,
"Departed, The (2006)",4.6,
Donnie Darko (2001),1.1,
Ferris Bueller's Day Off (1986),5.7,
"Green Mile, The (1999)",4.8,5.0
