# Matrix Factorialization
## Harris Dupre
## Data 612, Summer 2020

### Introduction

In this project we will implement a movie recommender system using the singular value decomposion (SVD) for matrix factorialization.

We will use data from the MovieLens dataset, specifically found in the ml-latest-small.zip which contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

In [2]:
import pandas as pd 
import numpy as np


ratings_df = pd.read_csv('https://raw.githubusercontent.com/hdupre/rec_sys/master/Project3/ratings.csv')
movies_df = pd.read_csv('https://raw.githubusercontent.com/hdupre/rec_sys/master/Project3/movies.csv')

In [3]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Ratings_df contains a user IDs, movie IDs, ratings, and timestamps. Timestamps will not be used. Repeated user IDs (as seen here) are ratings from the same user.

In [4]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Movies_df contains a movie ID, title, and genre. The movie ID matches the movie ID from ratings_df.

### Convert to matrix

In order to use the SVD function we need to transform the ratings data into a matrix, with user ID as the rows, movie ID as the columns, and ratings as the values.

We will then replace all the NaN values (as users have not rated the vast majority of movies so most user-movie pairings are NaN) with zeroes.

In [5]:
# use pandas pivot table function to transform the data
ratings_pivot = ratings_df.pivot_table(index='userId', columns='movieId', values='rating')
ratings_pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [6]:
# replace NaNs with 0
ratings_mat = ratings_pivot.replace(np.nan, 0)
ratings_mat.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Use the numpy SVD function

Use the numpy SVD function to generate the U, sigma, and V transposed matrices.

The U matrix can be thought of as the "user-to-concept" matrix, quantifying the inclination of a user to a certain, mostly undefinable concept or set of concepts that would make up their movie preferences.

The sigma matrix is a diagonal matrix that measures the strength of each concept independent of individual user preferences.

The V transposed matrix is the "movie-to-concept" matrix which represent the (again, not strictly definable) attributes of each movie.

By calculating the dot product of these three matrices, we can generate a matrix that will predict each user's rating for each movie.

This topic is covered in this video:
https://www.youtube.com/watch?v=P5mlg91as1c&list=PLLssT5z_DsK9JDLcT8T62VtzwyW9LNepV&index=47

In [7]:
# we use full_matrices=False to ensure a consistent shape so that the
# dot product calculation is valid
u, s, v_t = np.linalg.svd(ratings_mat, full_matrices=False)
print(u.shape,s.shape,v_t.shape)


(610, 610) (610,) (610, 9724)


In [None]:
# calculate the dot product of the matrices
prediction_array = np.dot(u, np.dot(np.diag(s),v_t))
# convert back into a dataframe with movie IDs as column names
prediction_df = pd.DataFrame(prediction_array, columns= ratings_mat.columns)
# shift index numbers (starting from 0) up by one to get the user IDs 
# (which are sequential) starting from 1.
prediction_df.index += 1

### Recommender system

The recommender system will take a user ID and output the top predicted movies. The prediction matrix, the original ratings pivot (containing the NaNs, the movie dataframe (containing the movie titles) and the number of recommendations desired will all be taken as parameters.

In [8]:
def recommender(user_id, prediction_matrix, ratings_pivot, movies_df,n_recommendations):
    # column of the selected user's predicted ratings
    selected_user = prediction_matrix.loc[user_id, : ]
    # sort ratings from highest to lowest
    selected_user = selected_user.sort_values(ascending=False)
    
    i=0
    j=0
    
    while i < n_recommendations:
        # if the user rating is NaN in the ratings pivot,
        # the user didn't rate the referenced movie so it
        # can be recommended. Otherwise, skip.
        if (np.isnan(ratings_pivot.at[user_id,selected_user.index[j]])):
            title = movies_df[movies_df['movieId']==selected_user.index[j]]['title']
            print("System recommends", title.to_string(index=False))
            i += 1
            j += 1
        else:
            j += 1

### Recommendations and conclusions

We will select some random users and see what the recommender returns.

In [9]:
recommender(14, prediction_df, ratings_pivot, movies_df, 10)

System recommends  Usual Suspects, The (1995)
System recommends  American President, The (1995)
System recommends  Firm, The (1993)
System recommends  Heat (1995)
System recommends  Maverick (1994)
System recommends  Taxi Driver (1976)
System recommends  Reservoir Dogs (1992)
System recommends  Terminator 2: Judgment Day (1991)
System recommends  Die Hard: With a Vengeance (1995)
System recommends  Sleepers (1996)


These movies all seem to be action and dramas, with most being from the mid-90s

In [14]:
recommender(65, prediction_df, ratings_pivot, movies_df, 10)

System recommends  Interview with the Vampire: The Vampire Chroni...
System recommends  Godfather: Part II, The (1974)
System recommends  Never Been Kissed (1999)
System recommends  Raiders of the Lost Ark (Indiana Jones and the...
System recommends  Snatch (2000)
System recommends  Shine (1996)
System recommends  Stand by Me (1986)
System recommends  Four Weddings and a Funeral (1994)
System recommends  Sense and Sensibility (1995)
System recommends  Godfather, The (1972)


This user is being recommended critically acclaimed dramas, romances, and Brad Pitt movies.

In [15]:
recommender(22, prediction_df, ratings_pivot, movies_df, 10)

System recommends  Sleepless in Seattle (1993)
System recommends  Birdcage, The (1996)
System recommends  Willy Wonka & the Chocolate Factory (1971)
System recommends  Clerks (1994)
System recommends  Shakespeare in Love (1998)
System recommends  Rock, The (1996)
System recommends  Jumanji (1995)
System recommends  Natural Born Killers (1994)
System recommends  Grumpier Old Men (1995)
System recommends  Groundhog Day (1993)


This user appears to be recommended light-hearted comedies/romances, though Natural Born Killers is more serious and violent (though arguably romantic), and The Rock is a standard action blockbuster.