In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.utils.extmath import randomized_svd

## Movie Recommendations 
We'll make movie recommendations from the [movielens dataset](http://grouplens.org/datasets/movielens/). There is a much larger dataset located here as well, but we will use the smaller version, which contains over 100,000 ratings by 610 users of 9724 movies. Here is what the movie file looks like:

In [2]:
movies = pd.read_csv('data/ml-latest-small/movies.csv')
print(movies.shape)
movies[0:25]

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Notice that there are only 9,742 movies in our dataset but the movieid's go all the way into the hundreds of thousands. This is because many movieids are skipped in between (this is just a subset of the original dataset containing many more movies):

In [3]:
movies.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


Here is what the ratings file looks like:

In [4]:
ratings = pd.read_csv('data/ml-latest-small/ratings.csv')
print(ratings.shape)
ratings.head()

(100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


We have not covered the "groupby" method before but it is very helpful for aggregating data:

In [5]:
print('Number of Users:')
print((ratings.groupby(['userId']).count()).shape[0])
print('Number of Movies:')
print((ratings.groupby(['movieId']).count()).shape[0])

Number of Users:
610
Number of Movies:
9724


We can also use groupby to count how many users provided the following rankings:

In [6]:
ratings.groupby(['rating'])['userId'].count()

rating
0.5     1370
1.0     2811
1.5     1791
2.0     7551
2.5     5550
3.0    20047
3.5    13136
4.0    26818
4.5     8551
5.0    13211
Name: userId, dtype: int64

We can create a pivot table where the columns correspond to the movieid and the rows correspond to the userid. We will fill in any movies that the users didn't rank with 0's. Below, we see that User #1 ranked Movies #1,3, and 6 as 4 stars, for example:

In [7]:
ratings_pivot = pd.pivot_table(ratings, index='userId', columns='movieId', values='rating', aggfunc=np.mean)
#ratings_pivot = ratings_pivot.fillna(ratings_pivot.mean())
ratings_pivot = ratings_pivot.fillna(0)
ratings_pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Homework 1: Implement the Algorithm

Apply the StandardScaler to the pivot table and then create a get_item_recommendations function that prints similar movies to a given movie. Here are some sample outputs to check your work:

A Disney movie should return other Disney movies:
<img src="images/movie1.png" width=500>

A Star Wars movie should return other Star Wars movies:
<img src="images/movie2.png" width=500>

A "Chick Flick" movie like Shakespeare in Love should return other chick flick movies:
<img src="images/movie3.png" width=500>


In [8]:
#insert 1

### Homework 2: Research the Netflix Prize

Read a few articles to answer the following questions:

1.) How much was the Netflix Prize worth?

2.) What platform was the original contest hosted on?

3.) What is an overview of some things that went into the algorithm?

4.) Does Netflix actually use the algorithm?

5.) Are there any current machine learning company-sponsored contests going on that are worth lots of money to solve?

In [9]:
#insert 2

### WITHOUT SCALING, TERRIBLE JOB

In [10]:
def get_item_recommendations(title, df, num_recom):
#    print(movies[movies.title == title].movieId.values[0])
    compare_item = movies[movies.title == title].movieId.values[0]
    recs = []
    for item in df.columns.values:
        if item != compare_item:
            recs.append([item,np.dot(df[compare_item],df[item])])
    final_rec = [i for i in sorted(recs,key=lambda x: x[1],reverse=True)]
    final_rec = final_rec[:num_recom]
    rec_titles = [(movies[movies.movieId == i[0]].title.values[0], i[1]) for i in final_rec]
    for title, product in rec_titles:
        print(title, product)
#    return rec_titles

title = "Toy Story (1995)" #not great job
#title = "Star Wars: Episode I - The Phantom Menace (1999)" # good job
#title = "Waiting to Exhale (1995)" # a couple of chick flicks
print(f"Items similar to movie {title}: {get_item_recommendations(title,ratings_pivot,20)}")

Forrest Gump (1994) 2476.5
Shawshank Redemption, The (1994) 2387.25
Pulp Fiction (1994) 2295.75
Star Wars: Episode IV - A New Hope (1977) 2242.0
Silence of the Lambs, The (1991) 1999.25
Matrix, The (1999) 1987.5
Jurassic Park (1993) 1972.5
Star Wars: Episode V - The Empire Strikes Back (1980) 1887.0
Star Wars: Episode VI - Return of the Jedi (1983) 1882.0
Braveheart (1995) 1804.5
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) 1770.25
Terminator 2: Judgment Day (1991) 1731.0
Apollo 13 (1995) 1710.0
Independence Day (a.k.a. ID4) (1996) 1690.25
Back to the Future (1985) 1684.0
Lion King, The (1994) 1680.75
Fight Club (1999) 1636.0
Aladdin (1992) 1633.75
Twelve Monkeys (a.k.a. 12 Monkeys) (1995) 1614.75
Shrek (2001) 1608.5
Items similar to movie Toy Story (1995): None


In [11]:
#toy story and forrest gump
print(np.dot(ratings_pivot[1], ratings_pivot[356]))
#toy story and toy story 2
print(np.dot(ratings_pivot[1], ratings_pivot[3114]))

2476.5
1308.75


### WITH SCALING, GOOD

In [12]:
sc_X = StandardScaler()
ratings_pivot_scaled = sc_X.fit_transform(ratings_pivot)

In [13]:
scaled_df = pd.DataFrame(ratings_pivot_scaled, columns = ratings_pivot.columns, index = ratings_pivot.index)

In [14]:
scaled_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.351355,-0.451377,3.877305,-0.102112,-0.282698,2.212582,-0.296949,-0.108158,-0.156937,-0.50648,...,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522
2,-0.713333,-0.451377,-0.289453,-0.102112,-0.282698,-0.437087,-0.296949,-0.108158,-0.156937,-0.50648,...,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522
3,-0.713333,-0.451377,-0.289453,-0.102112,-0.282698,-0.437087,-0.296949,-0.108158,-0.156937,-0.50648,...,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522
4,-0.713333,-0.451377,-0.289453,-0.102112,-0.282698,-0.437087,-0.296949,-0.108158,-0.156937,-0.50648,...,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522
5,1.351355,-0.451377,-0.289453,-0.102112,-0.282698,-0.437087,-0.296949,-0.108158,-0.156937,-0.50648,...,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522,-0.040522


Here, we use the original matrix and sort by all of the dot products:

In [15]:
def get_item_recommendations(title, df, num_recom):
#    print(movies[movies.title == title].movieId.values[0])
    compare_item = movies[movies.title == title].movieId.values[0]
    recs = []
    for item in df.columns.values:
        if item != compare_item:
            recs.append([item,np.dot(df[compare_item],df[item])])
    final_rec = [i for i in sorted(recs,key=lambda x: x[1],reverse=True)]
    final_rec = final_rec[:num_recom]
    rec_titles = [(movies[movies.movieId == i[0]].title.values[0], i[1]) for i in final_rec]
    for title, product in rec_titles:
        print(title, product)
#    return rec_titles

#title = "Toy Story (1995)" 
#title = "Star Wars: Episode I - The Phantom Menace (1999)" 
title = "Shakespeare in Love (1998)"

# title = "Star Wars: Episode I - The Phantom Menace (1999)"
print(f"Items similar to movie {title}: \n")
get_item_recommendations(title,scaled_df,20)

Items similar to movie Shakespeare in Love (1998): 

When Harry Met Sally... (1989) 278.612245152014
Full Monty, The (1997) 261.4128769496032
Big (1988) 242.05116074958872
Moulin Rouge (2001) 238.33523949441582
L.A. Story (1991) 237.63263616362667
Say Anything... (1989) 236.6796536453023
Elizabeth (1998) 235.94385688731032
Thelma & Louise (1991) 235.18730810811212
Lady and the Tramp (1955) 234.12726380160484
Splash (1984) 232.47445689046347
Breakfast Club, The (1985) 231.40261106546814
Jerry Maguire (1996) 228.63524988217574
Fish Called Wanda, A (1988) 223.82947498366525
Ferris Bueller's Day Off (1986) 223.702952153431
Some Like It Hot (1959) 222.62473062546036
My Best Friend's Wedding (1997) 222.3132316920245
Sixteen Candles (1984) 221.74999284228622
Mary Poppins (1964) 217.64458369026204
Pleasantville (1998) 217.1162732210213
Working Girl (1988) 217.05164676823358


Here, we use an SVD decomposition and only send a few components through:

In [19]:
n_components=60

U, Sigma, VT = randomized_svd(scaled_df.to_numpy(), n_components)

def get_item_recommendations(title, df, num_recom):
    recs = []
    movieid = movies[movies.title == title].movieId.values[0]              # get the movielens movieid
    compare_item = np.where(ratings_pivot.columns.values == movieid)[0][0] # get the corresponding index of that item in the ratings_pivot table
    for item in range(df.T.shape[0]):                                    # for each item except itself
        if item != compare_item:
            recs.append((np.dot(df.T[compare_item],df.T[item]), item))   # calculate the dot product pairings between each of the items principal components
    recs.sort(reverse = True)                                            # sort in decending order of dot product value
    recs = recs[:num_recom]
    recs = [movies[movies.movieId == ratings_pivot.columns[i[1]]].title.values[0] for i in recs if len(movies[movies.movieId == ratings_pivot.columns[i[1]]].title.values[0]) > 0]
    return recs


#title = "Toy Story (1995)"
#title = "Shakespeare in Love (1998)"
title = "Star Wars: Episode I - The Phantom Menace (1999)"
print(f"Items similar to movie {title} with {n_components} components: \n")
get_item_recommendations(title,VT,20)

Items similar to movie Star Wars: Episode I - The Phantom Menace (1999) with 60 components: 



['Men in Black (a.k.a. MIB) (1997)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Fifth Element, The (1997)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Ghostbusters (a.k.a. Ghost Busters) (1984)',
 'Indiana Jones and the Temple of Doom (1984)',
 'Saving Private Ryan (1998)',
 'Star Wars: Episode II - Attack of the Clones (2002)',
 'Total Recall (1990)',
 'Terminator, The (1984)',
 'Matrix, The (1999)',
 'X-Men (2000)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Forever Young (1992)',
 'Lethal Weapon (1987)',
 'Indiana Jones and the Last Crusade (1989)',
 'Lethal Weapon 2 (1989)',
 'Star Wars: Episode III - Revenge of the Sith (2005)',
 'Back to the Future Part II (1989)',
 'E.T. the Extra-Terrestrial (1982)']

The original pivot table has this shape and this many individual pieces of info:

In [17]:
print(ratings_pivot.shape, ratings_pivot.shape[0]*ratings_pivot.shape[1])

(610, 9724) 5931640


If we instead only send 60 components through, for example, then we only need to send:

In [18]:
n=60
ratings_pivot.shape[0]*60+60+60*ratings_pivot.shape[1]

620100

In [None]:
n_components=60

U, Sigma, VT = randomized_svd(scaled_df.to_numpy(), n_components)

def get_item_recommendations(compare_item, df, num_recom):
    recs = []
    movieid = movies[movies.title == title].movieId.values[0]              # get the movielens movieid
    compare_item = np.where(ratings_pivot.columns.values == movieid)[0][0]
    ### ????
    return recs

title = "Star Wars: Episode I - The Phantom Menace (1999)"
print(f"Items similar to movie {title} with {n_components} components: \n")
get_item_recommendations(title,VT,20)