In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.utils.extmath import randomized_svd

## Movie Recommendations 
We'll make movie recommendations from the [movielens dataset](http://grouplens.org/datasets/movielens/). There is a much larger dataset located here as well, but we will use the smaller version, which contains over 100,000 ratings by 610 users of 9724 movies. Here is what the movie file looks like:

In [30]:
movies = pd.read_csv('data/ml-latest-small/movies.csv')
print(movies.shape)
movies[0:25]

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Notice that there are only 9,742 movies in our dataset but the movieid's go all the way into the hundreds of thousands. This is because many movieids are skipped in between (this is just a subset of the original dataset containing many more movies):

In [31]:
movies.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


Here is what the ratings file looks like:

In [32]:
ratings = pd.read_csv('data/ml-latest-small/ratings.csv')
print(ratings.shape)
ratings.head()

(100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


We have not covered the "groupby" method before but it is very helpful for aggregating data:

In [33]:
print('Number of Users:')
print((ratings.groupby(['userId']).count()).shape[0])
print('Number of Movies:')
print((ratings.groupby(['movieId']).count()).shape[0])

Number of Users:
610
Number of Movies:
9724


We can also use groupby to count how many users provided the following rankings:

In [34]:
ratings.groupby(['rating'])['userId'].count()

rating
0.5     1370
1.0     2811
1.5     1791
2.0     7551
2.5     5550
3.0    20047
3.5    13136
4.0    26818
4.5     8551
5.0    13211
Name: userId, dtype: int64

We can create a pivot table where the columns correspond to the movieid and the rows correspond to the userid. We will fill in any movies that the users didn't rank with 0's. Below, we see that User #1 ranked Movies #1,3, and 6 as 4 stars, for example:

In [35]:
ratings_pivot = pd.pivot_table(ratings, index='userId', columns='movieId', values='rating', aggfunc=np.mean)
ratings_pivot = ratings_pivot.fillna(0)
ratings_pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Homework 1: Implement the Algorithm

Notice that if we apply our previous get_item_recommendations to our entire DataFrame (without using SVD) then we get the following recommendations:

A Disney movie should return other Disney movies:
<img src="images/movie1.png" width=500>

A Star Wars movie should return other Star Wars movies:
<img src="images/movie2.png" width=500>

A "Chick Flick" movie like Shakespeare in Love should return other chick flick movies:
<img src="images/movie3.png" width=500>

Apply the StandardScaler to the pivot table and then create a get_item_recommendations function that prints similar movies to a given movie. Apply an SVD decomposition and play around with what you think is a good number of components to send through in order to get movie recommendations that make sense.

In [None]:
#insert 1

### Homework 2: Research the Netflix Prize

Read a few articles to answer the following questions:

1.) How much was the Netflix Prize worth?

2.) What platform was the original contest hosted on?

3.) What is an overview of some things that went into the algorithm?

4.) Does Netflix actually use the algorithm?

5.) Are there any current machine learning company-sponsored contests going on that are worth lots of money to solve?

In [50]:
#insert 2