# SVD

In this notebook, we will look at one of the most popular machine learning recommender models, Singular Value Decomposition, also know as SVD.

The idea behind SVD is Matrix Factorization. 2 matrices multiply together to give a third matrix which is the product of the other 2. In reverse, any matrix can be decomposed into 2 factors. The idea behind SVD is to find 2 factored matrices that will be as close to the third matrix as possible when multiplied together.

The beauty of this approach is that it fill in ALL the NaN values resulting in predicted ratings for every movie the user has never seen. 

# Load the data

Run the following 2 cells to load the data.

In [1]:
# load movie ratings
url_movie_ratings = 'https://raw.githubusercontent.com/khanhnamle1994/movielens/master/ratings.csv'
import pandas as pd
df_ratings = pd.read_csv(url_movie_ratings, nrows=100000, sep='\\t', engine='python')
del df_ratings['user_emb_id']
del df_ratings['movie_emb_id']
del df_ratings['timestamp']
df_ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [2]:
# load movie titles
url_movies = 'https://raw.githubusercontent.com/khanhnamle1994/movielens/master/movies.csv'
df_movies = pd.read_csv(url_movies, error_bad_lines=False, encoding='latin-1', sep='\\t')
df_movies.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


# Separate Title

In [3]:
# 1a. Change the title to only a title
title=[]
for i in df_movies['title']:
    title.append(i.split(' (')[0])
df_movies['title']=title

# Merge/Trim DataFrames

Combine your DataFrames into one

In [4]:
df = pd.merge(df_movies, df_ratings)
df = df[['user_id','title', 'rating']]
df.head()

Unnamed: 0,user_id,title,rating
0,1,Toy Story,5
1,6,Toy Story,4
2,8,Toy Story,4
3,9,Toy Story,5
4,10,Toy Story,5


# Problem 1: Add Yourself to the DataFrame

In [26]:
 #1a. Create dictionary of movie titles and your rankings
 #1b. Loop through dictionary, add yourself as user 0 with movies + rankings
myranks = {'Godfather, The':5, 'Toy Story':4}
df.iloc[0] = [0, 'Godfather, The', 5] 

# Build SVD

In [5]:
pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 43 kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1617778 sha256=e39611a2a4695ab52b853e92ddb8b8426f4c71667c2f6c06fe3bcdb6abbf699c
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [6]:
# import libraries
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

# Initialize a Reader to transform df.
reader = Reader(rating_scale=(1,5))

# Load the dataset using the the reader
data = Dataset.load_from_df(df[['user_id', 'title', 'rating']], reader)

# Initialize SVD
svd = SVD(n_factors=50)

# Get the scores using cross-validation
cross_validate(svd, data, measures=['RMSE', 'MAE'])

{'fit_time': (3.0721633434295654,
  2.9512863159179688,
  2.9419586658477783,
  2.9530818462371826,
  2.954455852508545),
 'test_mae': array([0.73456313, 0.73206018, 0.73143865, 0.73551147, 0.72836544]),
 'test_rmse': array([0.92710701, 0.92596637, 0.92551462, 0.93198735, 0.92307739]),
 'test_time': (0.2234351634979248,
  0.14826416969299316,
  0.29373836517333984,
  0.15049433708190918,
  0.13502216339111328)}

In [7]:
# train svd on the full data
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f028a2d9c90>

# Predictions

In [10]:
# select movie recommendations for 1 user
user_2 = df[df['user_id'] ==2]
user_2.sort_values('rating',ascending=False)[:20]

Unnamed: 0,user_id,title,rating
45650,2,"Hunt for Red October, The",5
16827,2,"Silence of the Lambs, The",5
83695,2,"Green Mile, The",5
28216,2,On Golden Pond,5
29705,2,One Flew Over the Cuckoo's Nest,5
29908,2,Star Wars: Episode V - The Empire Strikes Back,5
34384,2,Dead Poets Society,5
34487,2,"Graduate, The",5
35321,2,Stand by Me,5
37833,2,Gandhi,5


In [11]:
user_2.sort_values('rating')[:15]

Unnamed: 0,user_id,title,rating
1050,2,Get Shorty,1
98834,2,Nurse Betty,1
11757,2,Cliffhanger,2
34347,2,Miller's Crossing,2
66487,2,"Thin Red Line, The",2
27156,2,Platoon,2
3185,2,Broken Arrow,2
21685,2,Breakfast at Tiffany's,2
83064,2,Backdraft,2
53473,2,"Breakfast Club, The",2


In [21]:
# predict any movie for user 
svd.predict(2, '').est

3.4210845591970998

# Problem 2: Analyze Predictions

In [31]:
# 2a. Put in a variety of predictions. 
# 2b. What do you notice about the range of predictions?
# 2c. How good is the classifier in your estimation?
# 2d. What can you do to improve it?
for i in range (50):
  print(svd.predict(i, 'Toy Story').est)

4.239124103905194
4.503557884453132
3.819262002254969
4.115948804025126
4.440630167541723
3.67317936778398
4.323785383273277
4.623265422208202
4.29484062653146
4.012751029910917
4.786823645705213
3.474163301696102
4.050966535129133
3.776062835690371
3.697248394692642
3.786423974614635
3.8668274134266913
4.506356885559947
4.378268403183349
4.075570464396048
4.432337843329165
3.3604519110152125
3.3356178372736065
3.9816515272639132
4.30421281577501
4.209308466742784
3.7317432731409355
4.334198922791509
3.916826168507244
3.732096939468384
3.6032309053636737
4.440093328199236
3.8256652458571256
3.4319248766996067
4.5804797855695405
3.6800395475038825
4.6362519684257615
4.181410427257102
4.2569725782141505
4.1638989902132595
4.076723996421234
4.167557163102628
4.364521081892413
4.518078205938747
4.640455574026947
3.8553226670240095
4.878596348963088
3.964569464502006
3.842264770330398
4.103160494784565


# Problem 3: Increase Volume

In [None]:
# 3a. Add more of your own rankings. How does this change predictions?
# 3b. Add more users to the database using nrows. What's the net effect?



# Top predictions

The following code is slightly modified from the Surprise FAQ documentation page here: https://surprise.readthedocs.io/en/stable/FAQ.html#faq.

Run the code to get all the predictions.

In [22]:
from collections import defaultdict

def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = svd.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

1 ['Shawshank Redemption, The', 'Wrong Trousers, The', 'Shakespeare in Love', 'When We Were Kings', 'His Girl Friday', 'Monty Python and the Holy Grail', 'Rear Window', 'Matrix, The', 'Usual Suspects, The', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb']
6 ['Saving Private Ryan', 'Wrong Trousers, The', 'Shawshank Redemption, The', 'To Kill a Mockingbird', 'Life Is Beautiful', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', "Schindler's List", 'North by Northwest', 'Matrix, The', 'Grand Day Out, A']
8 ['Seven Samurai', 'Wrong Trousers, The', 'Godfather, The', 'Usual Suspects, The', 'Christmas Story, A', 'Bridge on the River Kwai, The', 'This Is Spinal Tap', 'Maltese Falcon, The', 'Close Shave, A', 'Star Wars: Episode V - The Empire Strikes Back']
9 ['Raiders of the Lost Ark', 'Godfather, The', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 'Star Wars: Episode V - The Empire Strikes Back', 'Seven Samurai', "One Flew

# Problem 4: Get your recommendations only

In [None]:
#4a. Modify the top n recommendations above to get your code only
#4b. Include the recommendations of a few other users as well.


# Problem 5: Tune SVD?

Check out the following article to determine if you should tune your SVD. You are encouraged to experiment and compare results: https://towardsdatascience.com/svd-where-model-tuning-goes-wrong-61c269402919. What are your thoughts after reading the article? Share in the chat.

# Challenge Problem: Surpise Data

See if you can figure out how to use the following code to use surprise's 100k data instead of the github link. The issue is that's their own type. Alternatively, see if you can find a way to insert yourself into their data. The advantage is that it's updated so it includes recent movies.

In [23]:
from surprise import Dataset

# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k
