
# Advanced Recommender Systems with Python

Welcome to the code notebook for creating Advanced Recommender Systems with Python. This is an optional lecture notebook for you to check out. Currently there is no video for this lecture because of the level of mathematics used and the heavy use of SciPy here.

Recommendation Systems usually rely on larger data sets and specifically need to be organized in a particular fashion. Because of this, we won't have a project to go along with this topic, instead we will have a more intensive walkthrough process on creating a recommendation system with Python with the same Movie Lens Data Set.

*Note: The actual mathematics behind recommender systems is pretty heavy in Linear Algebra.*
___

## Methods Used

Two most common types of recommender systems are **Content-Based** and **Collaborative Filtering (CF)**. 

* Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the "wisdom of the crowd" to recommend items. 
* Content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them.

## Collaborative Filtering

In general, Collaborative filtering (CF) is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand (from an overall implementation perspective). The algorithm has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use. 

CF can be divided into **Memory-Based Collaborative Filtering** and **Model-Based Collaborative filtering**. 

In this tutorial, we will implement Model-Based CF by using singular value decomposition (SVD) and Memory-Based CF by computing cosine similarity. 

## The Data

We will use famous MovieLens dataset, which is one of the most common datasets used when implementing and testing recommender engines. It contains 100k movie ratings from 943 users and a selection of 1682 movies.

You can download the dataset [here](http://files.grouplens.org/datasets/movielens/ml-100k.zip) or just use the u.data file that is already included in this folder.

____
## Getting Started

Let's import some libraries we will need:

In [1]:
import numpy as np
import pandas as pd

We can then read in the **u.data** file, which contains the full dataset. You can read a brief description of the dataset [here](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt).

Note how we specify the separator argument for a Tab separated file.

In [2]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

Let's take a quick look at the data.

In [3]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


Note how we only have the item_id, not the movie name. We can use the Movie_ID_Titles csv file to grab the movie names and merge it with this dataframe:

In [4]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Then merge the dataframes:

In [5]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


Now let's take a quick look at the number of unique users and movies.

In [6]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

Num. of Users: 944
Num of Movies: 1682


## Train Test Split

Recommendation Systems by their very nature are very difficult to evaluate, but we will still show you how to evaluate them in this tutorial. In order to do this, we'll split our data into two sets. However, we won't do our classic X_train,X_test,y_train,y_test split. Instead we can actually just segement the data into two sets of data:

In [10]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

print(type(train_data))

<class 'pandas.core.frame.DataFrame'>


## Memory-Based Collaborative Filtering

Memory-Based Collaborative Filtering approaches can be divided into two main sections: **user-item filtering** and **item-item filtering**. 

A *user-item filtering* will take a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. 

In contrast, *item-item filtering* will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations. 

* *Item-Item Collaborative Filtering*: “Users who liked this item also liked …”
* *User-Item Collaborative Filtering*: “Users who are similar to you also liked …”

In both cases, you create a user-item matrix which built from the entire dataset.

Since we have split the data into testing and training we will need to create two ``[943 x 1682]`` matrices (all users by all movies). 

The training matrix contains 75% of the ratings and the testing matrix contains 25% of the ratings.  

Example of user-item matrix:
<img class="aligncenter size-thumbnail img-responsive" src="http://s33.postimg.org/ay0ty90fj/BLOG_CCA_8.png" alt="blog8"/>

After you have built the user-item matrix you calculate the similarity and create a similarity matrix. 

The similarity values between items in *Item-Item Collaborative Filtering* are measured by observing all the users who have rated both items.  

<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/i522ma83z/BLOG_CCA_10.png"/>

For *User-Item Collaborative Filtering* the similarity values between users are measured by observing all the items that are rated by both users.

<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/mlh3z3z4f/BLOG_CCA_11.png"/>

A distance metric commonly used in recommender systems is *cosine similarity*, where the ratings are seen as vectors in ``n``-dimensional space and the similarity is calculated based on the angle between these vectors. 
Cosine similiarity for users *a* and *m* can be calculated using the formula below, where you take dot product of  the user vector *$u_k$* and the user vector *$u_a$* and divide it by multiplication of the Euclidean lengths of the vectors.
<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?s_u^{cos}(u_k,u_a)=\frac{u_k&space;\cdot&space;u_a&space;}{&space;\left&space;\|&space;u_k&space;\right&space;\|&space;\left&space;\|&space;u_a&space;\right&space;\|&space;}&space;=\frac{\sum&space;x_{k,m}x_{a,m}}{\sqrt{\sum&space;x_{k,m}^2\sum&space;x_{a,m}^2}}"/>

To calculate similarity between items *m* and *b* you use the formula:

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?s_u^{cos}(i_m,i_b)=\frac{i_m&space;\cdot&space;i_b&space;}{&space;\left&space;\|&space;i_m&space;\right&space;\|&space;\left&space;\|&space;i_b&space;\right&space;\|&space;}&space;=\frac{\sum&space;x_{a,m}x_{a,b}}{\sqrt{\sum&space;x_{a,m}^2\sum&space;x_{a,b}^2}}
"/>

Your first step will be to create the user-item matrix. Since you have both testing and training data you need to create two matrices.  

In [13]:
for line in train_data.itertuples():
    print(line)

Pandas(Index=11253, user_id=687, item_id=288, rating=4, timestamp=884651576, title='Scream (1996)')
Pandas(Index=78900, user_id=23, item_id=83, rating=4, timestamp=874785926, title='Much Ado About Nothing (1993)')
Pandas(Index=60039, user_id=896, item_id=128, rating=4, timestamp=887159321, title='Supercop (1992)')
Pandas(Index=46647, user_id=727, item_id=188, rating=3, timestamp=883711679, title='Full Metal Jacket (1987)')
Pandas(Index=38223, user_id=543, item_id=204, rating=4, timestamp=874864737, title='Back to the Future (1985)')
Pandas(Index=80163, user_id=535, item_id=848, rating=3, timestamp=879618743, title='Murder, My Sweet (1944)')
Pandas(Index=50907, user_id=291, item_id=475, rating=5, timestamp=874805699, title='Trainspotting (1996)')
Pandas(Index=62031, user_id=618, item_id=28, rating=4, timestamp=891309887, title='Apollo 13 (1995)')
Pandas(Index=56593, user_id=718, item_id=284, rating=4, timestamp=883349191, title='Tin Cup (1996)')
Pandas(Index=53747, user_id=224, item_id=

Pandas(Index=43003, user_id=239, item_id=286, rating=1, timestamp=889178512, title='English Patient, The (1996)')
Pandas(Index=95165, user_id=600, item_id=1188, rating=3, timestamp=888452152, title='Young Guns II (1990)')
Pandas(Index=16146, user_id=25, item_id=23, rating=4, timestamp=885852529, title='Taxi Driver (1976)')
Pandas(Index=87645, user_id=83, item_id=31, rating=5, timestamp=880307751, title='Crimson Tide (1995)')
Pandas(Index=50450, user_id=305, item_id=156, rating=4, timestamp=886323068, title='Reservoir Dogs (1992)')
Pandas(Index=96305, user_id=627, item_id=1136, rating=4, timestamp=879530762, title='Ghosts of Mississippi (1996)')
Pandas(Index=26108, user_id=622, item_id=79, rating=5, timestamp=882591979, title='Fugitive, The (1993)')
Pandas(Index=95359, user_id=102, item_id=1239, rating=2, timestamp=888802319, title='Cutthroat Island (1995)')
Pandas(Index=63424, user_id=145, item_id=877, rating=2, timestamp=885557506, title='Excess Baggage (1997)')
Pandas(Index=45112, us

Pandas(Index=81505, user_id=276, item_id=772, rating=4, timestamp=874790826, title='Kids (1995)')
Pandas(Index=86990, user_id=699, item_id=413, rating=3, timestamp=884152706, title='Tales from the Crypt Presents: Bordello of Blood (1996)')
Pandas(Index=4710, user_id=657, item_id=1, rating=3, timestamp=884239123, title='Toy Story (1995)')
Pandas(Index=58666, user_id=588, item_id=655, rating=3, timestamp=890025864, title='Stand by Me (1986)')
Pandas(Index=31773, user_id=380, item_id=109, rating=2, timestamp=885480093, title='Mystery Science Theater 3000: The Movie (1996)')
Pandas(Index=88559, user_id=354, item_id=606, rating=5, timestamp=891217633, title='All About Eve (1950)')
Pandas(Index=79983, user_id=393, item_id=1044, rating=4, timestamp=889731821, title='Paper, The (1994)')
Pandas(Index=41926, user_id=913, item_id=169, rating=4, timestamp=880757553, title='Wrong Trousers, The (1993)')
Pandas(Index=41184, user_id=374, item_id=197, rating=5, timestamp=882158940, title='Graduate, The

Pandas(Index=15611, user_id=14, item_id=498, rating=5, timestamp=890881384, title='African Queen, The (1951)')
Pandas(Index=60423, user_id=506, item_id=586, rating=2, timestamp=885135882, title='Terminal Velocity (1994)')
Pandas(Index=44730, user_id=535, item_id=612, rating=4, timestamp=879618385, title='Lost Horizon (1937)')
Pandas(Index=7983, user_id=62, item_id=4, rating=4, timestamp=879374640, title='Get Shorty (1995)')
Pandas(Index=4248, user_id=43, item_id=118, rating=4, timestamp=883955546, title='Twister (1996)')
Pandas(Index=90687, user_id=561, item_id=45, rating=3, timestamp=885808716, title='Eat Drink Man Woman (1994)')
Pandas(Index=23561, user_id=250, item_id=117, rating=3, timestamp=878089628, title='Rock, The (1996)')
Pandas(Index=97, user_id=125, item_id=50, rating=5, timestamp=892836362, title='Star Wars (1977)')
Pandas(Index=23978, user_id=624, item_id=273, rating=4, timestamp=879793129, title='Heat (1995)')
Pandas(Index=62840, user_id=666, item_id=566, rating=3, times

Pandas(Index=76055, user_id=339, item_id=428, rating=5, timestamp=891032349, title='Harold and Maude (1971)')
Pandas(Index=99257, user_id=234, item_id=1447, rating=3, timestamp=892336119, title='Century (1993)')
Pandas(Index=82693, user_id=561, item_id=530, rating=4, timestamp=885807547, title='Man Who Would Be King, The (1975)')
Pandas(Index=97979, user_id=369, item_id=890, rating=3, timestamp=889428268, title='Mortal Kombat: Annihilation (1997)')
Pandas(Index=70687, user_id=829, item_id=213, rating=4, timestamp=881698933, title='Room with a View, A (1986)')
Pandas(Index=78665, user_id=90, item_id=499, rating=5, timestamp=891383866, title='Cat on a Hot Tin Roof (1958)')
Pandas(Index=5803, user_id=274, item_id=234, rating=5, timestamp=878946536, title='Jaws (1975)')
Pandas(Index=73448, user_id=721, item_id=331, rating=3, timestamp=877137285, title='Edge, The (1997)')
Pandas(Index=83717, user_id=367, item_id=441, rating=3, timestamp=876690049, title='Amityville Horror, The (1979)')
Pand

Pandas(Index=74352, user_id=339, item_id=435, rating=4, timestamp=891032189, title='Butch Cassidy and the Sundance Kid (1969)')
Pandas(Index=73810, user_id=758, item_id=24, rating=4, timestamp=881979891, title='Rumble in the Bronx (1995)')
Pandas(Index=36405, user_id=331, item_id=198, rating=4, timestamp=877196634, title='Nikita (La Femme Nikita) (1990)')
Pandas(Index=14670, user_id=628, item_id=338, rating=5, timestamp=880776981, title='Bean (1997)')
Pandas(Index=37411, user_id=344, item_id=508, rating=4, timestamp=884814697, title='People vs. Larry Flynt, The (1996)')
Pandas(Index=59603, user_id=22, item_id=692, rating=4, timestamp=878886480, title='American President, The (1995)')
Pandas(Index=38319, user_id=198, item_id=204, rating=3, timestamp=884207584, title='Back to the Future (1985)')
Pandas(Index=7821, user_id=385, item_id=201, rating=4, timestamp=879441982, title='Evil Dead II (1987)')
Pandas(Index=29857, user_id=437, item_id=1098, rating=3, timestamp=880141243, title='Flirt

Pandas(Index=98772, user_id=585, item_id=730, rating=3, timestamp=891285188, title='Queen Margot (Reine Margot, La) (1994)')
Pandas(Index=14616, user_id=250, item_id=338, rating=4, timestamp=883263374, title='Bean (1997)')
Pandas(Index=20461, user_id=201, item_id=750, rating=3, timestamp=884110598, title='Amistad (1997)')
Pandas(Index=97427, user_id=311, item_id=1297, rating=4, timestamp=884365654, title='Love Affair (1994)')
Pandas(Index=65784, user_id=267, item_id=203, rating=5, timestamp=878972241, title='Unforgiven (1992)')
Pandas(Index=10585, user_id=536, item_id=423, rating=4, timestamp=882360601, title='E.T. the Extra-Terrestrial (1982)')
Pandas(Index=20971, user_id=214, item_id=135, rating=3, timestamp=891544175, title='2001: A Space Odyssey (1968)')
Pandas(Index=1968, user_id=218, item_id=265, rating=3, timestamp=881288408, title='Hunt for Red October, The (1990)')
Pandas(Index=94621, user_id=291, item_id=1210, rating=4, timestamp=875087656, title='Virtuosity (1995)')
Pandas(I

Pandas(Index=78881, user_id=815, item_id=671, rating=4, timestamp=878695679, title='Bride of Frankenstein (1935)')
Pandas(Index=12689, user_id=388, item_id=690, rating=5, timestamp=886438540, title='Seven Years in Tibet (1997)')
Pandas(Index=23878, user_id=94, item_id=273, rating=4, timestamp=885872684, title='Heat (1995)')
Pandas(Index=53068, user_id=42, item_id=168, rating=3, timestamp=881107773, title='Monty Python and the Holy Grail (1974)')
Pandas(Index=51741, user_id=674, item_id=289, rating=2, timestamp=887763151, title='Evita (1996)')
Pandas(Index=58532, user_id=894, item_id=1462, rating=3, timestamp=882404642, title='Thieves (Voleurs, Les) (1996)')
Pandas(Index=35433, user_id=224, item_id=468, rating=4, timestamp=888104030, title='Rudy (1993)')
Pandas(Index=82776, user_id=234, item_id=528, rating=4, timestamp=892079689, title='Killing Fields, The (1984)')
Pandas(Index=66767, user_id=293, item_id=421, rating=3, timestamp=888906576, title="William Shakespeare's Romeo and Juliet 

Pandas(Index=91647, user_id=525, item_id=928, rating=3, timestamp=881086586, title='Craft, The (1996)')
Pandas(Index=1717, user_id=640, item_id=346, rating=4, timestamp=886353742, title='Jackie Brown (1997)')
Pandas(Index=84104, user_id=380, item_id=356, rating=2, timestamp=885480064, title='Client, The (1994)')
Pandas(Index=47078, user_id=934, item_id=97, rating=4, timestamp=891192329, title='Dances with Wolves (1990)')
Pandas(Index=42863, user_id=479, item_id=175, rating=4, timestamp=879461102, title='Brazil (1985)')
Pandas(Index=22164, user_id=82, item_id=756, rating=1, timestamp=878768741, title='Father of the Bride Part II (1995)')
Pandas(Index=12394, user_id=849, item_id=15, rating=5, timestamp=879695896, title="Mr. Holland's Opus (1995)")
Pandas(Index=8499, user_id=181, item_id=100, rating=3, timestamp=878962816, title='Fargo (1996)')
Pandas(Index=78754, user_id=276, item_id=207, rating=4, timestamp=874795988, title='Cyrano de Bergerac (1990)')
Pandas(Index=1103, user_id=694, it

Pandas(Index=904, user_id=748, item_id=172, rating=4, timestamp=879454810, title='Empire Strikes Back, The (1980)')
Pandas(Index=60582, user_id=297, item_id=147, rating=3, timestamp=874955183, title='Long Kiss Goodnight, The (1996)')
Pandas(Index=38903, user_id=588, item_id=227, rating=3, timestamp=890028385, title='Star Trek VI: The Undiscovered Country (1991)')
Pandas(Index=72929, user_id=437, item_id=283, rating=1, timestamp=880141716, title='Emma (1996)')
Pandas(Index=25081, user_id=874, item_id=137, rating=4, timestamp=888632484, title='Big Night (1996)')
Pandas(Index=44915, user_id=722, item_id=845, rating=5, timestamp=891280842, title='That Thing You Do! (1996)')
Pandas(Index=98738, user_id=13, item_id=400, rating=4, timestamp=885744650, title='Little Rascals, The (1994)')
Pandas(Index=2701, user_id=392, item_id=257, rating=5, timestamp=891038184, title='Men in Black (1997)')
Pandas(Index=75255, user_id=251, item_id=429, rating=4, timestamp=886271955, title='Day the Earth Stood 

Pandas(Index=37627, user_id=211, item_id=303, rating=3, timestamp=879437184, title="Ulee's Gold (1997)")
Pandas(Index=81210, user_id=655, item_id=773, rating=3, timestamp=887430072, title='Mute Witness (1994)')
Pandas(Index=8558, user_id=349, item_id=100, rating=4, timestamp=879466479, title='Fargo (1996)')
Pandas(Index=3695, user_id=554, item_id=274, rating=3, timestamp=876232317, title='Sabrina (1995)')
Pandas(Index=4362, user_id=500, item_id=118, rating=3, timestamp=883865610, title='Twister (1996)')
Pandas(Index=34007, user_id=193, item_id=56, rating=1, timestamp=889125572, title='Pulp Fiction (1994)')
Pandas(Index=84795, user_id=425, item_id=684, rating=2, timestamp=878738385, title='In the Line of Fire (1993)')
Pandas(Index=91423, user_id=321, item_id=607, rating=4, timestamp=879440109, title='Rebecca (1940)')
Pandas(Index=91443, user_id=617, item_id=607, rating=4, timestamp=883789212, title='Rebecca (1940)')
Pandas(Index=48846, user_id=301, item_id=47, rating=4, timestamp=882076

Pandas(Index=53872, user_id=782, item_id=873, rating=4, timestamp=891498512, title='Picture Perfect (1997)')
Pandas(Index=59077, user_id=303, item_id=588, rating=5, timestamp=879468459, title='Beauty and the Beast (1991)')
Pandas(Index=70596, user_id=891, item_id=107, rating=5, timestamp=883490041, title='Moll Flanders (1996)')
Pandas(Index=65246, user_id=581, item_id=9, rating=5, timestamp=879641787, title='Dead Man Walking (1995)')
Pandas(Index=37877, user_id=716, item_id=215, rating=5, timestamp=879796046, title='Field of Dreams (1989)')
Pandas(Index=95435, user_id=416, item_id=934, rating=2, timestamp=876698178, title="Preacher's Wife, The (1996)")
Pandas(Index=47739, user_id=547, item_id=315, rating=4, timestamp=891282555, title='Apt Pupil (1998)')
Pandas(Index=87479, user_id=854, item_id=628, rating=2, timestamp=882812451, title='Sleepers (1996)')
Pandas(Index=99256, user_id=405, item_id=1290, rating=2, timestamp=885546379, title='Country Life (1994)')
Pandas(Index=76618, user_id

Pandas(Index=86048, user_id=497, item_id=765, rating=3, timestamp=879363155, title='Boomerang (1992)')
Pandas(Index=58092, user_id=610, item_id=70, rating=4, timestamp=888703609, title='Four Weddings and a Funeral (1994)')
Pandas(Index=53232, user_id=790, item_id=550, rating=4, timestamp=885156618, title='Die Hard: With a Vengeance (1995)')
Pandas(Index=69152, user_id=823, item_id=87, rating=5, timestamp=878438887, title='Searching for Bobby Fischer (1993)')
Pandas(Index=36509, user_id=385, item_id=189, rating=5, timestamp=881530739, title='Grand Day Out, A (1992)')
Pandas(Index=88241, user_id=72, item_id=844, rating=4, timestamp=880035708, title='Freeway (1996)')
Pandas(Index=68204, user_id=608, item_id=93, rating=4, timestamp=880406299, title='Welcome to the Dollhouse (1995)')
Pandas(Index=61926, user_id=682, item_id=686, rating=4, timestamp=888519725, title='Perfect World, A (1993)')
Pandas(Index=36044, user_id=878, item_id=82, rating=3, timestamp=880870609, title='Jurassic Park (19

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Pandas(Index=95238, user_id=13, item_id=646, rating=4, timestamp=882140037, title='Once Upon a Time in the West (1969)')
Pandas(Index=46486, user_id=151, item_id=716, rating=2, timestamp=879528778, title='Home for the Holidays (1995)')
Pandas(Index=48198, user_id=777, item_id=238, rating=4, timestamp=875980541, title='Raising Arizona (1987)')
Pandas(Index=66649, user_id=186, item_id=977, rating=3, timestamp=879023273, title='Substitute, The (1996)')
Pandas(Index=68631, user_id=322, item_id=513, rating=4, timestamp=887314185, title='Third Man, The (1949)')
Pandas(Index=90718, user_id=828, item_id=45, rating=4, timestamp=891380166, title='Eat Drink Man Woman (1994)')
Pandas(Index=63877, user_id=297, item_id=269, rating=4, timestamp=875774037, title='Full Monty, The (1997)')
Pandas(Index=16799, user_id=474, item_id=208, rating=3, timestamp=887925497, title='Young Frankenstein (1974)')
Pandas(Index=83892, user_id=189, item_id=863, rating=4, timestamp=893266161, title='Garden of Finzi-Conti

Pandas(Index=41356, user_id=396, item_id=678, rating=3, timestamp=884645838, title='Volcano (1997)')
Pandas(Index=13427, user_id=454, item_id=237, rating=4, timestamp=881960361, title='Jerry Maguire (1996)')
Pandas(Index=72027, user_id=264, item_id=792, rating=5, timestamp=886123415, title='Bullets Over Broadway (1994)')
Pandas(Index=4967, user_id=682, item_id=1, rating=4, timestamp=888523054, title='Toy Story (1995)')
Pandas(Index=27760, user_id=864, item_id=48, rating=5, timestamp=888886945, title='Hoop Dreams (1994)')
Pandas(Index=36068, user_id=911, item_id=82, rating=2, timestamp=892840888, title='Jurassic Park (1993)')
Pandas(Index=28975, user_id=916, item_id=280, rating=2, timestamp=880843864, title='Up Close and Personal (1996)')
Pandas(Index=75766, user_id=506, item_id=693, rating=4, timestamp=874876651, title='Casino (1995)')
Pandas(Index=6150, user_id=639, item_id=98, rating=4, timestamp=891240643, title='Silence of the Lambs, The (1991)')
Pandas(Index=84250, user_id=436, it

Pandas(Index=7349, user_id=200, item_id=304, rating=5, timestamp=876041644, title='Fly Away Home (1996)')
Pandas(Index=65513, user_id=308, item_id=233, rating=3, timestamp=887738346, title='Under Siege (1992)')
Pandas(Index=42844, user_id=7, item_id=175, rating=5, timestamp=892133057, title='Brazil (1985)')
Pandas(Index=93885, user_id=459, item_id=978, rating=2, timestamp=879563435, title="Heaven's Prisoners (1996)")
Pandas(Index=23761, user_id=735, item_id=117, rating=3, timestamp=876698897, title='Rock, The (1996)')
Pandas(Index=40820, user_id=276, item_id=271, rating=4, timestamp=880913800, title='Starship Troopers (1997)')
Pandas(Index=32381, user_id=407, item_id=568, rating=2, timestamp=876338730, title='Speed (1994)')
Pandas(Index=7557, user_id=795, item_id=564, rating=1, timestamp=883774317, title='Tales from the Hood (1995)')
Pandas(Index=14129, user_id=497, item_id=403, rating=3, timestamp=879310883, title='Batman (1989)')
Pandas(Index=75989, user_id=298, item_id=282, rating=4

Pandas(Index=64671, user_id=135, item_id=324, rating=3, timestamp=879857575, title='Lost Highway (1997)')
Pandas(Index=62243, user_id=24, item_id=729, rating=5, timestamp=875323475, title='Nell (1994)')
Pandas(Index=49930, user_id=442, item_id=210, rating=3, timestamp=883388609, title='Indiana Jones and the Last Crusade (1989)')
Pandas(Index=88840, user_id=522, item_id=521, rating=5, timestamp=876961190, title='Deer Hunter, The (1978)')
Pandas(Index=51058, user_id=551, item_id=475, rating=5, timestamp=892777910, title='Trainspotting (1996)')
Pandas(Index=7951, user_id=933, item_id=241, rating=2, timestamp=874855069, title='Last of the Mohicans, The (1992)')
Pandas(Index=68555, user_id=189, item_id=162, rating=3, timestamp=893266230, title='On Golden Pond (1981)')
Pandas(Index=83846, user_id=59, item_id=90, rating=2, timestamp=888206363, title='So I Married an Axe Murderer (1993)')
Pandas(Index=58665, user_id=456, item_id=655, rating=3, timestamp=881373838, title='Stand by Me (1986)')
P

Pandas(Index=6411, user_id=527, item_id=193, rating=3, timestamp=879455680, title='Right Stuff, The (1983)')
Pandas(Index=43914, user_id=864, item_id=402, rating=3, timestamp=888892128, title='Ghost (1990)')
Pandas(Index=19250, user_id=184, item_id=514, rating=5, timestamp=889908497, title='Annie Hall (1977)')
Pandas(Index=90700, user_id=1, item_id=45, rating=5, timestamp=875241687, title='Eat Drink Man Woman (1994)')
Pandas(Index=61323, user_id=468, item_id=39, rating=3, timestamp=875296309, title='Strange Days (1995)')
Pandas(Index=72381, user_id=299, item_id=747, rating=4, timestamp=889502640, title='Benny & Joon (1993)')
Pandas(Index=18377, user_id=658, item_id=96, rating=4, timestamp=875147873, title='Terminator 2: Judgment Day (1991)')
Pandas(Index=92496, user_id=761, item_id=261, rating=1, timestamp=876189871, title='Air Bud (1997)')
Pandas(Index=70880, user_id=829, item_id=319, rating=4, timestamp=892312728, title='Everyone Says I Love You (1996)')
Pandas(Index=32630, user_id=5

Pandas(Index=50269, user_id=823, item_id=770, rating=4, timestamp=878438754, title='Devil in a Blue Dress (1995)')
Pandas(Index=21050, user_id=263, item_id=135, rating=5, timestamp=891299877, title='2001: A Space Odyssey (1968)')
Pandas(Index=90323, user_id=551, item_id=727, rating=5, timestamp=892783559, title='Immortal Beloved (1994)')
Pandas(Index=54113, user_id=639, item_id=306, rating=4, timestamp=891238550, title='Mrs. Brown (Her Majesty, Mrs. Brown) (1997)')
Pandas(Index=50004, user_id=109, item_id=210, rating=5, timestamp=880573084, title='Indiana Jones and the Last Crusade (1989)')
Pandas(Index=72124, user_id=416, item_id=708, rating=4, timestamp=889907392, title='Sex, Lies, and Videotape (1989)')
Pandas(Index=58793, user_id=188, item_id=13, rating=4, timestamp=875073408, title='Mighty Aphrodite (1995)')
Pandas(Index=93386, user_id=790, item_id=358, rating=2, timestamp=885154848, title='Spawn (1997)')
Pandas(Index=14148, user_id=162, item_id=403, rating=3, timestamp=877636713,

Pandas(Index=34094, user_id=363, item_id=56, rating=5, timestamp=891495301, title='Pulp Fiction (1994)')
Pandas(Index=52692, user_id=561, item_id=732, rating=3, timestamp=885809958, title='Dave (1993)')
Pandas(Index=48857, user_id=506, item_id=47, rating=4, timestamp=874876486, title='Ed Wood (1994)')
Pandas(Index=78090, user_id=174, item_id=764, rating=4, timestamp=886434343, title='If Lucy Fell (1996)')
Pandas(Index=10450, user_id=437, item_id=423, rating=5, timestamp=880141196, title='E.T. the Extra-Terrestrial (1982)')
Pandas(Index=11651, user_id=833, item_id=526, rating=4, timestamp=875224515, title='Ben-Hur (1959)')
Pandas(Index=11237, user_id=740, item_id=288, rating=4, timestamp=879523187, title='Scream (1996)')
Pandas(Index=95773, user_id=565, item_id=713, rating=5, timestamp=891037693, title='Othello (1995)')
Pandas(Index=93135, user_id=399, item_id=722, rating=2, timestamp=882348153, title='Nine Months (1995)')
Pandas(Index=12949, user_id=907, item_id=248, rating=5, timestam

Pandas(Index=26568, user_id=941, item_id=258, rating=4, timestamp=875048495, title='Contact (1997)')
Pandas(Index=84579, user_id=291, item_id=395, rating=3, timestamp=875086534, title='Robin Hood: Men in Tights (1993)')
Pandas(Index=55490, user_id=632, item_id=55, rating=2, timestamp=879457857, title='Professional, The (1994)')
Pandas(Index=35873, user_id=85, item_id=520, rating=3, timestamp=882996257, title='Great Escape, The (1963)')
Pandas(Index=9344, user_id=262, item_id=181, rating=3, timestamp=879961819, title='Return of the Jedi (1983)')
Pandas(Index=72093, user_id=846, item_id=792, rating=4, timestamp=883948221, title='Bullets Over Broadway (1994)')
Pandas(Index=65533, user_id=379, item_id=233, rating=3, timestamp=880525638, title='Under Siege (1992)')
Pandas(Index=39513, user_id=766, item_id=183, rating=4, timestamp=891309484, title='Alien (1979)')
Pandas(Index=73262, user_id=883, item_id=703, rating=3, timestamp=891693139, title="Widows' Peak (1994)")
Pandas(Index=91160, user

Pandas(Index=96187, user_id=498, item_id=1495, rating=3, timestamp=881958237, title='Flirt (1995)')
Pandas(Index=56548, user_id=18, item_id=284, rating=3, timestamp=880131804, title='Tin Cup (1996)')
Pandas(Index=68981, user_id=90, item_id=131, rating=5, timestamp=891384066, title="Breakfast at Tiffany's (1961)")
Pandas(Index=97488, user_id=448, item_id=1294, rating=1, timestamp=891887161, title='Ayn Rand: A Sense of Life (1997)')
Pandas(Index=7490, user_id=931, item_id=304, rating=4, timestamp=891036105, title='Fly Away Home (1996)')
Pandas(Index=98456, user_id=881, item_id=1177, rating=1, timestamp=876539418, title='Dunston Checks In (1996)')
Pandas(Index=29829, user_id=786, item_id=228, rating=4, timestamp=882844295, title='Star Trek: The Wrath of Khan (1982)')
Pandas(Index=58988, user_id=13, item_id=588, rating=4, timestamp=882398763, title='Beauty and the Beast (1991)')
Pandas(Index=29784, user_id=664, item_id=228, rating=4, timestamp=876526462, title='Star Trek: The Wrath of Khan

Pandas(Index=92915, user_id=151, item_id=735, rating=5, timestamp=879528438, title='Philadelphia (1993)')
Pandas(Index=42088, user_id=207, item_id=597, rating=3, timestamp=876018471, title='Eraser (1996)')
Pandas(Index=66697, user_id=214, item_id=1065, rating=5, timestamp=892668173, title='Koyaanisqatsi (1983)')
Pandas(Index=47348, user_id=647, item_id=173, rating=5, timestamp=876534131, title='Princess Bride, The (1987)')
Pandas(Index=3152, user_id=737, item_id=222, rating=3, timestamp=884315127, title='Star Trek: First Contact (1996)')
Pandas(Index=4887, user_id=852, item_id=1, rating=4, timestamp=891036457, title='Toy Story (1995)')
Pandas(Index=57118, user_id=800, item_id=864, rating=4, timestamp=887646980, title='My Fellow Americans (1996)')
Pandas(Index=1674, user_id=445, item_id=346, rating=5, timestamp=891200869, title='Jackie Brown (1997)')
Pandas(Index=80137, user_id=595, item_id=290, rating=4, timestamp=886921748, title='Fierce Creatures (1997)')
Pandas(Index=90923, user_id=

Pandas(Index=79255, user_id=6, item_id=276, rating=2, timestamp=883599134, title='Leaving Las Vegas (1995)')
Pandas(Index=33288, user_id=546, item_id=7, rating=5, timestamp=885140689, title='Twelve Monkeys (1995)')
Pandas(Index=39334, user_id=137, item_id=183, rating=5, timestamp=881433689, title='Alien (1979)')
Pandas(Index=24844, user_id=130, item_id=2, rating=4, timestamp=876252327, title='GoldenEye (1995)')
Pandas(Index=85087, user_id=440, item_id=361, rating=5, timestamp=891548567, title='Incognito (1997)')
Pandas(Index=88210, user_id=277, item_id=844, rating=4, timestamp=879543528, title='Freeway (1996)')
Pandas(Index=87887, user_id=217, item_id=827, rating=2, timestamp=889070232, title='Daylight (1996)')
Pandas(Index=70376, user_id=474, item_id=523, rating=5, timestamp=887924083, title='Cool Hand Luke (1967)')
Pandas(Index=45937, user_id=117, item_id=176, rating=5, timestamp=881012028, title='Aliens (1986)')
Pandas(Index=55052, user_id=643, item_id=58, rating=4, timestamp=891448

Pandas(Index=16183, user_id=655, item_id=23, rating=3, timestamp=887426971, title='Taxi Driver (1976)')
Pandas(Index=65377, user_id=838, item_id=9, rating=4, timestamp=887063696, title='Dead Man Walking (1995)')
Pandas(Index=68249, user_id=200, item_id=982, rating=2, timestamp=891825589, title='Maximum Risk (1996)')
Pandas(Index=85075, user_id=872, item_id=975, rating=4, timestamp=888479654, title='Fear (1996)')
Pandas(Index=44019, user_id=102, item_id=522, rating=3, timestamp=888803487, title='Down by Law (1986)')
Pandas(Index=24520, user_id=269, item_id=177, rating=5, timestamp=891449214, title='Good, The Bad and The Ugly, The (1966)')
Pandas(Index=482, user_id=447, item_id=50, rating=5, timestamp=878854552, title='Star Wars (1977)')
Pandas(Index=19690, user_id=868, item_id=317, rating=5, timestamp=877107961, title='In the Name of the Father (1993)')
Pandas(Index=31795, user_id=634, item_id=109, rating=4, timestamp=877017810, title='Mystery Science Theater 3000: The Movie (1996)')
Pa

Pandas(Index=65847, user_id=239, item_id=203, rating=1, timestamp=889179291, title='Unforgiven (1992)')
Pandas(Index=30646, user_id=731, item_id=127, rating=4, timestamp=886179415, title='Godfather, The (1972)')
Pandas(Index=90909, user_id=805, item_id=1118, rating=5, timestamp=881704553, title='Up in Smoke (1978)')
Pandas(Index=11363, user_id=870, item_id=288, rating=4, timestamp=875050370, title='Scream (1996)')
Pandas(Index=3, user_id=2, item_id=50, rating=5, timestamp=888552084, title='Star Wars (1977)')
Pandas(Index=41673, user_id=492, item_id=199, rating=3, timestamp=879969255, title='Bridge on the River Kwai, The (1957)')
Pandas(Index=1186, user_id=461, item_id=242, rating=3, timestamp=885355735, title='Kolya (1996)')
Pandas(Index=25415, user_id=521, item_id=1240, rating=3, timestamp=884478667, title='Ghost in the Shell (Kokaku kidotai) (1995)')
Pandas(Index=79919, user_id=804, item_id=826, rating=3, timestamp=879443776, title='Phantom, The (1996)')
Pandas(Index=99153, user_id=6

Pandas(Index=28158, user_id=452, item_id=483, rating=5, timestamp=875263244, title='Casablanca (1942)')
Pandas(Index=59795, user_id=693, item_id=632, rating=5, timestamp=875482626, title="Sophie's Choice (1982)")
Pandas(Index=19588, user_id=632, item_id=485, rating=4, timestamp=879457157, title='My Fair Lady (1964)')
Pandas(Index=5557, user_id=733, item_id=277, rating=1, timestamp=879536523, title='Restoration (1995)')
Pandas(Index=60309, user_id=593, item_id=49, rating=3, timestamp=875671891, title='I.Q. (1994)')
Pandas(Index=96947, user_id=716, item_id=956, rating=4, timestamp=879796011, title="Nobody's Fool (1994)")
Pandas(Index=5212, user_id=619, item_id=546, rating=2, timestamp=885953826, title='Broken Arrow (1996)')
Pandas(Index=63570, user_id=234, item_id=287, rating=3, timestamp=891228196, title="Marvin's Room (1996)")
Pandas(Index=73645, user_id=160, item_id=488, rating=5, timestamp=876862078, title='Sunset Blvd. (1950)')
Pandas(Index=32797, user_id=846, item_id=401, rating=5,

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Pandas(Index=42177, user_id=387, item_id=333, rating=3, timestamp=886479484, title='Game, The (1997)')
Pandas(Index=37262, user_id=269, item_id=484, rating=3, timestamp=891448895, title='Maltese Falcon, The (1941)')
Pandas(Index=2846, user_id=862, item_id=257, rating=5, timestamp=879303207, title='Men in Black (1997)')
Pandas(Index=96620, user_id=13, item_id=349, rating=1, timestamp=892387807, title='Hard Rain (1998)')
Pandas(Index=90985, user_id=10, item_id=610, rating=4, timestamp=877888613, title='Gigi (1958)')
Pandas(Index=40112, user_id=807, item_id=751, rating=3, timestamp=892527467, title='Tomorrow Never Dies (1997)')
Pandas(Index=58054, user_id=475, item_id=70, rating=4, timestamp=891627606, title='Four Weddings and a Funeral (1994)')
Pandas(Index=16943, user_id=346, item_id=685, rating=3, timestamp=874950383, title='Executive Decision (1996)')
Pandas(Index=13855, user_id=303, item_id=54, rating=3, timestamp=879484695, title='Outbreak (1995)')
Pandas(Index=17804, user_id=653, i

Pandas(Index=89717, user_id=435, item_id=841, rating=2, timestamp=884134553, title='Glimmer Man, The (1996)')
Pandas(Index=26409, user_id=435, item_id=258, rating=4, timestamp=884130647, title='Contact (1997)')
Pandas(Index=50526, user_id=248, item_id=156, rating=5, timestamp=884534945, title='Reservoir Dogs (1992)')
Pandas(Index=61429, user_id=374, item_id=820, rating=4, timestamp=882158327, title='Space Jam (1996)')
Pandas(Index=31741, user_id=26, item_id=109, rating=3, timestamp=891376987, title='Mystery Science Theater 3000: The Movie (1996)')
Pandas(Index=56887, user_id=13, item_id=22, rating=4, timestamp=882140487, title='Braveheart (1995)')
Pandas(Index=30614, user_id=852, item_id=127, rating=4, timestamp=891035544, title='Godfather, The (1972)')
Pandas(Index=53511, user_id=774, item_id=150, rating=1, timestamp=888558787, title='Swingers (1996)')
Pandas(Index=11152, user_id=107, item_id=288, rating=3, timestamp=891264432, title='Scream (1996)')
Pandas(Index=61989, user_id=249, i

Pandas(Index=24014, user_id=643, item_id=273, rating=3, timestamp=891445287, title='Heat (1995)')
Pandas(Index=38923, user_id=56, item_id=227, rating=3, timestamp=892676430, title='Star Trek VI: The Undiscovered Country (1991)')
Pandas(Index=79484, user_id=928, item_id=276, rating=5, timestamp=880937144, title='Leaving Las Vegas (1995)')
Pandas(Index=15326, user_id=262, item_id=275, rating=4, timestamp=879961980, title='Sense and Sensibility (1995)')
Pandas(Index=20022, user_id=334, item_id=200, rating=4, timestamp=891547171, title='Shining, The (1980)')
Pandas(Index=50768, user_id=790, item_id=431, rating=3, timestamp=885157159, title='Highlander (1986)')
Pandas(Index=71892, user_id=766, item_id=616, rating=3, timestamp=891309589, title='Night of the Living Dead (1968)')
Pandas(Index=95700, user_id=880, item_id=783, rating=1, timestamp=880175187, title='Milk Money (1994)')
Pandas(Index=33789, user_id=260, item_id=334, rating=5, timestamp=890618729, title='U Turn (1997)')
Pandas(Index=

In [14]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

In [15]:
print(train_data_matrix)

[[5. 3. 4. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 5. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


You can use the [pairwise_distances](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) function from sklearn to calculate the cosine similarity. Note, the output will range from 0 to 1 since the ratings are all positive.

In [29]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

Next step is to make predictions. You have already created similarity matrices: `user_similarity` and `item_similarity` and therefore you can make a prediction by applying following formula for user-based CF:

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?\hat{x}_{k,m}&space;=&space;\bar{x}_{k}&space;&plus;&space;\frac{\sum\limits_{u_a}&space;sim_u(u_k,&space;u_a)&space;(x_{a,m}&space;-&space;\bar{x_{u_a}})}{\sum\limits_{u_a}|sim_u(u_k,&space;u_a)|}"/>

You can look at the similarity between users *k* and *a* as weights that are multiplied by the ratings of a similar user *a* (corrected for the average rating of that user). You will need to normalize it so that the ratings stay between 1 and 5 and, as a final step, sum the average ratings for the user that you are trying to predict. 

The idea here is that some users may tend always to give high or low ratings to all movies. The relative difference in the ratings that these users give is more important than the absolute values. To give an example: suppose, user *k* gives 4 stars to his favourite movies and 3 stars to all other good movies. Suppose now that another user *t* rates movies that he/she likes with 5 stars, and the movies he/she fell asleep over with 3 stars. These two users could have a very similar taste but treat the rating system differently. 

When making a prediction for item-based CF you don't need to correct for users average rating since query user itself is used to do predictions.

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?\hat{x}_{k,m}&space;=&space;\frac{\sum\limits_{i_b}&space;sim_i(i_m,&space;i_b)&space;(x_{k,b})&space;}{\sum\limits_{i_b}|sim_i(i_m,&space;i_b)|}"/>

In [30]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

In [31]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

### Evaluation
There are many evaluation metrics but one of the most popular metric used to evaluate accuracy of predicted ratings is *Root Mean Squared Error (RMSE)*. 
<img src="https://latex.codecogs.com/gif.latex?RMSE&space;=\sqrt{\frac{1}{N}&space;\sum&space;(x_i&space;-\hat{x_i})^2}" title="RMSE =\sqrt{\frac{1}{N} \sum (x_i -\hat{x_i})^2}" />

You can use the [mean_square_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) (MSE) function from `sklearn`, where the RMSE is just the square root of MSE. To read more about different evaluation metrics you can take a look at [this article](http://research.microsoft.com/pubs/115396/EvaluationMetrics.TR.pdf). 

Since you only want to consider predicted ratings that are in the test dataset, you filter out all other elements in the prediction matrix with `prediction[ground_truth.nonzero()]`. 

In [32]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [33]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.135451660158989
Item-based CF RMSE: 3.4593766647252515


Memory-based algorithms are easy to implement and produce reasonable prediction quality. 
The drawback of memory-based CF is that it doesn't scale to real-world scenarios and doesn't address the well-known cold-start problem, that is when new user or new item enters the system. Model-based CF methods are scalable and can deal with higher sparsity level than memory-based models, but also suffer when new users or items that don't have any ratings enter the system. I would like to thank Ethan Rosenthal for his [post](http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/) about Memory-Based Collaborative Filtering. 

# Model-based Collaborative Filtering

Model-based Collaborative Filtering is based on **matrix factorization (MF)** which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF. The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. 
When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization you can restructure the  user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

Let's calculate the sparsity level of MovieLens dataset:

In [34]:
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


To give an example of the learned latent preferences of the users and items: let's say for the MovieLens dataset you have the following information: _(user id, age, location, gender, movie id, director, actor, language, year, rating)_. By applying matrix factorization the model learns that important user features are _age group (under 10, 10-18, 18-30, 30-90)_, _location_ and _gender_, and for movie features it learns that _decade_, _director_ and _actor_ are most important. Now if you look into the information you have stored, there is no such feature as the _decade_, but the model can learn on its own. The important aspect is that the CF model only uses data (user_id, movie_id, rating) to learn the latent features. If there is little data available model-based CF model will predict poorly, since it will be more difficult to learn the latent features. 

Models that use both ratings and content features are called **Hybrid Recommender Systems** where both Collaborative Filtering and Content-based Models are combined. Hybrid recommender systems usually show higher accuracy than Collaborative Filtering or Content-based Models on their own: they are capable to address the cold-start problem better since if you don't have any ratings for a user or an item you could use the metadata from the user or item to make a prediction. 

### SVD
A well-known matrix factorization method is **Singular value decomposition (SVD)**. Collaborative Filtering can be formulated by approximating a matrix `X` by using singular value decomposition. The winning team at the Netflix Prize competition used SVD matrix factorization models to produce product recommendations, for more information I recommend to read articles: [Netflix Recommendations: Beyond the 5 stars](http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html) and [Netflix Prize and SVD](http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-gower-netflix-SVD.pdf).
The general equation can be expressed as follows:
<img src="https://latex.codecogs.com/gif.latex?X=USV^T" title="X=USV^T" />


Given `m x n` matrix `X`:
* *`U`* is an *`(m x r)`* orthogonal matrix
* *`S`* is an *`(r x r)`* diagonal matrix with non-negative real numbers on the diagonal
* *V^T* is an *`(r x n)`* orthogonal matrix

Elements on the diagnoal in `S` are known as *singular values of `X`*. 


Matrix *`X`* can be factorized to *`U`*, *`S`* and *`V`*. The *`U`* matrix represents the feature vectors corresponding to the users in the hidden feature space and the *`V`* matrix represents the feature vectors corresponding to the items in the hidden feature space.
<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/kwgsb5g1b/BLOG_CCA_5.png"/>

Now you can make a prediction by taking dot product of *`U`*, *`S`* and *`V^T`*.

<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/ch9lcm6pb/BLOG_CCA_4.png"/>

In [35]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.727093975231784


Carelessly addressing only the relatively few known entries is highly prone to overfitting. SVD can be very slow and computationally expensive. More recent work minimizes the squared error by applying alternating least square or stochastic gradient descent and uses regularization terms to prevent overfitting. Alternating least square and stochastic gradient descent methods for CF will be covered in the next tutorials.


Review:

* We have covered how to implement simple **Collaborative Filtering** methods, both memory-based CF and model-based CF.
* **Memory-based models** are based on similarity between items or users, where we use cosine-similarity.
* **Model-based CF** is based on matrix factorization where we use SVD to factorize the matrix.
* Building recommender systems that perform well in cold-start scenarios (where little data is available on new users and items) remains a challenge. The standard collaborative filtering method performs poorly is such settings. 

## Looking for more?

If you want to tackle your own recommendation system analysis, check out these data sets. Note: The files are quite large in most cases, not all the links may stay up to host the data, but the majority of them still work. Or just Google for your own data set!

**Movies Recommendation:**

MovieLens - Movie Recommendation Data Sets http://www.grouplens.org/node/73

Yahoo! - Movie, Music, and Images Ratings Data Sets http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Jester - Movie Ratings Data Sets (Collaborative Filtering Dataset) http://www.ieor.berkeley.edu/~goldberg/jester-data/

Cornell University - Movie-review data for use in sentiment-analysis experiments http://www.cs.cornell.edu/people/pabo/movie-review-data/

**Music Recommendation:**

Last.fm - Music Recommendation Data Sets http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/index.html

Yahoo! - Movie, Music, and Images Ratings Data Sets http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Audioscrobbler - Music Recommendation Data Sets http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html

Amazon - Audio CD recommendations http://131.193.40.52/data/

**Books Recommendation:**

Institut für Informatik, Universität Freiburg - Book Ratings Data Sets http://www.informatik.uni-freiburg.de/~cziegler/BX/
Food Recommendation:

Chicago Entree - Food Ratings Data Sets http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data
Merchandise Recommendation:

**Healthcare Recommendation:**

Nursing Home - Provider Ratings Data Set http://data.medicare.gov/dataset/Nursing-Home-Compare-Provider-Ratings/mufm-vy8d

Hospital Ratings - Survey of Patients Hospital Experiences http://data.medicare.gov/dataset/Survey-of-Patients-Hospital-Experiences-HCAHPS-/rj76-22dk

**Dating Recommendation:**

www.libimseti.cz - Dating website recommendation (collaborative filtering) http://www.occamslab.com/petricek/data/
Scholarly Paper Recommendation:

National University of Singapore - Scholarly Paper Recommendation http://www.comp.nus.edu.sg/~sugiyama/SchPaperRecData.html

# Great Job!