# Recommender Systems

Recommender Systems have become ubiquitous in the modern data science landscape, as companies like Google, Netflix, Pandora, Facebook, etc. rely heavily on them to provide targeted content recommendation to their users to create a more enjoyable user experience.  In these exercises, we'll focus on the process of ***collaborative filtering*** for building recommenders on 2 different datasets (beers and movies).  

[Collaborative Filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) relies on a ***ratings matrix*** for all items to generate similarities between items and users based on similar ratings.  It's important to remember that collaborative filtering is one of the 2 main ways to conduct recommendation, the other being [Content-Based Filtering](https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering) which explicitly maps items and/or users into a shared feature space based on explicit user/item characteristics.  State of the art recommenders will often rely on hybrid approaches of these 2, so it's important to understand the differences, strengths, and weaknesses of each and what separates them.

### Datasets
- [Beer Ratings](https://github.com/pburkard88/DS_BOS_06/blob/master/Data/beer_reviews.tar.gz): A dataset of beer reviews
- [Movielens Data](https://github.com/pburkard88/DS_BOS_06/blob/master/Data/movielens): A dataset of movie ratings from the original [here](http://grouplens.org/datasets/movielens/)

### Learning Goals
- Perform collaborative filtering from ratings matrices using `pandas` and `sklearn` on the beers data
- Understand why this approach represents collaborative filtering
- Perform collaborative filtering using the [python-recsys](https://github.com/ocelma/python-recsys) library that provides some nice built-in recommender functionality
- Understand how SVDs or other matrix decompositions might fit in in the context of a recommender algorithm

## Similarity based Recommendation System: Beers
The first dataset we'll work with is a list of many beer reviews by a variety of reviewers with accompanying beer metadata on every review.  We'll use this data to generate our reviewer/beer ratings matrix from which we can perform collaborative filtering and recommend beers based on user preferences.

### Beers: Get the Data
First perform the usual imports of `numpy` and `pandas` as `np` and `pd`.

In [2]:
import pandas as pd
import numpy as np

Now let's get the data.  If you don't already have it locally you can use curl to pull it down.

In [2]:
! curl -O https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 27.3M  100 27.3M    0     0  6907k      0  0:00:04  0:00:04 --:--:-- 6959k


These steps here are optional, just move the data some place where you know where it is and then point your eventual call to `read_csv()` to that location.

In [3]:
! mv 'beer_reviews.tar.gz' ~/data/

In [4]:
!ls ~/data

GoogleNews-vectors-negative300.bin    beer_reviews.tar.gz
GoogleNews-vectors-negative300.bin.gz enable1.txt
[34mbeer_reviews[m[m                          text8.zip
beer_reviews.tar                      text8.zip.1


Import the data into a `pandas` dataframe called `df` by calling `read_csv()` with the appropriate path and the parameter `compression='gzip'` (you don't need this if you already extracted your file, it's just nice to see that pandas can handle gzipped data).

In [8]:
#df = pd.read_csv("~/data/beer_reviews.tar.gz", compression='gzip', error_bad_lines=False)
df = pd.read_csv("~/data/beer_reviews/beer_reviews.csv")


### Explore the Data
Let's look at the data with `head()`

In [6]:
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


Create a separate data frame `df_test` to investigate a little bit further by selecting out only the **beer_name="Pale Ale"** reviews using the `isIn([])` function.  Then sort this resulting table by **review_profilename** and examine the first 100 rows.  You should notice that the same reviewer can review multiple Pale Ales.

In [9]:
df_test = df[df.beer_name.isin(['Pale Ale'])].sort(columns='review_profilename')
df_test.head(100)

  if __name__ == '__main__':


Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
912451,19402,Inland Empire Brewing Company,1240528130,4.0,3.0,3.0,0110x011,American Pale Ale (APA),3.0,3.5,Pale Ale,5.50,49291
1406262,9824,Silverado Brewing Company,1253298781,3.5,3.5,3.5,1759Girl,American Pale Ale (APA),2.5,4.0,Pale Ale,5.12,25427
563154,423,Boulevard Brewing Co.,1305678006,3.5,2.0,4.0,1Adam12,American Pale Ale (APA),3.0,3.0,Pale Ale,5.40,2094
525342,2101,Blue Star Brewing Company,1237655645,4.5,4.0,4.0,1fastz28,American Pale Ale (APA),4.0,4.0,Pale Ale,,5828
41264,13397,Mountaineer Brewing Co.,1291941443,4.0,3.0,3.5,321jeff,American Pale Ale (APA),4.0,3.0,Pale Ale,5.59,28951
1385721,3725,Réservoir,1120719056,3.5,3.0,3.0,3Vandoo,English Pale Ale,4.0,3.0,Pale Ale,5.00,24527
562967,423,Boulevard Brewing Co.,1203782439,5.0,4.5,4.5,7thstreetbrewery,American Pale Ale (APA),5.0,4.5,Pale Ale,5.40,2094
563116,423,Boulevard Brewing Co.,1058366048,4.0,3.5,4.0,ADR,American Pale Ale (APA),3.5,3.0,Pale Ale,5.40,2094
477535,16465,Croucher Brewing Co.,1248702467,4.5,3.0,4.5,ADZA,American Pale Ale (APA),3.5,4.5,Pale Ale,5.00,40487
1429227,25252,Goodieson Brewery,1304334057,3.0,3.0,3.5,ADZA,American Pale Ale (APA),3.0,3.0,Pale Ale,4.50,68580


Let's restrict this to the top 250 beers. Use the `value_counts()` method to get a sorted list by value count on **beer_name** and then taking the first 250.  Overwrite `df` with this new data.

In [8]:
df.beer_name.value_counts()

90 Minute IPA                                 3290
India Pale Ale                                3130
Old Rasputin Russian Imperial Stout           3111
Sierra Nevada Celebration Ale                 3000
Two Hearted Ale                               2728
Arrogant Bastard Ale                          2704
Stone Ruination IPA                           2704
Sierra Nevada Pale Ale                        2587
Stone IPA (India Pale Ale)                    2575
Pliny The Elder                               2527
Founders Breakfast Stout                      2502
Pale Ale                                      2500
Sierra Nevada Bigfoot Barleywine Style Ale    2492
La Fin Du Monde                               2483
60 Minute IPA                                 2475
Storm King Stout                              2452
Duvel                                         2450
Brooklyn Black Chocolate Stout                2447
Bell's Hopslam Ale                            2443
Samuel Adams Boston Lager      

In [10]:
n = 250
top_n = df.beer_name.value_counts().index[:n]
df = df[df.beer_name.isin(top_n)]
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
798,1075,Caldera Brewing Company,1212201268,4.5,4.5,4.0,grumpy,American Double / Imperial Stout,4.0,4.5,Imperial Stout,,42964
1559,11715,Destiny Brewing Company,1137124057,4.0,3.5,4.0,blitheringidiot,American Pale Ale (APA),3.5,3.5,Pale Ale,4.5,26420
1560,11715,Destiny Brewing Company,1129504403,4.0,2.5,4.0,NeroFiddled,American Pale Ale (APA),4.0,3.5,Pale Ale,4.5,26420
1563,11715,Destiny Brewing Company,1137125989,3.5,3.0,4.0,blitheringidiot,American IPA,4.0,4.0,IPA,,26132
1564,11715,Destiny Brewing Company,1130936611,3.0,3.0,3.0,Gavage,American IPA,4.0,3.5,IPA,,26132


How big is this dataset?  Use `df.info()`

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 355275 entries, 798 to 1586564
Data columns (total 13 columns):
brewery_id            355275 non-null int64
brewery_name          355275 non-null object
review_time           355275 non-null int64
review_overall        355275 non-null float64
review_aroma          355275 non-null float64
review_appearance     355275 non-null float64
review_profilename    355175 non-null object
beer_style            355275 non-null object
review_palate         355275 non-null float64
review_taste          355275 non-null float64
beer_name             355275 non-null object
beer_abv              353477 non-null float64
beer_beerid           355275 non-null int64
dtypes: float64(6), int64(3), object(4)
memory usage: 37.9+ MB


Aggregate the data in a pivot table called `df_wide` using the `pivot_table` method. Display the mean review_overall for each beer_name aggregating the review_overall values by review_profilename. Use the mean (numpy.mean) as aggregator.  In other words, the `values` parameter should contain **review_overall** and the `index` parameter should contain **beer_name** and **beer_name**.  Make sure to call `unstack()` at the end.

In [12]:
df_wide = pd.pivot_table(df, values=["review_overall"],
        index=["beer_name", "review_profilename"],
        aggfunc=np.mean).unstack()
df_wide.shape

(250, 22140)

Display the head of the pivot table, but only for 5 users (columns are users)

In [12]:
df_wide.ix[0:5, 0:5]

Unnamed: 0_level_0,review_overall,review_overall,review_overall,review_overall,review_overall
review_profilename,0110x011,02maxima,03SVTCobra,05Harley,0Naught0
beer_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
#9,,,,,
120 Minute IPA,,,,4.0,
1554 Enlightened Black Ale,,,,,
60 Minute IPA,,,,,
90 Minute IPA,5.0,,,4.0,


### Discussion: what do you notice in this table?

Set Nans to zero with the `fillna()` function.

In [13]:
df_wide = df_wide.fillna(0)

Check that columns are users by examining the first few columns.

In [14]:
df_wide.columns[:10]

MultiIndex(levels=[['review_overall'], ['0110x011', '02maxima', '03SVTCobra', '05Harley', '0Naught0', '0beerguy0', '0runkp0s', '0tt0', '1000Bottles', '1001111.0', '100floods', '100proof', '1050Sudz', '108Dragons', '1099.0', '1100.0', '1121987.0', '11millsown113', '11osixBrew', '11thFloorBrewing', '1229design', '12NattiBottles', '12ouncecurls', '12percent', '12vUnion', '130guy', '13smurrf', '160Shillings', '1759Girl', '1759dallas', '17Guinness59', '18alpha', '196osh', '1993Heel', '1996StrokerKid', '1Adam12', '1MiltonWaddams', '1PA', '1after909', '1brbn1sctch1beer', '1fastz28', '1morebeer', '1noa', '1quiks10', '1santore', '1thinmint', '1whiskey', '20ozmonkey', '21mmer', '220emaple', '22ozStone', '2369.0', '2378GCGTG', '23fyerfyter', '2408bwk', '245Trioxin', '24Beer92', '25santurce', '28Rock', '2BDChicago', '2Cruzy', '2DaMtns', '2KHokie', '2LBrew', '2Shay', '2Stout4u', '2beerdogs', '2cansam', '2girls1hop', '2heartedMIKE', '2late2brew', '2melow', '2ndloveofmylife', '2ndstage', '2oose', '2r

Check that rows are beers by examining the first few rows.

In [16]:
pd.Series(df_wide.index[:10])

0                                    #9
1                        120 Minute IPA
2            1554 Enlightened Black Ale
3                         60 Minute IPA
4                         90 Minute IPA
5    Aecht Schlenkerla Rauchbier Märzen
6                          AleSmith IPA
7               AleSmith Speedway Stout
8                        Allagash White
9                   Alpha King Pale Ale
Name: beer_name, dtype: object

### Calculate distance between beers

This is the key.  We have our ratings matrix now and we're going to use cosine_similarity from scikit-learn to compute the distance between all beers in this space.

Import stuff

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import euclidean_distances

Apply `cosine_similarity()` to `df_wide` to calculate pairwise distances and store this in a variable called `dists`.

In [18]:
dists = cosine_similarity(df_wide)
dists

array([[ 1.        ,  0.27540494,  0.27410345, ...,  0.32928048,
         0.34805798,  0.31249922],
       [ 0.27540494,  1.        ,  0.25151873, ...,  0.2854835 ,
         0.23301356,  0.2802485 ],
       [ 0.27410345,  0.25151873,  1.        , ...,  0.31629515,
         0.22521858,  0.2737628 ],
       ..., 
       [ 0.32928048,  0.2854835 ,  0.31629515, ...,  1.        ,
         0.28025764,  0.34504013],
       [ 0.34805798,  0.23301356,  0.22521858, ...,  0.28025764,
         1.        ,  0.25526913],
       [ 0.31249922,  0.2802485 ,  0.2737628 , ...,  0.34504013,
         0.25526913,  1.        ]])

### Discussion: what type of object is dists?

Convert dists to a Pandas DataFrame, use the index as column index as well (distances are a square matrix).  This means we'll have a beers by beers matrix of the distances between every beer from the ratings space.  Check out the first 10 or so rows and columns and make sure things look right (should see 1s on the diagonal).

In [19]:
dists = pd.DataFrame(dists, columns=df_wide.index)

dists.index = dists.columns
dists.ix[0:10, 0:10]

beer_name,#9,120 Minute IPA,1554 Enlightened Black Ale,60 Minute IPA,90 Minute IPA,Aecht Schlenkerla Rauchbier Märzen,AleSmith IPA,AleSmith Speedway Stout,Allagash White,Alpha King Pale Ale
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
#9,1.0,0.275405,0.274103,0.388364,0.365175,0.253841,0.228479,0.227612,0.340681,0.293315
120 Minute IPA,0.275405,1.0,0.251519,0.378258,0.410366,0.262425,0.315971,0.337541,0.282273,0.336796
1554 Enlightened Black Ale,0.274103,0.251519,1.0,0.319887,0.314028,0.252486,0.266866,0.261761,0.260275,0.307296
60 Minute IPA,0.388364,0.378258,0.319887,1.0,0.533042,0.316928,0.312343,0.307627,0.360975,0.385249
90 Minute IPA,0.365175,0.410366,0.314028,0.533042,1.0,0.312861,0.344218,0.358754,0.356804,0.418582
Aecht Schlenkerla Rauchbier Märzen,0.253841,0.262425,0.252486,0.316928,0.312861,1.0,0.24449,0.246063,0.297672,0.263248
AleSmith IPA,0.228479,0.315971,0.266866,0.312343,0.344218,0.24449,1.0,0.521889,0.277409,0.400741
AleSmith Speedway Stout,0.227612,0.337541,0.261761,0.307627,0.358754,0.246063,0.521889,1.0,0.27393,0.420247
Allagash White,0.340681,0.282273,0.260275,0.360975,0.356804,0.297672,0.277409,0.27393,1.0,0.295666
Alpha King Pale Ale,0.293315,0.336796,0.307296,0.385249,0.418582,0.263248,0.400741,0.420247,0.295666,1.0


Select some beers and store them in `beers_i_like` then look their distances to other beers with `head()`

In [20]:
beers_i_like = ['Sierra Nevada Pale Ale', '120 Minute IPA', 'Allagash White']
dists[beers_i_like].head()

beer_name,Sierra Nevada Pale Ale,120 Minute IPA,Allagash White
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
#9,0.373968,0.275405,0.340681
120 Minute IPA,0.301693,1.0,0.282273
1554 Enlightened Black Ale,0.330033,0.251519,0.260275
60 Minute IPA,0.459641,0.378258,0.360975
90 Minute IPA,0.441189,0.410366,0.356804


Sum the distances of my favorite beers by row, to have one distance from each beer in the sample.  For instance if there are 3 beers in your `beers_i_like` then you will be summing 3 numbers for each row.  Store the results in `beers_summed`.  There are 2 ways you can do this:  
1. Calling `apply()` with a lambda function that contains `np.sum()` with `axis=1`
2. Calling `np.sum()` with `axis=1` on the entire dataframe (sliced by columns you like)

In [21]:
beers_summed = dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)
#beers_summed = np.sum(dists[beers_i_like], axis=1)

Optional: which one is faster? use ```%timeit``` to check

In [22]:
%timeit dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)

10 loops, best of 3: 22.6 ms per loop


In [23]:
%timeit np.sum(dists[beers_i_like], axis=1)

The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 804 µs per loop


Sort summed beers from best to worse using `order()`

In [24]:
beers_summed = beers_summed.sort_values(ascending=False)
beers_summed

beer_name
Sierra Nevada Pale Ale                        1.654205
Allagash White                                1.634784
120 Minute IPA                                1.583966
HopDevil Ale                                  1.224217
Sierra Nevada Celebration Ale                 1.215156
90 Minute IPA                                 1.208359
60 Minute IPA                                 1.198874
Stone Ruination IPA                           1.194210
Stone IPA (India Pale Ale)                    1.193193
Storm King Stout                              1.192405
Arrogant Bastard Ale                          1.189981
Sierra Nevada Bigfoot Barleywine Style Ale    1.178245
Prima Pils                                    1.178093
Brooklyn Black Chocolate Stout                1.156365
Ayinger Celebrator Doppelbock                 1.148356
Hennepin (Farmhouse Saison)                   1.147501
Samuel Adams Boston Lager                     1.146304
Hop Rod Rye                                   1.140271


Filter out the beers used as input using `isin()` and store this in `ranked_beers`, then transform this to a list using `tolist()`.  Print out the first 5 elements.

In [25]:
ranked_beers = beers_summed.index[beers_summed.index.isin(beers_i_like)==False]
ranked_beers = ranked_beers.tolist()
ranked_beers[:5]

['HopDevil Ale',
 'Sierra Nevada Celebration Ale',
 '90 Minute IPA',
 '60 Minute IPA',
 'Stone Ruination IPA']

Define a function that does what we just did for an arbitrary input list of beers. it should also receive the maximum number of beers requested n as optional parameter.

In [26]:
def get_similar(beers, n=None):
    """
    calculates which beers are most similar to the beers provided. Does not return
    the beers that were provided
    
    Parameters
    ----------
    beers: list
        some beers!
    
    Returns
    -------
    ranked_beers: list
        rank ordered beers
    """
    beers = [beer for beer in beers if beer in dists.columns]
    beers_summed = dists[beers].apply(lambda row: np.sum(row), axis=1)
    beers_summed = beers_summed.order(ascending=False)
    ranked_beers = beers_summed.index[beers_summed.index.isin(beers)==False]
    ranked_beers = ranked_beers.tolist()
    if n is None:
        return ranked_beers
    else:
        return ranked_beers[:n]

Test your function. Find the 10 beers most similar to "120 Minute IPA"

In [28]:
for beer in get_similar(["120 Minute IPA"], 10):
    print(beer)

World Wide Stout
90 Minute IPA
Double Bastard Ale
Stone Ruination IPA
Stone Imperial Russian Stout
Storm King Stout
60 Minute IPA
Oaked Arrogant Bastard Ale
Sierra Nevada Bigfoot Barleywine Style Ale
Brooklyn Black Chocolate Stout




Cool, let's try again with the 10 beers most similar to ["Coors Light", "Bud Light", "Amstel Light"]

In [29]:
for i, beer in enumerate(get_similar(["Coors Light", "Bud Light", "Amstel Light"], 10)):
    print("%d) %s" % (i+1, beer))

1) Miller Lite
2) Budweiser
3) Corona Extra
4) Samuel Adams Boston Lager
5) Heineken Lager Beer
6) Blue Moon Belgian White
7) Guinness Draught
8) Miller High Life
9) Samuel Adams Summer Ale
10) Sierra Nevada Pale Ale




## Movie Recommendations with Recsys
[python-recsys](https://github.com/ocelma/python-recsys) is a nice python library for implementing recommender systems.  We'll use it here to try and make movie recommendations from the [movielens dataset](http://grouplens.org/datasets/movielens/).  

### Install Recsys
First run something like the below code to install everything that you need for recsys.

## install python-recsys

### first install dependencies

pip install csc-pysparse networkx divisi2

### Get the Data
Download the movielens dataset [here](http://files.grouplens.org/datasets/movielens/ml-20m.zip) 

Let's look at the files, you can do this however you like.

In [35]:
! ls /Users/ablevins/data

GoogleNews-vectors-negative300.bin    beer_reviews.tar.gz
GoogleNews-vectors-negative300.bin.gz enable1.txt
[34mbeer_reviews[m[m                          text8.zip
beer_reviews.tar                      text8.zip.1


Read in the movies.dat data into a variable `movies` by using `pd.read_table` with `sep='::'`.  Make sure to set the `names` to ITEMID, Title, and Genres to set the columns and the `index_col` to ITEMID.

In [40]:
movies = pd.read_table('~/data/ml-20m/movies.csv', sep=',')

### Explore the Data
Take a look at the movies data with `head()`.

In [41]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Load the ratings.dat data into a `ratings` variable with the same separator, and the column names UserID, MovieID, Rating, Timestamp.

In [5]:
ratings = pd.read_table('~/data/ml-20m/ratings.csv', sep=',')

In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [13]:
ratings.movieId.value_counts()

296       67310
356       66172
318       63366
593       63299
480       59715
260       54502
110       53769
589       52244
2571      51334
527       50054
1         49695
457       49581
150       47777
780       47048
50        47006
1210      46839
592       46054
1196      45313
2858      44987
32        44980
590       44208
1198      43295
608       43272
47        43249
380       43159
588       41842
377       41562
1270      41426
858       41355
2959      40106
          ...  
107238        1
123629        1
107236        1
74478         1
123621        1
116323        1
107248        1
123645        1
107252        1
107147        1
90777         1
123627        1
107241        1
107268        1
123596        1
90883         1
123669        1
90935         1
107316        1
123587        1
107202        1
107204        1
107243        1
99939         1
123600        1
123607        1
90823         1
123609        1
123613        1
131136        1
Name: movieId, dtype: in

In [15]:
n = 250
top_n = ratings.movieId.value_counts().index[:n]
ratings = ratings[ratings.movieId.isin(top_n)]
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
7,1,223,4.0,1112485573


In [16]:

UserMovieMatrix = pd.pivot_table(ratings,values='rating',
                                index=['userId','movieId'],
                                aggfunc=np.mean).unstack()

In [17]:
UserMovieMatrix.head()

movieId,1,2,6,10,11,16,17,19,21,25,...,6711,6874,7153,7361,7438,8636,8961,32587,33794,58559
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,3.5,,,,,,,,,...,,,5.0,,4.0,4.5,4.0,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,,,3.0,4.0,,,,3.0,,,...,,,,,,,,,,
5,,3.0,,,5.0,,3.0,,,,...,,,,,,,,,,


Initialize an `SVD` instance called `svd`

In [20]:
UserMovieMatrixz = UserMovieMatrix.fillna(2.5)

In [21]:
from sklearn.utils.extmath import randomized_svd

U, Sigma, VT = randomized_svd(UserMovieMatrixz, 
                              n_components=15,
                              n_iter=5,
                              random_state=None)

In [24]:
VT

array([[ 0.10175526,  0.04401612,  0.05358105, ...,  0.04304211,
         0.04993961,  0.04361131],
       [ 0.02227473,  0.04958232,  0.0340221 , ..., -0.05179667,
        -0.05739599, -0.05808536],
       [-0.01404039, -0.0387483 ,  0.03251913, ..., -0.08670394,
        -0.11178202, -0.10526736],
       ..., 
       [ 0.11517488,  0.02861333, -0.02146591, ...,  0.01648529,
         0.00231085, -0.01335648],
       [ 0.01779486,  0.06436341,  0.07961331, ..., -0.03372538,
        -0.01111026, -0.0045956 ],
       [-0.23715209,  0.05962051, -0.07311029, ..., -0.0151384 ,
        -0.02143973,  0.08025806]])

 
- Use `k=100`
- Use `min_values=10`
- Use `pre_normalize=None`
- Use `mean_center=True`
- Use `post_normalize=True`

$M=U \Sigma V^T$:

you can also save the output SVD model (in a zip file)

In [None]:
# svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/movielens')

Reload a saved model:

In [None]:
# svd2 = SVD(filename='/tmp/movielens')

### Computing Similarities and Making Recommendations
Let's compute similarity between two movies, first we need to use the movies table to get the itemid that will be used with the ratings data that generated our svd.

Determine the movie ids of "Toy Story (1995)" and "Bug's Life, A (1998)".

In [None]:
movies[movies.Title == "Toy Story (1995)"]

In [None]:
movies[movies.Title == "Bug's Life, A (1998)"]

Print the similarity of these 2 movies by calling `svd.similarity()` with those 2 IDs.

In [None]:
ITEMID1 = 1    # Toy Story (1995)
ITEMID2 = 2355 # A bug's life (1998)
print svd.similarity(ITEMID1, ITEMID2)
# print svd2.similarity(ITEMID1, ITEMID2) to check

Use `svd.similar()` to get movies similar to Toy Story.

In [None]:
svd.similar(ITEMID1)

Try using `svd.predict()` to predict ratings for a given user and movie, $\hat{r}_{ui}$

In [None]:
MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING)

Look it up in the matrix...

In [None]:
svd.get_matrix().value(ITEMID, USERID)

Try using `svd.recommend()` to Recommend non rated movies to a user (`is_row=False`)

In [None]:
svd.recommend(USERID, is_row=False)

Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)

In [None]:
svd.recommend(ITEMID)

Find out more here: [https://github.com/ocelma/python-recsys](https://github.com/ocelma/python-recsys)