In [1]:
cd ..

/media/manuel/DATA/Backup/Proyectos/pydata_2016_talk


ipython nbconvert notebooks/Slides.ipynb --to slides --post serve

# A Primer on Recommendation Systems 

# Pydata Madrid 2016


## About me (Manuel Garrido)

* Data Scientist at [Scrapinghub](scrapinghub.com
* Born in Portugal, Spanish/US Citizen
* Industrial Engineer (in theory)
* consultant-->analyst-->data scientist
* excel-->R-->python-->spark

## About Scrapinghub



## Contact

* [manugarri.com](http://manugarri.com)

* @manugarri

* manuel [a t] scrapinghub [d_o_t] com

* hola [a t] manugarri [d_o_t] com

### Agenda

1. What are recommendation systems
2. Content Filtering
3. Collaborative Filtering

In [5]:
%load_ext watermark
%watermark

2016-04-04T23:01:39

CPython 3.5.1
IPython 4.1.2

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.19.0-56-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit


# About the talk

* Practical talk
* Focused on examples

**If you meet the requirements...** 

```
pandas>=0.17
numpy
scipy
```

**...then clone the talk's repo and run a notebook**

```
git clone X
cd X
jupyter notebook
```

### Import dis

In [9]:
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds

# 1. What are recommendation systems?

![title](../assets/wiki-definition.png "ShowMyImage")

![totle](../assets/lol.jpg)

### Basis for recommending items to users:

- **Item Information**
  - Item characteristics
- **User Information**
  - User characteristics.
  - User preferences.
  - Users relationships
  - User's previous interaction with our platform
- **Platform Information**
  - Business goals
  - Availability

### Similarity measures


Recommendations are based on the concept of similarity. There are multiple ways of defining similarity, or say diferently,
there are many similarity measures and each way applies to specific kinds of entities. Here are a few examples:

* **Edit distance (Lehvenstein & variations)**:  used to measure similarity between two strings. It takes into account how many changes would need to be applied to a word to become another. 

![title](../assets/Levenshtein.png)

* **Jaccard Index**: Useful when computing similarity among sets (one to many relations)

$$J(A, B) = \frac{|A \bigcap B|}{ |A\bigcup B|}$$

### Similarity measures (cont.)

* **Manhattan/Euclidean distance**:

![distances](../assets/diagram_euclidean_manhattan_distance_metrics.png)

* **cosine similarity**. 

![cosinm](../assets/cosinesim.png)

* **and many more!**

The Manhattan Distance is the distance between two points measured along axes at right angles
The Euclidean distance or Euclidean metric is the "ordinary" (i.e. straight-line) distance between two points
Cosine similarity  measures the cosine of the angle between them.

### Recommendation Systems Approaches
- **Content Filtering**
- **Collaborative Filtering**
- **Hybrid Systems**
- ** Other (Demographic / Social Recommendations)**

#### Demographic recommendations

* Use demographic information from the user to recommend items relevant to the user's stratum

#### Social recommendations

* Users sharing with other users

# Content Filtering 

In a Content filtering recommendation system, items are mapped into a feature space, and
recommendations depend on item characteristics.

 - That means that all the feature information we use is derived only from the items.

- That does not mean that we don't rely on any user information, just that information is used only on the recommendation step, and not at computing time.

## Example 1
![title](../assets/amazon_recommendations.jpeg "ShowMyImage")

On earlier versions of Amazon, recommendations where provided based on both user previous purchases and what the user explicitly told the platform he/she owned.

One way these recommendations could have been provided would by by first, mapping those books/dvs into a feature space
(for books for example, narrative category,  male/female protagonist).

Then, given the items the user has purchased, we can calculate the similarity between the items the user has purchased and all of Amazon items and return the most similar ones.

## Example
![title](../assets/pandora0.png "ShowMyImage")

Another example, Pandora, pandora is an online radio that is quite popular in the US.
When you join you can select songs, artists or fetures to create your first station. Pandora uses those songs /artists you selected to recommend new ones that are similar.

![title](../assets/pandora1.png "ShowMyImage")

Pandora allows us to add as many songs as we want to our station, each one improving the recommendations Pandora provides us.

![title](../assets/pandora3.png)

https://www.pandora.com/about/mgp

![title](../assets/pandora4.png)

Pandora is based of the Music Genome Project a project that clains to be the most in depth taxonomy of music. Pandora has a team of musical analysts analyzing 10,000 songs every month, mapping them to a set of 450 distinct musical features. They use these features to power their recommendations.

## Content Filtering

### Pros

* Recomending to a new user is easy. We just get user's input and we can recommend right ahead.

### Cons

* Need to map each item into the feature space. That means that any time a new item gets added, someone has to manually categorize that item (unless using unsupervised clustering methods).

* Recommendations are limited in scope. This means items can't be categorized in new features. Who can say 450 features are all the features necessary to map music?

## Practical Example:   Content Filtering Movie recommendation Engine

![title](../assets/ml-logo.png)
<br>

In this example we will build a recommendation system that recommends movies to users based on:

1. Movie features - represented as movie genres
2. Users' preferences about those categories


<br>
We will use the [Movielens](http://grouplens.org/datasets/movielens/) dataset. A free dataset containing millions of movie ratings by users.

In [6]:
movies_df = pd.read_table('assets/movies.dat', header=None, sep='::', 
                          names=['movie_id', 'movie_title', 'movie_genre'], engine='python')
movies_df = pd.concat([movies_df, movies_df.movie_genre.str.get_dummies(sep='|')], axis=1)  
movie_categories = movies_df.columns[3:].tolist()
movies_df.head()  

Unnamed: 0,movie_id,movie_title,movie_genre,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Animation|Children's|Comedy,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
user_movies = [
    'Die Hard: With a Vengeance (1995)',
    'Die Hard (1988)',
    'Braveheart (1995)',
    'Star Wars: Episode IV - A New Hope (1977)',
    'Star Wars: Episode VI - Return of the Jedi (1983)',
    'Indiana Jones and the Last Crusade (1989)',
    'Toy Story (1995)',
    'Aladdin (1992)',
    'Lion King, The (1994)',
]

In [9]:
def get_user_preferences(user_movies):
    user_features = movies_df[movies_df.movie_title.isin(user_movies)].ix[:,3:].T
    user_features = user_features.mean(axis=1).reset_index()
    print(user_features)
    return user_features.ix[:,1].tolist()

user_preferences_list = get_user_preferences(user_movies)

          index         0
0        Action  0.666667
1     Adventure  0.333333
2     Animation  0.333333
3    Children's  0.333333
4        Comedy  0.222222
5         Crime  0.000000
6   Documentary  0.000000
7         Drama  0.111111
8       Fantasy  0.111111
9     Film-Noir  0.000000
10       Horror  0.000000
11      Musical  0.222222
12      Mystery  0.000000
13      Romance  0.111111
14       Sci-Fi  0.222222
15     Thriller  0.222222
16          War  0.222222
17      Western  0.000000


In [10]:
def get_predicted_movie_score(movie_name, user_preferences): 
    movie_slice = movies_df[movies_df.movie_title==movie_name].iloc[0]
    movie_features = movie_slice[movie_categories]
    return np.dot(movie_features, user_preferences_list)

In [39]:
#Action +Sci-Fi + Thriller
get_predicted_movie_score('Terminator 2: Judgment Day (1991)', user_preferences_list)

1.1111111111111112

In [26]:
#Action + Drama
get_predicted_movie_score('Rocky (1976)', user_preferences_list)

0.77777777777777768

In [33]:
#Animation + Musical
get_predicted_movie_score('Prince of Egypt, The (1998)', user_preferences_list)

0.55555555555555558

In [28]:
#Horror + Thriller
get_predicted_movie_score('Scream (1996)', user_preferences_list)

0.22222222222222221

In [197]:
def get_movie_recommendations(user_preferences, n_recommendations):  
    movies_df['score'] = movies_df.movie_title.apply(get_predicted_movie_score, args=([user_preferences]))
    movies_df.sort_values(by=['score'], ascending=False, inplace=True)
    del movies_df['score']
    return movies_df[['movie_title','movie_genre']].head(n_recommendations)

get_movie_recommendations(user_preferences_list, 10) 

Unnamed: 0,movie_title,movie_genre
1187,"Transformers: The Movie, The (1986)",Action|Animation|Children's|Sci-Fi|Thriller|War
554,"Pagemaster, The (1994)",Action|Adventure|Animation|Children's|Fantasy
2253,Soldier (1998),Action|Adventure|Sci-Fi|Thriller|War
1192,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
1178,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
606,Heavy Metal (1981),Action|Adventure|Animation|Horror|Sci-Fi
1972,Condorman (1981),Action|Adventure|Children's|Comedy
2651,Inspector Gadget (1999),Action|Adventure|Children's|Comedy
542,Super Mario Bros. (1993),Action|Adventure|Children's|Sci-Fi
1197,Army of Darkness (1993),Action|Adventure|Comedy|Horror|Sci-Fi


#### We can even ask the user to explicitly state their opinion

In [199]:
from collections import OrderedDict


user_preferences = OrderedDict(zip(movie_categories, []))

user_preferences['Action'] = 5  
user_preferences['Adventure'] = 5  
user_preferences['Animation'] = 1  
user_preferences["Children's"] = 1  
user_preferences["Comedy"] = 3  
user_preferences['Crime'] = 2  
user_preferences['Documentary'] = 1  
user_preferences['Drama'] = 1  
user_preferences['Fantasy'] = 5  
user_preferences['Film-Noir'] = 1  
user_preferences['Horror'] = 2  
user_preferences['Musical'] = 1  
user_preferences['Mystery'] = 3  
user_preferences['Romance'] = 1  
user_preferences['Sci-Fi'] = 5  
user_preferences['War'] = 3  
user_preferences['Thriller'] = 2  
user_preferences['Western'] =1  

user_preferences_list = list(user_preferences.values())

In [200]:
get_movie_recommendations(user_preferences_list, 10) 

Unnamed: 0,movie_title,movie_genre
257,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
1197,Army of Darkness (1993),Action|Adventure|Comedy|Horror|Sci-Fi
2253,Soldier (1998),Action|Adventure|Sci-Fi|Thriller|War
2559,Star Wars: Episode I - The Phantom Menace (1999),Action|Adventure|Fantasy|Sci-Fi
2036,Tron (1982),Action|Adventure|Fantasy|Sci-Fi
1985,"Honey, I Shrunk the Kids (1989)",Adventure|Children's|Comedy|Fantasy|Sci-Fi
1505,"Lost World: Jurassic Park, The (1997)",Action|Adventure|Sci-Fi|Thriller
1113,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
1192,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
606,Heavy Metal (1981),Action|Adventure|Animation|Horror|Sci-Fi


# Collaborative Filtering 

In collaborative filtering, item features are derived from individual user behaviors or attitudes.

The assumption is that users get value from recommendations based on other users with similar tastes.

Unleases a potentially much bigger dataset to base our recommendations on.

### 2 Basic approaches:

* Item based collaborative filtering (Item-Item). Measure similarity between items

* User based collaborative filtering (Item-User). Measure similarity between users

## Item-Item example

![amazon](../assets/amazon3.png)
 
![amazon](../assets/amazon4.png)


On earlier versions of Amazon, recommendations where provided based on  user purchases. How [Amazon](http://www.cin.ufpe.br/~idal/rs/Amazon-Recommendations.pdf) described their algorithm is above. They used cosine similarity to compute the similarity between items.

### User-Item Collaborative Filtering Example

![facebook](../assets/facebook.png)

One obvious example of User item collaborative filtering is Facebook 'Friends you may know' even though the implementation is graph based, is using users that are similar to you in terms of the friends you share to recommend new friends to you based on your friends' friends 

### Collaborative filtering

#### Pros

* Capable of recommending new items without having to manually map them on the feature space. 
* Capable to find recommendations based on hidden features that an expert wouldn't be able to find (for example, combination of genres or actors).

#### Cons

* Cold Start. We need user interaction data to be able to start building our features
* Every user interaction affects the features. 

## Item-Item Collaborative filtering Example: Subreddit recommendations.

I am going to show you how to build a recommendation engine for Reddit!

A live example of it is located at http://findasub.manugarri.com

The goal can be written down as  "given a user's set of subreddits, recommend new subredits that are similar". 

In this system,  the features of each item (subreddit) will be the similarities of this subreddit with the others

```
manuel@manuel-P35V2 pydata_2016_talk $ sqlite3 assets/reddit.db 
SQLite version 3.9.2 2015-11-02 18:31:45
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE redditors (redditor varchar(30), sub varchar(1000));
CREATE TABLE similarity (sub1 varchar(30), sub2 varchar(1000), similarity float);
sqlite> 

```

In [11]:
import sqlite3
conn = sqlite3.connect('./assets/reddit.db')

print(pd.io.sql.read_sql('select count(*) from redditors;',conn))

pd.io.sql.read_sql('select redditor, sub from redditors limit 10;',conn)

   count(*)
0   1756328


Unnamed: 0,redditor,sub
0,bananashirt178,memes
1,tweed08,memes
2,tweed08,Portal
3,KTM950Guy,beerporn
4,braidedbutthair,beerporn
5,ipathological,beerporn
6,justinthreee,memes
7,jake420,shittyadvice
8,jake420,memes
9,jake420,findareddit


We have 1.8 million rows on the redditors table, each one is a comment of a redditor to a subreddit.

In [40]:
#https://github.com/manugarri/Reddit-Recommendation-Engine/blob/master/seed/similarity.py


#In production, be always mindful of Bobby Tables
def compute_sim(sub1, sub2):
    users_union = conn.execute('''SELECT COUNT(DISTINCT(redditor)) from redditors
                                    WHERE sub ="{0}" OR sub ="{1}";'''.format(sub1, sub2)).fetchone()
    users_intersect = conn.execute('''SELECT COUNT(DISTINCT(redditor)) from redditors
                                   WHERE sub ="{0}" and redditor in (
                                   SELECT DISTINCT(redditor) FROM redditors WHERE sub="{1}")'''.format(sub1, sub2)).fetchone()
    users_intersect = int(users_intersect[0])
    users_union = int(users_union[0])
    if users_intersect:
        return users_intersect * 1.0 / users_union
    else:
        return 0.0



So here we see how we are going to measure the similarity between subs.
We will use the Jaccard similarity index that I explained before.

So given a sample of reddit comments, we can estimate how similar two subreddits are by counting how many of the redditors that wrote messages on BOTH subreddits (intersection) divided by the number of redditors that wrote on either one of them,

This step takes quite some time to run 

In [66]:
compute_sim('aviation','flying')

0.10175922731976543

In [47]:
import json
from itertools import product

with open('assets/subs.json' ) as f:
    subreddits = json.load(f)[:5]
    
for sim1, sim2 in product(subreddits, repeat=2):
    if sim1!=sim2:
        print(sim1,sim2,compute_sim(sim1, sim2))

30ROCK 3DS 0.00051440329218107
30ROCK 3Dprinting 0.0006369426751592356
30ROCK 3amjokes 0.0006451612903225806
30ROCK 49ers 0.000643915003219575
3DS 30ROCK 0.00051440329218107
3DS 3Dprinting 0.0013764624913971094
3DS 3amjokes 0.0010391409767925182
3DS 49ers 0.0017313019390581717
3Dprinting 30ROCK 0.0006369426751592356
3Dprinting 3DS 0.0013764624913971094
3Dprinting 3amjokes 0.001193792280143255
3Dprinting 49ers 0.0007945967421533572
3amjokes 30ROCK 0.0006451612903225806
3amjokes 3DS 0.0010391409767925182
3amjokes 3Dprinting 0.001193792280143255
3amjokes 49ers 0.00040032025620496394
49ers 30ROCK 0.000643915003219575
49ers 3DS 0.0017313019390581717
49ers 3Dprinting 0.0007945967421533572
49ers 3amjokes 0.00040032025620496394


In [48]:
pd.io.sql.read_sql('select sub1, sub2, similarity from similarity order by similarity desc limit 15;',conn)

Unnamed: 0,sub1,sub2,similarity
0,iOSthemes,jailbreak,0.142021
1,asktransgender,transgender,0.112782
2,aviation,flying,0.101759
3,ukpolitics,unitedkingdom,0.101731
4,Liberal,progressive,0.099843
5,keto,ketorecipes,0.097407
6,Pokemongiveaway,pokemontrades,0.089431
7,ImaginaryCharacters,ImaginaryMonsters,0.08848
8,frugalmalefashion,malefashionadvice,0.084672
9,GiftofGames,RandomActsOfGaming,0.081259


In [96]:
def get_redditor_subs(redditor):
    records = conn.execute('''select sub from redditors where redditor = "{}"'''.format(redditor)).fetchall()
    return [r[0] for r in records]

def get_sub_recommendations(redditor, n_recommendations=20):
    redditor_subs = get_redditor_subs(redditor)
    print(redditor_subs)

    query_1 = "select sub1 as sub, sum(similarity) as similarity from similarity where sub2 in ({})\
                group by sub;".format(', '.join(["'" + s + "'" for s in redditor_subs]))
    all_sub_scores_1 = pd.io.sql.read_sql(query_1, conn)

    query_2 = "select sub2 as sub, sum(similarity) as similarity from similarity where sub1 in ({})\
                group by sub;".format(', '.join(["'" + s + "'" for s in redditor_subs]))
    all_sub_scores_2 = pd.io.sql.read_sql(query_2, conn)

    sub_scores = pd.concat([all_sub_scores_1, all_sub_scores_2])
    sub_scores = sub_scores.groupby('sub').sum().reset_index()
    sub_scores.sort_values(by='similarity', ascending=False, inplace=True)
    
    sub_scores = sub_scores[-sub_scores['sub'].isin(redditor_subs)]
    return sub_scores.head(n_recommendations)

In [95]:
get_sub_recommendations('NoeticIntelligence')

['IAmA', 'atheism', 'askscience', 'conspiracy', 'scifi', 'photography', 'changemyview', 'worldpolitics', 'Entrepreneur', 'compsci', 'dogs', 'webdev', 'google', 'Military', 'Paranormal', 'ragecomics', 'AskSocialScience', 'NeutralPolitics', 'privacy', 'PoliticalDiscussion', 'software', 'conspiratard', 'islam', 'printSF', 'NorthKoreaNews', 'Ask_Politics', 'moderatepolitics', 'commandline', 'hackers']


Unnamed: 0,sub,similarity
699,environment,0.185935
567,business,0.170226
1062,politics,0.165885
341,Libertarian,0.155474
1113,restorethefourth,0.150343
1075,progressive,0.147675
206,Foodforthought,0.146236
340,Liberal,0.134288
1344,worldnews,0.133243
1184,space,0.132084


## Example: Item-Item Based Collaborative filtering in the MovieLens dataset

In [8]:
ratings_df = pd.read_table('assets/ratings.dat', header=None, sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])

#we dont care about the time the rating was given
del ratings_df['timestamp']

#replace movie_id with movie_title for legibility
ratings_df = pd.merge(ratings_df, movies_df, on='movie_id')[['user_id', 'movie_title', 'movie_id','rating']]

ratings_df.head()  

  if __name__ == '__main__':


Unnamed: 0,user_id,movie_title,movie_id,rating
0,1,One Flew Over the Cuckoo's Nest (1975),1193,5
1,2,One Flew Over the Cuckoo's Nest (1975),1193,5
2,12,One Flew Over the Cuckoo's Nest (1975),1193,4
3,15,One Flew Over the Cuckoo's Nest (1975),1193,4
4,17,One Flew Over the Cuckoo's Nest (1975),1193,5


In [82]:
ratings_mtx_df = ratings_df.pivot_table(values='rating', index='user_id', columns='movie_title')  
ratings_mtx_df = ratings_mtx_df.apply(lambda x: x.fillna(x.mean()),axis=0)
ratings_mtx_df = ratings_mtx_df.apply(lambda x: x - x.mean(),axis=1)

movie_index = ratings_mtx_df.columns

ratings_mtx_df.head()  

movie_title,"$1,000,000 Duck (1971)",'Night Mother (1986),'Til There Was You (1997),"'burbs, The (1989)",...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1961),101 Dalmatians (1996),12 Angry Men (1957),...,"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Young and Innocent (1937),Your Friends and Neighbors (1998),Zachariah (1971),"Zed & Two Noughts, A (1985)",Zero Effect (1998),Zero Kelvin (Kj�rlighetens kj�tere) (1995),Zeus and Roxanne (1997),eXistenZ (1999)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.215674,0.128728,-0.550393,-0.33181,0.470867,-0.742701,0.180156,0.35376,-0.195997,1.052754,...,0.390211,0.147801,0.057299,0.133446,0.257299,0.171092,0.50813,0.257299,-0.720962,0.013397
2,-0.210064,0.134337,-0.544784,-0.3262,0.476476,-0.737091,0.185766,0.359369,-0.190388,1.058363,...,0.39582,0.15341,0.062909,0.139055,0.262909,0.176702,0.513739,0.262909,-0.715352,0.019006
3,-0.213476,0.130926,-0.548195,-0.329612,0.473065,-0.740503,0.182354,0.355957,-0.193799,1.054952,...,0.392409,0.149999,0.059497,0.135644,0.259497,0.17329,0.510328,0.259497,-0.718764,0.015595
4,-0.212675,0.131726,-0.547394,-0.328811,0.473866,-0.739702,0.183155,0.356758,-0.192999,1.055752,...,0.393209,0.150799,0.060298,0.136445,0.260298,0.174091,0.511128,0.260298,-0.717963,0.016395
5,-0.182296,0.162105,-0.517016,-0.298432,0.504244,-0.709323,0.213534,0.387137,-0.16262,1.086131,...,0.423588,0.181178,0.090677,0.166823,0.290677,0.20447,0.541507,0.290677,-0.687584,0.046774


Since the user rating matrix is very sparse, we normalize it by

1. Filling non rated user-movie interactions with the users' average rating,
2. Deduct the movie bias by removing the movie's global average rating from each rating

Now that we have a normalized user-ratings matrix, we can compute the relationships between movies by calculating the correlation between movies and ratings.

We will use **Pearson product-moment correlation coefficient (PPMC)** to compute the similarities between movies based off the relation of the ratings users give to them.

![pmcc](../assets/pmcc.png)

In [83]:
corr_matrix = np.corrcoef(ratings_mtx_df.T)  
corr_matrix.shape

(3706, 3706)

In [115]:
def get_movie_correlations(movie_title):  
    '''Returns correlation vector for a movie'''
    movie_idx = list(movie_index).index(movie_title)
    return corr_matrix[movie_idx]

def get_similar_movies(movie_title, threshold=0.2):
    movie_correlations_array =  get_movie_correlations(movie_title)
    return movie_index[movie_correlations_array>threshold]

In [116]:
get_similar_movies('Star Wars: Episode IV - A New Hope (1977)')

Index(['Raiders of the Lost Ark (1981)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)',
       'Star Wars: Episode VI - Return of the Jedi (1983)'],
      dtype='object', name='movie_title')

In [117]:
get_similar_movies('Die Hard (1988)')

Index(['Die Hard (1988)', 'Die Hard 2 (1990)',
       'Die Hard: With a Vengeance (1995)', 'Lethal Weapon (1987)',
       'Lethal Weapon 2 (1989)', 'Terminator 2: Judgment Day (1991)',
       'Terminator, The (1984)'],
      dtype='object', name='movie_title')

In [119]:
def get_movie_recommendations(user_movies):  
    '''given a set of movies, it returns all the movies sorted by their correlation with the user'''
    movie_similarities = np.zeros(corr_matrix.shape[0])
    for movie_id in user_movies:
        movie_similarities = movie_similarities + get_movie_correlations(movie_id)
    similarities_df = pd.DataFrame({
        'movie_title': movie_index,
        'sum_similarity': movie_similarities
        })
    similarities_df = similarities_df[~(similarities_df.movie_title.isin(user_movies))]
    similarities_df = similarities_df.sort_values(by=['sum_similarity'], ascending=False)
    return similarities_df

In [125]:
sample_user = 21
sample_user_movies = ratings_df[ratings_df.user_id==sample_user].movie_title.tolist() 
sample_user_movies

["Bug's Life, A (1998)",
 'Bambi (1942)',
 'Antz (1998)',
 'Aladdin (1992)',
 'Toy Story (1995)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Who Framed Roger Rabbit? (1988)',
 'South Park: Bigger, Longer and Uncut (1999)',
 'Akira (1988)',
 'Pinocchio (1940)',
 'Mad Max Beyond Thunderdome (1985)',
 'Titan A.E. (2000)',
 "Devil's Advocate, The (1997)",
 'Prince of Egypt, The (1998)',
 'Wild Wild West (1999)',
 'Iron Giant, The (1999)',
 'Brady Bunch Movie, The (1995)',
 'Princess Mononoke, The (Mononoke Hime) (1997)',
 'Little Nemo: Adventures in Slumberland (1992)',
 'Messenger: The Story of Joan of Arc, The (1999)',
 'Stop! Or My Mom Will Shoot (1992)',
 'House Party 2 (1991)']

In [126]:
recommendations = get_movie_recommendations(sample_user_movies)

#We get the top 20 recommended movies
recommendations.movie_title.head(20)  

3055    Snow White and the Seven Dwarfs (1937)
1865                 Lady and the Tramp (1955)
679                          Cinderella (1950)
1002                              Dumbo (1941)
1939                     Lion King, The (1994)
324                Beauty and the Beast (1991)
7                        101 Dalmatians (1961)
2770           Rescuers Down Under, The (1990)
1948                Little Mermaid, The (1989)
3412                        Toy Story 2 (1999)
2808                         Robin Hood (1973)
3026                    Sleeping Beauty (1959)
1739          James and the Giant Peach (1996)
3275                             Tarzan (1999)
97                  Alice in Wonderland (1951)
1781                   Jungle Book, The (1967)
1611       Hunchback of Notre Dame, The (1996)
2771                      Rescuers, The (1977)
2432                   Oliver & Company (1988)
1111                           Fantasia (1940)
Name: movie_title, dtype: object

## Collaborative filtering via Dimensionality Reduction by SVD

SVD (Singular Value Decomposition) is a matrix factorization method

it works by

$$A = U\Sigma V^T$$

With  being a diagonal matrix containing the eigenvalues of the original matrix A

In [18]:
movies_df = pd.read_table('assets/movies.dat', header=None, sep='::', names=['movie_id', 'movie_title', 'movie_genre'])
movies_df.head()  

  if __name__ == '__main__':


Unnamed: 0,movie_id,movie_title,movie_genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [19]:
ratings_df = pd.read_table('assets/ratings.dat', header=None, sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])
del ratings_df['timestamp']
ratings_df.head()  

  if __name__ == '__main__':


Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [20]:
users = ratings_df.user_id.unique()
movies = ratings_df.movie_id.unique()
 
number_of_rows = len(users)
number_of_columns = len(movies)

movie_indices, user_indices = {}, {}
 
for i in range(len(movies)):
    movie_indices[movies[i]] = i
    
for i in range(len(users)):
    user_indices[users[i]] = i

In [21]:
ratings_mtx_df = ratings_df.pivot_table(values='rating', index='user_id', columns='movie_id')
movie_index = ratings_mtx_df.columns

In [22]:
ratings_mtx_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,2.0,,,,,...,,,,,,,,,,


http://web.eecs.umich.edu/~cscott/past_courses/eecs545f11/projects/AsendorfMcgaffinPressSchwartz.pdf

Mixture Mean Algorithm for data imputation

0.452µuser(j) + 0.548µmovie(i) 

In [23]:
user_avgs = ratings_mtx_df.mean(axis=1)
movie_avgs = ratings_mtx_df.mean(axis=0)

user_alpha = 0.452
movie_alpha = 0.548

user_fill_values = user_alpha * user_avgs
movie_fill_values = movie_alpha * movie_avgs


for movie_id in ratings_mtx_df.columns:
    ratings_mtx_df.loc[:,movie_id][ratings_mtx_df.loc[:,movie_id].isnull()] = movie_fill_values[movie_id] + user_fill_values

In [24]:
ratings_mtx_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.647508,3.546455,3.389001,3.540986,4.018823,3.762226,3.545342,3.349244,3.833499,...,3.565825,3.050172,2.708911,3.131763,3.796338,3.885664,4.148375,3.902616,4.030483,3.965231
2,3.950828,3.432582,3.331528,3.174074,3.326059,3.803897,3.5473,3.330415,3.134317,3.618573,...,3.350898,2.835245,2.493984,2.916837,3.581411,3.670737,3.933449,3.68769,3.815557,3.750305
3,4.036158,3.517912,3.416858,3.259404,3.411389,3.889227,3.63263,3.415745,3.219647,3.703902,...,3.436228,2.920575,2.579314,3.002166,3.666741,3.756067,4.018778,3.77302,3.900886,3.835635
4,4.166567,3.648321,3.547267,3.389813,3.541798,4.019636,3.763038,3.546154,3.350056,3.834311,...,3.566637,3.050984,2.709723,3.132575,3.79715,3.886476,4.149187,3.903429,4.031295,3.966044
5,3.694674,3.176427,3.075374,2.91792,3.069905,2.0,3.291145,3.074261,2.878163,3.362418,...,3.094744,2.579091,2.23783,2.660682,3.325257,3.414583,3.677294,3.431535,3.559402,3.49415


In [25]:
ratings_mtx = ratings_mtx_df.as_matrix()
ratings_mtx.shape

(6040, 3706)

In [11]:
import gc

del ratings_mtx_df
del ratings_df
gc.collect()

107263

In [12]:
from scipy.sparse.linalg import svds

In [26]:
u,s, vt = svds(ratings_mtx, k = 2000)
s_diag_matrix = np.zeros((s.shape[0], s.shape[0]))

for i in range(s.shape[0]):
    s_diag_matrix[i,i] = s[i]
    
X_lr = np.dot(np.dot(u, s_diag_matrix), vt)

In [None]:
import pickle
with open('assets/movielens_1M_svd_u.pickle', 'wb') as handle:
    pickle.dump(u, handle)
with open('assets/movielens_1M_svd_s.pickle', 'wb') as handle:
    pickle.dump(s, handle)
with open('assets/movielens_1M_svd_vt.pickle', 'wb') as handle:
    pickle.dump(vt, handle)
    
del vt
del s
del u
gc.collect()
ratings_df = pd.read_table('assets/ratings.dat', header=None, sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])
del ratings_df['timestamp']

In [28]:
sample_user = 1

In [31]:
X_lr.shape

(6040, 3706)

In [35]:
user_ratings = ratings_df[ratings_df.user_id==sample_user].merge(
               movies_df, on='movie_id'
              ).sort_values(by='rating',ascending=False)
user_ratings_movies = user_ratings.movie_id.tolist()
user_ratings

Unnamed: 0,user_id,movie_id,rating,movie_title,movie_genre
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama
46,1,1029,5,Dumbo (1941),Animation|Children's|Musical
40,1,1,5,Toy Story (1995),Animation|Children's|Comedy
18,1,3105,5,Awakenings (1990),Drama
41,1,1961,5,Rain Man (1988),Drama
23,1,527,5,Schindler's List (1993),Drama|War
37,1,1022,5,Cinderella (1950),Animation|Children's|Musical
14,1,1035,5,"Sound of Music, The (1965)",Musical
25,1,48,5,Pocahontas (1995),Animation|Children's|Musical|Romance
45,1,1028,5,Mary Poppins (1964),Children's|Comedy|Musical


In [36]:
user_predicted_scores_df = pd.DataFrame({
        'movie_id': list(movie_indices.values()),
        'normalized_score': ratings_mtx[user_indices[sample_user],:],
        'pred_score': X_lr[user_indices[sample_user],:]})
user_predicted_scores_df = user_predicted_scores_df[-user_predicted_scores_df.movie_id.isin(user_ratings_movies)]
user_predicted_scores_df.sort(columns=['pred_score'], ascending=False).merge(movies_df, on='movie_id').head(20)



Unnamed: 0,movie_id,normalized_score,pred_score,movie_title,movie_genre
0,23,5.0,5.001451,Assassins (1995),Thriller
1,22,5.0,5.001254,Copycat (1995),Crime|Drama|Thriller
2,14,5.0,4.996516,Nixon (1995),Drama
3,4,5.0,4.996171,Waiting to Exhale (1995),Comedy|Drama
4,40,5.0,4.995617,"Cry, the Beloved Country (1995)",Drama
5,41,5.0,4.995482,Richard III (1995),Drama|War
6,7,5.0,4.993081,Sabrina (1995),Comedy|Romance
7,10,5.0,4.98968,GoldenEye (1995),Action|Adventure|Thriller
8,39,5.0,4.989618,Clueless (1995),Comedy|Romance
9,45,5.0,4.986121,To Die For (1995),Comedy|Drama


### We can load the 10M svd matrix and see how it performs

In [56]:
ratings_df = pd.read_table('assets/ml-10M100K/ratings.dat', header=None, 
                           sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'],
                           engine='python')
del ratings_df['timestamp']


users = ratings_df.user_id.unique()
movies = ratings_df.movie_id.unique()
 
number_of_rows = len(users)
number_of_columns = len(movies)

movie_indices, user_indices = {}, {}
 
for i in range(len(movies)):
    movie_indices[movies[i]] = i
    
for i in range(len(users)):
    user_indices[users[i]] = i

  if __name__ == '__main__':


Load the pickled svd components

In [45]:
u_10 = np.load('assets/ml-10M100K/movielens_10M_svd_u.pickle')
s_10 = np.load('assets/ml-10M100K/movielens_10M_svd_s.pickle')
vt_10 = np.load('assets/ml-10M100K/movielens_10M_svd_vt.pickle')

s_diag_matrix_10 = np.zeros((s_10.shape[0], s_10.shape[0]))

for i in range(s_10.shape[0]):
    s_diag_matrix_10[i,i] = s_10[i]
    
X_lr = np.dot(np.dot(u_10, s_diag_matrix_10), vt_10)

In [51]:
user_ratings =ratings_df[ratings_df.user_id==sample_user].merge(
               movies_df, on='movie_id'
              ).sort_values(by='rating',ascending=False)
user_ratings_movies = user_ratings.movie_id.tolist()
user_ratings

Unnamed: 0,user_id,movie_id,rating,movie_title,movie_genre
0,1,122,5.0,Boomerang (1992),Comedy|Romance
1,1,185,5.0,"Net, The (1995)",Sci-Fi|Thriller
20,1,594,5.0,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical
19,1,589,5.0,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller
18,1,588,5.0,Aladdin (1992),Animation|Children's|Comedy|Musical
17,1,586,5.0,Home Alone (1990),Children's|Comedy
16,1,539,5.0,Sleepless in Seattle (1993),Comedy|Romance
15,1,520,5.0,Robin Hood: Men in Tights (1993),Comedy
14,1,480,5.0,Jurassic Park (1993),Action|Adventure|Sci-Fi
13,1,466,5.0,Hot Shots! Part Deux (1993),Action|Comedy|War


In [55]:
user_predicted_scores_df = pd.DataFrame({
        'movie_id': list(movie_indices.values()),
        'pred_score': X_lr[user_indices[sample_user],:]})
user_predicted_scores_df = user_predicted_scores_df[-user_predicted_scores_df.movie_id.isin(user_ratings_movies)]
user_predicted_scores_df.sort(columns=['pred_score'], ascending=False).merge(movies_df, on='movie_id').head(20)



Unnamed: 0,movie_id,pred_score,movie_title,movie_genre
0,9,5.000206,Sudden Death (1995),Action
1,14,4.999915,Nixon (1995),Drama
2,19,4.99945,Ace Ventura: When Nature Calls (1995),Comedy
3,11,4.999285,"American President, The (1995)",Comedy|Drama|Romance
4,7,4.999271,Sabrina (1995),Comedy|Romance
5,2,4.999075,Jumanji (1995),Adventure|Children's|Fantasy
6,4,4.99898,Waiting to Exhale (1995),Comedy|Drama
7,5,4.998887,Father of the Bride Part II (1995),Comedy
8,3,4.998816,Grumpier Old Men (1995),Comedy|Romance
9,18,4.998762,Four Rooms (1995),Thriller


http://blogs.gartner.com/martin-kihn/how-to-build-a-recommender-system-in-python/

### We try http://maheshakya.github.io/gsoc/2014/05/18/preparing-a-bench-marking-data-set-using-singula-value-decomposition-on-movielens-data.html