## Recommender systems

Nowadays, recommender systems are used to personalize your experience on the web, telling you what to buy, where to eat or even who you should be friends with. People’s tastes vary, but generally follow patterns. People tend to like things that are similar to other things they like, and they tend to have similar taste as other people they are close with. Recommender systems try to capture these patterns to help predict what else you might like.

### Types
- Content-Based (Similarity between items)
- Collaborative Filtering (Similarity between user's behaviers)
    - Model-Based Collaborative filtering (SVD)
    - Memory-Based Collaborative Filtering (cosine similarity)
        - user-item filtering
        - item-item filtering
 
### Data
- [MovieLens 100K Dataset](https://grouplens.org/datasets/movielens/100k/)
- 100k movie ratings
- 943 users
- 1682 movies

In [2]:
import numpy as np
import pandas as pd
#import tools as t

In [3]:
#reading
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=header)

In [4]:
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)  

Number of users = 943 | Number of movies = 1682


In [20]:
from sklearn import cross_validation as cv
train_data, test_data = cv.train_test_split(df, test_size=0.25)

In [21]:
train_data.describe()

Unnamed: 0,user_id,item_id,rating,timestamp
count,75000.0,75000.0,75000.0,75000.0
mean,462.676733,424.801587,3.52812,883519800.0
std,266.538794,330.221092,1.125617,5341593.0
min,1.0,1.0,1.0,874724700.0
25%,255.0,175.0,3.0,879448800.0
50%,447.0,321.0,4.0,882827000.0
75%,682.0,630.0,4.0,888206500.0
max,943.0,1681.0,5.0,893286600.0


### Create a user-item rating matrix

<img src="user-item.png">

In [25]:
def user_item_rating(data):
    numrows = len(df['user_id'].unique())
    numcols = len(df['item_id'].unique())
    out = np.zeros((numrows, numcols))
    for row in data.values: 
        out[row[0]-1, row[1]-1] = row[2]
    return out 

In [27]:
train_data_matrix = user_item_rating(train_data)
test_data_matrix = user_item_rating(test_data)

print train_data_matrix.shape, test_data_matrix.shape
print "Train Matrix ", train_data_matrix[:10]
print
print "Test Matrix ", test_data_matrix[:10]

(943, 1682) (943, 1682)
Train Matrix  [[ 5.  3.  4. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]]

Test Matrix  [[ 0.  0.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


### Calculate Cosine Similarity
<img src="user_sim.gif">
<img src="item_sim.gif">

In [69]:
def cosine_similarity(data):
    dots = data.dot(data.T)
    norms = (np.diag(dots))**(1./2) 
    norms = norms[:, np.newaxis]
    norms = norms.dot(norms.T)
    out = dots / (norms+1e-16)
    return 1-out

In [70]:
user_similarity = cosine_similarity(train_data_matrix)
item_similarity = cosine_similarity(train_data_matrix.T)

In [71]:
print user_similarity.shape, item_similarity.shape
print user_similarity[0][1]
print item_similarity[0][1] 

(943, 943) (1682, 1682)
0.866573010982
0.721793571361


In [12]:
user_similarity = cosine_similarity(train_data_matrix)
item_similarity = cosine_similarity(train_data_matrix.T)

In [13]:
print user_similarity.shape, item_similarity.shape
print user_similarity[0][1]
print item_similarity[0][1]

(943, 943) (1682, 1682)
0.897418804502
0.704272229875


### Predictions
- user-item filtering
- item-item filtering

<img src="user_predict.gif">
<img src="item_predict.gif">

In [96]:
def predict(ratings, similarity, type='user'):
    if type=='user':
        means = np.mean(ratings, axis=1, keepdims=True)
        out = np.zeros_like(ratings)
        
        num = similarity.dot(ratings-means)
        denum = np.sum(np.abs(similarity), axis=1, keepdims=True)
        out = means+(num/denum)
    else: 
        out = np.zeros_like(ratings)
        #for user in xrange(ratings.shape[0]):
        res = ratings.dot(similarity)
        out = res / np.sum(np.abs(similarity), axis=1) 
    return out

In [97]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

In [98]:
print item_prediction[0]
print user_prediction[0]

[ 0.3716121   0.38723028  0.40124661 ...,  0.44992552  0.43976275
  0.44649227]
[ 1.58433995  0.55063093  0.49332769 ...,  0.29784113  0.29778281
  0.29532385]


### Evatuate

In [99]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [100]:
print 'User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))
print 'Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix))

User-based CF RMSE: 3.13185110258
Item-based CF RMSE: 3.45769122256
