# Implementation of Recommendation System
This notebook will show you how to build recommendation system.

Reference:

https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html

https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/

http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/

In [1]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt



In [2]:
#load & explore data
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('data/ml-100k/u.data', delimiter="\t", names=header)

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
user_id      100000 non-null int64
item_id      100000 non-null int64
rating       100000 non-null int64
timestamp    100000 non-null int64
dtypes: int64(4)
memory usage: 3.1 MB


In [3]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [4]:
n_items = df['item_id'].unique().shape[0]
n_users = df['user_id'].unique().shape[0]
print(n_items, n_users)

1682 943


In [5]:
# split dataset into train and test dataset
train_data, test_data = train_test_split(df, test_size=0.25)

In [6]:
# build movie info mapper
idx_to_movie = {}
with open('data/ml-100k/u.items', 'r') as f:
    for line in f.readlines():
        info = line.split('|')
        idx_to_movie[int(info[0])-1] = info[1]

### Memory-Based Collaborative Filtering

Memory-based algorithms are easy to implement and produce reasonable prediction quality. **The drawback of memory-based CF is that it doesn't scale to real-world scenarios and doesn't address the well-known cold-start problem**, that is when new user or new item enters the system.

Memory-Based Collaborative Filtering approaches can be divided into two main sections: **user-item filtering** and **item-item filtering**.

* user-item filtering takes a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. **“Users who are similar to you also liked …”**
*  item-item filtering will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations. **“Users who liked this item also liked …”**


In both cases, we create a user-item matrix which you build from the entire dataset. Since we have split the data into testing and training you will need to create two 943 ×× 1682 matrices. The training matrix contains 75% of the ratings and the testing matrix contains 25% of the ratings.

After you have built the user-item matrix you calculate the similarity and create a similarity matrix.

1.first step will be to create the user-item matrix. Since you have both testing and training data you need to create two matrices.

In [7]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]
    
test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

In [40]:
test_data_matrix.shape

(943, 1682)

2.second step is using the [pairwise_distances](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) function from sklearn to calculate the cosine similarity. Note, the output will range from 0 to 1 since the ratings are all positive.

In [17]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

In [18]:
print(user_similarity.shape)
print(item_similarity.shape)

(943, 943)
(1682, 1682)


3.next step is to make predictions and evaluation

In [19]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [21]:
user_prediction = predict(train_data_matrix, user_similarity, type='user')
item_prediction = predict(train_data_matrix, item_similarity, type='item')

In [20]:
print(user_prediction.shape)
print(item_prediction.shape)

(943, 1682)
(943, 1682)


In [22]:
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

print ('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print ('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.125684445325716
Item-based CF RMSE: 3.4537942450988033


4.final test

In [15]:
def top_k_movies(similarity, mapper, movie_idx, k=6):
    return [mapper[x] for x in np.argsort(similarity[movie_idx,:])[:-k-1:-1]]

In [23]:
idx = 0 # Toy Story
print("search moive: ",idx_to_movie[idx])
movies = top_k_movies(item_similarity, idx_to_movie, idx)
print("result movies", movies)

search moive:  Toy Story (1995)
result movies ['Scream of Stone (Schrei aus Stein) (1991)', 'Visitors, The (Visiteurs, Les) (1993)', 'Fear, The (1995)', 'Children of the Revolution (1996)', 'Bewegte Mann, Der (1994)', 'Farmer & Chase (1995)']


### Model-based Collaborative Filtering
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/

Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. **Matrix factorization is widely used for recommender systems** where it can deal better with scalability and sparsity than Memory-based CF. 

 **The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.** When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

In [27]:
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print ('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


In [29]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print ('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.7158826965835567


In [38]:
test_data_matrix.shape

(943, 1682)

In [44]:
user_id = 1 # Toy Story
print("search user: ",user_id)
movies = [idx_to_movie[x] for x in np.argsort(X_pred[user_id])]
print("result movies", movies[:6])

search user:  1
result movies ['Terminator 2: Judgment Day (1991)', 'Star Trek: Generations (1994)', 'This Is Spinal Tap (1984)', 'Gone with the Wind (1939)', 'Grifters, The (1990)', 'Star Trek VI: The Undiscovered Country (1991)']
