## SVD algorithm implementation

More Learning Resource for SVD:

* http://nicolas-hug.com/blog/matrix_facto_1
* http://nicolas-hug.com/blog/matrix_facto_2
* http://nicolas-hug.com/blog/matrix_facto_3
* http://nicolas-hug.com/blog/matrix_facto_4
* http://sifter.org/simon/journal/20061211.html

In [1]:
import pandas as pd
import numpy as np

import surprise
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
import warnings; warnings.simplefilter('ignore')

In [2]:
class MatrixFacto(surprise.AlgoBase):
    '''A basic rating prediction algorithm based on matrix factorization.'''
    
    def __init__(self, learning_rate, n_epochs, n_factors):
        
        self.lr = learning_rate  # learning rate for SGD
        self.n_epochs = n_epochs  # number of iterations of SGD
        self.n_factors = n_factors  # number of factors
        
    def fit(self, trainset):
        '''Learn the vectors p_u and q_i with SGD'''
        
        print('Fitting data with SGD...')
        
        # Randomly initialize the user and item factors.
        p = np.random.normal(0, .1, (trainset.n_users, self.n_factors))
        q = np.random.normal(0, .1, (trainset.n_items, self.n_factors))
        
        # SGD procedure
        for _ in range(self.n_epochs):
            for u, i, r_ui in trainset.all_ratings():
                err = r_ui - np.dot(p[u], q[i])
                # Update vectors p_u and q_i
                p[u] += self.lr * err * q[i]
                q[i] += self.lr * err * p[u]
                # Note: in the update of q_i, we should actually use the previous (non-updated) value of p_u.
                # In practice it makes almost no difference.
        
        self.p, self.q = p, q
        self.trainset = trainset

    def estimate(self, u, i):
        '''Return the estmimated rating of user u for item i.'''
        
        # return scalar product between p_u and q_i if user and item are known,
        # else return the average of all ratings
        if self.trainset.knows_user(u) and self.trainset.knows_item(i):
            return np.dot(self.p[u], self.q[i])
        else:
            return self.trainset.global_mean

In [3]:
# Creation of the dataframe. Column names are irrelevant.
ratings_dict = {'itemID': [1, 1, 1, 2, 2],
                'userID': [9, 32, 2, 45, 'user_foo'],
                'rating': [3, 2, 4, 3, 1]}
df = pd.DataFrame(ratings_dict)

In [4]:
df

Unnamed: 0,itemID,rating,userID
0,1,3,9
1,1,2,32
2,1,4,2
3,2,3,45
4,2,1,user_foo


In [5]:
reader = Reader(rating_scale=(1, 5))

In [6]:
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

In [7]:
data.split(2)

In [8]:
algo = MatrixFacto(learning_rate=.01, n_epochs=30, n_factors=10)

In [9]:
surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm MatrixFacto.

------------
Fold 1
Fitting data with SGD...
RMSE: 1.7078
------------
Fold 2
Fitting data with SGD...
RMSE: 1.5811
------------
------------
Mean RMSE: 1.6445
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [1.707825127659933, 1.5811388300841898]})

In [12]:
print ("Rating for ItemID 1 for user 9 who haven't rated that item: ")
print (algo.estimate(9,2))

Rating for ItemID 1 for user 9 who haven't rated that item: 
2.0


In [13]:
print ("Rating for ItemID 2 for user_foo user who haven't rated that item: ")
print (algo.estimate('user_foo',1))

Rating for ItemID 2 for user_foo user who haven't rated that item: 
2.0


### For ml-100k movie dataset

In [14]:
# data loading. We'll use the movielens dataset (https://grouplens.org/datasets/movielens/100k/)
# it will be downloaded automatically.
data = surprise.Dataset.load_builtin('ml-100k')
print (data)
data.split(2)  # split data for 2-folds cross validation

<surprise.dataset.DatasetAutoFolds object at 0x7f3480fe8390>


In [15]:
algo = MatrixFacto(learning_rate=.01, n_epochs=10, n_factors=10)

### RMSE for our implementation

In [16]:
surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm MatrixFacto.

------------
Fold 1
Fitting data with SGD...
RMSE: 0.9801
------------
Fold 2
Fitting data with SGD...
RMSE: 0.9811
------------
------------
Mean RMSE: 0.9806
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.9801449146986095, 0.9811367561046873]})

### RMSE for surprise SVD implementation 

In [17]:
# try a more sophisticated matrix factorization algorithm (on the same data)
algo = surprise.SVD()
surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.9551
------------
Fold 2
RMSE: 0.9565
------------
------------
Mean RMSE: 0.9558
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.9551023201138982, 0.9564721409716986]})