Welcome to the Surprise Package Practice Notebook.
Today we will learn some basic functions of this package.
More details about the package and its uses can be found at the following link: http://surpriselib.com/

In [1]:
# First, we need to install this package. 

In [2]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 238kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1617556 sha256=1d9c8e1597755d5606d0c4201d4e840d349bc304c5d2575f873b303792c6f0b7
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In today's practice, we will learn basic capabilities, such as Loading data through the package, splitting the data into a training set, and test-set. 

Moreover, we will use one of the major Collaborative Filtering algorithms named SVD to generate recommendations.


**For now**, you are still not familiar with the SVD algorithm, and it is ok. In the upcoming weeks, you will learn about SVD in detail.

In [3]:
import pandas as pd
from surprise import SVD
from surprise import Dataset,Reader,AlgoBase
from surprise import accuracy
from surprise.model_selection import train_test_split


In [4]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
# Split the data intro training and testing sets.
trainset, testset = train_test_split(data, test_size=.25,shuffle=True)

algo = SVD()
# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)





Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


# Surprise enables us to evaluate our model performance. Here we use two common performance measures, which we will cover later in this course.

In the meantime, you can read this blog post. [Evaluating Recommender Systems: Root Mean Squared Error or Mean Absolute Error](https://towardsdatascience.com/evaluating-recommender-systems-root-means-squared-error-or-mean-absolute-error-1744abc2beac#:~:text=Recommender%20System%20accuracy%20is%20popularly,scale%20as%20the%20original%20ratings.)

In [5]:
# Compute the Root Mean Squared Error (RMSE) 
accuracy.rmse(predictions)

RMSE: 0.9427


0.9426563425326393

In [6]:
# Compute the Mean Absolute Error (RMSE) 
accuracy.mae(predictions)


MAE:  0.7436


0.7436442754469157

# We can perform Cross-validation as well with surprise as follows:

In [15]:
from surprise.model_selection import KFold
kf = KFold(n_splits=5)

algo = SVD()  

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.9304
RMSE: 0.9352
RMSE: 0.9441
RMSE: 0.9339
RMSE: 0.9332


# Tune algorithm parameters with GridSearchCV

In [16]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [3,5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# Dataset handaling

Surprise using data with Dataset object. Therefore we need to make sure we transfer our DataFrame to a Dataset object.


How to convert a DataFrame to a Dataset object within surprise package:

In [9]:
# First, we will create a dataFrame 
df = pd.DataFrame({'uid':['u1','u2','u1','u3'],'iid':['i1','i2','i3','i2'],'r_ui':[3,4,2,3]})
df.head()

Unnamed: 0,uid,iid,r_ui
0,u1,i1,3
1,u2,i2,4
2,u1,i3,2
3,u3,i2,3


In [10]:
# We need to define a reader and explicitly declare the rating scale. Then, we use the reader in order to convert the dataFrame into a Dataset object
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(df[['uid','iid','r_ui']],reader)


In [11]:
data

<surprise.dataset.DatasetAutoFolds at 0x7f8f9ccff490>

# TOP N recommendation

We use the MovieLens-100k dataset.

We first train an SVD algorithm, and then predict all the ratings for the pairs (user, item) that are not in the training set. 

Then, we retrieve the top-10 items with the highest predicted rating values for each user.


In [12]:
from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=3)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [(iid,rating) for (iid, rating) in user_ratings])

196 [('318', 4.613439876116854), ('408', 4.5262117802168715), ('603', 4.514585888667098)]
186 [('480', 4.6779433638829335), ('64', 4.668510947686593), ('174', 4.648009518956394)]
22 [('12', 4.892784429537585), ('114', 4.877815233525465), ('8', 4.5957869058527)]
244 [('474', 5), ('408', 5), ('127', 5)]
166 [('64', 5), ('511', 5), ('174', 4.915109456125072)]
298 [('313', 4.939405313035719), ('12', 4.658499193286009), ('166', 4.613807941875697)]
115 [('480', 5), ('135', 5), ('483', 5)]
253 [('172', 4.926069925853049), ('191', 4.823262744560086), ('174', 4.761257036992227)]
305 [('603', 4.296783927413712), ('513', 4.28420996158569), ('657', 4.234071171271627)]
6 [('606', 4.398155608036407), ('654', 4.38408890513901), ('603', 4.335710921222474)]
62 [('169', 4.3840128488169965), ('513', 4.2545620950390335), ('408', 4.2507068285704905)]
286 [('515', 5), ('513', 5), ('520', 4.9904278773372095)]
200 [('178', 5), ('64', 5), ('199', 5)]
210 [('427', 4.9878926453885), ('408', 4.984730302055901), (

Surprise documentaion can be found in the following link [documentation](https://surprise.readthedocs.io/en/stable/getting_started.html)

Additionally, you can implement your own custom algorithm. Check the following link for more details: [Build a custom algorithm](https://surprise.readthedocs.io/en/stable/building_custom_algo.html)