# Imports and Installs

In [1]:
# Doesn't come default with our anaconda installations
# http://surpriselib.com/
!pip install surprise



In [2]:
from datetime import datetime

def now():
    return str(datetime.now())

In [117]:
print(now())

import pandas as pd
import numpy as np

from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise import Dataset
from surprise import Reader
from surprise import accuracy

print(now())

2018-04-23 14:13:33.074000
2018-04-23 14:13:33.074000


# Loading In Data (V2)

As requested by Yerania, I'm using a specific subset to train/test on, for consistency across all computers. The specific request was to use user IDs <= 6000, and a 20% test set.

Here I route the data through Pandas so I can adjust the "-1" ratings. -1, in the dataset, represents "User has seen this item but declined to rate it". I choose to interpret this as "User mildly likes item"--even if they outright hated it, they had enough interest in the item to 1) watch it, and 2) mark the item as watched (MyAnimeList does not automatically flag items you've seen, it's not Netflix)

Thus I changed all -1s to 5s.

In [4]:
print(now())
ratingDF = pd.read_csv('rating.csv')
print(now())

2018-04-20 22:21:16.924000
2018-04-20 22:21:19.094000


In [5]:
ratingDF["rating"] = ratingDF["rating"].replace(to_replace = -1, value = 5)

In [6]:
ratingDF.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,5
1,1,24,5
2,1,79,5
3,1,226,5
4,1,241,5


In [7]:
ratingSubset = ratingDF[ratingDF.user_id <= 6000]

In [8]:
del ratingDF
# Gotta save that RAM

In [9]:
# Reader class: http://surprise.readthedocs.io/en/stable/reader.html#surprise.reader.Reader
# Using custom datasets: http://surprise.readthedocs.io/en/stable/getting_started.html#load-custom

# Directly from file:
# reader = Reader(line_format='user item rating', sep=',', rating_scale=(1,10), skip_lines=1)
# ratingSubset = Dataset.load_from_file("ratingSubset.csv", reader=reader)

# From Pandas DataFrame:
reader = Reader(rating_scale=(1, 10))
ratingSubset = Dataset.load_from_df(ratingSubset[['user_id', 'anime_id', 'rating']], reader)

# Training and Sample Predictions

Largely following this guide: http://surprise.readthedocs.io/en/stable/getting_started.html

## Training

In [34]:
trainingRatingSubset, testRatingSubset = train_test_split(ratingSubset, test_size=.2)

In [144]:
# http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

algo = SVD(n_factors = 50, reg_all = 0.05)

In [145]:
print(now())
algo.fit(trainingRatingSubset)
print(now())

2018-04-23 15:41:35.777000
2018-04-23 15:41:53.607000


## Generating Predictions

This is done through the predict() method (explained on the getting started page)

We can specify any user_id and anime_id we want (as long as we don't specify ratings on nonexistent anime or nonexistent users).

For end-user-demo-purposes or whatever, we would need to:
* Collect the user's data
* Add the user's data as new rows of the dataset (imagine adding new rows to rating.csv). 
* From there we could simply run the same process (go through pandas DF, then the surprise dataset type), now able to treat them as an "existing" user.

In [146]:
# We'll use the known line 373,11771,8

pred = algo.predict(373, 11771, verbose=True)
#r_ui is an optional parameter displaying the ground truth value

user: 373        item: 11771      r_ui = None   est = 8.62   {u'was_impossible': False}


See this github page here: https://github.com/NicolasHug/Surprise/issues/82

For DataFrames specifically, the user_id and item_id should be *ints*. In all other cases (and all the documentation?) predict() uses strings to specify user_id and item_id.

## Evaluating Predictions

`Surprise` has RMSE, MAE, and FCP (Fraction of Concordant Pairs) in its accuracy toolbox. It is possible to calculate precision and recall with some hacks detailed here: http://surprise.readthedocs.io/en/stable/FAQ.html#how-to-compute-precision-k-and-recall-k

In [147]:
# Get the test predictions
print(now())
predictions = algo.test(testRatingSubset)
print(now())

2018-04-23 15:42:02.357000
2018-04-23 15:42:03.727000


In [148]:
accuracy.rmse(predictions)

RMSE: 1.2022


1.2021691143590241

In [149]:
accuracy.mae(predictions)

MAE:  0.8886


0.88859607243432459