# Imports and Installs

In [1]:
# Doesn't come default with our anaconda installations
# http://surpriselib.com/
!pip install surprise



In [2]:
from datetime import datetime

def now():
    return str(datetime.now())

In [3]:
print(now())

import pandas as pd
import numpy as np

from surprise import SVD
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise import Dataset
from surprise import Reader
from surprise import accuracy

print(now())

2018-04-20 02:52:32.815000
2018-04-20 02:52:34.133000


# Loading In Data

In [4]:
# Only using 1mil lines bc my computer cant handle more than that
# Ran this one already

print(now())
infile = open('rating.csv', 'r')
outfile = open('ratingSubset.csv', 'w')

for i in range(1000000):
    outfile.write(infile.readline())
print(now())

infile.close()
outfile.close()

2018-04-20 02:52:34.150000
2018-04-20 02:52:34.884000


Here I route the data through Pandas so I can adjust the "-1" ratings. -1, in the dataset, represents "User has seen this item but declined to rate it". I choose to interpret this as "User mildly likes item"--even if they outright hated it, they had enough interest in the item to 1) watch it, and 2) mark the item as watched (MyAnimeList does not automatically flag items you've seen, it's not Netflix)

Thus I changed all -1s to 5s.

In [5]:
print(now())

ratingSubset = pd.read_csv('ratingSubset.csv')

print(now())

2018-04-20 02:52:34.901000
2018-04-20 02:52:35.339000


In [6]:
ratingSubset["rating"] = ratingSubset["rating"].replace(to_replace = -1, value = 5)

In [7]:
ratingSubset.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,5
1,1,24,5
2,1,79,5
3,1,226,5
4,1,241,5


In [8]:
# Reader class: http://surprise.readthedocs.io/en/stable/reader.html#surprise.reader.Reader
# Using custom datasets: http://surprise.readthedocs.io/en/stable/getting_started.html#load-custom

# Directly from file:
# reader = Reader(line_format='user item rating', sep=',', rating_scale=(1,10), skip_lines=1)
# ratingSubset = Dataset.load_from_file("ratingSubset.csv", reader=reader)

# From Pandas DataFrame:
reader = Reader(rating_scale=(1, 10))
ratingSubset = Dataset.load_from_df(ratingSubset[['user_id', 'anime_id', 'rating']], reader)

# Training and Sample Predictions

Largely following this guide: http://surprise.readthedocs.io/en/stable/getting_started.html

## Training

In [9]:
# http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

algo = SVD()

In [10]:
trainingRatingSubset, testRatingSubset = train_test_split(ratingSubset, test_size=.25)

In [11]:
print(now())
algo.fit(trainingRatingSubset)
print(now())

2018-04-20 02:52:39.933000
2018-04-20 02:53:31.464000


## Generating Predictions

This is done through the predict() method (explained on the getting started page)

We can specify any user_id and anime_id we want (as long as we don't specify ratings on nonexistent anime or nonexistent users).

For end-user-demo-purposes or whatever, we would need to:
* Collect the user's data
* Add the user's data as new rows of the dataset (imagine adding new rows to rating.csv). 
* From there we could simply run the same process (go through pandas DF, then the surprise dataset type), now able to treat them as an "existing" user.

In [20]:
# We'll use the known line 373,11771,8

pred = algo.predict(373, 11711, verbose=True)
#r_ui is an optional parameter displaying the ground truth value

user: 373        item: 11711      r_ui = None   est = 7.22   {u'was_impossible': False}


This prediction isn't too far off, which is nice. 

In [19]:
# Some other sample explorations
user373File = open('ratingUser373.txt','r')
print("Ground Truth\tPrediction")

for line in user373File:
    tokens = line.strip().split(',')
    groundTruth = tokens[2]
    pred = algo.predict(int(tokens[0]), int(tokens[1]))
    print(str(groundTruth)+"\t\t"+str(pred.est))

Ground Truth	Prediction
5		5.91835129974
7		7.37951393223
4		4.95341453656
10		9.02367430374
9		8.11400567526
8		7.23721589118
9		6.67822975881
5		4.94946704532
8		6.91983667948
5		6.06202068973
5		5.39414553948
6		5.93556696197
5		4.96869690054
5		5.9022119847
10		9.60543417586
10		9.73945431572
9		8.88943131026
5		8.19220991895
4		4.20676266488
5		5.20111359504
5		4.81230676005
5		5.30100514797
5		4.85213142507
5		5.43448808955
5		5.24296069064
5		5.22980539764
7		7.23134606123
5		4.9076171021
9		8.23521522871
5		5.4240146539
7		6.91925388598
9		8.78546270206
7		6.28071659818
5		5.36567129499
10		9.6447339042
10		9.820532562
6		6.78514761793
10		8.10980266043
5		5.56931371011
5		5.00430013889
5		5.06539129596
5		5.76327393225
5		6.55546334652
5		5.75732958942
10		9.93691866859
10		9.41105663594
5		5.54555436599
5		7.33733636036
5		5.89108841876
6		6.38119046324
5		4.73880522122
9		7.78104783632
5		5.5316738541
8		6.8501602602
10		10
10		8.53038661367
7		8.03376021006
8		8.02680472815

~~...but all of the predicted ratings are the same.~~

~~I believe this to be a problem with the SVD approach itself or with the data, not necesasrily with how I coded it--something about how it just uses the default mean and the user factors aren't there or something. I remember this being a recurring issue in previous implementations though.~~

EDIT: See this github page here: https://github.com/NicolasHug/Surprise/issues/82

Turns out that, for DataFrames specifically, the user_id and item_id should be *ints*. In all other cases (and all the documentation?) predict() uses strings.

## Evaluating Predictions

`Surprise` has RMSE, MAE, and FCP (Fraction of Concordant Pairs) in its accuracy toolbox. It is possible to calculate precision and recall with some hacks detailed here: http://surprise.readthedocs.io/en/stable/FAQ.html#how-to-compute-precision-k-and-recall-k

In [14]:
# Get the test predictions
print(now())
predictions = algo.test(testRatingSubset)
print(now())

2018-04-20 02:53:31.600000
2018-04-20 02:53:34.828000


In [15]:
accuracy.rmse(predictions)

RMSE: 1.2476


1.2476084609411571