# Using `surprise`

See the documentation [here](https://surprise.readthedocs.io/en/stable/getting_started.html)!

In [2]:
import surprise
from surprise.prediction_algorithms import *
import pandas as pd
import numpy as np
import datetime as dt

## Agenda

SWBAT:

- use the `surprise` package to build recommendation engines.

In [3]:
data = surprise.Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/jacobserfaty/.surprise_data/ml-100k


Now that we've downloaded the data, we can find it in a hidden directory:

In [4]:
df = pd.read_csv('~/.surprise_data/ml-100k/ml-100k/u.data',
            sep='\t', header=None)
df = df.rename(columns={0: 'user', 1: 'item', 2: 'rating', 3: 'timestamp'})
df

Unnamed: 0,user,item,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


## Data Exploration

In [5]:
df['user'].nunique()

943

In [6]:
df['item'].nunique()

1682

In [7]:
stats = df[['rating', 'timestamp']].describe()
stats

Unnamed: 0,rating,timestamp
count,100000.0,100000.0
mean,3.52986,883528900.0
std,1.125674,5343856.0
min,1.0,874724700.0
25%,3.0,879448700.0
50%,4.0,882826900.0
75%,4.0,888260000.0
max,5.0,893286600.0


In [8]:
print(dt.datetime.fromtimestamp(stats.loc['min', 'timestamp']))
print(dt.datetime.fromtimestamp(stats.loc['max', 'timestamp']))

1997-09-19 23:05:10
1998-04-22 19:10:38


In [9]:
read = surprise.Reader('ml-100k')

In [10]:
read.rating_scale

(1, 5)

## Modeling

In [11]:
train, test = surprise.model_selection.train_test_split(data, random_state=42)

In [12]:
model = KNNBasic().fit(train)

Computing the msd similarity matrix...
Done computing similarity matrix.


$\hat{r}_{ui} = \frac{
    \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot r_{vi}}
    {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}$
    OR
$\hat{r}_{ui} = \frac{
    \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot r_{uj}}
    {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)}$

In [13]:
model2 = SVD().fit(train)

$\sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 +
    \lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right)$

In [14]:
model3 = NMF().fit(train)

$\hat{r}_{ui} = q_i^Tp_u$

In [15]:
#most similiar to movie id 51 using from KNNBasic
model.get_neighbors(iid=51, k=1)

[65]

In [None]:
conds = [df['item'] == 51, df['item'] == 65]
choices = 2*[True]

df.loc[np.select(conds, choices, default=False)].sort_values('user')

## Evaluation

In [None]:
# get predictions
model.test(test)

In [None]:
#KNNBasic
surprise.accuracy.mae(model.test(test))

In [None]:
#KNNBasic
surprise.accuracy.rmse(model.test(test))

In [None]:
#SVD
surprise.accuracy.mae(model2.test(test))

In [None]:
#SVD
surprise.accuracy.rmse(model2.test(test))

In [None]:
#NMF
surprise.accuracy.mae(model3.test(test))

In [None]:
#NMF
surprise.accuracy.rmse(model3.test(test))