# Recommendation Engines: Implementing Surprise
- Surprise is a scikit library that stands for **Simple Python RecommendatIon System Engine**
- Has built-in similarity metrics, baseline methods, content-based systems, matrix factorization systems

In this notebook, we'll first walk through setting up a super basic recommendation system, using the popular MovieLens 100K Dataset. Then, we'll look into more detail how Surprise works.

## Fitting and Predicting with Surprise

### 1. Install surprise if you haven't, and import the usual libraries.

In [5]:
!pip install surprise

Collecting surprise
  Using cached https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise (from surprise)
  Using cached https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/Kristinabarounis/Library/Caches/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0.1


In [6]:
# import libraries
import numpy as np
import pandas as pd

from surprise import Dataset, Reader
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split

### 2. Load in the dataset

Surprise has the dataset built in. You might need to download the dataset so follow the instructions in the code output! Unfortunately, the Surprise data format doesn't let us inspect the data, but here is the documentation: https://grouplens.org/datasets/movielens/100k/


In [7]:
data = Dataset.load_builtin('ml-100k')

# train-test split
train, test = train_test_split(data, test_size=.2)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/Kristinabarounis/.surprise_data/ml-100k


In [8]:
# list of tuples
# each tuple is user id, item id, rating
test

<surprise.trainset.Trainset at 0x120aa6828>

### 3. Run the default Singular Value Decomposition Model!

In [9]:
svd = SVD()
svd.fit(train)
predictions = svd.test(test)

In [12]:
predictions[:5]

# r_ui - real ratings
# est - estimated ratings

[Prediction(uid='837', iid='276', r_ui=1.0, est=3.453779503002562, details={'was_impossible': False}),
 Prediction(uid='699', iid='473', r_ui=3.0, est=3.0781912928602733, details={'was_impossible': False}),
 Prediction(uid='276', iid='720', r_ui=2.0, est=3.1048110184629625, details={'was_impossible': False}),
 Prediction(uid='318', iid='133', r_ui=4.0, est=3.7655571312679657, details={'was_impossible': False}),
 Prediction(uid='43', iid='275', r_ui=4.0, est=4.289089302668665, details={'was_impossible': False})]

In [10]:
accuracy.rmse(predictions)

# this RMSE means that the model is calculating empty ratings on average within .9 of the true value

RMSE: 0.9346


0.9345608466506067

### 4. Make a prediction!

In [11]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)

# get a prediction for specific users and items.
pred = svd.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 3.91   {'was_impossible': False}


## Applying Surprise

### 1. How does Surprise take in your data?
https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset

The dataset we'll use is a subset of the Yelp Open Dataset that's already been joined and cleaned.
https://www.yelp.com/dataset

In [13]:
yelp = pd.read_csv('yelp_reviews.csv').drop(['Unnamed: 0'], axis = 1)

In [14]:
yelp.head()

Unnamed: 0,user_id,business_id,stars
0,brd33PD_6nqK_VVnO3NWAg,--1UhMGODdWsrMastO9DZw,4.0
1,NqpKiaRsGfuU2voV5dPRCQ,--1UhMGODdWsrMastO9DZw,1.0
2,dhzlnpisqA7V1zfiO12AZA,--1UhMGODdWsrMastO9DZw,2.0
3,A4bpHuvzaQt9-XAg8e9Msw,--1UhMGODdWsrMastO9DZw,3.0
4,GL81ktDIteXA2VVH6gIakg,--1UhMGODdWsrMastO9DZw,5.0


### 2. Inspecting the dataset:

Here's where you'd do a **comprehensive** EDA!

In [15]:
print('Number of Users: ', len(yelp['user_id'].unique()))
print('Number of Businesses: ', len(yelp['business_id'].unique()))

Number of Users:  79773
Number of Businesses:  2518


1. What's the distribution of ratings? i.e. How many 1-star, 2-star, 3-star reviews?
2. How many reviews does a restaurant have?
3. How many reviews does a user make?

In [16]:
yelp['stars'].value_counts()

5.0    42685
4.0    23143
1.0    14315
3.0    11522
2.0     8335
Name: stars, dtype: int64

In [24]:
yelp['business_id'].value_counts().describe()

count    2518.000000
mean       39.714059
std       107.844814
min         3.000000
25%         5.000000
50%        10.000000
75%        29.000000
max      1694.000000
Name: business_id, dtype: float64

In [25]:
yelp['user_id'].value_counts().describe()

count    79773.000000
mean         1.253557
std          0.929012
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         50.000000
Name: user_id, dtype: float64

In [None]:
# we could deal with matrix sparsity by removing users or restaurants with less than a certain number of reviews

### 3. Reading in the dataset and prepping data

In [26]:
# Instantiate a 'Reader' to read in the data so Surprise can use it
reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(yelp[['user_id', 'business_id', 'stars']], reader)

In [27]:
trainset, testset = train_test_split(data, test_size=.2)

In [28]:
trainset

<surprise.trainset.Trainset at 0x120aa6d68>

### 4. Fitting and evaluating models
Here, let's assume that we've tuned all these hyperparameters using GridSearch, and we've arrived at our final model.

In [29]:
final = SVD(n_epochs=20, n_factors=1, biased=True, 
              lr_all=0.005, reg_all=0.06)

In [30]:
final.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x120b34710>

In [31]:
predictions = final.test(testset)

In [32]:
predictions[:3]

[Prediction(uid='7G_uGDjXUn1xjxbmNki1fw', iid='-eoeNEsPZ5-Fookdtr8vBw', r_ui=5.0, est=4.214254838636337, details={'was_impossible': False}),
 Prediction(uid='87MesQgdhmZfb86AocfkQQ', iid='-M9S1wlZTvv6T9EOo5X2Yw', r_ui=4.0, est=2.8425968754736055, details={'was_impossible': False}),
 Prediction(uid='KxGeqg5ccByhaZfQRI4Nnw', iid='-01XupAWZEXbdNbxNg5mEg', r_ui=4.0, est=3.2305540177524206, details={'was_impossible': False})]

In [33]:
accuracy.rmse(predictions)
accuracy.mae(predictions)

RMSE: 1.3043
MAE:  1.0605


1.0605026921430143

### 5. Making Predictions (again)
Unfortunately, this dataset has a convoluted string as the user/business IDs.

In [34]:
yelp['user_id'][55]

'HPtjvIrhzAUkKsiVkeT4MA'

In [35]:
yelp['business_id'][123]

'--7zmmkVg-IMGaXbuVd0SQ'

In [36]:
final.predict(yelp['user_id'][55], yelp['business_id'][13])

# None for r_ui means that was an empty value in the utility matrix - that user didn't review that restaurant

Prediction(uid='HPtjvIrhzAUkKsiVkeT4MA', iid='--1UhMGODdWsrMastO9DZw', r_ui=None, est=3.7791041255865845, details={'was_impossible': False})

### 6. What else?

Surprise has sample code where you can get the top **n** recommended items for a user. https://surprise.readthedocs.io/en/stable/FAQ.html

# Resources

- The structure of our lesson on recommendation engines is based on Chapter 9 of **Mining of Massive Datasets**: http://infolab.stanford.edu/~ullman/mmds/book.pdf
- Libraries for coding recommendation engines: 
    - Surprise: https://surprise.readthedocs.io/en/stable/index.html
    - LightFM: https://lyst.github.io/lightfm/docs/index.html
    
    
- Some blogs I might've written:
    - Overview: https://towardsdatascience.com/a-primer-to-recommendation-engines-49bd12ed849f?source=friends_link&sk=279dfeec5187614b37431dab167fd4e3
    - Collaborative filtering: https://towardsdatascience.com/a-primer-to-recommendation-engines-49bd12ed849f?source=friends_link&sk=279dfeec5187614b37431dab167fd4e3