We will make use of scikit surprise package to build a collaborative-filtering recommender system based on explicit ratings. We will be using a matrix factorisation algorithm for this project.

## Contents:
- [Installation-of-Packages](#Installation-of-Packages)
- [Loading of Libraries](#Loading-of-Libraries) 
- [Loading of Datasets & Preprocessing](#Loading-of-Datasets-&-Preprocessing)
- [Baseline Model - SVD Algorithm](#Baseline-Model---SVD-Algorithm)
- [Model Tuning using GridSearch](#Model-Tuning-using-GridSearch)
- [Using Best Params of GridSearch2](#Using-Best-Params-of-GridSearch2)
  - [Generating rating predictions](#Generating-rating-predictions)
  - [Additional metrics](#Additional-metrics)
  - [Generating shuffled location recommendations](#Generating-shuffled-location-recommendations)

## Installation of Packages

In [1]:
# !pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25ldone
[?25h  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp38-cp38-macosx_10_9_x86_64.whl size=1131466 sha256=24f12c372c81fa0da05ce38d63effca63a5cf084af5a55da3311000f7c0ad144
  Stored in directory: /Users/jo/Library/Caches/pip/wheels/e0/44/15/6d6010d88d0e8e3694643a009f445df00a74c79c938e2c0dd4
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


## Loading of Libraries

In [1]:
import numpy as np
import pandas as pd

# imports form surprise
from surprise import accuracy, Dataset, Reader, SVD
from surprise.model_selection import cross_validate, GridSearchCV, KFold
from collections import defaultdict

from sklearn import preprocessing # import label encoder

import difflib # helpers for computing deltas
import random

import pickle # to save and load models

## Loading of Datasets & Preprocessing

In [309]:
# Loading of dataset
recsys_df = pd.read_csv('./instagram-dataset/recsys_df_name.csv')
print(recsys_df.shape)
recsys_df.head()

(177607, 7)


Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd
0,4519805.0,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB
1,259484700.0,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB
2,6364797000.0,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB
3,221389400.0,857670431.0,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB
4,624306600.0,857670431.0,2019-05-30 07:56:50,3,green park,"London, United Kingdom",GB


In [310]:
# label_encoder object knows how to understand word labels
label_encoder = preprocessing.LabelEncoder()

In [311]:
# Encode labels in column 'name'
recsys_df['new_location_id']= label_encoder.fit_transform(recsys_df['name'])

In [312]:
# Check new location labels
recsys_df.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,new_location_id
0,4519805.0,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,3433
1,259484700.0,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,3433
2,6364797000.0,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,3433
3,221389400.0,857670431.0,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,2646
4,624306600.0,857670431.0,2019-05-30 07:56:50,3,green park,"London, United Kingdom",GB,2646


In [313]:
# Save and export
recsys_df.to_csv('./instagram-dataset/recsys_df_update.csv', index=False)

## Baseline Model - SVD Algorithm

In [4]:
# Instantiate reader
reader = Reader(rating_scale=(1, 3))

In [9]:
# Instantiate dataset 
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(recsys_df[["profile_id", "new_location_id", "sentiment_pred"]], reader)

In [11]:
# Cross-validate an SVD model using three-fold cross-validation
svd = SVD(verbose=True, n_epochs=10)
cross_validate(svd, data, measures=['RMSE','MSE', 'MAE'], cv=3, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Evaluating RMSE, MSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.6121  0.6111  0.6105  0.6112  0.0007  
MSE (testset)     0.3747  0.3734  0.3727  0.3736  0.0008  
MAE (testset)     0.5291  0.5284  0.5281  0.5286  0.0004  
Fit time          1.36    1.42    1.41    1.39    0.03    
Test time         0.64    0.62    0.48    0.58    0.07    


{'test_rmse': array([0.61212084, 0.61105601, 0.6104864 ]),
 'test_mse': array([0.37469193, 0.37338944, 0.37269364]),
 'test_mae': array([0.52910472, 0.52843588, 0.52810968]),
 'fit_time': (1.3552472591400146, 1.4156239032745361, 1.4137771129608154),
 'test_time': (0.6435410976409912, 0.6164758205413818, 0.4824199676513672)}

In [28]:
# Finding training metrics rmse
trainset = data.build_full_trainset()
svd.fit(trainset)

testset = trainset.build_testset()
predictions = svd.test(testset)
accuracy.rmse(predictions, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
RMSE: 0.5226


0.522637121227586

In [29]:
# Finding training metrics mse
trainset = data.build_full_trainset()
svd.fit(trainset)

testset = trainset.build_testset()
predictions = svd.test(testset)
accuracy.mse(predictions, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
MSE: 0.2732


0.2731655672682671

In [30]:
# Finding training metrics mae
trainset = data.build_full_trainset()
svd.fit(trainset)

testset = trainset.build_testset()
predictions = svd.test(testset)
accuracy.mae(predictions, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
MAE:  0.4433


0.44333588025935755

- The performance of the baseline model is as follows:

|Metric|train rmse|test rmse|train mse|test mse|train mae|test mae|
|---|---|---|---|---|---|---|
|Baseline|0.5226|0.6112|0.2732|0.3736|0.4433|0.5286|

- Comparing the train and test metrics, the train metrics are lower than the test metrics significantly so the model is overfitting.
- We will attempt to GridSearch for the best parameters.

## Model Tuning using GridSearch

#### GridSearch1

In [131]:
# Search over the following values of hyperparameters:
# Number of factors: 60, 80
# Number of epochs: 5, 10, 15, 20, 25
# Learning rate for all parameters: 0.002, 0.003, 0.004
# Regularization term for all parameters: 10 random values

param_grid = {"n_factors": [60, 80],
              "n_epochs": [5, 10, 15, 20, 25], 
              "lr_all": [0.002, 0.003, 0.004], 
              "reg_all": np.linspace(0.04, 2, 10)}

In [132]:
# Instantiate GridSearchCV using cv=5
gs = GridSearchCV(SVD, param_grid, measures=['RMSE','MSE', 'MAE'], cv=5)

In [133]:
%%time
# Fit GridSearch to training data
gs.fit(data)

CPU times: user 52min 34s, sys: 29.9 s, total: 53min 4s
Wall time: 54min


In [134]:
# Print metric score and combination of parameters that gave the best metric score
for metric in ['rmse','mse', 'mae']:
    print(f'Test {metric}: {gs.best_score[metric]}')
    print(f'Test best params: {gs.best_params[metric]}')

Test rmse: 0.6071656479964327
Test best params: {'n_factors': 80, 'n_epochs': 25, 'lr_all': 0.004, 'reg_all': 0.4755555555555555}
Test mse: 0.3686529094191794
Test best params: {'n_factors': 80, 'n_epochs': 25, 'lr_all': 0.004, 'reg_all': 0.4755555555555555}
Test mae: 0.5170955894700094
Test best params: {'n_factors': 80, 'n_epochs': 25, 'lr_all': 0.004, 'reg_all': 0.04}


In [135]:
# Finding training metrics rmse
algo1 = gs.best_estimator['rmse']
trainset = data.build_full_trainset()
algo1.fit(trainset)

testset = trainset.build_testset()
predictions = algo1.test(testset)

accuracy.rmse(predictions, verbose=True)

RMSE: 0.5203


0.5202581078998157

In [136]:
# Finding training metrics mse
algo2 = gs.best_estimator['mse']
trainset = data.build_full_trainset()
algo2.fit(trainset)

testset = trainset.build_testset()
predictions = algo2.test(testset)

accuracy.mse(predictions, verbose=True)

MSE: 0.2706


0.27063659396251577

In [137]:
# Finding training metrics mae
algo3 = gs.best_estimator['mae']
trainset = data.build_full_trainset()
algo3.fit(trainset)

testset = trainset.build_testset()
predictions = algo3.test(testset)

accuracy.mae(predictions, verbose=True)

MAE:  0.4002


0.40018532564143827

- The results after GridSearch1 are as follows:

|Metric|train rmse|test rmse|train mse|test mse|train mae|test mae|
|---|---|---|---|---|---|---|
|Baseline|0.5226|0.6112|0.2732|0.3736|0.4433|0.5286|
|GridSearch1|0.5203|0.6072|0.2706|0.3687|0.4002|0.5170|

- The metrics across rmse, mse and mae have improved compared to the baseline model.
- The model is less overfitting since most of the train and test metrics are closer with GridSearch1.
- We will now attempt to reduce the learning rate and increase the regularization term to counter the overfitting problem.

#### GridSearch2

In [217]:
# Search over the following values of hyperparameters:
# Number of epochs: 35, 40
# Learning rate for all parameters: 0.002, 0.003
# Regularization term for all parameters: 10 random values in np.linspace(0.2, 1, 10)

param_grid2 = {"n_epochs": [35, 40], 
              "lr_all": [0.002, 0.003], 
              "reg_all": np.linspace(0.2, 1, 10)}

In [218]:
# Instantiate GridSearchCV using cv=5
gs2 = GridSearchCV(SVD, param_grid2, measures=['rmse','mse', 'mae'], cv=5)

In [219]:
%%time
# Fit GridSearch to training data
gs2.fit(data)

CPU times: user 15min 54s, sys: 8.03 s, total: 16min 2s
Wall time: 16min 23s


In [220]:
# Print metric score and combination of parameters that gave the best metric score
for metric in ['rmse','mse', 'mae']:
    print(f'Test {metric}: {gs2.best_score[metric]}')
    print(f'Test best params: {gs2.best_params[metric]}')

Test rmse: 0.6068985188787803
Test best params: {'n_epochs': 40, 'lr_all': 0.003, 'reg_all': 0.5555555555555556}
Test mse: 0.3683400036076808
Test best params: {'n_epochs': 40, 'lr_all': 0.003, 'reg_all': 0.5555555555555556}
Test mae: 0.5193380526382191
Test best params: {'n_epochs': 40, 'lr_all': 0.003, 'reg_all': 0.2}


In [221]:
# Finding training metrics rmse
algo1 = gs2.best_estimator['rmse']
trainset = data.build_full_trainset()
algo1.fit(trainset)

testset = trainset.build_testset()
predictions = algo1.test(testset)

accuracy.rmse(predictions, verbose=True)

RMSE: 0.5117


0.511692197835627

In [222]:
# Finding training metrics mse
algo2 = gs2.best_estimator['mse']
trainset = data.build_full_trainset()
algo2.fit(trainset)

testset = trainset.build_testset()
predictions = algo2.test(testset)

accuracy.mse(predictions, verbose=True)

MSE: 0.2618


0.2617794890754093

In [223]:
# Finding training metrics mae
algo3 = gs2.best_estimator['mae']
trainset = data.build_full_trainset()
algo3.fit(trainset)

testset = trainset.build_testset()
predictions = algo3.test(testset)

accuracy.mae(predictions, verbose=True)

MAE:  0.4061


0.40607407603765117

- The results after GridSearch2 is as follows:

|Metric|train rmse|test rmse|train mse|test mse|train mae|test mae|
|---|---|---|---|---|---|---|
|Baseline|0.5226|0.6112|0.2732|0.3736|0.4433|0.5286|
|GridSearch1|0.5203|0.6072|0.2706|0.3687|0.4002|0.5170|
|GridSearch2|0.5117|0.6069|0.2618|0.3683|0.4061|0.5193|

- The metrics have improved after GridSearch2.
- Although the model is still overfitting, the train and test metrics for the targetted mae are closer.
- As such, we will employ the model from GridSearch2 with a train mae of 0.4061 and test mae of 0.5193.
- The best params for mae are as follows:
  - 'n_epochs': 40
  - 'lr_all': 0.003
  - 'reg_all': 0.2

## Using Best Params of GridSearch2

In [229]:
# use the algorithm that yields the best mse
algo = gs2.best_estimator["mae"]
algo.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fb35730e4c0>

In [302]:
# save and export model
pickle.dump(algo, open('./output/model.pkl', 'wb'))

#### Generating rating predictions

In [230]:
# Predict rating for profile_id=4519805 for new_location_id=2646
algo.predict(uid=4519805, iid=2646)

Prediction(uid=4519805, iid=2646, r_ui=None, est=2.4620357763065255, details={'was_impossible': False})

- Based on the prediction output, profile4519805 will give an estimated rating of 2.5/3 for location2646. 

#### Additional metrics

To calculate precision@k and recall@k, we will set k=100 (top 100 recommendations) and threshold=2.5 (out of 3) for a positive rating.

In [340]:
# Compute precision@k and recall@k
def precision_recall_at_k(predictions, k=100, threshold=2.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We will set it to 1.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We will set it to 1.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls

In [317]:
kf = KFold(n_splits=5)

In [343]:
precision_list = []
recall_list = []

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=100, threshold=2.5)
    
    precision_list.append(sum(prec for prec in precisions.values()) / len(precisions))
    recall_list.append(sum(rec for rec in recalls.values()) / len(recalls))

# Precision and recall can then be averaged over all users
precision_ave = sum(precision_list)/len(precision_list)
recall_ave = sum(recall_list)/len(recall_list)

In [344]:
print(precision_ave, recall_ave)

0.8031610013744401 0.7748232536921323


- Out of the top 100 recommendations and a threshold of 2.5 out of 3 for a positive rating, the model is able to predict with a precision@k value of 0.8031 and recall@k value of 0.7748 which are acceptable.
- We will move on to generate recommendations.

#### Generating shuffled location recommendations

In [353]:
# Function to get location recommendations
def get_location_id(name, metadata):
    
    """
    Gets the location ID for a location name based on the closest match in the metadata dataframe
    """
    
    existing_names = list(metadata['name'].values)
    closest_names = difflib.get_close_matches(name, existing_names)
    new_location_id = metadata[metadata['name'] == closest_names[0]]['new_location_id'].values[0]
    return new_location_id

def get_location_info(new_location_id, metadata):
    
    """
    Returns some basic information about a location given the location id and the metadata dataframe
    """
    
    location_info = metadata[metadata['new_location_id'] == new_location_id][['name', 
                                                    'city', 'cd']]
    return location_info.to_dict(orient='records')

def predict_rating(profile_id, name, model, metadata):
    
    """
    Predicts the review (on a scale of 1-3) that a user would assign to a specific location 
    """
    
    pickled_model = pickle.load(open('./output/model.pkl', 'rb'))
    new_location_id = get_location_id(name, metadata)
    rating_prediction = pickled_model.predict(uid=profile_id, iid=new_location_id)
    return rating_prediction.est

def generate_recommendation(profile_id, model, metadata, thresh=2.5):
    
    """
    Generates a location recommendation for a user based on a rating threshold. Only
    books with a predicted rating at or above the threshold will be recommended
    """
    
    if profile_id in metadata['profile_id'].values:
        names = list(metadata['name'].values)
        random.shuffle(names)
        for name in names:
            rating = predict_rating(profile_id, name, model, metadata)
            if rating >= thresh:
                new_location_id = get_location_id(name, metadata)
                return get_location_info(new_location_id, metadata)[0]
    else:
        # counter cold start problem by recommending top 20 rated locations
        print(f"Looks like you're not a member yet. Why not join now for better recommendation?")
        names = recsys_df.groupby(['name']).count().sort_values(by='profile_id', ascending=False).head(20).index.values
        random.shuffle(names)
        for name in names:
            new_location_id = get_location_id(name, metadata)
            return (get_location_info(new_location_id, metadata)[0])

In [354]:
# Generate location recommendation for profile4519805 who is in dataset
generate_recommendation(4519805, algo, recsys_df)

{'name': 'garden room at the lanesborough, london',
 'city': 'London, United Kingdom',
 'cd': 'GB'}

In [355]:
# Generate location recommendation for profile1 who is not in dataset
generate_recommendation(1, algo, recsys_df)

Looks like you're not a member yet. Why not join now for better recommendation?


{'name': 'chinatown', 'city': 'London, United Kingdom', 'cd': 'GB'}

- We will proceed to fitting another matrix factorisation model available on scikit surprise, NMF, and compare the metrics.