# Recommendation Engines - MovieLens Data

## Tuesday June 20 2017

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

### Tasks

1. Load the data into the recommendation format
2. Build and assess model accuracy
3. Make individual recommendations
4. Try multiple models and compare accuracy
5. Consider how a company could use this

In [1]:
# Install Surpise - a useful library for recommendation engines
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.0.3.tar.gz (2.4MB)
[K    100% |████████████████████████████████| 2.4MB 296kB/s ta 0:00:011
Building wheels for collected packages: scikit-surprise
  Running setup.py bdist_wheel for scikit-surprise ... [?25l- \ | / - \ | / - \ | / done
[?25h  Stored in directory: /Users/mikaelasquirchuk/Library/Caches/pip/wheels/5c/84/0c/21a872115299d7e2170620fc9fad866ec7588e958d9ac77b35
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.0.3


In [6]:
# Load Surprise
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import Reader

In [9]:
# 1. Load the data into the recommendation format

# As we're loading a custom dataset, we need to define a reader. In the
# movielens dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path = '../../data/u.data', reader=reader)
data.split(n_folds=5)

In [10]:
# 2. Build and assess model accuracy

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9284
MAE:  0.7350
------------
Fold 2
RMSE: 0.9366
MAE:  0.7388
------------
Fold 3
RMSE: 0.9477
MAE:  0.7468
------------
Fold 4
RMSE: 0.9351
MAE:  0.7354
------------
Fold 5
RMSE: 0.9320
MAE:  0.7339
------------
------------
Mean RMSE: 0.9360
Mean MAE : 0.7380
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9284  0.9366  0.9477  0.9351  0.9320  0.9360  
MAE     0.7350  0.7388  0.7468  0.7354  0.7339  0.7380  


In [11]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.18   {'was_impossible': False}


In [12]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import NMF

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.NMF = NMF()

# Evaluate performances of our algorithm on the dataset.
perf.NMF = evaluate(algo.NMF, data, measures=['RMSE', 'MAE'])

print_perf(perf.NMF)

Evaluating RMSE, MAE of algorithm NMF.

------------
Fold 1
RMSE: 0.9654
MAE:  0.7615
------------
Fold 2
RMSE: 0.9576
MAE:  0.7535
------------
Fold 3
RMSE: 0.9792
MAE:  0.7706
------------
Fold 4
RMSE: 0.9612
MAE:  0.7523
------------
Fold 5
RMSE: 0.9623
MAE:  0.7548
------------
------------
Mean RMSE: 0.9652
Mean MAE : 0.7585
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9654  0.9576  0.9792  0.9612  0.9623  0.9652  
MAE     0.7615  0.7535  0.7706  0.7523  0.7548  0.7585  


In [16]:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

from surprise import NormalPredictor

algo.NP = NormalPredictor()

perf.NP = evaluate(algo.NP, data, measures=['RMSE', 'MAE'])

print_perf(perf.NP)

Evaluating RMSE, MAE of algorithm NormalPredictor.

------------
Fold 1
RMSE: 1.5057
MAE:  1.2079
------------
Fold 2
RMSE: 1.5203
MAE:  1.2203
------------
Fold 3
RMSE: 1.5218
MAE:  1.2204
------------
Fold 4
RMSE: 1.5168
MAE:  1.2226
------------
Fold 5
RMSE: 1.5142
MAE:  1.2157
------------
------------
Mean RMSE: 1.5158
Mean MAE : 1.2174
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    1.5057  1.5203  1.5218  1.5168  1.5142  1.5158  
MAE     1.2079  1.2203  1.2204  1.2226  1.2157  1.2174  


In [17]:
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.

from surprise import BaselineOnly

algo.BLO = BaselineOnly()

perf.BLO = evaluate(algo.BLO, data, measures=['RMSE', 'MAE'])

print_perf(perf.BLO)

Evaluating RMSE, MAE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.9379
MAE:  0.7467
------------
Fold 2
Estimating biases using als...
RMSE: 0.9423
MAE:  0.7469
------------
Fold 3
Estimating biases using als...
RMSE: 0.9564
MAE:  0.7569
------------
Fold 4
Estimating biases using als...
RMSE: 0.9419
MAE:  0.7451
------------
Fold 5
Estimating biases using als...
RMSE: 0.9401
MAE:  0.7454
------------
------------
Mean RMSE: 0.9437
Mean MAE : 0.7482
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9379  0.9423  0.9564  0.9419  0.9401  0.9437  
MAE     0.7467  0.7469  0.7569  0.7451  0.7454  0.7482  


In [18]:
#knns.KNNBasic    A basic collaborative filtering algorithm.

from surprise import KNNBasic

algo.KNNBasic = KNNBasic()

perf.KNNBasic = evaluate(algo.KNNBasic, data, measures=['RMSE', 'MAE'])

print_perf(perf.KNNBasic)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9738
MAE:  0.7705
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9750
MAE:  0.7694
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9890
MAE:  0.7810
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9767
MAE:  0.7703
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9786
MAE:  0.7727
------------
------------
Mean RMSE: 0.9786
Mean MAE : 0.7728
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9738  0.9750  0.9890  0.9767  0.9786  0.9786  
MAE     0.7705  0.7694  0.7810  0.7703  0.7727  0.7728  


In [19]:
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

from surprise import KNNWithMeans

algo.KNNWithMeans = KNNWithMeans()

perf.KNNWithMeans = evaluate(algo.KNNWithMeans, data, measures=['RMSE', 'MAE'])

print_perf(perf.KNNWithMeans)

Evaluating RMSE, MAE of algorithm KNNWithMeans.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9484
MAE:  0.7510
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9478
MAE:  0.7462
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9617
MAE:  0.7579
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9468
MAE:  0.7440
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9472
MAE:  0.7449
------------
------------
Mean RMSE: 0.9504
Mean MAE : 0.7488
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9484  0.9478  0.9617  0.9468  0.9472  0.9504  
MAE     0.7510  0.7462  0.7579  0.7440  0.7449  0.7488  


In [20]:
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.

from surprise import KNNBaseline

algo.KNNBaseline = KNNBaseline()

perf.KNNBaseline = evaluate(algo.KNNBaseline, data, measures=['RMSE', 'MAE'])

print_perf(perf.KNNBaseline)

Evaluating RMSE, MAE of algorithm KNNBaseline.

------------
Fold 1
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9254
MAE:  0.7330
------------
Fold 2
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9279
MAE:  0.7304
------------
Fold 3
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9422
MAE:  0.7412
------------
Fold 4
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9281
MAE:  0.7297
------------
Fold 5
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9260
MAE:  0.7290
------------
------------
Mean RMSE: 0.9299
Mean MAE : 0.7327
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9254  0.9279  0.9422  0.9281  0.9260  0.9299  


In [23]:
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

from surprise import SVD

algo.SVD = SVD()

perf.SVD = evaluate(algo.SVD, data, measures=['RMSE', 'MAE'])

print_perf(perf.SVD)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9291
MAE:  0.7342
------------
Fold 2
RMSE: 0.9384
MAE:  0.7401
------------
Fold 3
RMSE: 0.9485
MAE:  0.7450
------------
Fold 4
RMSE: 0.9332
MAE:  0.7344
------------
Fold 5
RMSE: 0.9322
MAE:  0.7343
------------
------------
Mean RMSE: 0.9363
Mean MAE : 0.7376
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9291  0.9384  0.9485  0.9332  0.9322  0.9363  
MAE     0.7342  0.7401  0.7450  0.7344  0.7343  0.7376  


In [50]:
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

from surprise import SVDpp

algo.SVDpp = SVDpp()

perf.SVDpp = evaluate(algo.SVDpp, data, measures=['RMSE', 'MAE'])

print_perf(perf.SVDpp)

Evaluating RMSE, MAE of algorithm SVDpp.

------------
Fold 1


KeyboardInterrupt: 

In [28]:
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.

from surprise import SlopeOne

algo.SlopeOne = SlopeOne()

perf.SlopeOne = evaluate(algo.SlopeOne, data, measures=['RMSE', 'MAE'])

print_perf(perf.SlopeOne)

Evaluating RMSE, MAE of algorithm SlopeOne.

------------
Fold 1
RMSE: 0.9408
MAE:  0.7426
------------
Fold 2
RMSE: 0.9425
MAE:  0.7409
------------
Fold 3
RMSE: 0.9553
MAE:  0.7501
------------
Fold 4
RMSE: 0.9441
MAE:  0.7404
------------
Fold 5
RMSE: 0.9430
MAE:  0.7403
------------
------------
Mean RMSE: 0.9451
Mean MAE : 0.7429
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9408  0.9425  0.9553  0.9441  0.9430  0.9451  
MAE     0.7426  0.7409  0.7501  0.7404  0.7403  0.7429  


In [30]:
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.

from surprise import CoClustering

algo.CoClustering = CoClustering()

perf.CoClustering = evaluate(algo.CoClustering, data, measures=['RMSE', 'MAE'])

print_perf(perf.CoClustering)

Evaluating RMSE, MAE of algorithm CoClustering.

------------
Fold 1
RMSE: 0.9586
MAE:  0.7543
------------
Fold 2
RMSE: 0.9583
MAE:  0.7479
------------
Fold 3
RMSE: 0.9766
MAE:  0.7655
------------
Fold 4
RMSE: 0.9593
MAE:  0.7487
------------
Fold 5
RMSE: 0.9627
MAE:  0.7525
------------
------------
Mean RMSE: 0.9631
Mean MAE : 0.7538
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9586  0.9583  0.9766  0.9593  0.9627  0.9631  
MAE     0.7543  0.7479  0.7655  0.7487  0.7525  0.7538  


In [31]:
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.

from surprise import NMF

algo.NMF = NMF()

perf.NMF = evaluate(algo.NMF, data, measures=['RMSE', 'MAE'])

print_perf(perf.NMF)

Evaluating RMSE, MAE of algorithm NMF.

------------
Fold 1
RMSE: 0.9581
MAE:  0.7569
------------
Fold 2
RMSE: 0.9602
MAE:  0.7563
------------
Fold 3
RMSE: 0.9731
MAE:  0.7653
------------
Fold 4
RMSE: 0.9624
MAE:  0.7537
------------
Fold 5
RMSE: 0.9638
MAE:  0.7583
------------
------------
Mean RMSE: 0.9635
Mean MAE : 0.7581
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9581  0.9602  0.9731  0.9624  0.9638  0.9635  
MAE     0.7569  0.7563  0.7653  0.7537  0.7583  0.7581  


In [52]:
uid = str(4)
iid = str(10)

pred = algo.KNNBaseline.predict(uid, iid, r_ui=1, verbose=True)

user: 4          item: 10         r_ui = 1.00   est = 4.58   {'actual_k': 40, 'was_impossible': False}


##### 5. Consider how a company could use this

How might a company use a recommendation like this in practice? Write a few paragraphs covering how they could use the above covering:
- How the algorithm works?
- What data would be used?
- How would we know if it's working?
- What is the benefit of using an algorithm over this over just recommending the most popular films overall?

KNNBaseline appears to be the best prediction algorith with this data set - it has the lowest RMSE and MAE, meaning the error is minimised. That is, the predicted rating is the closest to the actual rating of each user.

KNN or "K Nearest Neighbours" functions work by classifying a datapoint to be predicted by finding the most similar datapoints and identifying their classification, then applying this classification. The value of K is how many neighbours are used to identify the classification.

KNNBaseline differs from 

The data could be used in the same way that it's used when Netflix uses it - in order to recommend movies to users to ensure maximum engagement on some kind of streaming engine. It could also be used by a rental service to recommend movies to users who have rented other movies from the service before, e.g. Google Play. Basically, it can be used to recommend movies to users who have previously consumed movies using the service before (since the algorithm/data presupposes that users have made "recommendations" before).

We would know that the algorithm is working if the MAE/RMSE are both low, and if users are using the service!

