Most recommender systems attempt to predict what the user would put in them if they rated the corresponding item. With too many “NaN”s (user-item matrix is sparse), the recommender won’t have enough data to understand what the user likes.

However, **explicit** rating is great if you can convince your users to give ratings to you. Therefore, if you have the luxury of data and user ratings, then the evaluation metrics should be RMSE or MAE. Let’s work with an example of Movielens dataset with Surprise library.

The code is heavily taken from here: https://towardsdatascience.com/evaluating-a-real-life-recommender-system-error-based-and-ranking-based-84708e3285b

Data source: https://grouplens.org/datasets/movielens/

In [36]:
from __future__ import division
import pandas as pd
from surprise import SVD
from surprise import KNNBaseline
from surprise.model_selection import train_test_split
from surprise.model_selection import LeaveOneOut
from surprise import Reader
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from collections import defaultdict

In [2]:
movies = pd.read_csv('../Recommender_Systems_Surprise/data/ML_Latest_small/movies.csv')
ratings = pd.read_csv('../Recommender_Systems_Surprise/data/ML_Latest_small/ratings.csv')

In [3]:
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies.shape

(9742, 3)

In [5]:
ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
ratings.shape

(100836, 4)

In [7]:
ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [8]:
# Lets merge these two datasets with 'moveId' column

df = pd.merge(movies, ratings, on='movieId', how = 'inner')
df.head(5)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


We will skip EDA part and go directly to Surprie library.

In [9]:
# We need to define reader function and Dataset. Recall that the latter one has to have
# only three columns.

reader = Reader(rating_scale=(0.5,5))
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
train_data, test_data = train_test_split(data, test_size=0.25, random_state=17)
MF = SVD(random_state=17) # Matrix Factorization

In [10]:
MF.fit(train_data)
prediction = MF.test(test_data)

For better understanding what actually were returned:

- **uid** – The (raw) user id. 
- **iid** – The (raw) item id.
- **r_ui** (float) – The true rating rui.
- **est** (float) – The estimated rating r̂ ui.
- **details** (dict) – Stores additional details about the prediction that might be useful for later analysis.

In [12]:
def MAE(prediction):
        return accuracy.mae(prediction, verbose=False)
def RMSE(prediction):
        return accuracy.rmse(prediction, verbose=False)

In [13]:
print("RMSE: ", RMSE(prediction))
print("MAE: ", MAE(prediction))

('RMSE: ', 0.8766640674471704)
('MAE: ', 0.6721879858284017)


### Top - N items. 

The idea is to provide a ranked list of N items they will likely be interestd in.

In [21]:
def GetTopN(predictions, n=10, minimumRating=4.0):
    topN = defaultdict(list)
    for userID, movieID, actualRating, estimatedRating, _ in predictions:
        if (estimatedRating >= minimumRating):
            topN[int(userID)].append((int(movieID), estimatedRating))

    for userID, ratings in topN.items():
        ratings.sort(key=lambda x: x[1], reverse=True)
        topN[int(userID)] = ratings[:n]
    
    # Returns a dictionary where each 'key' is a user 'values' are booksId with the corresponding estimated rating which is bigger min.Rating
    return topN

In [23]:
#topN.items()
#topN.keys()
#topN.values()

Leave-One-Out cross-validator

In [16]:
LOOCV = LeaveOneOut(n_splits=1, random_state=1)

for trainSet, testSet in LOOCV.split(data):
    # Train model(in our case its SVD) without left-out ratings(tstSet)
    MF.fit(trainSet) # 
    
    # Predicts ratings for left-out(those which are in the testSet) ratings only
    leftOutPredictions = MF.test(testSet)
    
    # Build predictions for all ratings which are not in the training set (fill sparse matrix with predictions)
    #(not rated by user but could be potentially useful)
    bigTestSet = trainSet.build_anti_testset() # create a user-movie matrix and simply fill entries with global mean
    allPredictions = MF.test(bigTestSet) # re-estimate each entry and make a prediction according to MF algo
    
    # Compute top 10 recs for each user
    topNPredicted = GetTopN(allPredictions, n=10)

### Hit Rate

Let’s see how good our top-10 recommendations are. To evaluate top-10, we use hit rate, that is, if a user rated one of the top-10 we recommended, we consider it is a “hit”.

The process of compute hit rate for a single user:

- Find all items in this user’s history in the training data.
- Intentionally remove one of these items ( Leave-One-Out cross-validation) and put ot in the test set.
- Use all other items to feed the recommender and make predictions for sparse (empty) entries of user-movie matrix 
- Ask for top 10 recommendations for each user.
- If the removed (the one in a testSet) item appear in the top 10 recommendations, it is a hit. If not, it’s not a hit.

In [38]:
# See how often we recommended a movie the user actually rated
def HitRate(topNPredicted, leftOutPredictions):
    hits = 0
    total = 0

 # For each left-out rating
    for leftOut in leftOutPredictions:
        userID = leftOut[0]
        leftOutMovieID = leftOut[1]
        # Is it in the predicted top 10 for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == int(movieID)):
                hit = True
                break
        if (hit) :
            hits += 1

        total += 1
    
    print(hits,total)
    # Compute overall precision
    return hits/total

In [43]:
print("Hit Rate: ", HitRate(topNPredicted, leftOutPredictions))

(22, 610)
('Hit Rate: ', 0.036065573770491806)


The whole hit rate of the system is the count of hits, divided by the test user count. It measures how often we are able to recommend a removed rating, higher is better.
A very low hit rate simply means we do not have enough data to work with. 

### Hit Rate by Rating Value

We can also break down hit rate by predicted rating values. Ideally, we want to predict movies user like, so we care high rating values not low ones.

In [44]:
def RatingHitRate(topNPredicted, leftOutPredictions):
    
    hits = defaultdict(float)
    total = defaultdict(float)
    
    # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Is it in the predicted top N for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == movieID):
                hit = True
                break
        if (hit) :
            hits[actualRating] += 1
        total[actualRating] += 1

    # Compute overall precision
    for rating in sorted(hits.keys()):
        print(rating, hits[rating] / total[rating])

In [45]:
print("Hit Rate by Rating value: ")
RatingHitRate(topNPredicted, leftOutPredictions)

Hit Rate by Rating value: 
(3.0, 0.008695652173913044)
(4.0, 0.044444444444444446)
(4.5, 0.07547169811320754)
(5.0, 0.07317073170731707)


Our hit rate breakdown is exactly what We’d hoped, the hit rate for rating score 5 and 4.5 are much higher than 4 or 3. Higher is better.

### Cumulative Hit Rate

Because we care about higher ratings, we can ignore the predicted ratings lower than 4, to compute hit rate for the ratings > = 4.

In [46]:
def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
    hits = 0
    total = 0
    # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Only look at ability to recommend things the users actually liked...
        if (actualRating >= ratingCutoff):
            # Is it in the predicted top 10 for this user?
            hit = False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if (int(leftOutMovieID) == movieID):
                    hit = True
                    break
            if (hit) :
                hits += 1
            total += 1

        # Compute overall precision
    return hits/total

In [47]:
print("Cumulative Hit Rate (rating >= 4): ", CumulativeHitRate(topNPredicted, leftOutPredictions, 4.0))    

('Cumulative Hit Rate (rating >= 4): ', 0.05898876404494382)


### Average Reciprocal Hit Ranking (ARHR)

Commonly used metric for ranking evaluation of Top-N recommender systems, that only takes into account where the first relevant result occurs. We get more credit for recommending an item in which user rated on the top of the rank than on the bottom of the rank. Higher is better.



In [48]:
# Compute ARHR
def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
    summation = 0
    total = 0
        # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Is it in the predicted top N for this user?
        hitRank = 0
        rank = 0
        for movieID, predictedRating in topNPredicted[int(userID)]:
            rank = rank + 1
            if (int(leftOutMovieID) == movieID):
                hitRank = rank
                break
        if (hitRank > 0) :
                summation += 1.0 / hitRank

        total += 1

    return summation / total



In [49]:
print("Average Reciprocal Hit Rank: ", AverageReciprocalHitRank(topNPredicted, leftOutPredictions))

('Average Reciprocal Hit Rank: ', 0.01152029664324746)


In [51]:
data.df.head(5)

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,5,1,4.0
2,7,1,4.5
3,15,1,2.5
4,17,1,4.5


In [59]:
fullTrainSet = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
simsAlgo = KNNBaseline(sim_options=sim_options)
simsAlgo.fit(fullTrainSet)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7faddab83b10>

In [63]:
MF.fit(fullTrainSet)
bigTestSet = fullTrainSet.build_anti_testset()
allPredictions = MF.test(bigTestSet)
topNPredicted = GetTopN(allPredictions, n=10)

In [66]:
# Print user coverage with a minimum predicted rating of 4.0:

def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
    hits = 0
    for userID in topNPredicted.keys():
        hit = False
        for movieID, predictedRating in topNPredicted[userID]:
            if (predictedRating >= ratingThreshold):
                hit = True
                break
        if (hit):
            hits += 1

    return hits / numUsers

In [67]:
print("\nUser coverage: ", UserCoverage(topNPredicted, fullTrainSet.n_users, ratingThreshold=4.0))

('\nUser coverage: ', 0.9311475409836065)
