## Movielens - 100K Dataset

MovieLens 100K dataset has been a standard dataset used for benchmarking recommender systems for more than 20 years now and hence this provides a good point to start our learning journey for recommender systems. For non commercial personalised recommendations for movies you can check out the website: https://movielens.org/

This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. 

## Data Description


**Ratings**    -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a comma separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   


**Movie Information**   -- Information about the items (movies); this is a comma separated
              list of
              movie id | movie title | release date | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.


**User Demographics**    -- Demographic information about the users; this is a comma
              separated list of
              user id | age | gender | occupation | zip code

## Table of Content

[1. Reading Dataset](#Reading-Dataset)

[2. Merging Movie information to ratings dataframe](#merge)

[3. Creating Binary Relevance Target from User Average](#br)

[4. Fitting SVD to make rating predictions](#svdfit)

[5. Root Mean Squared Error](#rmse)

[6. Creating Predicted List of relevant movies](#pbr)

[7. Precision@K](#patk)

[8. Recall@K](#ratk)

[9. Mean Reciprocal Rank](#mrr)

[10. Mean Average Precision@K](#mapk)

[11. Normalised Discounted Cumulative Gain@K](#mapk)

[12. What's next?](#whatsnext)

## 1. Reading Dataset <a class="anchor" id="Reading-Dataset"></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

In [2]:
#Reading ratings file:
ratings = pd.read_csv('ratings.csv')

#Reading Movie Info File
movie_info = pd.read_csv('movie_info.csv')

## 2.  Merging Movie information to ratings dataframe <a class="anchor" id="merge"></a>

The movie names are contained in a separate file. Let's merge that data with ratings and store it in ratings dataframe. The idea is to bring movie title information in ratings dataframe as it would be useful later on

In [3]:
ratings = ratings.merge(movie_info[['movie id','movie title']], how='left', left_on = 'movie_id', right_on = 'movie id')

In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title
0,196,242,3,881250949,242,Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997)
2,22,377,1,878887116,377,Heavyweights (1994)
3,244,51,2,880606923,51,Legends of the Fall (1994)
4,166,346,1,886397596,346,Jackie Brown (1997)


Lets also combine movie id and movie title separated by ': ' and store it in a new column named movie

In [5]:
ratings['movie'] = ratings['movie_id'].map(str) + str(': ') + ratings['movie title'].map(str)

In [6]:
ratings.columns

Index(['user_id', 'movie_id', 'rating', 'unix_timestamp', 'movie id',
       'movie title', 'movie'],
      dtype='object')

Keeping the columns movie, user_id and rating in the ratings dataframe and drop all others

In [7]:
ratings = ratings.drop(['movie id', 'movie title', 'movie_id','unix_timestamp'], axis = 1)

In [8]:
ratings = ratings[['user_id','movie','rating']]

## 3. Creating Binary Relevance Target from User Average <a class="anchor" id="br"></a>
For checking and using the classification accuracy metrics, we cannot work with ratings, so we need to create a new target from rating values which is 1 if the user has rated the movie greater than its user average and 0 if rated less than the user average. 

Post that we will create a list of relevant movies for all users with atleast one relevant movie

In [9]:
#Creating Train test split in order to extract average user rating from train set
X_train, X_test = train_test_split(ratings, test_size = 0.25, random_state=42)

In [10]:
# Average rating for each user using the ratings in train set
ratings_new = ratings.merge(X_train.groupby('user_id')['rating'].mean(), left_on = 'user_id', right_index=True, how = 'left')

In [11]:
#Renaming columns for better readability
ratings_new.columns = ['user_id', 'movie', 'actual_rating', 'user_average', ]

In [12]:
# Relevance for each user movie combination
ratings_new['relevant'] = (ratings_new['actual_rating'] >= ratings_new['user_average']).astype(int)
ratings_new.head()

Unnamed: 0,user_id,movie,actual_rating,user_average,relevant
0,196,242: Kolya (1996),3,3.571429,0
1,186,302: L.A. Confidential (1997),3,3.333333,0
2,22,377: Heavyweights (1994),1,3.28866,0
3,244,51: Legends of the Fall (1994),2,3.628415,0
4,166,346: Jackie Brown (1997),1,3.5,0


In [13]:
#Relevance Distribution
ratings_new['relevant'].value_counts(normalize = True)

1    0.54337
0    0.45663
Name: relevant, dtype: float64

In [14]:
#Split into training and test datasets for the new ratings dataframe
X_train, X_test = train_test_split(ratings_new, test_size = 0.25, random_state=42)

Creating ground truth list of top n movies based on test data ratings for all users with atleast 1 relevant movie

In [15]:
# Sorting by the actual rating in the test set for each user in descending order
X_test = X_test.sort_values(by = ['user_id','actual_rating'], ascending = [True, False])

# Creating a list of relevant movies for each user
X_test_list = X_test[X_test['relevant'] == 1].groupby(['user_id'])['movie'].apply(lambda x: x.values.tolist())

In [16]:
#Actual Relevant movies for all users with atleast 1 relevant movie
X_test_list.head()

user_id
1    [100: Fargo (1996), 181: Return of the Jedi (1...
2    [242: Kolya (1996), 302: L.A. Confidential (19...
3    [328: Conspiracy Theory (1997), 181: Return of...
4                               [50: Star Wars (1977)]
5    [428: Harold and Maude (1971), 390: Fear of a ...
Name: movie, dtype: object

## 4. Fitting SVD to make rating predictions <a class="anchor" id="svdfit"></a>
Now we will fit the best SVD model with 11 factors and predict for test data

In [17]:
#Importing functions to be used in this notebook from Surprise Package
from surprise import Dataset, Reader, SVD
from surprise.model_selection import GridSearchCV

#Reader object to import ratings from X_train
reader = Reader(rating_scale=(1, 5))

#Storing Data in surprise format from X_train
data = Dataset.load_from_df(X_train[['user_id','movie','actual_rating']], reader)

#Fitting the model on train data with the best parameters
model = SVD(n_factors = 11, n_epochs = 20, random_state = 42)

#Build full trainset will essentially fits the SVD on the complete train set instead of a part of it
#like we do in cross validation for grid search
model.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fd304a78a50>

## 5. Root Mean Squared Error (RMSE) <a class="anchor" id="rmse"></a>
First let us check the RMSE for the model which is a simple predictive accuracy metric

In [18]:
#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [19]:
#id pairs for test set
id_pairs = zip(X_test['user_id'], X_test['movie'])

#Making predictions for test set using predict method from Surprise
y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

#Actual rating values for test set
y_true = X_test['actual_rating']

# Checking performance on test set
rmse(y_true, y_pred)

0.9390125163978545

## 6. Creating Predicted List of relevant movies <a class="anchor" id="pbr"></a>
Now it is time to create a ranked list of top movies based on predicted ratings to demonstrate the use of accuracy metrics

In [20]:
#Creating a copy of test set
X_test_pred = X_test.copy()

#Predicted Rating for each user
X_test_pred['pred_rating'] = y_pred

In [21]:
#Predicted relevance based on average user rating
X_test_pred['pred_relevance'] = (X_test_pred['pred_rating'] >= X_test_pred['user_average']).astype(int)

In [22]:
#Creating predicted ranked lists of movies for all users with atleast 1 predicted relevant movie
X_test_pred = X_test_pred.sort_values(by = ['user_id','pred_rating'], ascending = [True, False])
X_test_pred_list = X_test_pred[X_test_pred['pred_relevance'] == 1].groupby(['user_id'])['movie'].apply(lambda x: x.values.tolist())

In [23]:
#Only keeping the common users in both lists
X_test_pred_list = X_test_pred_list[X_test_pred_list.index.isin(X_test_list.index)].sort_index()
X_test_list = X_test_list[X_test_list.index.isin(X_test_pred_list.index)].sort_index()

## 7. Precision @ K <a class="anchor" id="patk"></a>
Let's calculate Precision@10 for the test set

In [24]:
#Calculating Precision using the formula
def precision(actual, predicted, k):
    act_set = set(actual)
    pred_set = set(predicted[:k])
    result = len(act_set & pred_set) / float(k)
    return result

In [25]:
patk_list = []
for i,j in zip(X_test_list, X_test_pred_list):
    pr = precision(i,j,10)
    patk_list.append(pr)
    
np.mean(patk_list)

0.5395652173913044

## 8. Recall @ K <a class="anchor" id="ratk"></a>
Let's calculate Recall@10 for the test set

In [26]:
#Calculating Recall using the formula
def recall(actual, predicted, k):
    act_set = set(actual)
    pred_set = set(predicted[:k])
    result = len(act_set & pred_set) / float(len(act_set))
    return result

In [27]:
recall_list = []
for i,j in zip(X_test_list, X_test_pred_list):
    re = recall(i,j,10)
    recall_list.append(re)
    
np.mean(recall_list)

0.5246095601689543

## 9. Mean Reciprocal Rank <a class="anchor" id="mrr"></a>
Implementing mean reciprocal rank with binary relevance created list of movies.

In [28]:
#Calculating Mean Reciprocal Rank
rr_list = []
users = X_test['user_id'].unique()
mrm = pd.Series(X_test_list.index).apply(lambda x: X_test_list.loc[x][0]).tolist()
frm = pd.DataFrame({'user_id':X_test_list.index, 'frm':mrm})

for count,movie in enumerate(mrm):
    try:
        rr = 1/(1 + X_test_pred_list[count+1].index(movie))
    except:
        rr = 0
    rr_list.append(rr)
mrr = np.mean(rr_list)
print(mrr)

0.017333953433209728


## 10. Mean Average Precision @ K <a class="anchor" id="mapk"></a>
Here we will calculate Mean Average Precision @10 and check result

In [29]:
#Implementation of MAP@K
def apk(actual, predicted, k=10):
    #Clipping the predicted value till k
    if len(predicted)>k:
        predicted = predicted[:k]
    
    score = 0.0
    num_hits = 0.0
    
    #Calculating Precision
    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0
    #Average Precision @K
    return score / min(len(actual), k)

#Calculating average across all users
def mapk(actual, predicted, k=10):
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])


In [30]:
#Calculating MAP@5 for our use case
mapk(X_test_list, X_test_pred_list, k = 5)

0.6675380434782608

## 11. Normalised Discounted Cumulative Gain@K <a class="anchor" id="ndcg"></a>
This time we will not user binray relevance but create a predicted list using predicted ratings and an use the actual ratings to implement NDCG as per discussion in the video for Rank Aware Metrics

In [31]:
#Ranking with predicted rating
dff = X_test_pred.sort_values(by = ['user_id','pred_rating'], ascending = [True, False])

In [32]:
#Finding actual ratings corresponding to the ranked recommended list
dff_temp = dff.groupby(['user_id'])['actual_rating'].apply(lambda x: x.values.tolist())

In [33]:
#Calculating DCG
def dcg_at_k(r, k):
    r = np.asfarray(r)[:k]
    return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))

#Dividing DCG by IDCG (DCG Max) to get NDCG@K
def ndcg_at_k(r, k):
    dcg_max = dcg_at_k(sorted(r, reverse=True), k)
    
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k) / dcg_max

In [34]:
#Calculating NDCG@10 for given use case
ndcg_list = []
for i in dff_temp.index:
    ndcg_user = ndcg_at_k(r = dff_temp[i], k=10)
    ndcg_list.append(ndcg_user)
np.mean(ndcg_list)

0.9050174244115038

## 12. What's Next? <a class="anchor" id="whatsnext"></a>
Now you can try different algorithms and opimize for metrics by using these functions for any other problem as well. The Rank Aware metrics are especially useful for content based recommender Systems which will the focus of our next module.