# Recommender System - LightFM

# Table of Contents

- [1. Import Package](#package)
- [2. Import Data](#data)
- [3. Data Analysis](#analysis)
    - [3.1. Shape of Data](#analysis_shape)
    - [3.2. The Number of User and Movie](#analysis_numbof)
    - [3.3. How Much Rating Given by User](#analysis_ratinguser)
    - [3.4. How Much Rater of Each Movie](#analysis_rater)
- [4. Data Preprocessing](#prep)
    - [4.1. Create Data Train](#prep_train)
        - [4.1.1. Create Interaction Data](#prep_train_intr)
        - [4.1.2. Create User Feature](#prep_train_userfeature)
        - [4.1.3. Create Item Feature](#prep_train_itemfeature)
    - [4.2. Create Data Test](#prep_test)
    - [4.3. Create LightFM Dataset](#prep_lfmdataset)
        - [4.3.1. Create Interaction LightFM Dataset](#prep_lfmdataset_intr)
        - [4.3.2. Create User Feature LightFM Dataset](#prep_lfmdataset_uf)
        - [4.3.3. Create Item Feature LightFM Dataset](#prep_lfmdataset_if)
        - [4.3.4. Take A Look At LightFM Dataset](#prep_lfmdataset_look)
- [5. Modeling](#model)
    - [5.1. Hyperparameter Tuning](#model_tuning)
    - [5.2. Training](#model_train)
    - [5.3. Evaluation](#model_eval)
- [6. Generate Recommendation](#generate)

# Import Package <a class='anchor' id='package'></a>

We use <a href = 'https://making.lyst.com/lightfm/docs/home.html' target = '_blank'>LightFM</a> library to create a hybrid recommender system using LightFM.

Get to know with LightFM:
- <a href = 'https://making.lyst.com/lightfm/docs/home.html' target = '_blank'>Documentation</a>
- <a href = 'https://github.com/lyst/lightfm' target = '_blank'>Github</a>
- <a href = 'https://arxiv.org/pdf/1507.08439.pdf' target = '_blank'>Paper</a>

In [280]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from collections import defaultdict
import pickle
import itertools
import datetime

from sklearn.preprocessing import MultiLabelBinarizer

import bisect 

from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import precision_at_k,recall_at_k, reciprocal_rank, auc_score
from lightfm.cross_validation import random_train_test_split
from scipy.sparse import coo_matrix as sp
import os

# Import Data <a class='anchor' id='data'></a>

We use Movie Recommender System Dataset from <a href = 'https://www.kaggle.com/datasets/gargmanas/movierecommenderdataset' target = '_blank'>Kaggle</a>

In [2]:
data_mv = pd.read_csv('Data/movies.csv')
data_rating = pd.read_csv('Data/ratings.csv')

In [3]:
data_mv.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [4]:
data_rating.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


# Data Analysis <a class='anchor' id='analysis'></a>

## Shape of Data <a class='anchor' id='analysis_shape'></a>

In [5]:
print("shape of data movie : ", data_mv.shape)
print("shape of data rating : ", data_rating.shape)

shape of data movie :  (9742, 3)
shape of data rating :  (100836, 4)


## Number of User and Movie <a class='anchor' id='analysis_numbof'></a>

In [6]:
print("number of user : ", data_rating['userId'].nunique())
print("number of movie : ", data_mv['movieId'].nunique())

number of user :  610
number of movie :  9742


## How Much Rating Given by User <a class='anchor' id='analysis_ratinguser'></a>

In [7]:
ratingbyuser = data_rating.groupby('userId')[['rating']].count().sort_values('rating').reset_index()

In [8]:
ratingbyuser.head(10)

Unnamed: 0,userId,rating
0,442,20
1,406,20
2,147,20
3,194,20
4,569,20
5,576,20
6,431,20
7,207,20
8,278,20
9,320,20


In [9]:
ratingbyuser.tail(10)

Unnamed: 0,userId,rating
600,288,1055
601,606,1115
602,380,1218
603,68,1260
604,610,1302
605,274,1346
606,448,1864
607,474,2108
608,599,2478
609,414,2698


In [10]:
ratingbyuser[['rating']].describe()

Unnamed: 0,rating
count,610.0
mean,165.304918
std,269.480584
min,20.0
25%,35.0
50%,70.5
75%,168.0
max,2698.0


## How Much Rater of Each Movie <a class='anchor' id='analysis_rater'></a>

In [11]:
rater = data_rating.groupby('movieId')[['rating']].count().sort_values('rating').reset_index()

In [12]:
rater.head(10)

Unnamed: 0,movieId,rating
0,193609,1
1,4032,1
2,57526,1
3,57522,1
4,57502,1
5,57499,1
6,57421,1
7,57326,1
8,57147,1
9,4046,1


In [13]:
rater.tail(10)

Unnamed: 0,movieId,rating
9714,527,220
9715,589,224
9716,110,237
9717,480,238
9718,260,251
9719,2571,278
9720,593,279
9721,296,307
9722,318,317
9723,356,329


In [14]:
rater[['rating']].describe()

Unnamed: 0,rating
count,9724.0
mean,10.369807
std,22.401005
min,1.0
25%,1.0
50%,3.0
75%,9.0
max,329.0


# Data Preprocessing <a class='anchor' id='prep'></a>

LightFM is a Hybrid recommender system so we need the user-item rating, feature of item, and feature of user for training process. We will create two kinds of data, namely data train and data test.

- data train : all historical rated user-item data that we have and the feature of user and item
- data test  : all unrated user-item pairs (we will predict the recommendation score of those pairs)

LightFM use implicit feedback to create recommendation so we don't need the rating value, we only need to know what items have been rated by each user. The rated user-item pairs will be called as interaction data.

we have 610 users and 9742 movies so to create the data test we need to create all user-item pair data. If it is too large then we can divide the data into several batches in the predict process later, while now we can create a function to generate the data test.

In the end of preprocessing stage we need to create <a href='https://making.lyst.com/lightfm/docs/lightfm.data.html' target='_blank'>LighFM Dataset</a>.

## Create Data Train <a class='anchor' id='prep_train'></a>

**Interaction data** contains the interaction of user and item. In this case intercation between the user and the item is determined by the rating. In other cases we can use many things as interaction between user and item such as transaction activity, wishlist, cart, etc. 

**Feature** in LighFM are only categorical feature so continuous features must be discretized. In LightFM it is possible for a user or item to have no features at all or have multiple valie features. You will understand more later when we enter the stage of creating the LightFM dataset

In [15]:
# DISCRETIZATION FUNCTION

def discretize(value,split_points):
    value_position = np.digitize(value,split_points)
    
    if value_position == 0:
        return "<"+str(split_points[0])
    elif value_position == len(split_points):
        return ">"+str(split_points[-1])
    else:
        return str(split_points[value_position-1])+" - "+str(split_points[value_position])

### Create Interaction Data <a class='anchor' id='prep_train_intr'></a>

In [16]:
data_interaction = data_rating[['userId','movieId','rating']].copy()

### Create User Feature <a class='anchor' id='prep_train_userfeature'></a>

we will create some simple features for user such as:
- the number of rating given by user
- the most rated genre

In [93]:
user_feature = data_rating[['userId','movieId']].merge(data_mv[['movieId','genres']], how = 'left', on = 'movieId')

In [94]:
user_feature.head(5)

Unnamed: 0,userId,movieId,genres
0,1,1,Adventure|Animation|Children|Comedy|Fantasy
1,1,3,Comedy|Romance
2,1,6,Action|Crime|Thriller
3,1,47,Mystery|Thriller
4,1,50,Crime|Mystery|Thriller


In [99]:
# THE NUMBER OF RATING GIVEN BY USER

rating_given = user_feature.groupby('userId').agg(numb_rating = ('userId','count')).reset_index()
split_point = [
    rating_given['numb_rating'].quantile(q=0.25),
    rating_given['numb_rating'].quantile(q=0.5),
    rating_given['numb_rating'].quantile(q=0.75),
    rating_given['numb_rating'].quantile(q=0.75)+(1.5*(rating_given['numb_rating'].quantile(q=0.75)-rating_given['numb_rating'].quantile(q=0.25)))
]
rating_given['numb_rating'] = rating_given['numb_rating'].apply(lambda val : discretize(val, split_point))

In [100]:
# THE MOST RATED GENRE

def count_genres(genres, sep = "|"):
    ls_genres = genres.split(sep)
    dict_genres = {}
    for i in ls_genres:
        if i in dict_genres:
            dict_genres[i] = dict_genres[i]+1
        else:
            dict_genres[i] = 1
    return dict_genres

def most_rated_genres(genres, sep = "|"):
    counted_genres = count_genres(genres, sep)
    max_value = max(counted_genres.values())
    most_rated = [i for i,j in counted_genres.items() if j == max_value]
    most_rated = '|'.join(most_rated)
    return most_rated

most_rated = user_feature.copy()
most_rated['genres'] = most_rated['genres'] + "|"
most_rated = most_rated.groupby('userId')[['genres']].sum().reset_index()
most_rated['genres'] = most_rated['genres'].apply(lambda genre : most_rated_genres(genre))
most_rated.loc[most_rated['genres'] == '(no genres listed)', 'genres'] = None

In [101]:
most_rated.head(5)

Unnamed: 0,userId,genres
0,1,Action
1,2,Drama
2,3,Drama
3,4,Drama
4,5,Drama


In [102]:
# MERGE FEATURE

user_feature = rating_given.merge(most_rated, on = 'userId')

In [103]:
user_feature.head(10)

Unnamed: 0,userId,numb_rating,genres
0,1,168.0 - 367.5,Action
1,2,<35.0,Drama
2,3,35.0 - 70.5,Drama
3,4,168.0 - 367.5,Drama
4,5,35.0 - 70.5,Drama
5,6,168.0 - 367.5,Drama
6,7,70.5 - 168.0,Action
7,8,35.0 - 70.5,Comedy
8,9,35.0 - 70.5,Drama
9,10,70.5 - 168.0,Comedy


In user features, numb_rating was a continuous feature and we already discretized it. Genres now become the most rated genres and it is a multiple value feature. We will deal with them again later when creating the LightFM Dataset.

### Create Item Feature <a class='anchor' id='prep_train_itemfeature'></a>

we will create some simple features for item such as:
- genres (we leave it as it is)
- release year (we will extract it from the title)

In [120]:
mv_feature = data_mv.copy()

#EXTRACT YEAR FROM TITLE
mv_feature['title'] = mv_feature['title'].str.strip()
mv_feature['release_year'] = mv_feature['title'].str.split(" ").str[-1]
mv_feature['release_year'] = mv_feature['release_year'].str.replace("[^0-9]","")
mv_feature['release_year'] = mv_feature['release_year'].str[0:4]
mv_feature.loc[mv_feature['release_year'] != '', 
               'release_year'] = mv_feature.loc[mv_feature['release_year'] != '', 
                                                'release_year'].apply(lambda year : discretize(int(year),
                                                                                               [1990,2000,
                                                                                                2010,2015,
                                                                                                2020]))
mv_feature.loc[mv_feature['release_year'] == '', 'release_year'] = None

# litle preprocess for genres
mv_feature.loc[mv_feature['genres'] == "(no genres listed)", 'genres'] = None

mv_feature = mv_feature.drop('title',axis=1)

  mv_feature['release_year'] = mv_feature['release_year'].str.replace("[^0-9]","")


In [122]:
mv_feature.head(10)

Unnamed: 0,movieId,genres,release_year
0,1,Adventure|Animation|Children|Comedy|Fantasy,1990 - 2000
1,2,Adventure|Children|Fantasy,1990 - 2000
2,3,Comedy|Romance,1990 - 2000
3,4,Comedy|Drama|Romance,1990 - 2000
4,5,Comedy,1990 - 2000
5,6,Action|Crime|Thriller,1990 - 2000
6,7,Comedy|Romance,1990 - 2000
7,8,Adventure|Children,1990 - 2000
8,9,Action,1990 - 2000
9,10,Action|Adventure|Thriller,1990 - 2000


## Create Data Test <a class='anchor' id='prep_test'></a>

In [285]:
# CREATE DATA TEST
def generate_data_test(user_df, mv_df, rated_pair_df):
    temp_user = user_df.copy()
    temp_user['key'] = 1

    temp_mv = mv_df.copy()
    temp_mv['key'] = 1

    #cross join temp_user and temp_mv
    data_test = temp_user.merge(temp_mv,on='key').drop('key',axis=1)

    #join data test and a data that contains rated user-item pairs to get the rating
    data_test = data_test.merge(rated_pair_df, how='left', on=['userId','movieId'])

    #exclude rated data on data test
    data_test = data_test[data_test['rating'].isna()]
    
    return data_test

## Create LightFM Dataset <a class='anchor' id='prep_lfmdataset'></a>

Dataset in LighFM is like a dictionary to store every user and item and their features. The LightFM Dataset is not a tabular data so there are no columns there. We will put all the features in a list. Look at the <a href='https://making.lyst.com/lightfm/docs/lightfm.data.html' target='_blank'>documentation</a> to know more about LightFM Dataset

To distinct one feature with another (as mentioned before that LightFM Dataset have no columns) we can add the column name before the value of each feature. For example, for **action** genre we can change it to **genres:action**

In [150]:
dataset = Dataset()

### Create Interaction LightFM Dataset <a class='anchor' id='prep_lfmdataset_intr'></a>

In [199]:
# STORE USER ID AND MOVIE ID TO DATASET
dataset.fit(data_interaction['userId'].unique(),
           data_interaction['movieId'].unique())

print("interaction shape = ", dataset.interactions_shape())

#BUILD INTERACTION
(interactions, weights) = dataset.build_interactions((tuple(i) for i in data_interaction[['userId','movieId']].values))

interaction shape =  (610, 9724)


In [200]:
# INTERACTION
interactions.toarray()

array([[1, 1, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 1, 1, 1]], dtype=int32)

In [201]:
# WEIGHTS
weights.toarray()

array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 1., ..., 1., 1., 1.]], dtype=float32)

### Create User Feature LightFM Dataset <a class='anchor' id='prep_lfmdataset_uf'></a>

In [202]:
# you can store users who have no interactions if any
dataset.fit_partial(users = user_feature['userId'].unique())

In [203]:
#ADD COLUMN NAME BEFORE THE VALUE OF EACH FEATURE

for i in user_feature.columns[1:]:
    user_feature[i] = str(i) + ":" + user_feature[i]

In [204]:
user_feature.head()

Unnamed: 0,userId,numb_rating,genres
0,1,numb_rating:numb_rating:168.0 - 367.5,genres:genres:Action
1,2,numb_rating:numb_rating:<35.0,genres:genres:Drama
2,3,numb_rating:numb_rating:35.0 - 70.5,genres:genres:Drama
3,4,numb_rating:numb_rating:168.0 - 367.5,genres:genres:Drama
4,5,numb_rating:numb_rating:35.0 - 70.5,genres:genres:Drama


In [205]:
# STORE THE VALUES OF EACH FEATURE AS LIST
for i in user_feature.columns[1:]:
    dataset.fit_partial(user_features = [j for j in user_feature[i] if pd.isnull(j) == False])
    
# BUILD USER FEATURE => [(userId,{features}), ....]
user_features = [(i[0],{j for j in i[1:] if pd.isnull(j) == False}) for i in user_feature.values]
user_features = dataset.build_user_features(user_features, normalize = True)

In [206]:
user_features.toarray()

array([[0.33333334, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.33333334, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.33333334, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]], dtype=float32)

### Create Item Feature LightFM Dataset <a class='anchor' id='prep_lfmdataset_if'></a>

In [207]:
# you can store items who have no interactions if any
dataset.fit_partial(items = mv_feature['movieId'].unique())

In [208]:
#ADD COLUMN NAME BEFORE THE VALUE OF EACH FEATURE

def add_column_name_for_multiplevalues(str_multival, col_name, sep = "|"):
    try:
        split_multival = str_multival.split(sep)
        named_multival = [col_name+":"+i for i in split_multival]
        return "|".join(named_multival)
    except:
        return str_multival

mv_feature['genres'] = mv_feature['genres'].apply(lambda genre : add_column_name_for_multiplevalues(genre,'genres'))
mv_feature['release_year'] = "release_year" + ":" + mv_feature['release_year']

In [209]:
# STORE THE VALUES OF EACH FEATURE AS LIST
dataset.fit_partial(item_features = "|".join([i for i in mv_feature['genres'] if pd.isnull(i) == False]).split("|"))
dataset.fit_partial(item_features = [i for i in mv_feature['release_year'] if pd.isnull(i) == False])
    
# BUILD ITEM FEATURE => [(MovieId,{features}), ....]
item_features = [(i[0],{j for j in "|".join
                        ([k for k in i[1:] if pd.isnull(k) == False]).split("|") if j != ''}) for i in mv_feature.values]
item_features = dataset.build_item_features(item_features, normalize = True)

In [210]:
item_features.toarray()

array([[0.14285715, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.25      , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.2       , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.25      ,
        0.        ]], dtype=float32)

### Take a Look At LightFM Dataset <a class='anchor' id='prep_lfmdataset_look'></a>

Complete LightFM dataset contains 4 dictionary:
- **user mapping**, contains map of userId (index = 0)
- **user feature mapping**, contains map of user feature (index = 1)
- **item mapping**, contains map of item Id (index = 2)
- **item feature mapping**, contains map of item feature (index = 3)

In [215]:
userId_map, user_feature_map, item_map, item_feature_map = dataset.mapping()

In [248]:
# userId Map
print("First 10 : ",list(userId_map.items())[:10])
print("Last 10 : ",list(userId_map.items())[-10:])

First 10 :  [(1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7), (9, 8), (10, 9)]
Last 10 :  [(601, 600), (602, 601), (603, 602), (604, 603), (605, 604), (606, 605), (607, 606), (608, 607), (609, 608), (610, 609)]


In [249]:
# user feature Map
print("First 10 : ",list(user_feature_map.items())[:10])
print("Last 10 : ",list(user_feature_map.items())[-10:])

First 10 :  [(1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7), (9, 8), (10, 9)]
Last 10 :  [('genres:genres:Thriller|Comedy|Adventure', 633), ('genres:genres:Thriller|Drama', 634), ('genres:genres:Crime|Drama', 635), ('genres:genres:Drama|Comedy', 636), ('genres:genres:Animation', 637), ('genres:genres:Comedy|Drama|Adventure|Action', 638), ('genres:genres:Horror', 639), ('genres:genres:Adventure|Fantasy', 640), ('genres:genres:Adventure|Drama', 641), ('genres:genres:Fantasy', 642)]


In [250]:
# item map
print("First 10 : ",list(item_map.items())[:10])
print("Last 10 : ",list(item_map.items())[-10:])

First 10 :  [(1, 0), (3, 1), (6, 2), (47, 3), (50, 4), (70, 5), (101, 6), (110, 7), (151, 8), (157, 9)]
Last 10 :  [(7020, 9732), (7792, 9733), (8765, 9734), (25855, 9735), (26085, 9736), (30892, 9737), (32160, 9738), (32371, 9739), (34482, 9740), (85565, 9741)]


In [252]:
# item feature map
print("First 10 : ",list(item_feature_map.items())[:10])
print("Last 10 : ",list(item_feature_map.items())[-10:])

First 10 :  [(1, 0), (3, 1), (6, 2), (47, 3), (50, 4), (70, 5), (101, 6), (110, 7), (151, 8), (157, 9)]
Last 10 :  [('genres:genres:Musical', 9756), ('genres:genres:Documentary', 9757), ('genres:genres:IMAX', 9758), ('genres:genres:Western', 9759), ('genres:genres:Film-Noir', 9760), ('release_year:release_year:1990 - 2000', 9761), ('release_year:release_year:<1990', 9762), ('release_year:release_year:2000 - 2010', 9763), ('release_year:release_year:2010 - 2015', 9764), ('release_year:release_year:2015 - 2020', 9765)]


# Modeling <a class='anchor' id='model'></a>

## Hyperparameter Tuning <a class='anchor' id='model_tuning'></a>

We will use random searh to do the hyperparameter tuning. We can choose any evaluation method for tuning as long as the method is supported by LightFM. See <a href = 'https://making.lyst.com/lightfm/docs/lightfm.evaluation.html' target = '_blank'>documentaion</a> to know available evaluation methods.

In [270]:
def sample_hyperparameters():
    """
    Yield possible hyperparameter choices.
    """

    while True:
        yield {
            "no_components": np.random.randint(16, 64),
            "learning_schedule": np.random.choice(["adagrad", "adadelta"]),
            "loss": np.random.choice(["bpr", "warp", "warp-kos"]),
            "learning_rate": np.random.exponential(0.05),
            "item_alpha": np.random.exponential(1e-8),
            "user_alpha": np.random.exponential(1e-8),
            "max_sampled": np.random.randint(5, 15),
            "num_epochs": np.random.randint(5, 50),
        }

def random_search(train, test, item_features = None, user_features = None, evaluation_method = 'recall_at_k',num_samples=10, num_threads = 1):
    """
    Sample random hyperparameters, fit a LightFM model, and evaluate it
    on the test set.

    Parameters
    ----------

    train: np.float32 coo_matrix of shape [n_users, n_items]
        Training data.
    test: np.float32 coo_matrix of shape [n_users, n_items]
        Test data.
    evaluation_method = str, optional choose one of ['recall_at_k','precision_at_k','auc_score','reciprocal_rank']
        method to evaluate the model 
    num_samples: int, optional
        Number of hyperparameter choices to evaluate.
    num_threads: int, optional
        Number of threads


    Returns
    -------

    generator of (evaluation_score, hyperparameter dict)

    """
    results = []
    for hyperparams in itertools.islice(sample_hyperparameters(), num_samples):
        num_epochs = hyperparams.pop("num_epochs")

        model = LightFM(**hyperparams)
        model.fit(train, 
                  item_features = item_features, 
                  user_features = user_features,
                  epochs=num_epochs, 
                  num_threads=num_threads)

        if evaluation_method == 'recall_at_k':
            score = recall_at_k(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=num_threads).mean()
        elif evaluation_method == 'precision_at_k':
            score = precision_at_k(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=num_threads).mean()
        elif evaluation_method == 'auc_score':
            score = auc_score(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=num_threads).mean()
        elif evaluation_method == 'reciprocal_rank':
            score = reciprocal_rank(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=num_threads).mean()
        else:
            print("available evaluation method = ['recall_at_k','precision_at_k','auc_score','reciprocal_rank']")
            return (0,{})

        hyperparams["num_epochs"] = num_epochs
        results.append(tuple([score,hyperparams]))

    return max(results, key = lambda x:x[0])

In [271]:
(train, test) = random_train_test_split(interactions=interactions, test_percentage=0.2)

In [272]:
score,hyperparameter = random_search(train,
                                     test,
                                     item_features = item_features,
                                     user_features = user_features,
                                     evaluation_method='recall_at_k',
                                     num_threads = os.cpu_count())

In [273]:
print("score = ",score)
print()
print("hyperparameter =  ")
print(hyperparameter)

score =  0.11259901687886416

hyperparameter =  
{'no_components': 57, 'learning_schedule': 'adadelta', 'loss': 'warp-kos', 'learning_rate': 0.029479367712465857, 'item_alpha': 3.9639333271664814e-10, 'user_alpha': 8.708878975209094e-09, 'max_sampled': 11, 'num_epochs': 46}


## Training <a class='anchor' id='model_train'></a>

We will use <a href = 'https://making.lyst.com/lightfm/docs/lightfm.html' target = '_blank'>LightFM Method</a> from LightFM library.

In [274]:
model = LightFM()

epoch = hyperparameter.pop('num_epochs')
model.set_params(**hyperparameter)

<lightfm.lightfm.LightFM at 0x7fd3cea15550>

In [279]:
model.fit(interactions, 
          item_features = item_features, 
          user_features = user_features,
          epochs=epoch, 
          num_threads=os.cpu_count())

<lightfm.lightfm.LightFM at 0x7fd3cea15550>

## Evaluation <a class='anchor' id='model_eval'></a>

There are two types of evaluation in recommender system, namely online evaluation and offline evaluation. We will use offline evaluation and calculate the recall_at_k, precision_at_k, auc_score, and reciprocal_rank. Check <a href = 'https://making.lyst.com/lightfm/docs/lightfm.evaluation.html' target = '_blank'>documentaion</a> to know more about model evaluation in LightFM

In [281]:
# PRECISION_AT_K
precision_at_k(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=os.cpu_count()).mean()

0.30032897

In [282]:
# RECALL_AT_K
recall_at_k(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=os.cpu_count()).mean()

0.15421254492619657

In [283]:
# AUC_SCORE
auc_score(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=os.cpu_count()).mean()

0.94467086

In [284]:
# RECIPROCAL_RANK
reciprocal_rank(model, test, train_interactions=train,item_features=item_features,user_features=user_features, num_threads=os.cpu_count()).mean()

0.52448004

The evaluation results of our model are quite good! We have small recall but don't worry it's because there are so many positive items that user have in test set. If you want to know the definition of each calculation go check the <a href = 'https://making.lyst.com/lightfm/docs/lightfm.evaluation.html' target = '_blank'>documentaion</a>

# Generate Recommendation <a class='anchor' id='generate'></a>

We will generate top n recommendation for every user by predict the rating of the unrated user-item pairs (data test). In LightFM we need to input the user index and item index (from LightFM Dataset we built before) to get the recommendation score. Check the predict method in <a href='https://making.lyst.com/lightfm/docs/lightfm.html?highlight=predict#lightfm.LightFM.predict' target='_blank'>documentation</a> to know more.

In [None]:
def generate_data_test(user_df, mv_df, rated_pair_df):
    temp_user = user_df.copy()
    temp_user['key'] = 1

    temp_mv = mv_df.copy()
    temp_mv['key'] = 1

    #cross join temp_user and temp_mv
    data_test = temp_user.merge(temp_mv,on='key').drop('key',axis=1)

    #join data test and a data that contains rated user-item pairs to get the rating
    data_test = data_test.merge(rated_pair_df, how='left', on=['userId','movieId'])

    #exclude rated data on data test
    data_test = data_test[data_test['rating'].isna()]
    
    return data_test

In [288]:
# create user data
user_df = pd.DataFrame({'userId':list(dataset.mapping()[0].keys()),
                        'user_index':list(dataset.mapping()[0].values())})

# create movie data
mv_df = pd.DataFrame({'movieId':list(dataset.mapping()[2].keys()),
                      'movie_index':list(dataset.mapping()[2].values())})

In [290]:
user_df.head()

Unnamed: 0,userId,user_index
0,1,0
1,2,1
2,3,2
3,4,3
4,5,4


In [291]:
mv_df.head()

Unnamed: 0,movieId,movie_index
0,1,0
1,3,1
2,6,2
3,47,3
4,50,4


In [292]:
# create batch (we will predict 100 by 100 users)

batch = list(range(0,len(user_df['userId']),100)) + [len(user_df['userId'])]
batch

[0, 100, 200, 300, 400, 500, 600, 610]

In [None]:
model.predict(df_user_batch['user_index'].values, df_user_batch['course_index'].values,item_features=item_features, user_features=user_features)

In [294]:
n_recommendation = 10 #get top 10 recommendation
res_recommendation = pd.DataFrame()

for i in tqdm(range(1,len(batch))):
    #generate data test
    data_test = generate_data_test(user_df.loc[batch[i-1]:batch[i]],mv_df,data_interaction)
    
    #predict
    data_test['prediction'] = model.predict(data_test['user_index'].values, 
                                            data_test['movie_index'].values,
                                            item_features=item_features, 
                                            user_features=user_features,
                                            num_threads=os.cpu_count())
    
    #sort to get top n
    data_test = data_test.sort_values('prediction',ascending=False).reset_index(drop=True)
    data_test = data_test.groupby('userId').head(10)
    
    res_recommendation = res_recommendation.append(data_test[['userId','movieId','prediction']])

100%|██████████| 7/7 [00:04<00:00,  1.68it/s]


In [299]:
res_recommendation = res_recommendation.sort_values(['userId','prediction'],ascending=[True,False]).reset_index(drop=True)

In [301]:
res_recommendation.head(10)

Unnamed: 0,userId,movieId,prediction
0,1,1200,-4161.587402
1,1,858,-4161.893066
2,1,588,-4161.947754
3,1,3114,-4162.174805
4,1,1610,-4162.375488
5,1,551,-4162.431152
6,1,1036,-4162.446289
7,1,589,-4162.474121
8,1,2,-4162.498535
9,1,2791,-4162.63623


In [302]:
res_recommendation[res_recommendation['userId'] == 600]

Unnamed: 0,userId,movieId,prediction
6040,600,1968,-10644.774414
6041,600,586,-10645.720703
6042,600,1580,-10645.880859
6043,600,333,-10645.952148
6044,600,785,-10646.072266
6045,600,1777,-10646.223633
6046,600,597,-10646.237305
6047,600,2396,-10646.295898
6048,600,3363,-10646.397461
6049,600,2100,-10646.549805


Here it is! in our results we have userId, movieId, and prediction as recommendation score