LightFM 
- hybrid recommender system that unites the advantages of content-based and collaborative recommenders
- matrix factorization model that represents users and items as **linear combinations** of their latent content feature representations (embeddings)
  - **feature embeddings are summed together** to produce a representation of the user/item
  - e.g. gender [0.1, 0.1, 0.6] + preference for action movie [0.6, 0.3, 0.2] -> user representation [0.7, 0.4, 0.8]
- user & item embeddings are learned during training to encode about users' preference on items and semantic information about the items
  - dot product of user & item embeddings to produce a score for each user/item pair (adjusted by a bias term)
- outperforms pure collaborative matrix factorization model in terms of handling cold-start problems as it can make content/feature-based recommendations when lacking interaction data

In [1]:
import pandas as pd
import numpy as np
from lightfm import LightFM
import math
import re



In [2]:
# read data
rating = pd.read_csv("../data/movielens/ratings.csv")
movie = pd.read_csv("../data/movielens/movies.csv")
tag = pd.read_csv("../data/movielens/tags.csv")

In [3]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Preprocessing - Item Features

In [4]:
# extract year
def extract_year(s):
    try:
        return int(re.findall('\(([0-9]+)\)', s)[-1])
    except:
        print(s)
        return 'UNKNOWN'
movie['year'] = movie.title.apply(lambda x: extract_year(x))

Big Bang Theory, The (2007-)
Fawlty Towers (1975-1979)
Hyena Road
The Lovers and the Despot
Stranger Things
Women of '69, Unboxed


In [5]:
# manually impute year
years = {"Big Bang Theory, The (2007-)": 2010,
"Fawlty Towers (1975-1979)": 1970,
"Hyena Road": 2010,
"The Lovers and the Despot": 2016,
"Stranger Things": 2010,
"Women of '69, Unboxed": 2010}

for ix, row in movie.iterrows():
    if row['title'] in years:
        movie.at[ix, 'year'] = years[row['title']]

In [6]:
# convert year to decades
movie['decade'] = movie['year'].apply(lambda x: math.floor(x/10)*10)

In [7]:
# extract genre
movie['genres'] = movie['genres'].replace('(no genres listed)', "NULL")
movie['genres'] = movie['genres'].apply(lambda x: x.split("|"))

In [8]:
# combine all features & one-hot-encode
movie_features = movie.apply(lambda x: x['genres'] + [x['decade']], axis=1)
all_movie_features = set([f for features in movie_features for f in features])

In [9]:
# convert movieId
movie_id_mapping = {i:ix for ix, i in enumerate(movie.movieId)}
movie['movieId'] = movie['movieId'].apply(lambda x: movie_id_mapping[x])

### Preprocessing - User-Item Interaction

In [10]:
# map movieid & userid to new ids
user_id_mapping = {i:ix for ix, i in enumerate(rating.userId.unique())}
rating['userId'] = rating['userId'].apply(lambda x: user_id_mapping[x])
rating['movieId'] = rating['movieId'].apply(lambda x: movie_id_mapping[x])

In [11]:
# convert explicit rating to implicit feedback
threshold = 3
rating['rating'] = rating['rating'].apply(lambda x: 0 if x < threshold else 1) 

# Build dataset - Interaction Matrix, Feature Matrix

In [12]:
from lightfm.data import Dataset

dataset = Dataset()
dataset.fit(
    users=set(rating['userId']),
    items=movie_id_mapping.values(),
    item_features=all_movie_features
)


In [13]:
# build interactions matrix from (user_id, item_id) or (user_id, item_id, weight)
iteraction_tuples = rating[rating['rating']==1][['userId', 'movieId']].apply(tuple, axis=1).to_list()
interactions, weights = dataset.build_interactions(iteraction_tuples)

In [14]:
# build item features from (item id, [list of feature names]) or (item id, {feature name: feature weight})
feature_tuples = list(enumerate(movie_features))
item_features = dataset.build_item_features(feature_tuples)

# Model

- besides classic optimization algorithms (SGD, ALS), LightFM implements **BPR (Bayesian Personalized Ranking) and WARP (Weighted Approximate-Rank Pairwise) loss**, which are particularly well suited for implicit feedback learning-to-rank task
- WARP algorithm:
  - for a given user, sample a negative item & positive item
  - compute predictions for both
  - if y_pred of the negative item > positive item, update gradients
  - optimization:
    - if found violation early: large gradient update & decrement gradually;
- BPR algorithm
  - for a given user, sample a negative item & positive item
  - compute predictions for both
  - compute the difference between the predictions (scores)
  - pass the difference to a sigmoid function and use it as a weight for gradient update via SGD

In [15]:
model = LightFM(
    no_components=200, # embedding dims
    learning_rate=0.05,
    loss='warp'    
)

model.fit(interactions, 
          item_features=item_features,
          epochs=20,
          num_threads=4,
          verbose=True)

Epoch: 100%|██████████| 20/20 [00:08<00:00,  2.41it/s]


<lightfm.lightfm.LightFM at 0x7fc160d213a0>

# Evaluation

roc/auc: the probability that a randomly chosen positive example has a higher score than a randomly chosen negative example.

In [16]:
from lightfm.evaluation import auc_score

In [17]:
auc_scores = auc_score(model, 
                    test_interactions=interactions, 
                    item_features=item_features,
                    num_threads=4)

In [18]:
# auc scores for every user
# if no interaction for the user, then auc=0.5, i.e. random guessing
len(auc_scores)

671

In [19]:
# mean auc
np.mean(auc_scores)

0.98133314

# Prediction

In [20]:
def recommend(user_id, n_sample=5):
    pred_scores = model.predict(
                        user_ids = user_id,
                        item_ids=np.array(list(movie_id_mapping.values())),
                        item_features=item_features)
    # user's watch history
    watched_ix = rating[(rating['userId']==user_id)&(rating['rating']==1)]['movieId']
    watched_ix = watched_ix.sample(min(n_sample, len(watched_ix)))
    watched = movie[movie['movieId'].isin(watched_ix)][['movieId', 'title']]
    # recommend items that the user has not seen & with high predicted scores
    rec_movie_id = np.argsort(pred_scores)[::-1][[i for i in range(len(pred_scores)) if i not in watched_ix]][:n_sample]
    recommended = pd.DataFrame({
                                'movieId':rec_movie_id,
                                'title': movie['title'][rec_movie_id],
                                'pred_score':pred_scores[rec_movie_id]
                                })
    return watched, recommended
    

In [21]:
# find out users that have very little interaction data
rating[rating['rating']==1].groupby('userId').count().sort_values('rating')['rating'].head(10)


userId
578     4
34      7
428     7
0       8
324     8
309     8
336    10
603    11
169    12
28     12
Name: rating, dtype: int64

In [22]:
watched, recommended = recommend(578, n_sample=5)
print("Watched")
print("------------------------------------------------------------------------")
print(watched)
print("Recommended")
print("------------------------------------------------------------------------")
print(recommended)

Watched
------------------------------------------------------------------------
      movieId                                              title
2062     2062                                 Matrix, The (1999)
3367     3367                                     Memento (2000)
6788     6788                              Game Plan, The (2007)
7936     7936  Twilight Saga: Breaking Dawn - Part 1, The (2011)
Recommended
------------------------------------------------------------------------
      movieId                                              title  pred_score
4395     4395      Lord of the Rings: The Two Towers, The (2002)    2.866094
3856     3856  Amelie (Fabuleux destin d'Amélie Poulain, Le) ...    2.717700
3871     3871  Lord of the Rings: The Fellowship of the Ring,...    2.708966
5127     5127       Eternal Sunshine of the Spotless Mind (2004)    2.663655
5026     5026  Lord of the Rings: The Return of the King, The...    2.605367


In [23]:
watched, recommended = recommend(34, n_sample=5)
print("Watched")
print("------------------------------------------------------------------------")
print(watched)
print("Recommended")
print("------------------------------------------------------------------------")
print(recommended)


Watched
------------------------------------------------------------------------
      movieId                                              title
219       219                          Heavenly Creatures (1994)
416       416  Englishman Who Went Up a Hill But Came Down a ...
1665     1665                                        Tron (1982)
1801     1801                                       Ronin (1998)
1811     1811                                 Player, The (1992)
Recommended
------------------------------------------------------------------------
      movieId                                              title  pred_score
535       535                                       Fargo (1996)    3.667782
406       406                               Fugitive, The (1993)    3.451589
1288     1288                           L.A. Confidential (1997)    3.394264
955       955  Raiders of the Lost Ark (Indiana Jones and the...    3.165970
266       266                                Pulp Fiction (