# LightFM: A Hybrid Recommendation algorithm 
#### https://github.com/lyst/lightfm 

LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback, including efficient implementation of BPR and WARP ranking losses. It's easy to use, fast (via multithreaded model estimation), and produces high quality results. It also makes it possible to incorporate both item and user metadata into the traditional matrix factorization algorithms. It represents each user and item as the sum of the latent representations of their features, thus allowing recommendations to generalise to new items (via item features) and to new users (via user features).

In [8]:
!pip install lightfm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
import numpy as np
import pandas as pd
# import the "lightfm" hybrid recommendation algorithm package
from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k

## Data Exploration of MovieLens dataset

In [10]:
# Fetch full dataset
# https://github.com/lyst/lightfm/tree/master/examples/movielens#getting-the-data
data_mld = fetch_movielens()
print(data_mld)

{'train': <943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 90570 stored elements in COOrdinate format>, 'test': <943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 9430 stored elements in COOrdinate format>, 'item_features': <1682x1682 sparse matrix of type '<class 'numpy.float32'>'
	with 1682 stored elements in Compressed Sparse Row format>, 'item_feature_labels': array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
       'Sliding Doors (1998)', 'You So Crazy (1994)',
       'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object), 'item_labels': array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
       'Sliding Doors (1998)', 'You So Crazy (1994)',
       'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object)}


In [11]:
for key, value in data_mld.items():
    print(key, type(value), value.shape)

train <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
test <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
item_features <class 'scipy.sparse.csr.csr_matrix'> (1682, 1682)
item_feature_labels <class 'numpy.ndarray'> (1682,)
item_labels <class 'numpy.ndarray'> (1682,)


#### Observations:
- 'train' and 'test' are sparse coordinate matrices with size (1682, 1682); each having 90570 and 9430 movie ratings for the 943 users and 1682 movies
- 'item_features' is a sparse diagonal CSR matrix for movies with size (1682, 1682)
- 'item_feature_labels' and 'item_labels' are arrays of size 1682: consisting of movie titles


In [18]:
# print(data_mld['train'])
# print(data_mld['test'])

In [19]:
# Convert train and test datasets from sparse Coordinate matrix to sparse CSR matrix (to access indices)
train_csr_mld = data_mld['train'].tocsr() 
test_csr_mld = data_mld['test'].tocsr()

print(type(train_csr_mld), train_csr_mld.shape)  #<class 'scipy.sparse.csr.csr_matrix'> (943, 1682)
print(type(test_csr_mld), test_csr_mld.shape)
# print(train_csr_mld)

<class 'scipy.sparse.csr.csr_matrix'> (943, 1682)
<class 'scipy.sparse.csr.csr_matrix'> (943, 1682)


In [20]:
# Convert train and test datasets from sparse Coordinate matrix to dense matrix 
train_dense_mld = data_mld['train'].todense() 
test_dense_mld = data_mld['test'].todense()

print(type(train_dense_mld), train_dense_mld.shape)  #<class 'numpy.matrix'> (943, 1682)
print(type(test_dense_mld), test_dense_mld.shape)
# print(train_dense_mld)

<class 'numpy.matrix'> (943, 1682)
<class 'numpy.matrix'> (943, 1682)


In [21]:
# Analyze train_csr_mld & train_dense_mld matrics
print("train_csr_mld with 1st user's 1st movie's rating")
print(train_csr_mld[0,0])
print("train_dense_mld with 1st user's 1st movie's rating")
print(train_dense_mld[0,0])

print('\n')
print("train_csr_mld with 1st user's 5 movies' (coordinate) ratings")
print(train_csr_mld[0,0:5])
print("train_dense_mld with 1st user's 5 movies' (matrix) ratings")
print(train_dense_mld[0,0:5])

print('\n')
print("train_csr_mld with 1st movie's (coordinate) ratings (if exists) for user_id in [1,10]")
print(train_csr_mld[0:10,0])
print("train_dense_mld with 1st movie's (matrix) ratings for user_id in [1,10]")
print(train_dense_mld[0:10,0])

print('\n')
print("train_csr_mld with 1st 5 movies' (coordinate) ratings (if exists) for user_id in [1,10]")
print(train_csr_mld[0:10,0:5])
print("train_dense_mld with 1st 5 movies' (matrix) ratings for user_id in [1,10]")
print(train_dense_mld[0:10,0:5])

train_csr_mld with 1st user's 1st movie's rating
5
train_dense_mld with 1st user's 1st movie's rating
5


train_csr_mld with 1st user's 5 movies' (coordinate) ratings
  (0, 0)	5
  (0, 1)	3
  (0, 2)	4
  (0, 3)	3
  (0, 4)	3
train_dense_mld with 1st user's 5 movies' (matrix) ratings
[[5 3 4 3 3]]


train_csr_mld with 1st movie's (coordinate) ratings (if exists) for user_id in [1-10]
  (0, 0)	5
  (1, 0)	4
  (5, 0)	4
  (9, 0)	4
train_dense_mld with 1st movie's (matrix) ratings for user_id in [1-10]
[[5]
 [4]
 [0]
 [0]
 [0]
 [4]
 [0]
 [0]
 [0]
 [4]]


train_csr_mld with 1st 5 movies' (coordinate) ratings (if exists) for user_id in [1-10]
  (0, 0)	5
  (0, 1)	3
  (0, 2)	4
  (0, 3)	3
  (0, 4)	3
  (1, 0)	4
  (5, 0)	4
  (6, 3)	5
  (9, 0)	4
  (9, 3)	4
train_dense_mld with 1st 5 movies' (matrix) ratings for user_id in [1-10]
[[5 3 4 3 3]
 [4 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [4 0 0 0 0]
 [0 0 0 5 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [4 0 0 4 0]]


In [26]:
# Analyze item_features & item_features_dense_mld
item_features_csr_mld = data_mld['item_features']
item_features_dense_mld = data_mld['item_features'].todense() 
print(type(item_features_csr_mld), type(item_features_dense_mld))

print('\n')
print("item_features_csr_mld with 1st user's 1st movie's rating")
print(item_features_csr_mld[0,0])
print("item_features_dense_mld with 1st user's 1st movie's rating")
print(item_features_dense_mld[0,0])

print('\n')
print("item_features_csr_mld with 1st 5 movies' (coordinate) ratings (if exists) for user_id in [1,10]")
print(item_features_csr_mld[0:5,0:5])
print("item_features_dense_mld with 1st 5 movies' (matrix) ratings for user_id in [1,10]")
print(item_features_dense_mld[0:5,0:5])


<class 'scipy.sparse.csr.csr_matrix'> <class 'numpy.matrix'>


item_features_csr_mld with 1st user's 1st movie's rating
1.0
item_features_dense_mld with 1st user's 1st movie's rating
1.0


item_features_csr_mld with 1st 5 movies' (coordinate) ratings (if exists) for user_id in [1,10]
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0
  (4, 4)	1.0
item_features_dense_mld with 1st 5 movies' (matrix) ratings for user_id in [1,10]
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


In [36]:
# Analyze 'item_feature_labels' & 'item_labels'
print(type(data_mld['item_feature_labels']), type(data_mld['item_labels']))
print(len(data_mld['item_feature_labels']), len(data_mld['item_labels']))
print(data_mld['item_feature_labels'])
print(data_mld['item_labels'])
print('\n')
print(np.unique(data_mld['item_feature_labels'] == data_mld['item_labels'], return_counts=True))
print("'item_feature_labels' & 'item_labels' are the same arrays")

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
1682 1682
['Toy Story (1995)' 'GoldenEye (1995)' 'Four Rooms (1995)' ...
 'Sliding Doors (1998)' 'You So Crazy (1994)'
 'Scream of Stone (Schrei aus Stein) (1991)']
['Toy Story (1995)' 'GoldenEye (1995)' 'Four Rooms (1995)' ...
 'Sliding Doors (1998)' 'You So Crazy (1994)'
 'Scream of Stone (Schrei aus Stein) (1991)']


(array([ True]), array([1682]))
'item_feature_labels' & 'item_labels' are the same arrays


In [19]:
# df_mld = pd.DataFrame(data = np.column_stack((train_mld, test_mld)) # {'train':train_mld, 'test':test_mld}
#                       ,index = ['Row_' + str(i + 1) 
#                         for i in range(train_mld.shape[0])])
# # df.plot('x', 'y', kind='scatter')
# df_mld

ValueError: ignored

# Modeling (Notebooks examples)

#### Example 1: https://github.com/lyst/lightfm/blob/master/examples/quickstart/short_quickstart.ipynb 
#### Fitting an implicit feedback model on the MovieLens 100k dataset using "WARP" (Weighted Approximate-Rank Pairwise) loss function

In [5]:
# Short quickstart
from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k

# Load the MovieLens 100k dataset. Only five
# star ratings are treated as positive.
data = fetch_movielens(min_rating=5.0)

# Instantiate and train the model
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)

# Evaluate the trained model
test_precision = precision_at_k(model, data['test'], k=5).mean()
print(test_precision) # 0.049669746

0.049669746


#### Example 2: https://github.com/lyst/lightfm/blob/master/examples/quickstart/quickstart.ipynb OR https://youtu.be/9gBC9R-msAk 
#### Fitting an implicit feedback model on the MovieLens 100k dataset using "WARP" (Weighted Approximate-Rank Pairwise) loss function