<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# LightFM -  Factorization Machine on MovieLens (Python, CPU)

This notebook explains the concept of a Factorization Machine based model for recommendation, it also outlines the steps to construct a pure matrix factorization and a Factorization Machine using the [LightFM](https://github.com/lyst/lightfm) package. It also demonstrates how to extract both user and item affinity from a fitted model.

*NOTE: LightFM is not available in the core package of Recommenders, to run this notebook, install the experimental package with `pip install recommenders[experimental]`.*

## 1. Factorization Machine model

### 1.1 Background

In general, most recommendation models can be divided into two categories:
- Content based model,
- Collaborative filtering model.

The content-based model recommends based on similarity of the items and/or users using their description/metadata/profile. On the other hand, collaborative filtering model (discussion is limited to matrix factorization approach in this notebook) computes the latent factors of the users and items. It works based on the assumption that if a group of people expressed similar opinions on an item, these people would tend to have similar opinions on other items. For further background and detailed explanation between these two approaches, the reader can refer to machine learning literatures [3, 4].

The choice between the two models is largely based on the data availability. For example, the collaborative filtering model is usually adopted and effective when sufficient ratings/feedbacks have been recorded for a group of users and items.

However, if there is a lack of ratings, content based model can be used provided that the metadata of the users and items are available. This is also the common approach to address the cold-start issues, where there are insufficient historical collaborative interactions available to model new users and/or items.


### 1.2 Factorization Machine algorithm

In view of the above problems, there have been a number of proposals to address the cold-start issues by combining both content-based and collaborative filtering approaches. The Factorization Machine model is among one of the solutions proposed [1].  

In general, most approaches proposed different ways of assessing and/or combining the feature data in conjunction with the collaborative information.

### 1.3 LightFM package 

LightFM is a Python implementation of a Factorization Machine recommendation algorithm for both implicit and explicit feedbacks [1].

It is a Factorization Machine model which represents users and items as linear combinations of their content features’ latent factors. The model learns **embeddings or latent representations of the users and items in such a way that it encodes user preferences over items**. These representations produce scores for every item for a given user; items scored highly are more likely to be interesting to the user.

The user and item embeddings are estimated for every feature, and these features are then added together to be the final representations for users and items. 

For example, for user i, the model retrieves the i-th row of the feature matrix to find the features with non-zero weights. The embeddings for these features will then be added together to become the user representation e.g. if user 10 has weight 1 in the 5th column of the user feature matrix, and weight 3 in the 20th column, the user 10’s representation is the sum of embedding for the 5th and the 20th features multiplying their corresponding weights. The representation for each items is computed in the same approach. 

#### 1.3.1 Modelling approach

Let $U$ be the set of users and $I$ be the set of items, and each user can be described by a set of user features $f_{u} \subset F^{U}$ whilst each items can be described by item features $f_{i} \subset F^{I}$. Both $F^{U}$ and $F^{I}$ are all the features which fully describe all users and items. 

The LightFM model operates based binary feedbacks, the ratings will be normalised into two groups. The user-item interaction pairs $(u,i) \in U\times I$ are the union of positive (favourable reviews) $S^+$ and negative interactions (negative reviews) $S^-$ for explicit ratings. For implicit feedbacks, these can be the observed and not observed interactions respectively.

For each user and item feature, their embeddings are $e_{f}^{U}$ and $e_{f}^{I}$ respectively. Furthermore, each feature is also has a scalar bias term ($b_U^f$ for user and $b_I^f$ for item features). The embedding (latent representation) of user $u$ and item $i$ are the sum of its respective features’ latent vectors:

$$ 
q_{u} = \sum_{j \in f_{u}} e_{j}^{U}
$$

$$
p_{i} = \sum_{j \in f_{i}} e_{j}^{I}
$$

Similarly the biases for user $u$ and item $i$ are the sum of its respective bias vectors. These variables capture the variation in behaviour across users and items:

$$
b_{u} = \sum_{j \in f_{u}} b_{j}^{U}
$$

$$
b_{i} = \sum_{j \in f_{i}} b_{j}^{I}
$$

In LightFM, the representation for each user/item is a linear weighted sum of its feature vectors.

The prediction for user $u$ and item $i$ can be modelled as sigmoid of the dot product of user and item vectors, adjusted by its feature biases as follows:

$$
\hat{r}_{ui} = \sigma (q_{u} \cdot p_{i} + b_{u} + b_{i})
$$

As the LightFM is constructed to predict binary outcomes e.g. $S^+$ and $S^-$, the function $\sigma()$ is based on the [sigmoid function](https://mathworld.wolfram.com/SigmoidFunction.html). 

The LightFM algorithm estimates interaction latent vectors and bias for features. For model fitting, the cost function of the model consists of maximising the likelihood of data conditional on the parameters described above using stochastic gradient descent. The likelihood can be expressed as follows:

$$
L = \prod_{(u,i) \in S+}\hat{r}_{ui} \times \prod_{(u,i) \in S-}1 - \hat{r}_{ui}
$$

Note that if the feature latent vectors are not available, the algorithm will behaves like a [logistic matrix factorisation model](http://stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf).

## 2. Movie recommender with LightFM using only explicit feedbacks

### 2.1 Import libraries

In [26]:
import os
import sys
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import sparse

import lightfm
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm import cross_validation

print("System version: {}".format(sys.version))
print("LightFM version: {}".format(lightfm.__version__))


System version: 3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:18:52) [Clang 18.1.8 ]
LightFM version: 1.17


### 2.2 Defining variables

In [27]:
# Select MovieLens data size
MOVIELENS_DATA_SIZE = '20M'

# default number of recommendations
K = 10
# percentage of data used for testing
TEST_PERCENTAGE = 0.25
# model learning rate
LEARNING_RATE = 0.25
# no of latent factors
NO_COMPONENTS = 20
# no of epochs to fit model
NO_EPOCHS = 20
# no of threads to fit model
NO_THREADS = 32
# regularisation for both user and item features
ITEM_ALPHA = 1e-6
USER_ALPHA = 1e-6

# seed for pseudonumber generations
SEED = 42

### 2.2 Retrieve data

In [None]:
data = pd.read_csv('data/ratings.csv', header=0, dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})
data.shape

(32000204, 4)

### 2.3 Prepare data

Before fitting the LightFM model, we need to create an instance of `Dataset` which holds the interaction matrix.

In [39]:
dataset = Dataset()

The `fit` method creates the user/item id mappings.

In [48]:
dataset.fit(users=data['userId'], 
            items=data['movieId'])

# quick check to determine the number of unique users and items in the data
num_users, num_topics = dataset.interactions_shape()
print(f'Num users: {num_users}, num_topics: {num_topics}.')

Num users: 200948, num_topics: 84432.


Next is to build the interaction matrix. The `build_interactions` method returns 2 COO sparse matrices, namely the `interactions` and `weights` matrices.

In [49]:
(interactions, weights) = dataset.build_interactions(data.iloc[:, 0:3].values)

LightLM works slightly differently compared to other packages as it expects the train and test sets to have same dimension. Therefore the conventional train test split will not work.

The package has included the `cross_validation.random_train_test_split` method to split the interaction data and splits it into two disjoint training and test sets. 

However, note that **it does not validate the interactions in the test set to guarantee all items and users have historical interactions in the training set**. Therefore this may result into a partial cold-start problem in the test set.

In [50]:
train_interactions, test_interactions = cross_validation.random_train_test_split(
    interactions, test_percentage=TEST_PERCENTAGE,
    random_state=np.random.RandomState(SEED))

Double check the size of both the train and test sets.

In [51]:
print(f"Shape of train interactions: {train_interactions.shape}")
print(f"Shape of test interactions: {test_interactions.shape}")

Shape of train interactions: (200948, 84432)
Shape of test interactions: (200948, 84432)


### 2.4 Fit the LightFM model

In this notebook, the LightFM model will be using the weighted Approximate-Rank Pairwise (WARP) as the loss. Further explanation on the topic can be found [here](https://making.lyst.com/lightfm/docs/examples/warp_loss.html#learning-to-rank-using-the-warp-loss).


In general, it maximises the rank of positive examples by repeatedly sampling negative examples until a rank violation has been located. This approach is recommended when only positive interactions are present.

In [52]:
model1 = LightFM(loss='warp', no_components=NO_COMPONENTS, 
                 learning_rate=LEARNING_RATE,                 
                 random_state=np.random.RandomState(SEED))

The LightFM model can be fitted with the following code:

In [53]:
model1.fit(interactions=train_interactions,
          epochs=NO_EPOCHS);

In [55]:
def recommend_cold_start_user(model, item_id_mapping, liked_items_dict, top_n=10):
    """
    Recommend items for a cold-start user based on known liked items.
    liked_items_dict: {movieId: rating, ...}
    item_id_mapping: {movieId: internal_id, ...}
    """
    # Get internal IDs for liked items
    liked_internal_ids = [
        item_id_mapping[movie_id]
        for movie_id, rating in liked_items_dict.items()
        if rating > 0 and movie_id in item_id_mapping
    ]

    if not liked_internal_ids:
        raise ValueError("No liked items provided or movieId not in mapping!")

    # Get item embeddings
    item_embeddings = model.item_embeddings

    # Average embeddings of liked items
    user_embedding = np.mean(item_embeddings[liked_internal_ids], axis=0)

    # Score all items
    scores = item_embeddings.dot(user_embedding)

    # Exclude liked items from recommendations
    all_internal_ids = np.array(list(item_id_mapping.values()))
    mask = ~np.isin(all_internal_ids, liked_internal_ids)
    available_internal_ids = all_internal_ids[mask]
    available_scores = scores[available_internal_ids]

    # Find top N indices among available items
    top_indices_relative = np.argsort(-available_scores)[:top_n]
    top_internal_ids = available_internal_ids[top_indices_relative]

    # Map internal IDs to external movieIds
    reverse_item_id_mapping = {v: k for k, v in item_id_mapping.items()}
    recommended_items = [reverse_item_id_mapping[idx] for idx in top_internal_ids]

    return recommended_items

# Usage:
user_id_mapping, user_feature_mapping, item_id_mapping, item_feature_mapping = dataset.mapping()

liked_items = {
    1: 5.0,        # Toy Story
    166461: 5.0,   # Moana (2016)
    167036: 5.0,   # Sing (2016)
    172547: 5.0,   # Despicable Me 3 (2017)
    152081: 5.0,   # Zootopia (2016)
    596: 5.0,   # Pinocchio (1940)
    170957: 5.0, #Cars 3 (2017)
    135887: 5.0, #Minions (2015)
    134853: 5.0, #Inside Out (2015)
    115617	: 5.0, #Big Hero 6 (2014)
}

recommended_items = recommend_cold_start_user(model1, item_id_mapping, liked_items, top_n=5)

movies = pd.read_csv('data/movies.csv')
movie_id_to_title = dict(zip(movies['movieId'], movies['title']))

for item_id in recommended_items:
    print(movie_id_to_title.get(item_id, f"Unknown movie (ID {item_id})"))

Afro Samurai (2007)
Mercy (2016)
EvenHand (2002)
Bobby Jones, Stroke of Genius (2004)
That's Black Entertainment (1990)
