Factorization Machines (https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf)
- a generic algorithm for classification, regression and ranking
- a generalization of linear regression and matrix factorization model
- similar to SVM with a polynomial kernel
- models n-way (typically 2-way) interaction between features (embeddings)
- compute 2-way feature interaction as the dot product of the latent features
- author proposes a neat trick to reduce polynomial complexity to linear time

the model (2-way interaction): 

$\hat{y}(x) = \mathbf{w}_0 + \sum_{i=1}^d \mathbf{w}_i x_i + \sum_{i=1}^d\sum_{j=i+1}^d \langle\mathbf{v}_i, \mathbf{v}_j\rangle x_i x_j$

- $w_0$ - global bias
- $w_i$ - weight of i-th variable
- $V, v_i, v_j$ - feature embeddings
- $\langle\mathbf{v}_i, \mathbf{v}_j\rangle$ - dot product of embeddings -> weight of the interaction

As we can see, the first part of the equation is just a linear regression

derivation of pairwise interaction

\begin{split}\begin{aligned}
&\sum_{i=1}^d \sum_{j=i+1}^d \langle\mathbf{v}_i, \mathbf{v}_j\rangle x_i x_j \\
 &= \frac{1}{2} \sum_{i=1}^d \sum_{j=1}^d\langle\mathbf{v}_i, \mathbf{v}_j\rangle x_i x_j - \frac{1}{2}\sum_{i=1}^d \langle\mathbf{v}_i, \mathbf{v}_i\rangle x_i x_i \\
 &= \frac{1}{2} \big (\sum_{i=1}^d \sum_{j=1}^d \sum_{l=1}^k\mathbf{v}_{i, l} \mathbf{v}_{j, l} x_i x_j - \sum_{i=1}^d \sum_{l=1}^k \mathbf{v}_{i, l} \mathbf{v}_{i, l} x_i x_i \big)\\
 &=  \frac{1}{2} \sum_{l=1}^k \big ((\sum_{i=1}^d \mathbf{v}_{i, l} x_i) (\sum_{j=1}^d \mathbf{v}_{j, l}x_j) - \sum_{i=1}^d \mathbf{v}_{i, l}^2 x_i^2 \big ) \\
 &= \frac{1}{2} \sum_{l=1}^k \big ((\sum_{i=1}^d \mathbf{v}_{i, l} x_i)^2 - \sum_{i=1}^d \mathbf{v}_{i, l}^2 x_i^2)
 \end{aligned}\end{split}

 - $k$ - embedding dimension
 - $d$ - number of features

An illustration of the shape & format of the data, where each feature will be embedded into a latent space

<img src="illustration.png" style="width:600px;height:300px;">


In [90]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils import data
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, confusion_matrix


Factorization Machines can be very easily implemented in Pytorch with just a few lines of code!

In [74]:
import torch
import torch.nn as nn
class FM(nn.Module):
    def __init__(self, n_features, embed_dim):
        super().__init__()
        self.feature_embedding = nn.Embedding(n_features, embed_dim) # n_dim feature embeddings for 2-way interaction
        torch.nn.init.xavier_uniform(self.feature_embedding.weight)
        self.fc = nn.Embedding(n_features, 1) # 1-d embedding equivalent to the weights for each feature in linear regression
        self.bias = nn.Parameter(torch.zeros((1,))) # global bias term
        
        
    def forward(self, x):
        first_order = self.fc(x).sum(dim=1) + self.bias
        second_order = self.factorization_machine(self.feature_embedding(x))
        res = first_order + second_order
        return torch.sigmoid(res)
    
    
    def factorization_machine(self, x_embed):
        """compute the 2-way interaction term
        """
        sq_sum = x_embed.sum(dim=1)**2 # (x_size, n_dim)
        sum_sq = (x_embed**2).sum(dim=1) # (x_size, n_dim)
        return 0.5 * (sq_sum - sum_sq).sum(dim=1, keepdim=True) # (x_size, 1)
    
    def predict(self, x):
        self.eval()
        with torch.no_grad():
            return self.forward(x)
    
        

The `factorization_machine()` function implements the equation:

$\frac{1}{2} \sum_{l=1}^k \big ((\sum_{i=1}^d \mathbf{v}_{i, l} x_i)^2 - \sum_{i=1}^d \mathbf{v}_{i, l}^2 x_i^2)$

There are two parts in the equation:

$(\mathbf{v}_{i, l} x_i)^2$     and     $(\mathbf{v}_{i, l}^2 x_i^2)$

Since each $x_i$ corresponds to one $v_i$, we can just treat $\mathbf{v}_{i, l} x_i$ as one latent embedding vector, which will be estimated during training, instead of estimating $V$ separately.

In [75]:
# train loop
def train(model, dataloader, epochs=20, lr=0.001):
    device = (
        torch.device("cuda:0") if torch.cuda.is_available(
        ) else torch.device("cpu")
    )
    model.to(device)
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.BCELoss()
    training_history = []
    for epoch in range(epochs):
        epoch_loss = 0
        for x, y in dataloader:
            y_pred = model.forward(x)
            loss = criterion(y_pred, y)
            epoch_loss += loss
            model.zero_grad()
            loss.backward()
            optimizer.step()
        epoch_loss /= len(dataloader)
        training_history.append(epoch_loss)
        if epoch%10 == 0:
            print(f"Epoch {epoch}: {epoch_loss:.4f}")
    return model, training_history


# Data Preparation
- X is an array of feature indices (n_samples, n_attributes), where each feature index will be mapped to a latent factor
  - [user, item, user_features, item_features]
- y is just a (n_sample, 1) array of the ground truth

In [76]:
import sys
sys.path.append('..')
import utils

In [77]:
rating, item, user = utils.get_movielens()

In [78]:
item_label = utils.get_items_label_encoding(item, return_df=False)
user_label = utils.get_users_label_encoding(user, return_df=False)

In [79]:
# concat item & user feature matrix to get X
X = np.hstack((item_label[rating['item_id']-1,:], user_label[rating['user_id']-1,:])) # offset -1 since item&user id starts with 1

In [80]:
# convert rating to 1/0
threshold = 3
y = np.where(rating['rating'].to_numpy()>=threshold, 1, 0).reshape(-1, 1)

Train test split

Here, for simplicity, we are only using a random split, with 80% as the train set, and 20% as the test set. In practice, the splitting maybe done by user, e.g. 80/20 split of a user's rating/interaction history.

In [81]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [82]:
dataset = data.TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train).float())
train_dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

# Train Model

In [84]:
n_features = X.max() + 1
embed_dim = 20
model = FM(n_features=n_features, embed_dim=embed_dim)

  torch.nn.init.xavier_uniform(self.feature_embedding.weight)


In [85]:
model, history = train(model, train_dataloader, epochs=200, lr=0.01)

Epoch 0: 0.4881
Epoch 10: 0.2507
Epoch 20: 0.2117
Epoch 30: 0.1971
Epoch 40: 0.1823
Epoch 50: 0.1801
Epoch 60: 0.1749
Epoch 70: 0.1717
Epoch 80: 0.1741
Epoch 90: 0.1692
Epoch 100: 0.1709
Epoch 110: 0.1706
Epoch 120: 0.1662
Epoch 130: 0.1680
Epoch 140: 0.1711
Epoch 150: 0.1712
Epoch 160: 0.1724
Epoch 170: 0.1710
Epoch 180: 0.1688
Epoch 190: 0.1791


In [92]:
y_pred_soft = model.predict(torch.from_numpy(X_test))

In [93]:
y_pred = np.where(y_pred_soft.numpy() > 0.5, 1, 0)
accuracy_score(y_pred, y_test)

0.76355

In [89]:
roc_auc_score(y_test, y_pred_soft)


0.6677428361646534

In [94]:
f1_score(y_test, y_pred)

0.8549608955681643

In [95]:
confusion_matrix(y_test, y_pred)

array([[ 1333,  2135],
       [ 2594, 13938]])