# Factorization Machine

之前的　Linear Regression 的形式是形如　$\hat{y}= w^T x$，Factorization Machine 在　Linear Regression　的基础之上添加了所谓的交叉项，即　$c_{ij}x_ix_j$, 即　$\hat{y} = w_1x_1 + \ldots + w_fx_f + \sum_{p=1}^{f-1} \sum_{q=p+1}^{f} c_{pq}x_px_q$，由于有些交叉项在实际中并不存在，所以使用向量相乘的办法，用一个　$f \times k$　的矩阵，从中任选两个向量 $v_i, v_j$　相乘作为系数，从　$n$　个向量中任选两个相乘构成的系数个数一共有 $C_f^2 = \frac{f(f-1)}{2}$　个，刚好等于后面交叉项的数量。

最后的交叉项可以写成　$\sum_{p=1}^{f-1} \sum_{q=p+1}^{f} c_{pq}x_px_q = \frac{1}{2}(\sum_{p=1}^f \sum_{q=1}^f c_{pq}x_px_q - \sum_p^f v_p^2x_p^2)=\frac{1}{2} \sum_{u=1}^k[(\sum_{p=1}^fv_{p,u}x_p)(\sum_{q=1}^f v_{q,u}x_q) - (\sum_{p=1}^f v_{p, u}^2x_p^2)]= \frac{1}{2} \sum_{u=1}^k[(\sum_{p=1}^fv_{p,u}x_p)^2- (\sum_{p=1}^f v_{p, u}^2x_p^2)]$，降低计算复杂度。

其中 $c_{pq} = v_p \cdot v_q$

计算　Loss 采用的函数仍是　MSE，即　$loss = \frac{1}{2} \sum_i^n(\hat{y}_i - y_i)^2$

Gradient 的计算及更新：

$\begin{align*}
w_i & = w_i - \eta \cdot \frac{1}{n} \cdot [\sum_i^n x_i \cdot (\hat{y_i} - y_i) ] \\
v_{p,u} & = v_{p,u} - \eta \cdot \frac{1}{n} \cdot \sum_{i=1}^n [(\hat{y_i} - y_i) \cdot x_{i, p}^2 \cdot (\sum_{p=1}^f v_{p, u} - v_{p, u})]
\end{align*}$

Factorization Machine 是一个适用于回归场景的算法（分类场景也可以使用）。为了演示这个算法，采用 Boston Housing 数据集

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split


boston = load_boston()
boston_data = pd.DataFrame(boston.data,  columns=boston.feature_names)
boston_data['bias'] = np.ones(boston.data.shape[0])
boston_data['target'] = boston.target

ss = StandardScaler()
boston_data = ss.fit_transform(boston_data)

shape = boston_data.shape
X_train, X_test, y_train, y_test = train_test_split(boston_data[0:shape[0], 0:-1], boston_data[0:shape[0], -1], test_size=0.25,
                                                    random_state=33)

X_train = X_train.T
X_test = X_test.T

In [37]:
# numpy version

w = np.random.rand(X_train.shape[0])
k = 6
v = np.random.rand(X_train.shape[0], k)

BATCH_SIZE = 8
LEARNING_RATE = 0.001

EPOCH = 20
PRINT_STEP = EPOCH / 10

for epoch in range(EPOCH):
    index = np.random.randint(0, X_train.shape[1], size=BATCH_SIZE)
    X_batch = X_train[:, index]
    y_batch = y_train[index]

    # linear part
    linear_part = np.dot(w.T, X_batch)

    # cross part
    cross_part = np.zeros(BATCH_SIZE)
    for m in range(0, X_train.shape[0] - 1):
        for n in range(m + 1, X_train.shape[0]):
            v_m = v[m, :]
            v_n = v[n, :]
            cross_part += np.dot(v_m, v_n) * np.multiply(X_batch[m, :], X_batch[n, :])
 
    y_hat = linear_part + cross_part
    loss = y_hat - y_batch

    # linear pard update grade
    w = w - LEARNING_RATE * np.multiply(loss, X_batch).sum(axis=1) / BATCH_SIZE
    
    # matrix grad update
    for p in range(X_train.shape[0]):
        for u in range(k):
            v_grad = np.multiply(loss,  X_batch[p, :]**2 * (v[:, u].sum() -  v[p, u])).sum()
            v[p, u] =  v[p, u] - LEARNING_RATE * v_grad / BATCH_SIZE
        
    if epoch % PRINT_STEP == 0:
        print('EPOCH: %d, loss: %f' % (epoch, (loss**2).sum()))

EPOCH: 0, loss: 4079.220222
EPOCH: 1, loss: 5909.605041
EPOCH: 2, loss: 3003.988646
EPOCH: 3, loss: 55614.314832
EPOCH: 4, loss: 2662.843102
EPOCH: 5, loss: 12014.136855
EPOCH: 6, loss: 23536.633983
EPOCH: 7, loss: 398885.922581
EPOCH: 8, loss: 67933995.477794
EPOCH: 9, loss: 15787614749033836314624.000000


In [35]:
# PyTorch Version

import torch

device = torch.device('cpu')
dtype = torch.double

INPUT_DIMENSION, OUTPUT_DIMENSION = X_train.shape[0], 1
w = torch.randn(INPUT_DIMENSION, OUTPUT_DIMENSION, device=device, dtype=dtype, requires_grad=True)
k = 6
v = torch.randn(INPUT_DIMENSION, k, device=device, dtype=dtype, requires_grad=True)

LEARNING_RATE = 1e-3

BATCH_SIZE = 8
EPOCH = 20
PRINT_STEP = EPOCH / 10

for epoch in range(EPOCH):
    index = np.random.randint(0, X_train.shape[0], size=BATCH_SIZE)
    X_batch = torch.from_numpy(X_train[:, index]).reshape(INPUT_DIMENSION, BATCH_SIZE)
    y_batch = torch.from_numpy(y_train[index])

    # linear part
    linear_part = w.T.mm(X_batch)

    # cross part
    cross_part = torch.from_numpy(np.zeros(BATCH_SIZE)).reshape((1,-1))
    for m in range(0, X_train.shape[0] - 1):
        for n in range(m + 1, X_train.shape[0]):
            v_m = v[m, :].reshape((1, -1))
            v_n = v[n, :].reshape((1, -1))
            cross_part += v_m.mul(v_n).sum() * X_batch[m, :].mul(X_batch[n, :])
 
    y_hat = linear_part + cross_part
    loss = ((y_hat - y_batch)**2).sum() / 2
    loss.backward()

    with torch.no_grad():
        w -= LEARNING_RATE * w.grad
        v -= LEARNING_RATE * v.grad

        # Manually zero the gradients after updating weights
        w.grad.zero_()
        v.grad.zero_()

    if epoch % PRINT_STEP == 0:
        print('EPOCH: %d, loss: %f' % (epoch, loss))

EPOCH: 0, loss: 2504.345184
EPOCH: 2, loss: 935.662576
EPOCH: 4, loss: 215.542112
EPOCH: 6, loss: 559.341540
EPOCH: 8, loss: 589.773573
EPOCH: 10, loss: 62.043805
EPOCH: 12, loss: 115.991072
EPOCH: 14, loss: 413.338862
EPOCH: 16, loss: 21.422009
EPOCH: 18, loss: 21.678856
