# Factorization Machine

之前的　Linear Regression 的形式是形如　$\hat{y}= w^T x$，Factorization Machine 在　Linear Regression　的基础之上添加了所谓的交叉项，即　$c_{ij}x_ix_j$, 即　$\hat{y} = w_1x_1 + \ldots + w_fx_f + \sum_{p=1}^{f-1} \sum_{q=p+1}^{f} c_{pq}x_px_q$，由于有些交叉项在实际中并不存在，所以使用向量相乘的办法，用一个　$f \times k$　的矩阵，从中任选两个向量 $v_i, v_j$　相乘作为系数，从　$n$　个向量中任选两个相乘构成的系数个数一共有 $C_f^2 = \frac{f(f-1)}{2}$　个，刚好等于后面交叉项的数量。

最后的交叉项可以写成　$\sum_{p=1}^{f-1} \sum_{q=p+1}^{f} c_{pq}x_px_q = \frac{1}{2}(\sum_{p=1}^f \sum_{q=1}^f c_{pq}x_px_q - \sum_p^f v_p^2x_p^2)=\frac{1}{2} \sum_{u=1}^k[(\sum_{p=1}^fv_{p,u}x_p)(\sum_{q=1}^f v_{q,u}x_q) - (\sum_{p=1}^f v_{p, u}^2x_p^2)]= \frac{1}{2} \sum_{u=1}^k[(\sum_{p=1}^fv_{p,u}x_p)^2- (\sum_{p=1}^f v_{p, u}^2x_p^2)]$，降低计算复杂度。

其中 $c_{pq} = v_p \times v_q$

计算　Loss 采用的函数仍是　MSE，即　$loss = \frac{1}{2} \sum_i^n(\hat{y}_i - y_i)^2$

Gradient 的计算及更新：

$\begin{align*}
w_i & = w_i - \eta \cdot [\sum_i^n x_i \cdot (\hat{y_i} - y_i) ] \\
v_{p,u} & = v_{p,u} - \eta \cdot \sum_{i=1}^n [(\hat{y_i} - y_i) \cdot (x_{i, p} (\sum_{p=1}^f v_{p, u} x_{i, p}) - x_{i, p}^2v_{p, u})]
\end{align*}$

Factorization Machine 是一个适用于回归场景的算法。为了演示这个算法，采用典型的　Boston Housing 的数据集。

In [1]:
# load boston housing 

from sklearn.datasets import load_boston
import numpy as np
import pandas as pd


boston = load_boston()
X = boston.data
y = boston.target

boston_df = pd.DataFrame(X, columns=boston.feature_names)

# z-score
for feature in boston.feature_names:
    boston_df[feature] = (boston_df[feature] - boston_df[feature].mean()) / boston_df[feature].std()

# min-max
for feature in boston.feature_names:
    boston_df[feature] = (boston_df[feature] - boston_df[feature].min()) / (boston_df[feature].max() - boston_df[feature].min())

X = np.c_[boston_df.values, np.ones(X.shape[0])]

In [2]:
k = 9

LEARNING_RATE = 1e-6
EPOCH = 300

BATCH_SIZE = 30

PRINT_NUMS = 20
PRINT_INTERVAL = EPOCH / PRINT_NUMS

f = X.shape[1] - 1

w = np.random.uniform(0, 1, size=(1, f + 1))
v = np.random.uniform(0, 1, size=(f, k))

n = BATCH_SIZE

for epoch in range(EPOCH):
    index = np.random.randint(0, X.shape[0], size=BATCH_SIZE)
    sample_x = X[index]
    sample_y = y[index]

    # linear part
    linear_part = np.dot(w, sample_x.T)

    # cross part
    p = np.dot(v.T, sample_x[:,0:f].T)
    p = p * p
    p = np.sum(p, axis=0)

    q = np.dot(np.multiply(v.T, v.T), np.multiply(sample_x[:, 0:f].T, sample_x[:, 0:f].T))
    q = np.sum(q, axis=0)
    r = (p - q).reshape(1, q.shape[0])

    y_hat = linear_part  + r
    loss = y_hat - sample_y
    # update gradient
    w = w - LEARNING_RATE * np.dot(loss, sample_x)

 
    for p in range(f):
        for u in range(k):
            sumation = 0
            for i in range(n):
                sumation2 = 0
                for q in range(f):
                    sumation2 += v[q, u] * sample_x[i, q]
                sumation += sample_x[i, p] * (sumation2 - sample_x[i, p] * v[p, u]) * loss[0, i]
        v[p, u] -= LEARNING_RATE * sumation

    if epoch % PRINT_INTERVAL == 0:
        print('EPOCH: %d, loss: %f' % (epoch, loss.sum()))

EPOCH: 0, loss: 673.563069
EPOCH: 15, loss: 956.476021
EPOCH: 30, loss: 957.301041
EPOCH: 45, loss: 1131.980154
EPOCH: 60, loss: 754.626841
EPOCH: 75, loss: 875.519168
EPOCH: 90, loss: 1088.979308
EPOCH: 105, loss: 851.566699
EPOCH: 120, loss: 633.100518
EPOCH: 135, loss: 974.367024
EPOCH: 150, loss: 788.498512
EPOCH: 165, loss: 618.190015
EPOCH: 180, loss: 955.684439
EPOCH: 195, loss: 963.548392
EPOCH: 210, loss: 943.783070
EPOCH: 225, loss: 931.921670
EPOCH: 240, loss: 636.592666
EPOCH: 255, loss: 891.015543
EPOCH: 270, loss: 1016.814411
EPOCH: 285, loss: 879.597780
