Task04 Wide&Deep

### 1.Wide&Deep模型

#### 1.1.原理简介

Wide&Deep模型是由单层的Wide部分和多层的Deep部分组成的混合模型：
* Wide部分：通过线性模型来处理大量历史的行为特征，使模型具有“记忆能力”。但其通常依赖于更多的特征工程。

* Deep部分：通过对稀疏特征的embedding进行学习，模型可以较好地推广到不可见的深度特征组合，让模型具有“泛化能力”。但如果数据过于稀疏，那么神经网络会过度泛化，即过拟合。

Wide和Deep优势的结合使模型兼具了逻辑回归和深度神经网络的优点，能够快速处理并记忆大量历史行为特征，并且具有强大的表达能力。

**关于记忆能力与泛化能力的理解：**

* 记忆能力（memorization）： 模型直接学习并利用（exploiting）历史数据中物品或特征“贡献频率”的能力。

* 泛化能力（generalization）： 基于特征传递的相关性，探索（exploring）过去从未或很少发生的新特征组合。




#### 1.2.Wide 部分

Wide部分善于处理大量稀疏的id类特征，通常由一个广义线性模型构成：
<div align=center><font size=3 width="50%" height="50%">$y=\mathbf{w}^T\mathbf{x}+b$</font></div>

其中特征$\mathbf{x}=[x_1,x_2,\dots,x_n]$由原生输入特征和经过**交叉乘积变换**（cross-product transformation）的组合特征构成。

什么是cross-product transformation：

<div align=center><font size=3 width="50%" height="50%">$\phi_k(\mathbf{x})=\prod\limits_{i=1}^d x_i^{c_{ki}}\quad c_{ki}\in\{0,1\}   $</font></div>

$c_{ki}$是一个布尔变量，当第$i$个特征属于第个$k$特征组合时，$c_{ki}$值为1，否则为0。这能捕获两个二元特征之间的交互，并为广义线性模型添加非线性。

对于wide部分训练时候使用的优化器是带$L1$正则的FTRL算法(Follow-the-regularized-leader)，而L1 FTLR是非常注重模型稀疏性质的，也就是说W&D模型采用L1 FTRL是想让Wide部分变得更加的稀疏，即Wide部分的大部分参数都为0，这就大大压缩了模型权重及特征向量的维度。Wide部分模型训练完之后留下来的特征都是非常重要的，那么模型的“记忆能力”就可以理解为发现"直接的"，“暴力的”，“显然的”关联规则的能力。

#### 1.3.Deep部分

该部分主要是一个Embedding+MLP的神经网络模型。大规模稀疏特征通过embedding转化为低维密集型特征。然后特征进行拼接输入到MLP中，挖掘藏在特征背后的数据模式。数学形式为：

<div align=center><font size=3 width="50%" height="50%">$a^{(l+1)}=f(\mathbf{W}^{(l)}a^{(l)}+b^{(l)})   $</font></div>

Deep部分使用的优化器是AdaGrad。

#### 1.4.Joint Training of Wide and Deep Model
 
Wide部分和Deep部分的输出进行加权求和作为最后的输出。模型的最终预测为：

<div align=center><font size=3 width="50%" height="50%">$P(Y=1|x) =\sigma\left(\mathbf{w}^T_{wide}[\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}^T_{deep} a^{(l_f)}+b \right) $</font></div>

其中$\sigma(\cdot)$为sigmoid函数，$\mathbf{w}^T_{wide}$和$\mathbf{w}^T_{deep}$分别是Wide部分和Deep部分的权重。

（论文中对Wide部分和Deep部分训练使用的优化器是不同的，但对于普通的场景，或者说实验【一些公共数据集上】，在一些复现代码，直接采用了单个优化器，参考[2]。）

### 2. 代码实现

用tensorflow 2.1实现Wide & Deep 模型，数据集使用sample criteo dataset（Task03中也使用过）。

#### 数据预处理

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

In [4]:
#返回稀疏特征的字典

def sparseFeature(feat, feat_num, embed_dim=8):
    """
    create dictionary for sparse feature
    :param feat: feature name
    :param feat_num: the total number of sparse features that do not repeat
    :param embed_dim: embedding dimension
    :return:
    """
    return {'feat': feat, 'feat_num': feat_num, 'embed_dim': embed_dim}

In [5]:
#返回dense特征的字典

def denseFeature(feat):
    """
    create dictionary for dense feature
    :param feat: dense feature name
    :return:
    """
    return {'feat': feat}

In [7]:
def create_criteo_dataset(embed_dim=8, test_size=0.2):
    """
    a example about creating criteo dataset
    :param embed_dim: the embedding dimension of sparse features
    :param test_size: ratio of test dataset
    :return: feature columns, train, test
    """
    names = ['label', 'I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10', 'I11',
             'I12', 'I13', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
             'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22',
             'C23', 'C24', 'C25', 'C26']

    data_df = pd.read_table('G:\Datasets\dac_sample.tar\dac_sample.txt', header=None, names=names)
    
    sparse_features = ['C' + str(i) for i in range(1, 27)]
    dense_features = ['I' + str(i) for i in range(1, 14)]

    data_df[sparse_features] = data_df[sparse_features].fillna('-1')
    data_df[dense_features] = data_df[dense_features].fillna(0)

    for feat in sparse_features:
        le = LabelEncoder()
        data_df[feat] = le.fit_transform(data_df[feat])

    # ==============Feature Engineering===================

    # ====================================================
    dense_features = [feat for feat in data_df.columns if feat not in sparse_features + ['label']]

    mms = MinMaxScaler(feature_range=(0, 1))
    data_df[dense_features] = mms.fit_transform(data_df[dense_features])

    feature_columns = [[denseFeature(feat) for feat in dense_features]] + \
                      [[sparseFeature(feat, len(data_df[feat].unique()), embed_dim=embed_dim)
                        for feat in sparse_features]]

    train, test = train_test_split(data_df, test_size=test_size)

    train_X = [train[dense_features].values, train[sparse_features].values.astype('int32')]
    train_y = train['label'].values.astype('int32')
    test_X = [test[dense_features].values, test[sparse_features].values.astype('int32')]
    test_y = test['label'].values.astype('int32')

    return feature_columns, (train_X, train_y), (test_X, test_y)

In [14]:
feature_columns, train, test = create_criteo_dataset(embed_dim=8, test_size=0.2)
train_X, train_y = train
test_X, test_y = test

#### 构建模型 

model: Wide & Deep Learning for Recommender Systems

In [8]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Embedding, Concatenate, Dropout, Input, Layer
from tensorflow.keras.regularizers import l2

In [9]:
class Linear(Layer):
    """
    Linear Part
    """
    def __init__(self):
        super(Linear, self).__init__()
        self.dense = Dense(1, activation=None)

    def call(self, inputs, **kwargs):
        result = self.dense(inputs)
        return result

In [11]:
class DNN(Layer):
    """
    Deep Neural Network
    """

    def __init__(self, hidden_units, activation='relu', dropout=0.):
        """
        :param hidden_units: A list. Neural network hidden units.
        :param activation: A string. Activation function of dnn.
        :param dropout: A scalar. Dropout number.
        """
        super(DNN, self).__init__()
        self.dnn_network = [Dense(units=unit, activation=activation) for unit in hidden_units]
        self.dropout = Dropout(dropout)

    def call(self, inputs, **kwargs):
        x = inputs
        for dnn in self.dnn_network:
            x = dnn(x)
        x = self.dropout(x)
        return x

In [12]:
class WideDeep(tf.keras.Model):
    def __init__(self, feature_columns, hidden_units, activation='relu',
                 dnn_dropout=0., embed_reg=1e-4):
        """
        Wide&Deep
        :param feature_columns: A list. dense_feature_columns + sparse_feature_columns
        :param hidden_units: A list. Neural network hidden units.
        :param activation: A string. Activation function of dnn.
        :param dnn_dropout: A scalar. Dropout of dnn.
        :param embed_reg: A scalar. The regularizer of embedding.
        """
        super(WideDeep, self).__init__()
        self.dense_feature_columns, self.sparse_feature_columns = feature_columns
        self.embed_layers = {
            'embed_' + str(i): Embedding(input_dim=feat['feat_num'],
                                         input_length=1,
                                         output_dim=feat['embed_dim'],
                                         embeddings_initializer='random_uniform',
                                         embeddings_regularizer=l2(embed_reg))
            for i, feat in enumerate(self.sparse_feature_columns)
        }
        self.dnn_network = DNN(hidden_units, activation, dnn_dropout)
        self.linear = Linear()
        self.final_dense = Dense(1, activation=None)

    def call(self, inputs, **kwargs):
        dense_inputs, sparse_inputs = inputs
        sparse_embed = tf.concat([self.embed_layers['embed_{}'.format(i)](sparse_inputs[:, i])
                                  for i in range(sparse_inputs.shape[1])], axis=-1)
        x = tf.concat([sparse_embed, dense_inputs], axis=-1)

        # Wide
        wide_out = self.linear(dense_inputs)
        # Deep
        deep_out = self.dnn_network(x)
        deep_out = self.final_dense(deep_out)
        # out
        outputs = tf.nn.sigmoid(0.5 * (wide_out + deep_out))
        return outputs

    def summary(self, **kwargs):
        dense_inputs = Input(shape=(len(self.dense_feature_columns),), dtype=tf.float32)
        sparse_inputs = Input(shape=(len(self.sparse_feature_columns),), dtype=tf.int32)
        keras.Model(inputs=[dense_inputs, sparse_inputs],
                    outputs=self.call([dense_inputs, sparse_inputs])).summary()

In [18]:
model = WideDeep(feature_columns, hidden_units=[256, 128, 64], dnn_dropout=0.5)

In [19]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 26)]         0                                            
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None,)]            0           input_2[0][0]                    
__________________________________________________________________________________________________
tf_op_layer_strided_slice_1 (Te [(None,)]            0           input_2[0][0]                    
__________________________________________________________________________________________________
tf_op_layer_strided_slice_2 (Te [(None,)]            0           input_2[0][0]                    
______________________________________________________________________________________________

#### 训练模型

In [13]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC
import os

In [21]:
model.compile(loss=binary_crossentropy, 
              optimizer=Adam(learning_rate=0.001),
              metrics=[AUC()])

In [22]:
model.fit(
        train_X,
        train_y,
        epochs=5,
        callbacks=[EarlyStopping(monitor='val_loss', patience=1, restore_best_weights=True)],  # checkpoint
        batch_size=512,
        validation_split=0.1
    )

Train on 72000 samples, validate on 8000 samples
Epoch 1/5


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 2/5


<tensorflow.python.keras.callbacks.History at 0x1e7769fafc8>

In [23]:
print('test AUC: %f' % model.evaluate(test_X, test_y)[1])

test AUC: 0.747843


***************************************

参考资料：

[1] 王喆-《深度学习推荐系统》

[2] [【论文导读】Wide&Deep模型的深入理解](https://mp.weixin.qq.com/s/LRghf8mj1hjUYri_m3AzBg)

[3] [见微知著，你真的搞懂Google的Wide&Deep模型了吗？](https://zhuanlan.zhihu.com/p/142958834)