### 一. 深宽模型简介
1. 深宽模型也是一种模型融合的排序算法, 同LR+GBDT,FM,FFM的作用一样  
2. CTR预估的传统做法 :  
 CTR预估的传统做法是采用广义线性模型, 如$LR$. 然后采用特征海洋战术挖掘大量特征, 因此"宽"指的是特征数量大

### 二. 深宽模型的做法

1. 推荐系统的一般做法  
  拿着用户特征和上下文场景特征从物库中召回候选结果, 再对这些结果进行排序输出给用户   
  同时开始记录日志.再把收集到的日志, 用户特征, 上下文场景特征, 物品特征拉平成模型的训练数据, 进行新一轮的训练  
  
2. 深宽模型架构  
  深宽模型由"深模型"和"宽模型"组成.   
  1. "宽模型"指线性部分, 采用逻辑回归  
   2. "深模型"指前馈神经网路. 将输入的高维稀疏向量转变为低维的稠密向量. 维度一般为10到100 
    
<img src='img/widedeep.png' width='70%' height='70%'>

3. 宽模型  
 宽模型是线性模型. 特征包括原始特征和高阶组合特征. 定义$k$阶组合特征为$${ \phi  }_{ k }\left( x \right) =\prod _{ i=1 }^{ d }{ { { x }_{ i } }^{ { C }_{ ki } } } \quad ,\quad { c }_{ ki }\in \left\{ 0,1 \right\} $$ ${ c }_{ ki }$是bollean型变量, 表示特征${ x }_{ i }$是否在k阶特征组合之中

### 三. 训练数据生成
1. categorical型特征, 映射成整数型ID (不用one-hot)
2. 连续型实数特征, 首先按照区间划分成$n_q$个分位. 对落在第$i$个分位的特征, 将其值改为 : $${ x }_{ i }=\frac { i-1 }{ { n }_{ q }-1 } $$因此, 连续性特征被映射成(0,1)的数值

### 四. 训练过程
使用google app store软件推荐为例. 整体模型如下图
<img src='img/wandd.png' height='70%' width='70%'>
#### 深模型  
1. 对于categorical型特征 :  
 训练特征自己的embedding矩阵, 用这个矩阵和映射的整型ID相乘, 得到特征值对应的嵌入向量. 此处嵌入向量为32维 
2. 对于连续型实数特征 :   
 连续性特征已经变成分位数, 和categorical型特征生成的embedding向量一起flatten成一个长度约为1200的大向量, 此后进行3层Relu  
 
#### 宽模型
将特征进行高维交叉组合后, 形成新特征${ \phi  }_{ k }\left( x \right) $, 和深模型输出的特征一起进行逻辑回归

#### retrain
每次新一批的消费数据到来后, 都要重新开始训练模型. 当然, 为了优化用户体验, 用户的推荐结果可以使用上一次训练生成的Embedding和模型参数

### 五. Tensorflow训练Deep Wide模型 

#### 1. tensorflow中的稀疏矩阵
稀疏矩阵由(indices,values,dense_shape)3部分组成. 如下表示
```python
  SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4])
  ```

  表示成dense tensor如下:

  ```python
  [[1, 0, 0, 0]
   [0, 0, 2, 0]
   [0, 0, 0, 0]]
  ```
#### 2. 将categorical型特征, 映射成整数型ID
```python
tf.contrib.layers.sparse_column_with_hash_bucket()
'''没有词典文件的情况下, 想把categorical特征转换成int值, 就使用hash映射:   
   output_id = Hash(input_feature_string) % bucket_size'''
```

In [1]:
import pandas as pd
import tensorflow as tf

  from ._conv import register_converters as _register_converters


In [3]:
df = pd.read_csv('../../data/census.csv')
df.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [10]:
columns = ['age', 'workclass', 'education_level', 'education-num','marital-status', 'occupation', 'relationship', 
           'race', 'sex','capital-gain', 'capital-loss', 'hours-per-week', 'native-country','income']
label_column = 'label'
# categorical型特征
categorical_column = ['workclass','education_level','marital-status',
                      'occupation','relationship','race','sex','native-country']
# 连续型实数特征
continuous_column = ['age','education-num','capital-gain','capital-loss','hours-per-week']

df[label_column] = df['income'].apply(lambda x:x=='>50K')
df.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,label
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K,False
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K,False
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K,False
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K,False
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,False


In [67]:
def get_tf_input(df):
    '''构造输入向量'''
    # Creates a dictionary mapping from each continuous feature column name (k) to
    # the values of that column stored in a constant Tensor.
    continuous_cols = {col_name:tf.constant(df[col_name].values) 
                       for col_name in continuous_column}
    # Creates a dictionary mapping from each categorical feature column name (k)
    # to the values of that column stored in a tf.SparseTensor.
    categorical_cols = {col_name:tf.SparseTensor(indices=[[i,0] for i in range(df[col_name].size)],
                                                values=df[col_name].values,
                                                dense_shape=[df[col_name].size,1])
                       for col_name in categorical_column}
    # Merges the two dictionaries into one.
    features_col = dict(list(continuous_cols.items())+list(categorical_cols.items()))
    # Converts the label column into a constant Tensor.
    labels = tf.constant(df[label_column].values)
    
    return features_col,labels

In [68]:
def define_wide_and_deep_features(df):
    '''定义构建宽模型的交叉特征构建方式
       和深模型的输入特征构建方式'''
    # Categorical base columns.
    workclass = tf.contrib.layers.sparse_column_with_hash_bucket('workclass',hash_bucket_size=100)
    education_level = tf.contrib.layers.sparse_column_with_hash_bucket('education_level',hash_bucket_size=100)
    marital_status = tf.contrib.layers.sparse_column_with_hash_bucket('marital-status',hash_bucket_size=1000)
    occupation = tf.contrib.layers.sparse_column_with_hash_bucket('occupation',hash_bucket_size=1000)
    relationship = tf.contrib.layers.sparse_column_with_hash_bucket('relationship',hash_bucket_size=100)
    race = tf.contrib.layers.sparse_column_with_keys(column_name='race',keys=df['race'].unique())
    sex = tf.contrib.layers.sparse_column_with_keys(column_name='sex',keys=df['sex'].unique())
    native_country = tf.contrib.layers.sparse_column_with_hash_bucket('native-country',hash_bucket_size=1000)
    # Continuous base columns.
    age = tf.contrib.layers.real_valued_column('age')
    age_buckets = tf.contrib.layers.bucketized_column(age,boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
    education_num = tf.contrib.layers.real_valued_column('education-num')
    capital_gain = tf.contrib.layers.real_valued_column('capital-gain')
    capital_loss = tf.contrib.layers.real_valued_column('capital-loss')
    hours_per_week = tf.contrib.layers.real_valued_column('hours-per-week')
    # wide_columns
    wide_columns = [workclass,education_level,occupation,relationship,race,sex,native_country,age_buckets,
                    # 宽模型的交叉特征
                    tf.contrib.layers.crossed_column([education_level,occupation],hash_bucket_size=int(1e4)),
                    tf.contrib.layers.crossed_column([native_country,occupation],hash_bucket_size=int(1e4)),
                    tf.contrib.layers.crossed_column([age_buckets,education_level,occupation],hash_bucket_size=int(1e6))
                   ]
    # deep_columns
    deep_columns = [
        tf.contrib.layers.embedding_column(workclass,dimension=8),
        tf.contrib.layers.embedding_column(education_level,dimension=8),
        tf.contrib.layers.embedding_column(sex,dimension=8),
        tf.contrib.layers.embedding_column(relationship,dimension=8),
        tf.contrib.layers.embedding_column(native_country,dimension=8),
        tf.contrib.layers.embedding_column(occupation,dimension=8),
        age,education_num,capital_gain,capital_loss,hours_per_week
    ]
    return wide_columns,deep_columns

In [69]:
if __name__ =="__main__":
    import tempfile
    model_dir = tempfile.mkdtemp()
    
    wide_columns,deep_columns = define_wide_and_deep_features(df)
    
    model = tf.contrib.learn.DNNLinearCombinedClassifier(
        model_dir = model_dir,
        linear_feature_columns=wide_columns,
        dnn_feature_columns=deep_columns,
        dnn_hidden_units=[100,50]
    )
    
    def get_train():
        return get_tf_input(df[:36177])
    def get_test():
        return get_tf_input(df[36177:])
    
    model.fit(input_fn=get_train,steps=200)
    res = model.evaluate(input_fn=get_test,steps=1)
    for key in sorted(res):
        print("%s: %s" % (key, res[key]))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3119831b00>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/tmpc3oa02tt'}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 2 into /tmp/tmpc3oa02tt/model.ckpt.
INFO:tensorflow:loss = 52.793495, step = 0
INFO:tensorf