## Tensorflow2使用LR和GBDT和DNN实现二分类预估

#### CTR预估用于推荐系统的排序阶段：
* 本身是一个二分类问题，收集的样本数据1代表点击是正样本、0代表未点击是负样本
* 输入数据为用户user的信息、召回阶段产出的候选item信息列表
* 输出为用户点积这些item的概率，按照概率排序取TOP N即为推荐结果

#### CTR预估常用的排序算法
* LR：逻辑回归，最简单的做法，常用于推荐系统初期，性能高、可解释
* GBDT：集成树模型，用N棵树的集成结果进行预测，效果比LR好
* DNN：深度学习，使用多层layer实现非线性变换，需要大量的数据集和调参技巧

#### 演示流程：

1. 用Pandas导入 CSV 文件。
2. 用tf.dataset读取数据，进行分批（batch）、随机排序（shuffle）处理
3. 用feature_column将 CSV 中的列映射到用于训练模型的特征
4. 训练LR/GBDT/DNN对比AUC的效果

#### 银行营销数据集

数据来自葡萄牙银行机构电话营销活动的记录。

每行是同一个客户的电话沟通记录，以及最后一列是这客户最后明确购买还是不购买产品，是一个二分类问题。

数据列如下：
01. 年龄，数字
02. 工作类型：分类，比如管理员、企业家
03. 婚姻状况：分类，比如已婚、离婚、单身
04. 教育：分类，小学、中学等
05. 是否有信用：分类、是/否
06. 余额：数字、年均余额
07. 是否有住房贷款：分类、是/否
08. 是否有贷款：分类，是/否
09. 联系方式：分类、位置、手机号码、固定号码
10. 天：数字、最后一次联系日
11. 月：数字、最后一次联系的月份
12. 持续时间：数字、上次联系时间
13. 广告系列：数字，在这个广告和用户联系的次数，包括其他人
14. pdays：数字，和这个客户上次联系的间隔天数
15. 以前：数字，这个广告此前和此客户的联系次数
16. 以前的结果：分类、成功、失败
17. 最终是否订阅了定期存款：二分类、是、否

### 1. 导入库

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from sklearn.model_selection import train_test_split

### 2. 使用 Pandas读取CSV

In [2]:
df = pd.read_csv("./datas/bank/bank-full.csv", sep=";")
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [3]:
df.shape

(45211, 17)

In [4]:
df["y"].unique()

array(['no', 'yes'], dtype=object)

In [5]:
df.loc[df["y"]=="yes", "y"] = 1
df.loc[df["y"]=="no", "y"] = 0
df["y"] = df["y"].astype(int)
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,0


### 3. 将 dataframe 拆分为训练集和测试集

In [6]:
train, test = train_test_split(df, test_size=0.2)
y_train = train.pop("y")
y_test = test.pop("y")

train.shape, test.shape, y_train.shape, y_test.shape

((36168, 16), (9043, 16), (36168,), (9043,))

### 4. 用tf.data.Dataset读取数据

In [7]:
def make_input_fn(data_df, label_df, num_epochs=50, shuffle=True, batch_size=32):
    def input_function():
        ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
        if shuffle:
            ds = ds.shuffle(1000)
        ds = ds.batch(batch_size).repeat(num_epochs)
        return ds
    return input_function

In [8]:
train_input_fn = make_input_fn(train, y_train)
test_input_fn = make_input_fn(test, y_test, num_epochs=1, shuffle=False)

### 5. feature_column特征处理


In [9]:
# 分类列名称列表
category_names = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome"]

# 数值列名称列表
numeric_names = ["balance", "duration", "campaign", "pdays", "previous"]

In [10]:
# 数据统计，分类列的枚举列表
category_vocabulary_list = {}

for feature_name in category_names:
    category_vocabulary_list[feature_name] = list(df[feature_name].unique())

print(category_vocabulary_list)

{'job': ['management', 'technician', 'entrepreneur', 'blue-collar', 'unknown', 'retired', 'admin.', 'services', 'self-employed', 'unemployed', 'housemaid', 'student'], 'marital': ['married', 'single', 'divorced'], 'education': ['tertiary', 'secondary', 'unknown', 'primary'], 'default': ['no', 'yes'], 'housing': ['yes', 'no'], 'loan': ['no', 'yes'], 'contact': ['unknown', 'cellular', 'telephone'], 'month': ['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb', 'mar', 'apr', 'sep'], 'poutcome': ['unknown', 'failure', 'other', 'success']}


In [11]:
feature_columns = []

for feature_name in category_names:
    feature_columns.append(
        feature_column.indicator_column(
            feature_column.categorical_column_with_vocabulary_list(feature_name, category_vocabulary_list[feature_name])
        )
    )
    
for feature_name in numeric_names:
    feature_columns.append(feature_column.numeric_column(feature_name))

feature_columns

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='job', vocabulary_list=('management', 'technician', 'entrepreneur', 'blue-collar', 'unknown', 'retired', 'admin.', 'services', 'self-employed', 'unemployed', 'housemaid', 'student'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='marital', vocabulary_list=('married', 'single', 'divorced'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='education', vocabulary_list=('tertiary', 'secondary', 'unknown', 'primary'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='default', vocabulary_list=('no', 'yes'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='housing', vocabulary_list=('yes', 'no'), dtype=tf.

### 6.1 使用LR训练和评估模型

In [12]:
train_input_fn = make_input_fn(train, y_train)
test_input_fn = make_input_fn(test, y_test, num_epochs=1, shuffle=False)

In [13]:
linear_est = tf.estimator.LinearClassifier(
    feature_columns=feature_columns)

linear_est.train(train_input_fn, max_steps=1000)
result = linear_est.evaluate(test_input_fn)
result

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmppa3dx8bs', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use Varia

{'accuracy': 0.89815325,
 'accuracy_baseline': 0.88809025,
 'auc': 0.8428245,
 'auc_precision_recall': 0.47382975,
 'average_loss': 0.42300984,
 'label/mean': 0.11190976,
 'loss': 0.42291445,
 'precision': 0.56565654,
 'prediction/mean': 0.08622836,
 'recall': 0.38735178,
 'global_step': 1000}

### 6.2 使用GBDT训练和评估模型

In [14]:
train_input_fn = make_input_fn(train, y_train)
test_input_fn = make_input_fn(test, y_test, num_epochs=1, shuffle=False)

In [15]:
gbdt = tf.estimator.BoostedTreesClassifier(
    feature_columns=feature_columns, n_batches_per_layer=100)

gbdt.train(train_input_fn, max_steps=1000)
result = gbdt.evaluate(test_input_fn)
result

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmppjeurnyu', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs in

{'accuracy': 0.90014374,
 'accuracy_baseline': 0.88809025,
 'auc': 0.8591649,
 'auc_precision_recall': 0.48814416,
 'average_loss': 0.3558857,
 'label/mean': 0.11190976,
 'loss': 0.35584077,
 'precision': 0.62471396,
 'prediction/mean': 0.253982,
 'recall': 0.26976284,
 'global_step': 1000}

### 6.3 使用DNN训练和评估模型

In [16]:
train_input_fn = make_input_fn(train, y_train)
test_input_fn = make_input_fn(test, y_test, num_epochs=1, shuffle=False)

In [17]:
dnn = tf.estimator.DNNClassifier(
    hidden_units=[64, 32, 16], 
    feature_columns=feature_columns)

dnn.train(train_input_fn, max_steps=10000)
result = dnn.evaluate(test_input_fn)
result

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpqpwcr_05', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:global_step/sec: 848.486
INFO:tensorflow:loss = 0.40685356, step = 6900 (0.118 sec)
INFO:tensorflow:global_step/sec: 855.584
INFO:tensorflow:loss = 0.38497496, step = 7000 (0.117 sec)
INFO:tensorflow:global_step/sec: 653.319
INFO:tensorflow:loss = 0.3524521, step = 7100 (0.153 sec)
INFO:tensorflow:global_step/sec: 698.997
INFO:tensorflow:loss = 0.3569974, step = 7200 (0.142 sec)
INFO:tensorflow:global_step/sec: 689.407
INFO:tensorflow:loss = 0.37861946, step = 7300 (0.145 sec)
INFO:tensorflow:global_step/sec: 724.54
INFO:tensorflow:loss = 0.371001, step = 7400 (0.138 sec)
INFO:tensorflow:global_step/sec: 643.304
INFO:tensorflow:loss = 0.42895234, step = 7500 (0.155 sec)
INFO:tensorflow:global_step/sec: 691.365
INFO:tensorflow:loss = 0.38287464, step = 7600 (0.145 sec)
INFO:tensorflow:global_step/sec: 724.676
INFO:tensorflow:loss = 0.3309633, step = 7700 (0.138 sec)
INFO:tensorflow:global_step/sec: 627.989
INFO:tensorflow:loss = 0.22624245, step = 7800 (0.162 sec)
INFO:t

{'accuracy': 0.8915183,
 'accuracy_baseline': 0.88809025,
 'auc': 0.74716,
 'auc_precision_recall': 0.33547643,
 'average_loss': 0.33921474,
 'label/mean': 0.11190976,
 'loss': 0.3391198,
 'precision': 0.5536332,
 'prediction/mean': 0.15207805,
 'recall': 0.15810277,
 'global_step': 10000}