## 用tensorflow的深度回归函数，像sklearn一样搭建自己的回归模型

#### 前言

最近在研究tensorflow的使用，发现这个框架发展实在是快速，网上的例子可能很快由于版本更新而需要做调整。所以我也记录一下我搭建自己模型的过程。 整个过程主要参考了一个来自[kaggle帖子](https://www.kaggle.com/usersumit/allstate-claims-severity/tensorflow-dnnregressor)和一篇[QSAR文章](http://wikicoursenote.com/wiki/Deep_Neural_Nets_as_a_Method_for_Quantitative_Structure%E2%80%93Activity_Relationships)。

因此数据来自kaggle竞赛，并做了一些神经网络参数的调整和丰富。

#### 数据

数据来自[kaggle竞赛](https://www.kaggle.com/c/allstate-claims-severity), 关于保险公司对于各户数据的预测。包含类别特征(categorical feature)和数值特征(numerical feature). 我并没有过多的关注特征的含义，只是作为示例数据来构建神经网络。因此我只取了训练集的前5000行，既作为训练数据，也做测试数据。

In [11]:
import pandas as pd
df_train_ori = pd.read_csv('train_head.csv')
df_train_ori.head()

Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85


In [12]:
print("Shape of data: %s" % str(df_train_ori.shape))

Shape of data: (4999, 132)


“cat”开头的数据为类别特征，将用独热编码(one-hot encoding)处理，“cont”开头的为数值特征，不用处理。最后将特征合并。

In [13]:
label = "loss"
# drop id and fetch y.
df_train_ori.drop('id', axis=1, inplace=True)
y = df_train_ori[label]
# one hot encode
features = df_train_ori.columns
continuous_features = [feature for feature in features if 'cont' in feature]
categorical_features = [feature for feature in features if 'cat' in feature]
one_hot = pd.get_dummies(df_train_ori[categorical_features])
training_set = df_train_ori.drop(categorical_features, axis=1)
training_set = training_set.join(one_hot)
# re-fetch feature names
features = [f for f in training_set.columns if f != label]
print("Shape of data: %s" % str(training_set.shape))

Shape of data: (4999, 849)


特征归一化也是预处理中重要的一步，sklearn中有非常方便高效的方法。最后数据矩阵化，就得到了可以最终的模型输入数据。

In [14]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# scale
scaler = MinMaxScaler()
training_set[features] = scaler.fit_transform(training_set[features])

# matrix
train_x = training_set[features].as_matrix()
train_y = training_set[label].as_matrix()

#### 深度神经网络的结构，参数和训练

这里我构建了一个三层的神经网络，定义了一个监控来实现提前终止(early stopping)。构建的函数为DNNRegressor，按照函数的输入要求，还需要定义特征列的对象，由于数据已经矩阵化没有特征名称信息，需要用下标来生成这个对象。对于回归问题，判断提前终止的量化标准为MSE，同时分出30%的数据来计算MSE判断是否提前终止。

需要注意的是，如果想要模型下次再次重新载入，甚至投入生产，model_dir需要指定和保存好。同时ops也需要保存，下次用同样的参数构建DNNRegressor。

In [20]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
# Feature cols
feature_cols = [tf.contrib.layers.real_valued_column("", dimension=len(features))]

########################################################################
# ------ PARAMETERS ---------
training_validation_split = 0.8
learning_rate             = 0.05           
dropout                   = 0.1           
network_structure         = [ 400, 200, 80 ]  
early_stopping_rounds     = 100
validate_every_n_steps    = 50
max_steps                 = 290
######################################################################

######################################################################
# ----- Split training and validation sets -----
# ----- Create validation set for early stopping -----
X_train, X_validate, y_train, y_validate = train_test_split(train_x, train_y, test_size=0.33, random_state=42)

validation_metrics = {'MSE': tf.contrib.metrics.streaming_mean_squared_error}
val_monitor = tf.contrib.learn.monitors.ValidationMonitor(X_validate, y_validate,
  every_n_steps=validate_every_n_steps,
  metrics=validation_metrics,
  early_stopping_metric="loss",
  early_stopping_metric_minimize=True,
  early_stopping_rounds=early_stopping_rounds)
######################################################################

# Build 3 layer fully connected DNN 
ops = dict(feature_columns=feature_cols, 
          hidden_units=network_structure, 
          optimizer=tf.train.AdagradOptimizer(
            learning_rate= learning_rate,
          ),
          model_dir="/tmp/test",
          dropout=dropout)

regressor = tf.contrib.learn.DNNRegressor(**ops)

# Training with Fit
regressor.fit(X_train, y_train, steps=max_steps, monitors=[val_monitor])

Explicitly set `enable_centered_bias` to 'True' if you want to keep existing behaviour.


DNNRegressor(hidden_units=[400, 200, 80], dropout=0.1, optimizer=<tensorflow.python.training.adagrad.AdagradOptimizer object at 0x7f968029bad0>, feature_columns=[_RealValuedColumn(column_name='', dimension=848, default_value=None, dtype=tf.float32, normalizer=None)])

#### 预测

方便起见，还用训练集来完成测试任务。

In [23]:
from sklearn.metrics import mean_squared_error
import numpy as np
# predit
y_test = y_train

y_pred = regressor.predict(X_train, as_iterable=False)
rmse = mean_squared_error(y_pred, y_test)

# ---------- Score -------------
r2Score     = metrics.r2_score(y_test, y_pred)
exvarScore  = metrics.explained_variance_score(y_test, y_pred)
medaeScore  = metrics.median_absolute_error(y_test, y_pred)
maeScore    = metrics.mean_absolute_error(y_test, y_pred)
maeDevScore = np.std(np.absolute(y_pred - y_test))
mseScore    = metrics.mean_squared_error(y_test, y_pred)
 
r2String       = 'R^2 Score:             {0}'.format(r2Score)
varianceString = 'Explained Variance:    {0}'.format(exvarScore)
medianString   = 'Median Absolute Error: {0}'.format(medaeScore)
meanString     = 'Mean Absolute Error:   {0}'.format(maeScore)
meanDevString  = 'MAE Deviation:         {0}'.format(maeDevScore)
meanSqString   = 'Mean Squared Error:    {0}'.format(mseScore)
 
print(r2String)
print(varianceString)
print(medianString)
print(meanString)
print(meanDevString)
print(meanSqString)

Instructions for updating:
The default behavior of predict() is changing. The default value for
as_iterable will change to True, and then the flag will be removed
altogether. The behavior of this flag is described below.


R^2 Score:             0.876760863552
Explained Variance:    0.891644418607
Median Absolute Error: 520.000625
Mean Absolute Error:   725.465175455
MAE Deviation:         745.977314795
Mean Squared Error:    1082781.87499


#### 总结

tensorflow现在把skflow的函数转移到了tf.contrib.learn中，如DNNRegressor给习惯了使用sklearn的用户很好的体验tensorflow的一个起点。在训练过程中还可以通过tensorboard --logdir=/tmp/test 来进行监控。具体可以参考[官方文档](https://www.tensorflow.org/versions/r0.12/how_tos/summaries_and_tensorboard/index.html)。另外，记住以上的代码在tensorflow 0.11.0下完成，鉴于tensorflow现在每次更新都有较大调整，所以还是需要留意。

In [24]:
import tensorflow as tf
print(tf.__version__)

0.11.0
