## Tensorflow使用feature_column银行营销分类数据集

### 演示流程：

1. 用Pandas导入 CSV 文件。
2. 用tf.dataset读取数据，进行分批（batch）、随机排序（shuffle）处理
3. 用feature_column将 CSV 中的列映射到用于训练模型的特征
4. 用 Keras 构建，训练并评估模型

### 银行营销数据集

数据来自葡萄牙银行机构电话营销活动的记录。

每行是同一个客户的电话沟通记录，以及最后一列是这客户最后明确购买还是不购买产品，是一个二分类问题。

数据列如下：
01. 年龄，数字
02. 工作类型：分类，比如管理员、企业家
03. 婚姻状况：分类，比如已婚、离婚、单身
04. 教育：分类，小学、中学等
05. 是否有信用：分类、是/否
06. 余额：数字、年均余额
07. 是否有住房贷款：分类、是/否
08. 是否有贷款：分类，是/否
09. 联系方式：分类、位置、手机号码、固定号码
10. 天：数字、最后一次联系日
11. 月：数字、最后一次联系的月份
12. 持续时间：数字、上次联系时间
13. 广告系列：数字，在这个广告和用户联系的次数，包括其他人
14. pdays：数字，和这个客户上次联系的间隔天数
15. 以前：数字，这个广告此前和此客户的联系次数
16. 以前的结果：分类、成功、失败
17. 最终是否订阅了定期存款：二分类、是、否

### 1. 导入库

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

### 2. 使用 Pandas读取CSV

In [2]:
df = pd.read_csv("./datas/bank/bank-full.csv", sep=";")
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [3]:
df.shape

(45211, 17)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [5]:
df["y"].unique()

array(['no', 'yes'], dtype=object)

In [6]:
df.loc[df["y"]=="yes", "y"] = 1
df.loc[df["y"]=="no", "y"] = 0
df["y"] = df["y"].astype(int)
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,0


### 3. 将 dataframe 拆分为训练、验证和测试集

In [7]:
train_all, test = train_test_split(df, test_size=0.2)
train, val = train_test_split(train_all, test_size=0.2)
print(len(train), '训练集数目')
print(len(val), '验证集数目')
print(len(test), '测试集数目')

28934 训练集数目
7234 验证集数目
9043 测试集数目


### 4. 用tf.data.Dataset读取数据

In [8]:
def df_to_dataset(df, shuffle=True, batch_size=32):
    """便捷函数，将pandas的df转换成dataset"""
    labels = df.pop('y')
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(df))
    ds = ds.batch(batch_size)
    return ds

In [9]:
batch_size = 32
train_ds = df_to_dataset(train, shuffle=True, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

### 5. feature_column特征处理


In [10]:
# 分类列名称列表
category_names = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome"]

# 数值列名称列表
numeric_names = ["balance", "duration", "campaign", "pdays", "previous"]

In [11]:
# 数据统计，分类列的枚举列表
category_vocabulary_list = {}

for feature_name in category_names:
    category_vocabulary_list[feature_name] = list(df[feature_name].unique())

print(category_vocabulary_list)

{'job': ['management', 'technician', 'entrepreneur', 'blue-collar', 'unknown', 'retired', 'admin.', 'services', 'self-employed', 'unemployed', 'housemaid', 'student'], 'marital': ['married', 'single', 'divorced'], 'education': ['tertiary', 'secondary', 'unknown', 'primary'], 'default': ['no', 'yes'], 'housing': ['yes', 'no'], 'loan': ['no', 'yes'], 'contact': ['unknown', 'cellular', 'telephone'], 'month': ['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb', 'mar', 'apr', 'sep'], 'poutcome': ['unknown', 'failure', 'other', 'success']}


In [12]:
# 数据统计，数值类型的最大值和最小值
numeric_meanstd = {}

for feature_name in numeric_names:
    numeric_meanstd[feature_name] = df[feature_name].mean(), df[feature_name].std()

print(numeric_meanstd)

{'balance': (1362.2720576850766, 3044.7658291686002), 'duration': (258.1630797814691, 257.52781226517095), 'campaign': (2.763840658246887, 3.0980208832802205), 'pdays': (40.19782796222158, 100.1287459906047), 'previous': (0.5803233726305546, 2.3034410449314233)}


In [13]:
def norm_func(x, mean, std):
    """标准化函数"""
    return (x-mean) / std

In [14]:
from functools import partial

new_func = partial(norm_func, numeric_meanstd["balance"][0], numeric_meanstd["balance"][1])
new_func(2143)

-0.7851114192643601

In [15]:
feature_columns = []

for feature_name in category_names:
    feature_column.categorical_column_with_vocabulary_list(feature_name, category_vocabulary_list[feature_name])
    
for feature_name in numeric_names:
    # 生成单个参数的归一化函数
    new_func = partial(norm_func, numeric_meanstd[feature_name][0], numeric_meanstd[feature_name][1])
    feature_columns.append(feature_column.numeric_column(feature_name, normalizer_fn=new_func))

### 6. 创建，编译和训练模型

In [16]:
model = tf.keras.Sequential([
    layers.DenseFeatures(feature_columns),
    layers.Dense(128, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(train_ds,
          validation_data=val_ds,
          epochs=30)

Epoch 1/30


TypeError: in user code:

    d:\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    d:\Anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    d:\Anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    d:\Anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    d:\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:531 train_step  **
        y_pred = self(x, training=True)
    d:\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\base_layer.py:927 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    d:\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\sequential.py:291 call
        outputs = layer(inputs, **kwargs)
    d:\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\base_layer.py:927 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    d:\Anaconda3\lib\site-packages\tensorflow\python\feature_column\dense_features.py:145 call  **
        self._state_manager)
    d:\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:2916 get_dense_tensor
        return transformation_cache.get(self, state_manager)
    d:\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:2679 get
        transformed = column.transform_feature(self, state_manager)
    d:\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:2889 transform_feature
        return self._transform_input_tensor(input_tensor)
    d:\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:2863 _transform_input_tensor
        input_tensor = self.normalizer_fn(input_tensor)
    <ipython-input-13-a836cb719331>:3 norm_func
        return (x-mean) / std
    d:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py:1009 r_binary_op_wrapper
        x = ops.convert_to_tensor(x, dtype=y.dtype.base_dtype, name="x")
    d:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py:1341 convert_to_tensor
        ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    d:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_conversion_registry.py:52 _default_conversion_function
        return constant_op.constant(value, dtype, name=name)
    d:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py:262 constant
        allow_broadcast=True)
    d:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py:300 _constant_impl
        allow_broadcast=allow_broadcast))
    d:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py:451 make_tensor_proto
        _AssertCompatible(values, dtype)
    d:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py:331 _AssertCompatible
        (dtype.name, repr(mismatch), type(mismatch).__name__))

    TypeError: Expected int64, got -1682.4937714835237 of type 'float' instead.


In [None]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)