## Tensorflow使用feature_column处理心脏病数据集

### 演示流程：

1. 用Pandas导入 CSV 文件。
2. 用tf.dataset读取数据，进行分批（batch）、随机排序（shuffle）处理
3. 用feature_column将 CSV 中的列映射到用于训练模型的特征
4. 用 Keras 构建，训练并评估模型

### 心脏病数据集

有几百行数据，每行描述了一个病人（patient），每列描述了一个属性（attribute）。

我们将使用这些信息来预测一位病人是否患有心脏病，这是在该数据集上的二分类任务。

数据集描述如下，注意里面有数字Numerical，也有分类Categorical：

>列| 描述| 特征类型 | 数据类型
>------------|--------------------|----------------------|-----------------
>Age | 年龄以年为单位 | Numerical | integer
>Sex | （1 = 男；0 = 女） | Categorical | integer
>CP | 胸痛类型（0，1，2，3，4）| Categorical | integer
>Trestbpd | 静息血压（入院时，以mm Hg计） | Numerical | integer
>Chol | 血清胆固醇（mg/dl） | Numerical | integer
>FBS |（空腹血糖> 120 mg/dl）（1 = true；0 = false）| Categorical | integer
>RestECG | 静息心电图结果（0，1，2）| Categorical | integer
>Thalach | 达到的最大心率 | Numerical | integer
>Exang | 运动诱发心绞痛（1 =是；0 =否）| Categorical | integer
>Oldpeak | 与休息时相比由运动引起的 ST 节段下降|Numerical | integer
>Slope | 在运动高峰 ST 段的斜率 | Numerical | float
>CA | 荧光透视法染色的大血管动脉（0-3）的数量 | Numerical | integer
>Thal | 3 =正常；6 =固定缺陷；7 =可逆缺陷 | Categorical | string
>Target | 心脏病诊断（1 = true；0 = false） | Classification | integer

### 1. 导入库

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

### 2. 使用 Pandas读取CSV

In [3]:
df = pd.read_csv("./datas/heart/heart.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


### 3. 将 dataframe 拆分为训练、验证和测试集

In [4]:
train, test = train_test_split(df, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), '训练集数目')
print(len(val), '验证集数目')
print(len(test), '测试集数目')

193 训练集数目
49 验证集数目
61 测试集数目


### 4. 用tf.data.Dataset读取数据

In [5]:
def df_to_dataset(df, shuffle=True, batch_size=32):
    """便捷函数，将pandas的df转换成dataset"""
    labels = df.pop('target')
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(df))
    ds = ds.batch(batch_size)
    return ds

In [6]:
batch_size = 32
train_ds = df_to_dataset(train, shuffle=True, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

In [7]:
# 测试下数据集
for feature_batch, label_batch in train_ds.take(1):
    print(feature_batch)
    print()
    print(feature_batch["age"])
    print()
    print(label_batch)

{'age': <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([65, 45, 46, 59, 71, 67, 67, 48, 65, 47, 39, 55, 65, 57, 67, 29, 49,
       46, 62, 44, 56, 62, 54, 64, 59, 44, 45, 57, 50, 45, 56, 46])>, 'sex': <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0])>, 'cp': <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([3, 4, 2, 1, 2, 3, 4, 2, 4, 3, 3, 4, 4, 4, 4, 2, 4, 4, 4, 4, 2, 3,
       4, 4, 0, 2, 2, 2, 4, 4, 4, 4])>, 'trestbps': <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([160, 138, 105, 178, 160, 115, 125, 110, 110, 108,  94, 180, 135,
       110, 160, 130, 130, 140, 124, 120, 140, 130, 122, 120, 164, 130,
       130, 124, 150, 142, 200, 138])>, 'chol': <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([360, 236, 204, 270, 302, 564, 254, 229, 248, 243, 199, 327, 254,
       201, 286, 204, 269, 311, 209, 169, 294, 263, 286, 246, 176, 219,
       234, 261, 243, 309, 288, 243

### 5. feature_column特征处理


In [8]:
feature_columns = []

# 数值列
for header in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
    feature_columns.append(feature_column.numeric_column(header))

# 分桶列
age = feature_column.numeric_column("age")
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# 分类列
thal = feature_column.categorical_column_with_vocabulary_list(
    'thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# 嵌入列
thal_embedding = feature_column.embedding_column(thal, dimension=4)
feature_columns.append(thal_embedding)

# 组合列
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=100)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

### 6. 创建，编译和训练模型

In [9]:
model = tf.keras.Sequential([
    layers.DenseFeatures(feature_columns),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(train_ds,
          validation_data=val_ds,
          epochs=30)

Epoch 1/30


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7f0e6047c110>

In [10]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

Accuracy 0.7377049326896667
