# 5-1 特征列feature_column
特征列一般用于结构化数据建模使用，图像和文本数据一般不会用到特征列

## 一 特征列用法概述


使用特征列可以将类别特征转换为one-hot编码特征，将连续特征构建分桶特征，以及对多个特征生成交叉特征等等。


要创建特征列，请调用 tf.feature_column 模块的函数。该模块中常用的九个函数如下图所示，所有九个函数都会返回一个 Categorical-Column 或一个 
Dense-Column 对象，但却不会返回 bucketized_column，后者继承自这两个类。

注意：所有的Catogorical Column类型最终都要通过indicator_column转换成Dense Column类型才能传入模型！

* numeric_column 数值列，最常用。


* bucketized_column 分桶列，由数值列生成，可以由一个数值列出多个特征，one-hot编码。


* categorical_column_with_identity 分类标识列，one-hot编码，相当于分桶列每个桶为1个整数的情况。


* categorical_column_with_vocabulary_list 分类词汇列，one-hot编码，由list指定词典。


* categorical_column_with_vocabulary_file 分类词汇列，由文件file指定词典。


* categorical_column_with_hash_bucket 哈希列，整数或词典较大时采用。


* indicator_column 指标列，由Categorical Column生成，one-hot编码


* embedding_column 嵌入列，由Categorical Column生成，嵌入矢量分布参数需要学习。嵌入矢量维数建议取类别数量的 4 次方根。


* crossed_column 交叉列，可以由除categorical_column_with_hash_bucket的任意分类列构成。

## 二 特征列使用规范
以下是一个使用特征列解决Titanic生存问题的范例

In [29]:
import numpy as np
import tensorflow as tf
import pandas as pd
from tensorflow.keras import layers, models

In [30]:
# 构建数据管道
dftrain_raw = pd.read_csv('../data/titanic/train.csv')
dftest_raw = pd.read_csv('../data/titanic/test.csv')
dfraw = pd.concat([dftrain_raw, dftest_raw])
dfraw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S
1,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
2,388,1,2,"Buss, Miss. Kate",female,36.0,0,0,27849,13.0,,S
3,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S
4,687,0,3,"Panula, Mr. Jaako Arnold",male,14.0,4,1,3101295,39.6875,,S


In [31]:
print(dfraw.dtypes)  # <class 'pandas.core.series.Series'>

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [56]:
def prepare_dfdata(dfraw):
    dfdata = dfraw.copy()
    dfdata.columns = [x.lower() for x in dfraw.columns]
    dfdata = dfdata.rename(columns = {'survived':'label'})  # 把survived列变为标签列
    dfdata = dfdata.drop(['passengerid', 'name'], axis = 1)  # 这两个与是否生存没有关系
    for col, dtype in dict(dfdata.dtypes).items():
        if dfdata[col].hasnans:
            # 添加标识符包含是否缺失
            dfdata[col + '_nan']  = pd.isna(dfdata[col]).astype('int32')
            # 填充
            if dtype not in [np.object, np.str, np.unicode]:
                dfdata[col].fillna(dfdata[col].mean(), inplace = True)
            else:
                dfdata[col].fillna('', inplace = True)
    return dfdata

In [64]:
dfdata = prepare_dfdata(dfraw)
dftrain = dfdata.iloc[0:len(dftrain_raw), :]
dftest = dfdata.iloc[len(dftrain_raw):, :]

In [65]:
# 从dataframe中导入数据
def df_to_dataset(df, shuffle = True, batch_size = 32):
    dfdata = df.copy()
    if 'label' not in dfdata.columns:
        ds = tf.data.Dataset.from_tensor_slices(dfdata.to_dict('list'))
    else:
        labels = dfdata.pop('label')
        ds = tf.data.Dataset.from_tensor_slices((dfdata.to_dict('list'), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size = len(dfdata))
    ds = ds.batch(batch_size)
    return ds

In [66]:
ds_train = df_to_dataset(dftrain)
ds_test = df_to_dataset(dftest)

In [67]:
for features, label in ds_train.unbatch().take(2):
    print("=================")
    print(features)
    print('------------')
    print(label)

{'pclass': <tf.Tensor: shape=(), dtype=int32, numpy=1>, 'sex': <tf.Tensor: shape=(), dtype=string, numpy=b'female'>, 'age': <tf.Tensor: shape=(), dtype=float32, numpy=31.0>, 'sibsp': <tf.Tensor: shape=(), dtype=int32, numpy=0>, 'parch': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'ticket': <tf.Tensor: shape=(), dtype=string, numpy=b'36928'>, 'fare': <tf.Tensor: shape=(), dtype=float32, numpy=164.8667>, 'cabin': <tf.Tensor: shape=(), dtype=string, numpy=b'C7'>, 'embarked': <tf.Tensor: shape=(), dtype=string, numpy=b'S'>, 'age_nan': <tf.Tensor: shape=(), dtype=int32, numpy=0>, 'cabin_nan': <tf.Tensor: shape=(), dtype=int32, numpy=0>, 'embarked_nan': <tf.Tensor: shape=(), dtype=int32, numpy=0>}
------------
tf.Tensor(1, shape=(), dtype=int64)
{'pclass': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'sex': <tf.Tensor: shape=(), dtype=string, numpy=b'male'>, 'age': <tf.Tensor: shape=(), dtype=float32, numpy=29.699118>, 'sibsp': <tf.Tensor: shape=(), dtype=int32, numpy=0>, 'parch': <tf.Tenso

In [68]:
[1, 2] + [3, 4]  # expand

[1, 2, 3, 4]

In [72]:
# 定义特征列
feature_columns = []
# 数值列
for col in ['age', 'fare', 'parch', 'sibsp'] + [c for c in dfdata.columns if c.endswith('_nan')]:
    feature_columns.append(tf.feature_column.numeric_column(col))

print(tf.feature_column.numeric_column(col))
# 分桶列
age  = tf.feature_column.numeric_column('age')
age_buckets = tf.feature_column.bucketized_column(age, boundaries = [18, 25, 30, 35, 40, 50, 55, 60, 65])
feature_columns.append(age_buckets)
print(age_buckets)

NumericColumn(key='embarked_nan', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
BucketizedColumn(source_column=NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(18, 25, 30, 35, 40, 50, 55, 60, 65))


In [73]:
# 类别列 
# 注意: 所有的Categorical Column类型最终都要通过indicator_column转换为Dense Column才能传入模型
sex = tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
          key = 'sex', vocabulary_list = ['male', 'female']
        ))
feature_columns.append(sex)
print(sex)

IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0))


In [75]:
pclass = tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_vocabulary_list(
              key = 'pclass', vocabulary_list = [1, 2, 3]
            ))

ticket = tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_hash_bucket('ticket', 3)
            )

embarked = tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_vocabulary_list(
              key = 'embarked', vocabulary_list = ['S', 'C', 'B']
        ))
feature_columns.append(pclass)
feature_columns.append(ticket)
feature_columns.append(embarked)

In [76]:
# 嵌入列
cabin = tf.feature_column.embedding_column(
            tf.feature_column.categorical_column_with_hash_bucket('cabin', 32), 2)
feature_columns.append(cabin)

In [78]:
# 交叉列
pclass_cate = tf.feature_column.categorical_column_with_vocabulary_list(
              key = 'pclass', vocabulary_list = [1, 2, 3]
            )
crossed_feature = tf.feature_column.indicator_column(
                    tf.feature_column.crossed_column([age_buckets, pclass_cate], hash_bucket_size = 15))

feature_columns.append(crossed_feature)

In [79]:
len(feature_columns)

14

In [81]:
# 定义模型
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
    layers.DenseFeatures(feature_columns),
    layers.Dense(64),
    layers.Dense(64),
    layers.Dense(1, activation = 'sigmoid')
])


# 训练模型
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

history = model.fit(ds_train, validation_data = ds_test, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [82]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_features (DenseFeature multiple                  64        
_________________________________________________________________
dense (Dense)                multiple                  2944      
_________________________________________________________________
dense_1 (Dense)              multiple                  4160      
_________________________________________________________________
dense_2 (Dense)              multiple                  65        
Total params: 7,233
Trainable params: 7,233
Non-trainable params: 0
_________________________________________________________________
