## Tensorflow的feature_column的语法演示

feature_column是data到model的桥梁，实现原始数据到Model所需要的格式。

内容回顾：
tf.feature_column包含如下方法：
* 数值列（Numeric column）：tf.feature_column.numeric_column
* 分桶列（Bucketized column）：tf.feature_column.bucketized_column
* 分类标识列(Categorical identity column)：tf.feature_column.categorical_column_with_identity
* 分类词汇列(Categorical vocabulary column)：tf.feature_column.categorical_column_with_vocabulary_list
* 哈希处理的列(Hashed Column)：tf.feature_column.categorical_column_with_hash_bucket
* 组合列(Crossed column)：tf.feature_column.crossed_column
* 指标列(Indicator Column)：tf.feature_column.indicator_column
* 嵌入列(Embedding Column)：tf.feature_column.embedding_column
* 共享Embedding：tf.feature_column.shared_embeddings
* 加权分类特征：tf.feature_column.weighted_categorical_column

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers

tf.keras.backend.set_floatx('float32')

### 准备dict数据

In [2]:
data = {
    'marks': [55,21,63,88,74,54,95,41,84,52],
    'grade': ['average','poor','average','good','good','average','good','average','good','average'],
    'point': ['c','f','c+','b+','b','c','a','d+','b+','c'],
    'weight': [0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19]
}

In [3]:
def demo(feature_column):
    """常用函数，整合data>feature_column>layer，并输出Layer结果"""
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(data).numpy())

### 1. Numeric columns

In [4]:
data["marks"]

[55, 21, 63, 88, 74, 54, 95, 41, 84, 52]

In [5]:
marks = feature_column.numeric_column("marks")
demo(marks)

[[55.]
 [21.]
 [63.]
 [88.]
 [74.]
 [54.]
 [95.]
 [41.]
 [84.]
 [52.]]


### 2. Bucketized columns

In [6]:
data["marks"]

[55, 21, 63, 88, 74, 54, 95, 41, 84, 52]

In [7]:
marks

NumericColumn(key='marks', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

In [8]:
marks_buckets = feature_column.bucketized_column(
    marks, boundaries=[30,40,50,60,70,80,90])
demo(marks_buckets)

[[0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]]


### 3. Indicator columns

In [9]:
data["grade"]

['average',
 'poor',
 'average',
 'good',
 'good',
 'average',
 'good',
 'average',
 'good',
 'average']

In [10]:
grade = feature_column.categorical_column_with_vocabulary_list(
      'grade', ['poor', 'average', 'good'])
grade

VocabularyListCategoricalColumn(key='grade', vocabulary_list=('poor', 'average', 'good'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [11]:
grade_one_hot = feature_column.indicator_column(grade)
demo(grade_one_hot)

[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


### 4. Embedding columns

In [12]:
data["point"]

['c', 'f', 'c+', 'b+', 'b', 'c', 'a', 'd+', 'b+', 'c']

In [13]:
point_unique = pd.Series(data["point"]).unique()
point_unique

array(['c', 'f', 'c+', 'b+', 'b', 'a', 'd+'], dtype=object)

In [14]:
point = feature_column.categorical_column_with_vocabulary_list(
    "point", list(point_unique))

In [15]:
# categorical_column变成one-hot
point_one_hot = feature_column.indicator_column(point)
demo(point_one_hot)

[[1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]]


In [16]:
# categorical_column变成embedding
point_embedding = feature_column.embedding_column(point, dimension=3)
demo(point_embedding)

[[-0.8958108   0.04639652  0.38751155]
 [ 0.75818956  0.61422676 -0.22615439]
 [-0.19805278  0.05863639 -0.13422246]
 [-0.6949732  -0.38393614 -0.56113476]
 [ 0.2548367  -0.33699897  0.75269896]
 [-0.8958108   0.04639652  0.38751155]
 [ 0.8323947   0.19694473  0.54354566]
 [-0.90260595  0.05213019  0.19949049]
 [-0.6949732  -0.38393614 -0.56113476]
 [-0.8958108   0.04639652  0.38751155]]


### 5. Hashed feature columns

In [17]:
data["point"]

['c', 'f', 'c+', 'b+', 'b', 'c', 'a', 'd+', 'b+', 'c']

In [18]:
point_hashed = feature_column.categorical_column_with_hash_bucket(
      'point', hash_bucket_size=3)
demo(feature_column.indicator_column(point_hashed))

[[0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]


### 6. Crossed feature columns

In [19]:
marks_buckets

BucketizedColumn(source_column=NumericColumn(key='marks', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(30, 40, 50, 60, 70, 80, 90))

In [20]:
grade

VocabularyListCategoricalColumn(key='grade', vocabulary_list=('poor', 'average', 'good'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [21]:
crossed_feature = feature_column.crossed_column(
    [marks_buckets, grade], hash_bucket_size=10)
demo(feature_column.indicator_column(crossed_feature))

[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


### 7. Weighted Categorical Columns

In [22]:
point

VocabularyListCategoricalColumn(key='point', vocabulary_list=('c', 'f', 'c+', 'b+', 'b', 'a', 'd+'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [23]:
weight_categorical_column = feature_column.weighted_categorical_column(
    point, 'weight')

demo(feature_column.indicator_column(weight_categorical_column))

Instructions for updating:
No similar op available at this time.
[[0.1  0.   0.   0.   0.   0.   0.  ]
 [0.   0.11 0.   0.   0.   0.   0.  ]
 [0.   0.   0.12 0.   0.   0.   0.  ]
 [0.   0.   0.   0.13 0.   0.   0.  ]
 [0.   0.   0.   0.   0.14 0.   0.  ]
 [0.15 0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.   0.16 0.  ]
 [0.   0.   0.   0.   0.   0.   0.17]
 [0.   0.   0.   0.18 0.   0.   0.  ]
 [0.19 0.   0.   0.   0.   0.   0.  ]]
