## Tensorflow的feature_column的语法演示

内容回顾：
tf.feature_column用于将原始数据转换成model所需要的格式，包含如下方法：
* 数值列（Numeric column）：tf.feature_column.numeric_column
* 分桶列（Bucketized column）：tf.feature_column.bucketized_column
* 分类标识列(Categorical identity column)：tf.feature_column.categorical_column_with_identity
* 分类词汇列(Categorical vocabulary column)：tf.feature_column.categorical_column_with_vocabulary_list
* 哈希处理的列(Hashed Column)：tf.feature_column.categorical_column_with_hash_bucket
* 组合列(Crossed column)：tf.feature_column.crossed_column
* 指标列(Indicator Column)：tf.feature_column.indicator_column
* 嵌入列(Embedding Column)：tf.feature_column.embedding_column
* 共享Embedding：tf.feature_column.shared_embeddings
* 加权分类特征：tf.feature_column.weighted_categorical_column

In [10]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers

tf.keras.backend.set_floatx('float64')

### 准备dict数据

In [11]:
df = pd.DataFrame({'marks': [55,21,63,88,74,54,95,41,84,52],
        'grade': ['average','poor','average','good','good','average','good','average','good','average'],
        'point': ['c','f','c+','b+','b','c','a','d+','b+','c']})
df

Unnamed: 0,marks,grade,point
0,55,average,c
1,21,poor,f
2,63,average,c+
3,88,good,b+
4,74,good,b
5,54,average,c
6,95,good,a
7,41,average,d+
8,84,good,b+
9,52,average,c


In [13]:
df["marks"] = df["marks"].astype("float64")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   marks   10 non-null     float64
 1   grade   10 non-null     object 
 2   point   10 non-null     object 
dtypes: float64(1), object(2)
memory usage: 368.0+ bytes


In [14]:
data = dict(df)
data

{'marks': 0    55.0
 1    21.0
 2    63.0
 3    88.0
 4    74.0
 5    54.0
 6    95.0
 7    41.0
 8    84.0
 9    52.0
 Name: marks, dtype: float64,
 'grade': 0    average
 1       poor
 2    average
 3       good
 4       good
 5    average
 6       good
 7    average
 8       good
 9    average
 Name: grade, dtype: object,
 'point': 0     c
 1     f
 2    c+
 3    b+
 4     b
 5     c
 6     a
 7    d+
 8    b+
 9     c
 Name: point, dtype: object}

In [15]:
def demo(feature_column):
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(data).numpy())

### 1. Numeric columns

In [16]:
data["marks"]

0    55.0
1    21.0
2    63.0
3    88.0
4    74.0
5    54.0
6    95.0
7    41.0
8    84.0
9    52.0
Name: marks, dtype: float64

In [17]:
marks = feature_column.numeric_column("marks")
demo(marks)

[[55.]
 [21.]
 [63.]
 [88.]
 [74.]
 [54.]
 [95.]
 [41.]
 [84.]
 [52.]]


### 2. Bucketized columns

In [18]:
data["marks"]

0    55.0
1    21.0
2    63.0
3    88.0
4    74.0
5    54.0
6    95.0
7    41.0
8    84.0
9    52.0
Name: marks, dtype: float64

In [19]:
marks

NumericColumn(key='marks', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

In [20]:
marks_buckets = feature_column.bucketized_column(
    marks, boundaries=[30,40,50,60,70,80,90])
demo(marks_buckets)

[[0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]]


### 3. Indicator columns

In [21]:
data["grade"]

0    average
1       poor
2    average
3       good
4       good
5    average
6       good
7    average
8       good
9    average
Name: grade, dtype: object

In [22]:
grade = feature_column.categorical_column_with_vocabulary_list(
      'grade', ['poor', 'average', 'good'])
grade

VocabularyListCategoricalColumn(key='grade', vocabulary_list=('poor', 'average', 'good'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [23]:
grade_one_hot = feature_column.indicator_column(grade)
demo(grade_one_hot)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


### 4. Embedding columns

In [24]:
data["point"]

0     c
1     f
2    c+
3    b+
4     b
5     c
6     a
7    d+
8    b+
9     c
Name: point, dtype: object

In [25]:
point = feature_column.categorical_column_with_vocabulary_list(
    "point", df["point"].unique())

In [26]:
# categorical_column变成one-hot
point_one_hot = feature_column.indicator_column(point)
demo(point_one_hot)

[[1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]]


In [27]:
# categorical_column变成embedding
point_embedding = feature_column.embedding_column(point, dimension=3)
demo(point_embedding)

[[-1.0506157   0.07907454  0.7597467 ]
 [-0.29805383 -0.32958034  0.11289294]
 [ 0.4115803  -0.15990475  0.08945537]
 [ 0.17709787 -0.00756577 -0.72060245]
 [-0.11707171  0.45751548 -0.61383283]
 [-1.0506157   0.07907454  0.7597467 ]
 [-0.335763    0.5380431   0.45197704]
 [-0.84164476 -0.10877138 -0.8661403 ]
 [ 0.17709787 -0.00756577 -0.72060245]
 [-1.0506157   0.07907454  0.7597467 ]]


### 5. Hashed feature columns

In [28]:
data["point"]

0     c
1     f
2    c+
3    b+
4     b
5     c
6     a
7    d+
8    b+
9     c
Name: point, dtype: object

In [29]:
point_hashed = feature_column.categorical_column_with_hash_bucket(
      'point', hash_bucket_size=3)
demo(feature_column.indicator_column(point_hashed))

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]


### 6. Crossed feature columns

In [33]:
point

VocabularyListCategoricalColumn(key='point', vocabulary_list=('c', 'f', 'c+', 'b+', 'b', 'a', 'd+'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [34]:
grade

VocabularyListCategoricalColumn(key='grade', vocabulary_list=('poor', 'average', 'good'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [35]:
crossed_feature = feature_column.crossed_column(
    [point, grade], hash_bucket_size=100)
demo(feature_column.indicator_column(crossed_feature))

SystemError: <built-in function TFE_Py_FastPathExecute> returned a result with an error set