# Tensorflow Categorical Encoding

# References

* [Classify structured data using Keras preprocessing layers](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers)

> This tutorial demonstrates how to classify structured data, such as tabular data, using a simplified version of the PetFinder dataset from a Kaggle competition stored in a CSV file.

## [tf.keras.layers.CategoryEncoding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/CategoryEncoding)

Keras preprocessing Category encoding layer.


## [tf.one_hot](https://www.tensorflow.org/api_docs/python/tf/one_hot)
* ```tf.one_hot``` does **NOT accept string** categories. You must convert strings into integers by yourself.
* ```tf.one_hot``` needs **depth** to tell how many unique categories.


### tf.feature_columns

This is for TF1. DO NOT USE for TF2.

* [Classify structured data with feature columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns)

> tf.feature_columns module was designed for use with TF1 Estimators. In TF2, Keras preprocessing layers cover this functionality, for migration instructions see the Migrating feature columns guide.


<img src="./image/keras_category_encoding.png" align="left" width=750/>

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np

In [2]:
url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
file_path = "~/.keras/datasets/petfinder-mini/petfinder-mini.csv"

tf.keras.utils.get_file('petfinder_mini.zip', url, extract=True)
df = pd.read_csv(file_path)

In [3]:
# In the original dataset, `'AdoptionSpeed'` of `4` indicates a pet was not adopted.
df['label'] = np.where(df['AdoptionSpeed']==4, 0, 1)

# Drop unused features.
df = df.drop(columns=['AdoptionSpeed', 'Description'])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11537 entries, 0 to 11536
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Type          11537 non-null  object
 1   Age           11537 non-null  int64 
 2   Breed1        11537 non-null  object
 3   Gender        11537 non-null  object
 4   Color1        11537 non-null  object
 5   Color2        11537 non-null  object
 6   MaturitySize  11537 non-null  object
 7   FurLength     11537 non-null  object
 8   Vaccinated    11537 non-null  object
 9   Sterilized    11537 non-null  object
 10  Health        11537 non-null  object
 11  Fee           11537 non-null  int64 
 12  PhotoAmt      11537 non-null  int64 
 13  label         11537 non-null  int64 
dtypes: int64(4), object(10)
memory usage: 1.2+ MB


In [5]:
df.head(3)

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,label
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,1,1
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,2,1
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,7,1


---
# Data

In [6]:
train, val, test = np.split(df.sample(frac=1), [int(0.8*len(df)), int(0.9*len(df))])
del df

## Convert pandas dataframe to TF dataset

In [7]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('label')
    ds = tf.data.Dataset.from_tensor_slices((
        dict(dataframe),   # <--- X: features
        labels             # <--- Y: labels
    ))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))

    ds = ds.batch(batch_size)
    ds = ds.prefetch(batch_size)
    return ds

## Examin dataset

In [8]:
batch_size = 5
train_ds = df_to_dataset(train, batch_size=batch_size)

2021-11-01 13:59:54.965321: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [9]:
tf.data.experimental.get_structure(train_ds)

({'Type': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Age': TensorSpec(shape=(None,), dtype=tf.int64, name=None),
  'Breed1': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Gender': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Color1': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Color2': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'MaturitySize': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'FurLength': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Vaccinated': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Sterilized': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Health': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'Fee': TensorSpec(shape=(None,), dtype=tf.int64, name=None),
  'PhotoAmt': TensorSpec(shape=(None,), dtype=tf.int64, name=None)},
 TensorSpec(shape=(None,), dtype=tf.int64, name=None))

In [10]:
[(train_features, label_batch)] = train_ds.take(1)
print('Features:', list(train_features.keys()))
print('A batch of ages:', train_features['Age'])
print('A batch of targets:', label_batch )

Features: ['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize', 'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'PhotoAmt']
A batch of ages: tf.Tensor([2 7 2 1 1], shape=(5,), dtype=int64)
A batch of targets: tf.Tensor([1 1 0 1 1], shape=(5,), dtype=int64)


---
# Keras Preprocessing
Note that it is a **Keras layer** which takes a TF dataset as its input.

## Categorical value to integer value

Use the top N most frequent tokens are used to create the vocabulary. All others will be treated as out-of-vocabulary (OOV). 

* [StringLookup(max_tokens=N)](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup)
* [IntegerLookup((max_tokens=N)](https://www.tensorflow.org/api_docs/python/tf/keras/layers/IntegerLookup)

## Category Encoding (OHE/MHE) Layer in Keras

```CategoryEncoding``` layer takes an integer column and produce OHE or MHE encodinged columns. It can NOT accept string, hence string columns or discreet integer columns need to be converted into continuous integers via StringLookup or IntegerLookup.

* [CategoryEncoding(num_tokens=None, output_mode=<>)](https://www.tensorflow.org/api_docs/python/tf/keras/layers/CategoryEncoding)

### One Hot Encoding vs Multi Hot Encoding

MHE is to save the space. For ```data=['cat', 'dog', 'fish', 'bird', 'ant']```, OHE requires ```N=5``` size array such as ```(1,0,0,0,0)``` for **cat**. MHE uses binary representation hence requires $log_2(N=5)$ size array such as ```[0,0,0]``` for **cat**.


* [What exactly is multi-hot encoding and how is it different from one-hot?](https://stats.stackexchange.com/a/467672)

> multi-hot-encoding introduces false additive relationships, e.g. ```[0,0,1] + [0,1,0] = [0,1,1]``` that is ```'dog' + 'fish' = 'bird'```. That is the price you pay for the reduced representation.

## Keras layer to convert categorical into MHE

Convert a TF dataset categorical column (single TF Tensor) into MHE columns (single Tensor having multiple columns).

In [11]:
def get_category_encoding_layer(dataset, name, dtype, max_tokens=None, oov_token=None):
    """Create a Keras layer to convert a column into Multi Hot Encoding.
    The layer function as below.
    1. Convert string/integer in the target column (dataset[name]) into indices.
       e.g. ['cat', 'dog', 'fish', 'bird', 'ant'] into [0,1,2,3,4]
    2. Convert indices in the column into Multi Hot Encoding.
    
    Args:
        dataset: TF Dataset that have the target column against which to create the category_encoding_layer.
        name: The name that identifies the target column in the dataset.
        max_tokens: 
            Use the top max_token most frequent tokens are used to create the vocabulary. 
            All others will be treated as out-of-vocabulary (OOV).

    Returns: Keras layer to function as category encoder.
    """
    if dtype == 'string':
    # Create a layer that turns strings into integer indices.
        oov_token = oov_token if oov_token is not None and isinstance(oov_token, str) else '[UNK]'
        lookup = tf.keras.layers.StringLookup(max_tokens=max_tokens, oov_token=oov_token)
    else:
        # Otherwise, create a layer that turns integer values into integer indices.
        oov_token = oov_token if oov_token is not None and isinstance(oov_token, (inf, float)) else -1
        lookup = tf.keras.layers.IntegerLookup(max_tokens=max_tokens, oov_token=oov_token)

    # Extract the target feature column by "name" from the "dataset"
    feature = dataset.map(lambda features, label: features[name])

    # Fit the lookup table (string -> int) to the values in the feature column.
    lookup.adapt(feature)

    # Encode the integer indices. Multi Hot to save the space.
    encoder = tf.keras.layers.CategoryEncoding(num_tokens=lookup.vocabulary_size(), output_mode='multi_hot')

    def f(column):
        """Apply multi-hot encoding"""
        return encoder(lookup(column))

    return f

In [12]:
# Test the string categorical 'Type' column conversion into MHE'
tensor_column_categorical_type = tf.constant([
    [pet.numpy()] for pet in train_features['Type']
])

test_type_layer = get_category_encoding_layer(
    dataset=train_ds,
    name='Type',
    dtype='string'
)
tensor_column_mhe_type = test_type_layer(tensor_column_categorical_type)

for i in range(len(tensor_column_categorical_type)):
    print("{} : {}".format(
    tensor_column_categorical_type[i].numpy(),
    tensor_column_mhe_type[i].numpy()
))
    
del test_type_layer, tensor_column_categorical_type, tensor_column_mhe_type

2021-11-01 13:59:56.017142: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


[b'Dog'] : [0. 1. 0.]
[b'Cat'] : [0. 0. 1.]
[b'Dog'] : [0. 1. 0.]
[b'Dog'] : [0. 1. 0.]
[b'Cat'] : [0. 0. 1.]


In [13]:
tensor_column_categorical_age = tf.constant([
    [pet.numpy()] for pet in train_features['Age']
])

test_age_layer = get_category_encoding_layer(
    dataset=train_ds,
    name='Age',
    dtype='int64',
    max_tokens=5
)
tensor_column_mhe_age = test_age_layer(tensor_column_categorical_age)

for i in range(len(tensor_column_categorical_age)):
    print("{} : {}".format(
    tensor_column_categorical_age[i].numpy(),
    tensor_column_mhe_age[i].numpy()
))
    
del test_age_layer, tensor_column_categorical_age, tensor_column_mhe_age

[2] : [0. 1. 0. 0. 0.]
[7] : [1. 0. 0. 0. 0.]
[2] : [0. 1. 0. 0. 0.]
[1] : [0. 0. 0. 1. 0.]
[1] : [0. 0. 0. 1. 0.]


## Keras layer to normalize numeric values

In [14]:
def get_normalization_layer(name, dataset):
    # Create a Normalization layer for the feature.
    normalizer = tf.keras.layers.Normalization(axis=None)

    # Prepare a Dataset that only yields the feature.
    feature_ds = dataset.map(lambda x, y: x[name])

    # Learn the statistics of the data.
    normalizer.adapt(feature_ds)

    return normalizer

---
# Training

## Split data into training, validation, and test

In [15]:
batch_size = 256
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

## Keras model  

In [16]:
all_inputs = []
encoded_features = []

## Holizontal Keras preprocessing layers for numerical normalization

In [17]:
# Numerical features.
for header in ['PhotoAmt', 'Fee']:
    numeric_col = tf.keras.Input(shape=(1,), name=header)
    normalization_layer = get_normalization_layer(header, train_ds)
    encoded_numeric_col = normalization_layer(numeric_col)
    all_inputs.append(numeric_col)
    encoded_features.append(encoded_numeric_col)

### Holizontal Keras preprocessing layers for numerical categorical into MHE

In [18]:
numeric_input_feature = tf.keras.Input(shape=(1,), name='Age', dtype='int64')
numeric_category_encoding_layer = get_category_encoding_layer(
    name='Age',
    dataset=train_ds,
    dtype='int64',
    max_tokens=5
)
categorically_encoded_feature = numeric_category_encoding_layer(numeric_input_feature)
all_inputs.append(numeric_input_feature)
encoded_features.append(categorically_encoded_feature)

### Holizontal Keras preprocessing layers for String categorical into MHE

In [19]:
string_categorical_columns = [
    'Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',  'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1'
]

for column_name in string_categorical_columns:
    string_input_feature = tf.keras.Input(shape=(1,), name=column_name, dtype='string')

    # String category encoding layer
    string_category_encoding_layer = get_category_encoding_layer(
        name=column_name,
        dataset=train_ds,
        dtype='string',
        max_tokens=5,
        oov_token='[UNK]'
    )
    # Categorical encoding
    categorically_encoded_feature = string_category_encoding_layer(string_input_feature)

    all_inputs.append(string_input_feature)
    encoded_features.append(categorically_encoded_feature)

In [20]:
all_features = tf.keras.layers.concatenate(encoded_features)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(all_inputs, output)

In [21]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

## Keras Model Training

In [22]:
# Use `rankdir='LR'` to make the graph horizontal.
tf.keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')


In [23]:
model.fit(train_ds, epochs=100, validation_data=val_ds)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100


Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7f1e93ef2910>

In [24]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

Accuracy 0.754766047000885


In [25]:
!mkdir -p model
model.save('model/pet_classifier_model')
reloaded_model = tf.keras.models.load_model('model/pet_classifier_model')

del train_ds, val_ds, test_ds, model

2021-11-01 14:02:09.563731: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: model/pet_classifier_model/assets


INFO:tensorflow:Assets written to: model/pet_classifier_model/assets


# Prediction

In [26]:
sample = {
    'Type': 'Cat',
    'Age': 3,
    'Breed1': 'Tabby',
    'Gender': 'Male',
    'Color1': 'Black',
    'Color2': 'White',
    'MaturitySize': 'Small',
    'FurLength': 'Short',
    'Vaccinated': 'No',
    'Sterilized': 'No',
    'Health': 'Healthy',
    'Fee': 100,
    'PhotoAmt': 2,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = reloaded_model.predict(input_dict)
prob = tf.nn.sigmoid(predictions[0])

print(
    "This particular pet had a %.1f percent probability "
    "of getting adopted." % (100 * prob)
)

This particular pet had a 73.5 percent probability of getting adopted.
