# tf.keras.layers.CategoryEncoding behavior

* [tf.keras.layers.CategoryEncoding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/CategoryEncoding)

Keras preprocessing Category encoding layer.

## References

* [Explanation of tf.keras.layers.CategoryEncoding output_mode='multi_hot' behavior](https://stackoverflow.com/questions/69792031/explanation-of-tf-keras-layers-categoryencoding-output-mode-multi-hot-behavior)

* https://github.com/tensorflow/tensorflow/issues/52892

* [Classify structured data using Keras preprocessing layers](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers)

> This tutorial demonstrates how to classify structured data, such as tabular data, using a simplified version of the PetFinder dataset from a Kaggle competition stored in a CSV file.

In [1]:
import tensorflow as tf
import numpy as np

## Category Encoding (OHE/MHE) Layer in Keras

```CategoryEncoding``` layer takes an integer column and produce OHE or MHE encodinged columns. It can NOT accept string, hence string columns or discreet integer columns need to be converted into continuous integers via StringLookup or IntegerLookup.

* [CategoryEncoding(num_tokens=None, output_mode=<>)](https://www.tensorflow.org/api_docs/python/tf/keras/layers/CategoryEncoding)

### One Hot Encoding vs Multi Hot Encoding

MHE is to save the space. For ```data=['cat', 'dog', 'fish', 'bird', 'ant']```, OHE requires ```N=5``` size array such as ```(1,0,0,0,0)``` for **cat**. MHE uses binary representation hence requires $log_2(N=5)$ size array such as ```[0,0,0]``` for **cat**.


* [What exactly is multi-hot encoding and how is it different from one-hot?](https://stats.stackexchange.com/a/467672)

> multi-hot-encoding introduces false additive relationships, e.g. ```[0,0,1] + [0,1,0] = [0,1,1]``` that is ```'dog' + 'fish' = 'bird'```. That is the price you pay for the reduced representation.

## Convert categorical into MHE

Convert a TF dataset categorical column (single TF Tensor) into MHE columns (single Tensor having multiple columns).

In [103]:
dataset = tf.data.Dataset.from_tensor_slices(tf.constant(['cat', 'dog', 'fish', 'bird']))

lookup = tf.keras.layers.StringLookup(max_tokens=5, oov_token='[UNK]')
lookup.adapt(dataset)
lookup.get_vocabulary()

['[UNK]', 'fish', 'dog', 'cat', 'bird']

In [102]:
mhe = tf.keras.layers.CategoryEncoding(num_tokens=lookup.vocabulary_size(), output_mode="multi_hot")
print(f"cat: {mhe(lookup(tf.constant('cat'))).numpy()}")
print(f"dog: {mhe(lookup(tf.constant('dog'))).numpy()}")

## Convert categorical into OHE

In [124]:
ohe = tf.keras.layers.CategoryEncoding(num_tokens=lookup.vocabulary_size(), output_mode="one_hot")
print(f"cat: {ohe(lookup(tf.constant('cat'))).numpy()}")
print(f"dog: {ohe(lookup(tf.constant('dog'))).numpy()}")

cat: [0. 0. 0. 1. 0.]
dog: [0. 0. 1. 0. 0.]


In [130]:
print(ohe(lookup(tf.constant(['cat', 'dog']))).numpy())

[[0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]]


# Handling multiple values

CategoryEncoding with ```output_mode='multi_hot' behavior``` does not convert a list of values. Instead, need to be 2D array.

In [135]:
# Does not handle 1D array values
print(mhe(lookup(tf.constant(['cat', 'dog', 'bird']))).numpy())

[0. 0. 1. 1. 1.]


In [136]:
# Need to be 2D array values
print(mhe(lookup(tf.constant([['cat'], ['dog']]))).numpy())

[[0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]]
