In [1]:
from sklearn.datasets import make_classification

# Categorical Data

With this next dataset we start to move into deep learning territory. 

Now not all categorical data is better suited to deep learning, but high cardinality categorical data (aka columns with a lot of categories) is. 

Old ML algs can only treat each category as completely separate entities, whereas deep learning with the use of embeddings, can capture the similarities of some categories with others. The most classic version of this with word embeddings, but the same thing can be done with zipcodes.

So let's get cracking by making a dataset.

In [29]:
numeric_dataset = make_classification(
    n_samples=10_000, 
    n_features=25, 
    n_informative=10,
    n_classes=2)

x, y = numeric_dataset

In [30]:
import pandas as pd
import numpy as np

np.set_printoptions(precision=1)

In [31]:
num_categories = 100
for i in range(5):
    x[:, i] = pd.cut(x[:, i], num_categories, labels=False)

In [32]:
x[0]

array([59. , 31. , 42. , 55. , 64. ,  0.7,  0.2, -1.9,  0.3, -1. , -0.1,
       -0.2, -0.6,  0.1, -1.2,  0.6,  1.6, -2.2, -0.4,  1.3, -0.5,  0.8,
        2. , -0.2, -1.5])

In [33]:
x_numeric = x[:, 5:]
x_cat = x[:, :5]

We have 5 different variables with 100 categories each. 

The next step is to standardize the inputs. The nice thing about categoricals is that we won't need to standardize them. We will still need to standardize the numerice ones

In [35]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

x_standardized = ss.fit_transform(x_numeric)

Now we can start to make our model. 

In [36]:
import tensorflow as tf

p = .1

numeric_inputs = tf.keras.layers.Input((20,), name='numeric_inputs')
cat_inputs = tf.keras.layers.Input((5,), name='cat_inputs')

Notice that now our model takes two inputs, categorical and numeric. The categorical inputs are fed into an embedding layer:

In [37]:
def emb_sz_rule(n_cat): 
    return min(600, round(1.6 * n_cat**0.56))

embedding_layer = tf.keras.layers.Embedding(
    num_categories, 
    emb_sz_rule(num_categories), 
    input_length=5)
cat_inputs = embedding_layer(cat_inputs)
cat_inputs = tf.keras.layers.Flatten()(cat_inputs)

Above we make an embedding layer. An embedding layer uses a series of weights to represent each category and in that way learns how the categories relate. To find out how many weights we should use, we use the `emb_sz_fule`. It's a pretty good rule of thumb (comes from fast.ai).

Next we pass both the embeddings and the numeric inputs into the same network we used last time:

In [38]:
x = tf.keras.layers.Concatenate()([cat_inputs, numeric_inputs])

x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(100, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(20, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
out = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(x)

In [39]:
model = tf.keras.models.Model(
    inputs=[numeric_inputs, cat_inputs], outputs=out)
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [40]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
cat_inputs (InputLayer)         [(None, 5)]          0                                            
__________________________________________________________________________________________________
numeric_inputs (InputLayer)     [(None, 20)]         0                                            
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 25)           0           cat_inputs[0][0]                 
                                                                 numeric_inputs[0][0]             
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 25)           0           concatenate_4[0][0]        

In [41]:
import numpy as np

def bootstrap_sample_generator(batch_size):
    while True:
        batch_idx = np.random.choice(
            standardized_x.shape[0], batch_size)
        yield ({'numeric_inputs': standardized_x[batch_idx],
                'cat_inputs': x_cat[batch_idx]}, 
               {'output': y[batch_idx]})

In [42]:
batch_size = 32

model.fit_generator(
    bootstrap_sample_generator(batch_size),
    steps_per_epoch=10_000 / batch_size,
    epochs=5,
    max_queue_size=10,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x135c3dcf8>

Definitely a lower accuracy (partly because we destroyed information by converting numbers into categories.

Using embeddings can help out a ton with these sorts of problems. So if you have a dataset that for the most part is normal, but also has high cardinality categorical variables, then consider NNs.

One thing more I'll say here is that initializing the embedding from another similar dataset can help a lot. For example initializing word vectors is a very common trend in NLP.