## Neural-Net for Tabular data
### <u>Problem statement 2:</u> 
There are a set of categorical variable. They can be encoded in NN within `Embedding Layer`. This later encodes each categorical value to a fixed size vector then concatenated with the remaining numerical variables. Here the rule of thumb of setting the size of embedding layer has been applied. 

In [1]:
from sklearn.datasets import make_classification

In [32]:
# Binary Classification Dataset 
# 20: Numerical; 5: Categorical 
dataset = make_classification(n_samples=10_000, 
                              n_features=25, 
                              n_classes=2,
                              n_informative=10)
X, y = dataset
X, y

(array([[-0.63,  1.38, -1.08, ...,  1.41,  0.11, -1.3 ],
        [ 0.58, -2.3 ,  0.5 , ...,  0.36,  0.93, -0.55],
        [-1.31,  1.87,  2.65, ...,  2.02,  0.81,  0.81],
        ...,
        [ 0.09,  1.09, -4.18, ..., -0.73, -0.03, -0.15],
        [ 0.69,  4.16, -0.31, ..., -1.67, -0.3 , -0.35],
        [-0.09, -4.48, -0.11, ...,  0.84, -1.28, -0.37]]),
 array([0, 1, 1, ..., 0, 0, 1]))

In [33]:
import pandas as pd 
import numpy as np 
np.set_printoptions(precision=2, suppress=True)

In [34]:
no_category = 100

for col_id in range(5):
    X[:, col_id] = pd.cut(X[:, col_id], no_category, labels=False)


In [35]:
X_cat = X[:, :5]
X_num = X[:, 5:]

In [36]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_num_scaled = scaler.fit_transform(X_num)

In [37]:
from tensorflow.keras.layers import (Input, Dropout, Dense, 
                                     BatchNormalization, Embedding, 
                                     Flatten, Concatenate
                                    )
from tensorflow.keras.models import Model

# Dropout probability 
p = .1
batch_size = 32

#### __*Separate Inputs as Numerical and Categorical:*__
- *Categorical* input is first fed into **Embedding Layer** to learn the Vector representation of each category

In [None]:
# Rule of thumb: size of embedding vector w.r.t the number of categories 
def embedding_size(no_cat):
    return min(600, round(1.6 * no_cat**0.56))

In [38]:
# Define Categorical & Numerical input layer 
num_inputs = Input((20,), name='num_inputs')
cat_inputs = Input((5,), name='cat_inputs')


# Encode the category using embedding layer
# Embedding layer weights are trainable 
embedding_layer = Embedding(input_dim=no_category, 
                            output_dim=embedding_size(no_category),
                            input_length=5
                           )
cats = embedding_layer(cat_inputs)
cats = Flatten()(cats)


x = Concatenate()([cats, num_inputs])
#===== L0
x = Dropout(p)(x)
x = Dense(100, activation='relu')(x)
#===== L1
x = BatchNormalization()(x)
x = Dropout(p)(x)
x = Dense(20, activation='relu')(x)
#===== L2
x = BatchNormalization()(x)
x = Dropout(p)(x)
x = Dense(10, activation='relu')(x)
#===== L3
x = BatchNormalization()(x)
x = Dropout(p)(x)
#===== Output
out = Dense(1, activation='sigmoid', name='output')(x)

In [42]:
model = Model(
    inputs=[num_inputs, cat_inputs], 
    outputs=out
)
model.compile(optimizer='rmsprop',
             loss='binary_crossentropy', 
             metrics=['accuracy'])

In [43]:
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
cat_inputs (InputLayer)         [(None, 5)]          0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 5, 21)        2100        cat_inputs[0][0]                 
__________________________________________________________________________________________________
flatten_2 (Flatten)             (None, 105)          0           embedding_2[0][0]                
__________________________________________________________________________________________________
num_inputs (InputLayer)         [(None, 20)]         0                                            
____________________________________________________________________________________________

In [44]:
import numpy as np 

def sample_generator(batch_size):
    while True:
        batch_idx = np.random.choice(X_num_scaled.shape[0], batch_size)
        yield (
            {'num_inputs': X_num_scaled[batch_idx], 
             'cat_inputs': X_cat[batch_idx]},
            {'output': y[batch_idx]}
        )

In [46]:
model.fit_generator(
    generator=sample_generator(batch_size),
    steps_per_epoch=10_000 / batch_size,
    epochs=20,
    max_queue_size=10
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f1f8454b610>