In [1]:
from sklearn.datasets import make_classification

# Variable Length Features

Now we start to get into the stuff that NNs shine at. 

So we are still focusing on typical datasets, so no NL or images etc. But this time we are adding one more caveat, we can have variable length features. 

One example of this is trying to classify whether somebody will default on their loan given all of the credit cards that they have. 

Before what you'd have to do is look at aggregations of those features like: average balance of all the credit cards, max balance, etc.

Now with NNs we can use all of those features directly.

---

To practice with this data we will need to do some work create it. We will start by using some more advanced features from the make classification function:

In [2]:
make_classification?

In [3]:
base_dataset = make_classification(
    n_samples=10_000, 
    n_features=30, 
    n_informative=10,
    n_clusters_per_class=2,
    n_classes=4)

x, y = base_dataset

Notice that this time we have four classes. We will use those to create two classes below. But before that we will normalize the data:

In [4]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

x_standardized = ss.fit_transform(x)

In [5]:
base_classes = []

for i in range(4):
    base_classes.append(x_standardized[y == i])

In [6]:
import numpy as np

num_points = 5_000
class1_dist = [.5, .5, 0, 0]
class2_dist = [0, .2, .6, .2]

def make_var_len_feature_point(dist):
    feature_sets = []
    num_features = np.random.randint(3, 11)
    for _ in range(num_features):
        # choose which distribution the credit card comes from
        base_class = np.random.choice([0, 1, 2, 3], 1, p=dist)
        base_class_points = base_classes[base_class[0]]
        feature_set_idx = np.random.choice(base_class_points.shape[0], 1)
        feature_sets.append(base_class_points[feature_set_idx])
        
    for _ in range(10 - num_features):
        feature_sets.append(np.zeros((1, 30)))

    return np.concatenate(feature_sets)[np.newaxis, :, :]


class1_points = []
for _ in range(num_points):
    class1_points.append(
        make_var_len_feature_point(class1_dist))
class1_points = np.concatenate(class1_points)
    
class2_points = []
for _ in range(num_points):
    class2_points.append(
        make_var_len_feature_point(class2_dist))
class2_points = np.concatenate(class2_points)

In [7]:
class2_points.shape

(5000, 10, 30)

Notice that we have two classes above and that they have a variable number of feature sets (or in concrete terms, our customers have a variable number of credit cards). Each feature set represents information about a single credit card (thus they are a series of numbers).

I'm making the classes/customers in class 1 and 0 distinct by saying that the credit cards they generally have are distinct. Thus those two class distributions above signify that they generally have different types of credit cards.

The final thing to notice here is that we go ahead and pad people that don't have 10 cards at least up to 10. Unfortunately this is necessary if you want to have batch sizes greater than 1. That being said, in more sophisticated applications, you will see people group customers with similar number of cards together and run on batches of the same size.

---

Ultimately we end up with data that that consists of customers coming from different classes that have different credit cards. 

The next step is to make the model

In [8]:
def bootstrap_sample_generator(batch_size):
    while True:
        batch_idx = np.random.choice(
            class1_points.shape[0], batch_size // 2)
        batch_x = np.concatenate([
            class1_points[batch_idx],
            class2_points[batch_idx],
        ])
        batch_y = np.concatenate([
            np.zeros(batch_size // 2),
            np.ones(batch_size // 2),
        ])
        yield ({'numeric_inputs': batch_x}, 
               {'output': batch_y})

In [9]:
import tensorflow as tf

p = .1

Notice that we are back to just having one input.

In [10]:
inputs = tf.keras.layers.Input((10, 30), name='numeric_inputs')

This is where the big difference lay. We want to operate on a variable number of inputs. So sometimes there are 4 cards and sometimes 10. Even moreso, there is no order to these inputs.

It would be nice if we could process each card separately and then combine the information about all the cards together.

And we can do that with two layers:

1. Conv1D: we use a convolution layer to apply the same operation to each feature set, thus processing each card separately
2. GlogalMax/AveragePool: We use this layer to combine information from all the cards together into one

In [11]:
x = tf.keras.layers.Dropout(p)(inputs)
# notice I use a kernel size of 1
# this is because there is no information given by adjacency
x = tf.keras.layers.Conv1D(10, 1)(x)
x = tf.keras.layers.Activation('relu')(x)

global_ave = tf.keras.layers.GlobalAveragePooling1D()(x)
global_max = tf.keras.layers.GlobalMaxPool1D()(x)
x = tf.keras.layers.Concatenate()([global_ave, global_max])

x = tf.keras.layers.BatchNormalization()(x)

Notice that we still use batch norm and dropout like before. This time though the work is done in the convolution and the pooling layers

---

The next step is a bit of a bonus, but I think it is a cool addition. The one problem with the above is that we consider each card separately. So one technique that has been highly effective is adding in global information to the original inputs.

The way I think about this is: let's first consider all the the credit cards separately and combine that information, then let's re-examine them all in light of that information.

We do this by adding that global information back onto the original inputs and then repeating the same operations we did above:

In [12]:
# bonus
x = tf.keras.layers.RepeatVector(10)(x)
x = tf.keras.layers.Concatenate()([inputs, x])

x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Conv1D(10, 1)(x)
x = tf.keras.layers.Activation('relu')(x)

global_ave = tf.keras.layers.GlobalAveragePooling1D()(x)
global_max = tf.keras.layers.GlobalMaxPool1D()(x)
x = tf.keras.layers.Concatenate()([global_ave, global_max])

x = tf.keras.layers.BatchNormalization()(x)

Now that we have gathered all this information about the credit cards, we will feed it though the same old network we had before

In [13]:
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(100, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(20, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)

x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(p)(x)
out = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(x)

In [14]:
model = tf.keras.models.Model(inputs=inputs, outputs=out)
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [15]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
numeric_inputs (InputLayer)     [(None, 10, 30)]     0                                            
__________________________________________________________________________________________________
dropout (Dropout)               (None, 10, 30)       0           numeric_inputs[0][0]             
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 10, 10)       310         dropout[0][0]                    
__________________________________________________________________________________________________
activation (Activation)         (None, 10, 10)       0           conv1d[0][0]                     
______________________________________________________________________________________________

In [16]:
batch_size = 32

model.fit_generator(
    bootstrap_sample_generator(batch_size),
    steps_per_epoch=10_000 // batch_size,
    epochs=5,
    max_queue_size=10,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x13726a4a8>

Our next lesson will be pretty similar to this one, but we will be working with ordered data instead.