# Dynamic Bootstrapping

This notebook outlines some of my initial thoughts on using bootstrap samples of decreasing balance to learn models on highly imbalanced datasets.  The idea is to start training a model with bootstrap samples of balanced classes and gradually decay the balance to the true distribution as learning takes place much like the learning rate.  The hope is that this will allow the classifier to identify patterns related to the minority class and then gradually come to recognize the true distribution.  Hopefully it becomes less biased towards the minority as time goes on and the bootstrap samples approach the true distribution.

### Data

All data was taken from the https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data challenge.  Just unzip the Train.zip file into the directory with this notebook in order to run the following code.

In [1]:
import keras
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras.layers import Dense, Dropout, Activation, Embedding, LSTM
from keras.optimizers import SGD, Adam
import warnings
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, precision_recall_curve,average_precision_score
warnings.filterwarnings('ignore')

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
# Load in the data
data = pd.read_csv('Train.csv', nrows=250000)
data.head()

Unnamed: 0,Id,Title,Body,Tags
0,1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...
1,2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox
2,3,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning
3,4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding
4,5,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents


In [3]:
# Parameters
max_length = 350  # length of input sequences to the model
n_top_tags = 1  # n most prevelant tags to try to predict
vocab_size = 2000  # How many distinct tokens to take
char_model = False  # type of model to train (character or word)
batch_size = 128
num_epochs = 20

In [4]:
# Convert the tags and texts to lists for the keras tokenizer
tag_list = data['Tags'].tolist()
text_list = data['Body'].tolist()

print(tag_list[:25])
print("="*115)
print(text_list[:2])

['php image-processing file-upload upload mime-types', 'firefox', 'r matlab machine-learning', 'c# url encoding', 'php api file-get-contents', 'proxy active-directory jmeter', 'core-plot', 'c# asp.net windows-phone-7', '.net javascript code-generation', 'sql variables parameters procedure calls', '.net obfuscation reflector', 'algorithm language-agnostic random', 'postfix migration mdaemon', 'documentation latex3 expl3', 'windows-7', 'php url-routing conventions', 'r temporary-files', 'wpf binding', 'javascript code-generation playframework minify', 'php xml hash multidimensional-array simplexml-load-string', 'medical-science cancer healthcare', 'c# .net linq', 'actionscript-3 flex flex3', 'iis', 'c# linq string enumeration']
["<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter w

In [5]:
# Tokenize the labels
tag = 'c#'
tag_matrix = np.array([[(tag in x.split(' ')) + 0.0] for x in tag_list])

In [6]:
# tag_tokenizer = Tokenizer(num_words=n_top_tags + 1)
# tag_tokenizer.fit_on_texts(tag_list)
# tag_matrix = tag_tokenizer.texts_to_matrix(tag_list)[:, 1:]

In [7]:
print(tag_matrix.shape)
print(tag_matrix)
# print(list(tag_tokenizer.word_index.keys())[:11])

(250000, 1)
[[0.]
 [0.]
 [0.]
 ...
 [0.]
 [0.]
 [0.]]


# Sequence Model

In [8]:
text_tokenizer = Tokenizer(num_words=vocab_size, char_level=char_model)
text_tokenizer.fit_on_texts(text_list)
text_matrix = text_tokenizer.texts_to_sequences(text_list)

# We have a numeric representation of the words in the questions
print(vocab_size)
print(text_matrix[:2])
print(list(text_tokenizer.word_index.keys())[:25])
print(list(text_tokenizer.word_index.keys())[vocab_size-25:vocab_size])

2000
[[1, 381, 51, 4, 335, 23, 34, 1868, 53, 9, 34, 142, 53, 92, 274, 334, 548, 1658, 44, 261, 53, 2, 99, 9, 16, 52, 45, 4, 795, 2, 166, 61, 607, 2, 79, 10, 798, 6, 44, 153, 28, 2, 79, 100, 1417, 61, 53, 79, 81, 795, 1, 1, 9, 56, 6, 86, 4, 335, 23, 2, 1868, 53, 9, 34, 142, 32, 1278, 2, 53, 1024, 45, 102, 1], [1, 12, 21, 1171, 3, 67, 536, 4, 583, 6, 844, 307, 136, 15, 903, 1694, 893, 4, 84, 16, 954, 9, 2, 746, 338, 22, 178, 240, 3, 311, 512, 43, 1028, 9, 2, 746, 338, 10, 1384, 536, 61, 954, 13, 9, 27, 59, 3, 62, 9, 56, 6, 86, 4, 774, 536, 32, 954, 1, 1, 1]]
['p', 'the', 'i', 'to', 'code', 'a', 'gt', 'lt', 'is', 'and', 'pre', 'in', 'this', 'of', 'it', 'that', 'for', '0', '1', 'have', 'my', 'on', 'if', 'with', 'but']
['directories', 'fragment', 'argv', 'nginx', 'uiimage', 'updating', 'identifier', 'v1', 'uk', 'relationship', 'annotation', 'eg', 'embedded', 'htaccess', "name'", 'conditions', 'putting', '0000', 'mysite', 'inf', 'fairly', 'went', 'emails', "'s", 'escape']


In [9]:
# Padd all sequences to the same size
X = sequence.pad_sequences(text_matrix, maxlen=max_length, padding='pre', truncating='post')

y = tag_matrix

x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

In [10]:
# Try a sequence model instead
seq_model = Sequential()
seq_model.add(Embedding(vocab_size, 100, input_shape=(max_length, )))
seq_model.add(Dropout(.2))
seq_model.add(LSTM(64))
seq_model.add(Dropout(.2))
seq_model.add(Dense(n_top_tags, activation='sigmoid'))
seq_model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [11]:
seq_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 350, 100)          200000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 350, 100)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 242,305
Trainable params: 242,305
Non-trainable params: 0
_________________________________________________________________


In [None]:
normal_scores = []

In [None]:
for e in range(0, 10):
    seq_model.fit(x_train, y_train, epochs=1)
    predictions = seq_model.predict(x_val)
    f1_scores = f1_score(y_val, predictions  > 0.5)
    precision_scores = precision_score(y_val, predictions  > 0.5)
    recall_scores = recall_score(y_val, predictions  > 0.5)
    auc_scores = roc_auc_score(y_val, predictions)
    avg_precision_score = average_precision_score(y_val, predictions)
    normal_scores.append([e, f1_scores, precision_scores, recall_scores, auc_scores, avg_precision_score])
    print(f1_scores, precision_scores, recall_scores, auc_scores, avg_precision_score)

Epoch 1/1

In [None]:
predictions = seq_model.predict(x_val)
f1_scores = f1_score(y_val, predictions  > 0.5)
precision_scores = precision_score(y_val, predictions  > 0.5)
recall_scores = recall_score(y_val, predictions  > 0.5)
auc_scores = roc_auc_score(y_val, predictions)
average_precision_score(y_val, predictions)
print(f1_scores, precision_scores, recall_scores, auc_scores, )

In [None]:
average_precision_score(y_val, predictions)

In [None]:
normal_scores.append([e, f1_scores, precision_scores, recall_scores, auc_scores])

# Use Dynamic Bootstrapping 

Previously we just trained a model out of the box on the data.  Pefromance is fairly poor and most of the classes are just set to 0.  Here we use dynamic bootstrapping to initialize our model parameters and then train for a few epochs on the true dataset.

In [None]:
# Try a sequence model instead
seq_model_boot = Sequential()
seq_model_boot.add(Embedding(vocab_size, 100, input_shape=(max_length, )))
seq_model_boot.add(Dropout(.2))
seq_model_boot.add(LSTM(64))
seq_model_boot.add(Dropout(.2))
seq_model_boot.add(Dense(n_top_tags, activation='sigmoid'))
seq_model_boot.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Find the true ratio of the training set.  This is the lower bound on how imblanaced the classes
# are allowed to be
true_ratio = np.sum(y_train) / y_train.shape[0]

In [None]:
# So 313 bootstrap samples per epoch
# Each bootstrap sample has 512 examples in it
# We will use a batch size of 128

# true_ratio = imbalance_ratio * x ^ (steps_per_epoch  * num_epochs)
# true_ratio / imbalance_ratio = x ^ (steps_per_epoch  * num_epochs)
# log(true_ratio) - log(imbalance_ratio) = steps_per_epoch  *num_epochs log(x)
# log(true_ratio) - log(imbalance_ratio) / (steps_per_epoch * num_epochs) = log(x)
# 10 ^ (log(true_ratio) - log(imbalance_ratio) / (steps_per_epoch * num_epochs)) = x

def calculate_ratio_constant(num_data, batch_size, num_batches, num_epochs, true_ratio, imbalance_ratio):
    """Calculates the decay constant for the imbalance ratio"""
    steps_per_epoch = num_data / (batch_size * num_batches)
    return np.exp((np.log(true_ratio) - np.log(imbalance_ratio)) / (steps_per_epoch * num_epochs))

In [None]:
batch_size = 128
num_epochs = 7
num_batches_in_bootstrap_sample = 10
imbalance_ratio = .5
num_times_to_train_on_full_dataset = 3
decay_rate = calculate_ratio_constant(y.shape[0], batch_size, num_batches_in_bootstrap_sample, num_epochs, true_ratio, imbalance_ratio)
print(decay_rate)

In [None]:
# Create a pandas dataframe of the x_train data to bootstrap sample
x_train_df = pd.DataFrame(x_train)
x_train_df['label'] = pd.Series(y_train.flatten())

In [None]:
# Create the positive and negative examples
x_train_pos_df = x_train_df[x_train_df['label'] == 1.0]
x_train_neg_df = x_train_df[x_train_df['label'] == 0.0]

In [None]:
# Bootstrap sample from the training set

# start balanced
sample_ration = .5

validation_performance = []

# Perform 4 epochs of training using bootstrap samples where the imbalance rate gradually approaches the true
# distribution
for i in range(0, num_epochs):
    for j in range(0, int(y_train.shape[0] / (num_batches_in_bootstrap_sample * batch_size))):
        pos_samples = (x_train_pos_df
                       .sample(int(num_batches_in_bootstrap_sample * batch_size * imbalance_ratio), replace=True)
                       .drop(columns='label')
                       .as_matrix())
        neg_samples = (x_train_neg_df
                       .sample(int(num_batches_in_bootstrap_sample * batch_size * imbalance_ratio), replace=True)
                       .drop(columns='label')
                       .as_matrix())
        x_train_sampled = np.concatenate([pos_samples, neg_samples])
        y_train_pos = np.ones([pos_samples.shape[0], ])
        y_train_neg = np.zeros([neg_samples.shape[0], ])
        y_train_sampled = np.concatenate([y_train_pos, y_train_neg])
        seq_model_boot.fit(x_train_sampled, y_train_sampled, epochs=1, batch_size=batch_size, shuffle=True)
        
        # Update sample ratio
        imbalance_ratio = imbalance_ratio * decay_rate
    
    # Every epoch print and save the validation performance
    predictions = seq_model_boot.predict(x_val)
    f1_scores = f1_score(y_val, predictions  > 0.5)
    precision_scores = precision_score(y_val, predictions  > 0.5)
    recall_scores = recall_score(y_val, predictions  > 0.5)
    auc_scores = roc_auc_score(y_val, predictions)
    avg_precision_score = average_precision_score(y_val, predictions)
    validation_performance.append([f1_scores, precision_scores, recall_scores, auc_scores, avg_precision_score])
    print(f1_scores, precision_scores, recall_scores, auc_scores, avg_precision_score)

seq_model_boot.fit(x_train, y_train, epochs=num_times_to_train_on_full_dataset, shuffle=True)

In [None]:
seq_model_boot.fit(x_train, y_train, epochs=num_times_to_train_on_full_dataset, shuffle=True)

In [None]:
predictions = seq_model_boot.predict(x_val)
f1_scores = f1_score(y_val, predictions  > 0.5)
precision_scores = precision_score(y_val, predictions  > 0.5)
recall_scores = recall_score(y_val, predictions  > 0.5)
auc_scores = roc_auc_score(y_val, predictions)
print(f1_scores, precision_scores, recall_scores, auc_scores)

# Notes:

```
After 5 epochs trained on full data set we get:
0.5976387002403092
0.7844212835984641
0.48270042194092827

After 7
0.6183401154537712 0.7037914691943128 0.5513924050632911

After dynamic bootstrapping we get
0.6410598233627728
0.6330426197136745
0.6492827004219409
```

It appears that dynamic bootstrapping gives, better performance.  It also appears that it converges more quickly to the optimum value.

# Cost Sensitive Learning

Use class weighting and compare

# Over Sampleing
Here we just oversample the minority class to equal the majority

# Under Sampling

Here we understample the majority class to equal the minority class

# SMOTE

Use smote to create ideal dataset then run training for 10 epochs