## Active Learning with Regularization and Data Augmentation

After performing some unsupervised analysis, it's clear that there are a few classes that are hard to distinguish in the feature space. To help identify a strong separation boundary between these more fuzzy observations, we can apply two techniques:

1. Active Learning to query for examples which the model has a hard time discerning. 
2. Data Augmentation to create new samples from the points that the model has a hard time discerning. 

With active learning, we hope to be more thoughtful about which labels we choose to extract from the database since retrieving labels can be an expensive task. Moreover, with data augmentation, we hope to make our model more robust in classifying fuzzier data points without having to query the database for additional labels. Though augmented samples increase our model's training time (which likely has its own costs), they don't require us to pay any costs associated with generating labels. We are willing to make this tradeoff.

To leverage these methods in our approach, we build the following modeling pipeline:

1. Generate initial random sample of data and train on our CNN. We choose to collect an equal number of instances per class so the model can identify which classes are separable and which aren't. The remaining examples go into a sampling pool which our CNN can query.
2. Create new samples with data generator based on the initial random sample.
3. Train CNN on this initial sample and its generated examples.
4. Query for observations that model has most uncertainty in assigning a label.
5. Generate new data samples from queried data.
6. Train CNN on queried and new data samples. Remove the queried samples from the sampling pool.
7. Repeat steps 4-6.

We hope that our setup will allow us to build a classifier that hits our accuracy benchmark using as few labels as possible.

In [1]:
import keras
from keras.models import Sequential
from keras.layers import (
    Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization, Concatenate
)
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import Adam
from keras.layers.experimental.preprocessing import (
    RandomRotation, RandomFlip, RandomTranslation, RandomZoom
)
from modAL.models import ActiveLearner
from modAL.uncertainty import entropy_sampling, uncertainty_sampling, margin_sampling
from data.get_data import load_mnist
import numpy as np

In [2]:
# here y_train serves as our mock database to collect labels
X_train, y_train = load_mnist('data/f_mnist_data', kind='train')
X_test, y_test = load_mnist('data/f_mnist_data', kind='t10k')

# minmax scaling
X_train, X_test = X_train/255.0, X_test/255.0

In [3]:
X_train = X_train.reshape(60000, 28, 28, 1)
X_test = X_test.reshape(10000, 28, 28, 1)

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

# assemble initial data sample for active learner
initial_idx = np.array([],dtype=int)
n_instances_per_class = 200
for i in range(10):
    idx = np.random.choice(np.where(y_train[:,i]==1)[0], size=n_instances_per_class, replace=False)
    initial_idx = np.concatenate((initial_idx, idx))

The next cell implements a simple data generator to create new training examples. To synthesize new examples, we apply some fairly standard image transformations to our input.  

In [4]:
from keras.layers import Concatenate

augmenter = Sequential([
    RandomFlip("horizontal_and_vertical"),
    RandomRotation(0.15),
    RandomTranslation(height_factor=(-.15, .15), width_factor=(-.15, .15)),
    RandomZoom(height_factor=(-.15, .15))
])

def generate_new_samples(X, y, augmenter=augmenter, num_samples=1):
    if num_samples > 1:
        sample_batch = Concatenate(axis=0)([augmenter(X) for _ in range(num_samples)])
        sample_labels = Concatenate(axis=0)([y for _ in range(num_samples)])
        return sample_batch.numpy(), sample_labels.numpy()
    else:
        sample_batch = augmenter(X).numpy()
        # sample_label is same as y
        return sample_batch, y

2021-12-27 12:22:08.786869: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
# forming the initial training data
X_initial = X_train[initial_idx]
y_initial = y_train[initial_idx]

# generate augmented samples
X_gen, y_gen = generate_new_samples(X_initial, y_initial, num_samples=1)

# generate the sampling pool
# remove the initial data from the training dataset
X_pool = np.delete(X_train, initial_idx, axis=0)
y_pool = np.delete(y_train, initial_idx, axis=0)

In [6]:
def create_keras_model():

    # model configs and hyperparams
    dim = (28,28,1)
    dropout_rate = .25
    
    model = Sequential([
        Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same', 
               input_shape=dim),
        BatchNormalization(),
        MaxPooling2D(pool_size=(2, 2)),
        Dropout(dropout_rate),

        Conv2D(128, kernel_size=(3, 3), activation='relu', padding='same'),
        BatchNormalization(),     
        MaxPooling2D(pool_size=(2, 2)),
        Dropout(dropout_rate),
        
        Conv2D(256, kernel_size=(3, 3), activation='relu', padding='same'),
        BatchNormalization(),  
        MaxPooling2D(pool_size=(2, 2)),   
        Dropout(dropout_rate),
        
        Flatten(),
        
        Dense(1024, activation='relu'),
        Dropout(dropout_rate),
        
        Dense(512, activation='relu'),
        Dropout(dropout_rate),
        
        Dense(10, activation='softmax')
    ])

    adam = Adam(lr=0.001, decay=1e-6)
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

    return model

classifier = KerasClassifier(create_keras_model)

In [7]:
# training params
epochs = 40

# initialize ActiveLearner
learner = ActiveLearner(
    estimator=classifier,
    X_training=np.concatenate([X_initial, X_gen], axis=0), 
    y_training=np.concatenate([y_initial, y_gen], axis=0),
    query_strategy=margin_sampling,
    verbose=1,
    epochs=epochs,
)

# baseline score
print(f"baseline score: {learner.score(X_test, y_test)}")

2021-12-27 12:22:10.807925: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)


Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
baseline score: 0.849399983882904


In [8]:
n_queries = 5
epochs=40
accuracy_arr = list()

# conduct several rounds of data querying and use the newly queried samples to improve model performance
for idx in range(n_queries):
    print(f'Query no. {(idx + 1)}')
    query_idx, query_instance = learner.query(X_pool, n_instances=400, verbose=0)
    
    # generate new samples
    X_gen, y_gen = generate_new_samples(X_pool[query_idx], y_pool[query_idx], num_samples=1)
    
    # train active learner on queried data and newly generated samples
    learner.teach(
        X=np.concatenate([X_pool[query_idx], X_gen], axis=0),
        y=np.concatenate([y_pool[query_idx], y_gen], axis=0), 
        verbose=1,
        epochs=epochs,
    )
    
    # remove queried instances from the sampling pool
    X_pool = np.delete(X_pool, query_idx, axis=0)
    y_pool = np.delete(y_pool, query_idx, axis=0)
    
    # evaluate performance
    model_accuracy = learner.score(X_test, y_test, verbose=0)
    accuracy_arr.append(model_accuracy)
    print(f"model accuracy: {model_accuracy}")

Query no. 1




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8604999780654907
Query no. 2




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8623999953269958
Query no. 3




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8683000206947327
Query no. 4




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8737999796867371
Query no. 5




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8809000253677368


In [9]:
len(learner.X_training)

8000

So far, we've shown our model 8000 total samples (half of which are synthetic and have zero labeling costs), and our model's accuracy is ~88.1%. We are already close to our baseline results and we've used 60% less labels! 

Can we improve our model's performance by querying more samples?

In [10]:
# 3 more times
for idx in range(n_queries, n_queries + 3):
    print(f'Query no. {(idx + 1)}')
    query_idx, query_instance = learner.query(X_pool, n_instances=400, verbose=0)
    X_gen, y_gen = generate_new_samples(X_pool[query_idx], y_pool[query_idx], num_samples=1)
    learner.teach(
        X=np.concatenate([X_pool[query_idx], X_gen], axis=0),
        y=np.concatenate([y_pool[query_idx], y_gen], axis=0), 
        verbose=1,
        epochs=epochs,
    )
    
    # remove queried instance from pool
    X_pool = np.delete(X_pool, query_idx, axis=0)
    y_pool = np.delete(y_pool, query_idx, axis=0)
    
    # evaluate performance
    model_accuracy = learner.score(X_test, y_test, verbose=0)
    accuracy_arr.append(model_accuracy)
    print(f"model accuracy: {model_accuracy}")

Query no. 6




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8828999996185303
Query no. 7




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8960999846458435
Query no. 8




Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.8626000285148621


At this point, we've shown our model 5,200 real samples, or 8.667% of the train dataset, and we've surpassed the performance of our baseline!

We also have that the test accuracy has increased on almost every iteration of querying (the notable exception being the last iteration). So, for the msot part, our queries are helping the model classify examples of unseen data!

In [11]:
# one more iteration *fingers crossed*
query_idx, query_instance = learner.query(X_pool, n_instances=400, verbose=0)
X_gen, y_gen = generate_new_samples(X_pool[query_idx], y_pool[query_idx], num_samples=1)
learner.teach(
    X=np.concatenate([X_pool[query_idx], X_gen], axis=0),
    y=np.concatenate([y_pool[query_idx], y_gen], axis=0), 
#         X=X_pool[query_idx],
#         y=y_pool[query_idx],
    verbose=1,
    epochs=epochs,
)

# remove queried instance from pool
X_pool = np.delete(X_pool, query_idx, axis=0)
y_pool = np.delete(y_pool, query_idx, axis=0)

# evaluate performance
model_accuracy = learner.score(X_test, y_test, verbose=0)
accuracy_arr.append(model_accuracy)
print(f"model accuracy: {model_accuracy}")



Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
model accuracy: 0.9006999731063843


We hit 90% accuracy! Let's see how much data we used in total.

In [12]:
print(f"Number of total samples trained on Active Learner: {len(learner.X_training)}")

Number of total samples trained on Active Learner: 11200


In [13]:
print(f"Max Accuracy on Test Set: {max(accuracy_arr)}")

Max Accuracy on Test Set: 0.9006999731063843


In [15]:
max_idx = accuracy_arr.index(max(accuracy_arr))
num_samples_max_accuracy = n_instances_per_class * 10 + (max_idx + 1) * 400

print(f"Number of total samples trained at point of Maximum Accuracy: {num_samples_max_accuracy * 2}")
print(f"Number of augmented samples trained at point of Maximum Accuracy: {num_samples_max_accuracy}")
print(f"Number of real samples trained at point of Maximum Accuracy: {num_samples_max_accuracy}")
print(f"Percentage of train data used at point of Maximum Accuracy: {num_samples_max_accuracy/X_train.shape[0]}")

Number of total samples trained at point of Maximum Accuracy: 11200
Number of augmented samples trained at point of Maximum Accuracy: 5600
Number of real samples trained at point of Maximum Accuracy: 5600
Percentage of train data used at point of Maximum Accuracy: 0.09333333333333334


Using active learning and data augmentation, we were able to hit our target accuracy (barely, but still) using only 5,600 queried observations from the labels database. In the end, we only used 9.33% of our train data, which is around half the data we used in our baseline experiment. And, with some finer-grained model tuning, I think it's possible we could hit our accuracy target with even fewer observations. But given the time constraints of this problem, that is outside the scope of this analysis. 