In [1]:
import os
import pickle
import json
from itertools import cycle, product
import pandas as pd

import numpy as np
np.random.seed(123)

from tensorflow import set_random_seed
set_random_seed(123)

from keras.models import Sequential, save_model, load_model
from keras.layers import *
from keras import backend as K
from sklearn.metrics import f1_score

from keras.callbacks import ReduceLROnPlateau, EarlyStopping, Callback

import matplotlib.pyplot as plt
from seaborn import color_palette 
plt.style.use('seaborn')

Using TensorFlow backend.


# Summary
I'm fitting the keras' embedding layer for a few epochs within the possibly simplest imaginable setup, namely with global averaging and softmax on top. This examinations in meant to be fairly inexpensive but broad, and serve to produce reasonable starting embeddings for more complicated net architectures and to compare various loss-functions/batch-sizes/optimizers. The choice of loss function is a somewhat non-trivial matter due to the class imbalance of our dataset and the stated objective of maximazing the macroF1 score.

* Results land in the directory defined in the *working_dir* variable below.
* Weights are stores as *blabla_weights.p* and they may be directly depickled into the *weights* argument of the Embedding layer (see the *Embedding* within *init* of the *BlackBox* class). 
* There are some nice plots below.

We manage to establish an interesting benchmark for our dataset: the simple setup described above reproduces the performance of shallow classifiers, *macroF1*=80% on test-data. Thus, at least neural nets are no-worse than SVMs and LogisticRegression. For other net architectures to carry their weight, they'll have to go beyond that.

### Choose the directory where to store the results

In [2]:
! mkdir keras_GlobalAvg_GridSearch_results
working_dir = 'keras_GlobalAvg_GridSearch_results'

mkdir: cannot create directory ‘keras_GlobalAvg_GridSearch_results’: File exists


### Load in the test- and validation-data
The test data is not used

In [3]:
# n_sample=30_000

In [4]:
X_train = np.load(os.path.join("data", "Kdata", "X_train.npy")) #[-n_sample:]
y_train = np.load(os.path.join("data", "Kdata", "y_train.npy")) #[-n_sample:]
X_val = np.load(os.path.join("data", "Kdata", "X_val.npy")) #[-n_sample:]
y_val = np.load(os.path.join("data", "Kdata", "y_val.npy")) #[-n_sample:]

### Load in global parameters describing the data prepared in *keras_preprocessing.ipynb*
* dimensions needed for the word embedding
* number of classes
* class-weights

In [5]:
global_params = pickle.load(open("global_params.p", "rb"))
unique_words = global_params['unique_words']
num_words = global_params['num_words']
padded_length = global_params['padded_length']
n_classes = global_params['n_classes']
class_weights = global_params['class_weights']

global_params

{'unique_words': 277303,
 'num_words': 277304,
 'padded_length': 679,
 'n_classes': 6,
 'class_weights': array([ 1.26825655,  0.72736371,  0.27602776, 13.23801959, 30.29201502,
         9.49559404])}

### Load in custom loss functions and metrics
* cat.-accuracy
* macro-precision, macro-f1, macro-recall
* cat.-crossentropy
* a custom loss function, my_loss

In [6]:
%run keras_custom_functions.ipynb

my_metrics = list(CUSTOM_OBJECTS.values())
my_metrics

[<function __main__.cat_acc(y_true, y_pred)>,
 <function __main__.macroPrec(y_true, y_pred)>,
 <function __main__.macroF1(y_true, y_pred)>,
 <function __main__.macroRecall(y_true, y_pred)>,
 <function __main__.cat_cross(y_true, y_pred)>,
 <function __main__.fuzzy_macroF1_flip(y_true, y_pred)>,
 <function __main__.my_cross(y_true, y_pred)>,
 <function __main__.my_loss(y_true, y_pred)>]

### Load in wrappers for keras' sequential-model functionality

In [7]:
%run keras_plot_history.ipynb
%run keras_blackbox_wrapper.ipynb

---

# Quasi-grid-search
* Define the sequential setup by specifiying the layers, and parameter scopes to search through.
* Train for a fixed number of epochs.
* Examine how different parameter combinations influence quality of the classification (measured by the resulting model's macro-F1 on the validation set).

In [8]:
layers = [Dropout(0.5),
          GlobalAveragePooling1D()
         ]

In [9]:
# the options

losses = ['categorical_crossentropy', my_loss]
batch_sizes = [100, 200, 500]
optimizers = ['adam', 'nadam']

options = list(product(losses, batch_sizes, optimizers))
n_options = len(options)
print(f"{n_options} options in the cross-search, e.g. {options[0]}")

12 options in the cross-search, e.g. ('categorical_crossentropy', 100, 'adam')


In [10]:
# fit and evaluate on the validation data, loop through the options

epochs = 2
results = []

def run_test(k):

    loss, batch_size, optimizer = options[k-1]
    print(f"\n{k}/{n_options}")    

    model = BlackBox(tag=f"GS_{k}",\
                     layers=layers, loss=loss, batch_size=batch_size, optimizer=optimizer,\
                     epochs=epochs, metrics=None)
    model.fit(verbose=1, validate=False)
    model.evaluate(X_val, y_val)
    
    result = (model.eval_df, model.loss_name, model.batch_size, model.optimizer)
    results.append(result)
    
    #model.discard()
    del model
    #%reset_selective -f "^model$"
    
for k in range(1, n_options+1):
    run_test(k)


1/12
loss=categorical_crossentropy, batch_size=100, optimizer=adam, explicit-class-weights: True, embedd-trainable: True
Epoch 1/2
Epoch 2/2

2/12
loss=categorical_crossentropy, batch_size=100, optimizer=nadam, explicit-class-weights: True, embedd-trainable: True
Epoch 1/2
Epoch 2/2

3/12
loss=categorical_crossentropy, batch_size=200, optimizer=adam, explicit-class-weights: True, embedd-trainable: True
Epoch 1/2
Epoch 2/2

4/12
loss=categorical_crossentropy, batch_size=200, optimizer=nadam, explicit-class-weights: True, embedd-trainable: True
Epoch 1/2
Epoch 2/2

5/12
loss=categorical_crossentropy, batch_size=500, optimizer=adam, explicit-class-weights: True, embedd-trainable: True
Epoch 1/2
Epoch 2/2

6/12
loss=categorical_crossentropy, batch_size=500, optimizer=nadam, explicit-class-weights: True, embedd-trainable: True
Epoch 1/2
Epoch 2/2

7/12
loss=my_loss, batch_size=100, optimizer=adam, explicit-class-weights: False, embedd-trainable: True
Epoch 1/2
Epoch 2/2

8/12
loss=my_loss,

In [18]:
# take macroF1 from the *results* list
F1s = [[result[0].loc['macroF1'].iloc[0], *result[1:]] for result in results]
F1s_df = pd.DataFrame(F1s, columns = ['macF1 on val', 'loss', 'batch', 'optimizer'])
F1s_df.sort_values(by='macF1 on val', ascending=False, inplace=True)
ranking = F1s_df.index

F1s_df.index = F1s_df.index + 1
F1s_df.index.name = "GS_"
F1s_df

Unnamed: 0_level_0,macF1 on val,loss,batch,optimizer
GS_,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,0.800219,my_loss,200,nadam
8,0.799125,my_loss,100,nadam
12,0.794683,my_loss,500,nadam
7,0.794543,my_loss,100,adam
9,0.789097,my_loss,200,adam
2,0.78072,categorical_crossentropy,100,nadam
4,0.778044,categorical_crossentropy,200,nadam
11,0.771202,my_loss,500,adam
1,0.770701,categorical_crossentropy,100,adam
3,0.723731,categorical_crossentropy,200,adam


### Commentary
Remember that I have let the model fit for only two epochs. The differences are subject to statistical fluctuations due to shuffling of the data done by keras plus a hard-to-gauge bias introduced by initial embedding weights. Seeing as our simple neural network is not necessarilly very representative of more complicated nets it is not guaranteed that the winner of our search will always be best. Nonetheless the above results tell a fairly consistent story in the context of our data. **my_loss** is better than **cat-cross**. **nadam** is better than **adam**. And, in the range of the order of few hundreds, the smaller the batch_size the better but the gain from that is the least significant. The **batch_size=100 or =200** already seems small seeing as it will often not contain the least frequent classes (see the *arXiv_cleanup.ipynb*).

In [12]:
# see the other metrics as well
eval_results = pd.concat([result[0] for result in results], axis=1)
ordered_columns = eval_results.columns.values[ranking]
eval_results[ordered_columns]

Unnamed: 0,GS_10,GS_8,GS_12,GS_7,GS_9,GS_2,GS_4,GS_11,GS_1,GS_3,GS_6,GS_5
cat_cross,0.302168,0.324697,0.284958,0.301253,0.290869,0.181275,0.183533,0.271755,0.186971,0.197483,0.196022,0.230411
my_loss,0.206441,0.20866,0.216839,0.21664,0.226614,0.255171,0.266789,0.25283,0.27757,0.32767,0.331436,0.444602
cat_acc,0.92867,0.929535,0.9242,0.926605,0.922745,0.935455,0.93451,0.918105,0.93355,0.93065,0.93125,0.92086
macroPrec,0.783735,0.765821,0.790043,0.770455,0.776557,0.849545,0.844752,0.793514,0.843908,0.850876,0.858197,0.81338
macroF1,0.800219,0.799125,0.794683,0.794543,0.789097,0.78072,0.778044,0.771202,0.770701,0.723731,0.715217,0.50192
macroRecall,0.819289,0.843283,0.799855,0.823802,0.803427,0.731953,0.729782,0.752909,0.720559,0.65697,0.646718,0.479012


### Commentary
As we see, in order to boost the F1 score, it pays to have the precision and recall scores close. The best scoring model above does not have the highest precision or recall, it also does not have the lowest cross-entropy. 


---
(Kernel restart)

---

## Longer fit of the two loss functions
* Settle on *optimizer*=nadam and *batch_size*=200 (in this range of values the smaller the batch the longer it takes to fit the whole set, and the difference in perfmormance between 200 and 100 seems already small enough).
* Compare the results obtained on a stretch of a few more epochs with different loss functions
* Save embedding layers to files for later use

In [19]:
epochs=6

In [None]:
# 1

loss, batch_size, optimizer = 'categorical_crossentropy', 800, 'nadam'

model1 = BlackBox(tag="GlobalAvg",\
                  layers=layers, loss=loss, batch_size=batch_size, optimizer=optimizer,\
                  epochs=epochs, metrics=my_metrics)

In [None]:
model1.fit(verbose=1)
model1.save_embedd()
model1.Ksave()
model1.save_hist()

In [None]:
model1.Kload()
model1.load_hist()

In [None]:
model1.evaluate(X_val, y_val)
model1.plot()

In [None]:
# 2

loss, batch_size, optimizer = my_loss, 800, 'nadam'

model2 = BlackBox(tag="GlobalAvg",\
                  layers=layers, loss=loss, batch_size=batch_size, optimizer=optimizer,\
                  epochs=epochs, metrics=my_metrics)
model2.fit(verbose=1, validate=True)
model2.save_embedd()
model2.Ksave()
model2.save_hist()

In [None]:
model2.Kload()
model2.load_hist()

In [None]:
model2.evaluate(X_val, y_val)
model2.plot()

In [None]:
# 1+2

eval_together = pd.concat([model1.eval_df, model2.eval_df], axis=1)
eval_together.columns = ['loss: cat. cross.', 'loss: my_loss']
eval_together

### Commentary
* Looking at the validation scores: The *my_loss* function is better than *categorical crossentropy* at keeping the precision and recall equal. It actually emphasizes the recall more, so that - after 3rd epoch - it scores slightly better than precision. It also faster in terms of increasing the F1 at starting epochs, which seems like an advantegous property of a loss function to be used on more complicated architectures. The *cat-cross* on the other hand is able to quickly launch the precision beyond 80%, but the recall is not able to keep up and it then pulls the precision down at later epochs.

* Starting from between 2nd and 4th epoch we see overfitting: the scores obtained directly on train-data are higher than on validation-data. Interestingly, the cross-entropy seems easier to overfit.

---

## Evaluation on test-data
Recall that our final macro-F1 on test-data reached by the SVM was 80%. Here we have not yet looked at the test-data, but, judging by the score on the validation-data, it seems that we have reproduced the result of a shallow classifier with a 50-dimensional embedding, global averaging, and a single softmax-layer. We will yet use the knowledge gained by our grid-search and the embeddings produced by the neural networks trained above in further modelling. But, in order to have a score characterizing the simple architecture with global-averaging, we will now make the final fit for both loss functions.

In [None]:
X_test = np.load(os.path.join("data", "Kdata", "X_test.npy"))
y_test = np.load(os.path.join("data", "Kdata", "y_test.npy"))

In [None]:
# 1

loss, batch_size, optimizer = 'categorical_crossentropy', 800, 'nadam'

model1 = BlackBox(tag="GlobalAvg",\
                  layers=layers, loss=loss, batch_size=batch_size, optimizer=optimizer,\
                  epochs=epochs, metrics=my_metrics)
model1.Kload()
model1.evaluate(X_test, y_test)

In [None]:
# 2

loss, batch_size, optimizer = my_loss, 800, 'nadam'

model2 = BlackBox(tag="GlobalAvg",\
                  layers=layers, loss=loss, batch_size=batch_size, optimizer=optimizer,\
                  epochs=epochs, metrics=my_metrics)
model2.Kload()
model2.evaluate(X_test, y_test)

In [None]:
# 1+2

eval_together = pd.concat([model1.eval_df, model2.eval_df], axis=1)
eval_together.columns = ['loss: cat. cross.', 'loss: my_loss']
eval_together