In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
from local_lib import utils, train
from pathlib import Path
import random
import shutil

<IPython.core.display.Javascript object>

# Background

See the result of simple model runs and how it relates with the private leader board. It is important to see how closely the local validation set actually match with the test set because I can test more iteration than the restrictive 5 times a week competition rule. 

# Method

1. Repeat the baseline model
1. Just run the model with clean dataset. 
1. Train with nonsensical images removed. 
1. Train with nonsensical images removed and upsample to 10,000. 

# Conclusion


# Future


---

# Setup

In [4]:
labels_pt = "../data/processed/labels_v0.1.0.csv"
source_image_dir = Path('../data/processed/labeling/images')
train_image_dir = Path('../data/processed/tmp_training')

labels = pd.read_csv(labels_pt)
labels.head()

<IPython.core.display.Javascript object>

# Baseline Training

The official baseline performance on the public leader board with the original data is `0.65521`.

We can see that the validation and the leader board metric is very close to each other which suggestion that the validation is representative of the leaderboard so we don't need to adjust the validation set. 

In [None]:
if train_image_dir.exists():
    shutil.rmtree(train_image_dir)

_ = utils.copy_group_to_dir(
    df=labels, 
    source_dir=source_image_dir, 
    dest_dir=train_image_dir, 
    group= ["org_source", "org_symbol"]
   )
assert len(_) == 0

train.train_model(train_image_dir, "../data/raw/label_book", verbose=1)

```
Epoch 100/100
259/259 [==============================] - 35s 134ms/step - loss: 0.0242 - accuracy: 0.9942 - val_loss: 1.6930 - val_accuracy: 0.6458
102/102 [==============================] - 3s 25ms/step - loss: 1.3684 - accuracy: 0.6814
7/7 [==============================] - 0s 31ms/step - loss: 2.7221 - accuracy: 0.5577
final loss 1.3684338331222534, final acc 0.6814268231391907
test loss 2.722132444381714, test acc 0.557692289352417
```

# Basic data cleaning

Relabled the whole dataset which hopefully is a much cleaner dataset. Only `18` out of `813` images were corrected, so the data so the validation was relatively clean. 

In [23]:
clean_val = labels.query("org_source == 'val' and org_symbol != symbol")
clean_val.shape

<IPython.core.display.Javascript object>

In [None]:
if train_image_dir.exists():
    shutil.rmtree(train_image_dir)

_ = utils.copy_group_to_dir(
    df=labels, 
    source_dir=source_image_dir, 
    dest_dir=train_image_dir, 
    group= ["org_source", "symbol"]
   )
assert len(_) == 0

train.train_model(train_image_dir, "../data/raw/label_book", verbose=1)

```
Epoch 100/100
259/259 [==============================] - 40s 152ms/step - loss: 0.0200 - accuracy: 0.9961 - val_loss: 1.5971 - val_accuracy: 0.6507
102/102 [==============================] - 2s 22ms/step - loss: 1.1378 - accuracy: 0.7171
7/7 [==============================] - 0s 19ms/step - loss: 2.2028 - accuracy: 0.6346
final loss 1.1377620697021484, final acc 0.7170971632003784
test loss 2.202808141708374, test acc 0.6346153616905212
```

# Baseline train with simple augmentations and upsampling up to 10,000 images

10,000 is the maximum amount of images allow for the competition. This is the sum of train and validation. 

nonsensical images were removed before augmentation because augmenting them will provide little or detrimental results to the training since they are noise. 

The overall performance when from `0.65521` to `0.75000`! However, the validation set metrics has wider gap from the leaderboard than the baseline model. It could be because I am using the "cleaned" dataset, but the actual change is relatively small and won't impact the metrics that much. Further investgiation is needed. 

In [7]:
max_images = 10_000 - 8
remainder = max_images - len(labels)

if train_image_dir.exists():
    shutil.rmtree(train_image_dir)

_ = utils.copy_group_to_dir(
    df=labels, 
    source_dir=source_image_dir, 
    dest_dir=train_image_dir, 
    group= ["org_source", "symbol"]
   )
assert len(_) == 0

labels_filtered = labels.query("suggestion != 'remove' and org_source != 'val'")
images_2_aug = labels_filtered.query("confusing == 'none' and org_source != 'val'").copy()

images_2_aug["aug_amount"] = 0

images_per_group = remainder // 10

symbol_counts = images_2_aug.symbol.value_counts()

# balance sampling so we don't bias toward common symbols
for row in symbol_counts.to_frame().reset_index().itertuples():
    
    symbol_sample_min = int(images_per_group/row.symbol)
    sample_amount = [symbol_sample_min] * row.symbol
    sample_remainder = images_per_group - sum(sample_amount)
    if sample_remainder > 0:
        sample_amount[:sample_remainder] = [symbol_sample_min + 1] * sample_remainder
    random.shuffle(sample_amount)
    images_2_aug.loc[images_2_aug.symbol == row.index, "aug_amount"] = sample_amount

assert images_2_aug.aug_amount.sum() <= remainder

<IPython.core.display.Javascript object>

In [None]:
import imgaug.augmenters as iaa
seq = iaa.Sequential([
    iaa.Crop(px=(1, 16), keep_size=True),
    iaa.Fliplr(0.5),
    iaa.GaussianBlur(sigma=(0, 3.0))
])

utils.apply_aug(images_2_aug, source_image_dir, train_image_dir, seq, ["org_source", "symbol"])

train.train_model(train_image_dir, "../data/raw/label_book", verbose=1)

```
Epoch 100/100
1148/1148 [==============================] - 147s 128ms/step - loss: 0.0220 - accuracy: 0.9926 - val_loss: 1.3617 - val_accuracy: 0.7306
102/102 [==============================] - 2s 21ms/step - loss: 0.9264 - accuracy: 0.8118
7/7 [==============================] - 0s 29ms/step - loss: 1.7896 - accuracy: 0.7115
final loss 0.9263656139373779, final acc 0.8118081092834473
test loss 1.7895989418029785, test acc 0.7115384340286255
```

Leader board result  

Aug 12 - submission-01 = 0.75000

## Train with nonsensical image removed

In [None]:
labels_filtered = labels.query("suggestion != 'remove' or org_source == 'val'")

if train_image_dir.exists():
    shutil.rmtree(train_image_dir)

_ = utils.copy_group_to_dir(
    df=labels_filtered, 
    source_dir=source_image_dir, 
    dest_dir=train_image_dir, 
    group= ["org_source", "symbol"]
   )
assert len(_) == labels.query("suggestion == 'remove' and org_source == 'train'").shape[0]

train.train_model(train_image_dir, "../data/raw/label_book", verbose=1)

```
Epoch 100/100
249/249 [==============================] - 36s 145ms/step - loss: 0.0172 - accuracy: 0.9955 - val_loss: 1.4800 - val_accuracy: 0.6851
102/102 [==============================] - 3s 27ms/step - loss: 1.1938 - accuracy: 0.7319
7/7 [==============================] - 0s 23ms/step - loss: 1.8671 - accuracy: 0.5577
final loss 1.1938356161117554, final acc 0.7318572998046875
test loss 1.8670954704284668, test acc 0.557692289352417
```