In [11]:
from fastai.vision import *
from fastai.metrics import accuracy
from fastai.basic_data import *
import pandas as pd

from utils import *

## A look at the data

In [12]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,Image,Id
0,0000e88ab.jpg,w_f48451c
1,0001f9222.jpg,w_c3d896a
2,00029d126.jpg,w_20df2c5
3,00050a15a.jpg,new_whale
4,0005c1ef8.jpg,new_whale


In [13]:
df.Id.value_counts().head()

new_whale    9664
w_23a388d      73
w_9b5109b      65
w_9c506f6      62
w_0369a5c      61
Name: Id, dtype: int64

In [14]:
(df.Id == 'new_whale').mean()

0.38105752927723668

In [15]:
(df.Id.value_counts() == 1).mean()

0.41418581418581418

41% of all whales have only a single image associated with them.

38% of all images contain a new whale - a whale that has not been identified as one of the known whales.

There is a superb writeup on what a solution to this problem might look like [here](https://www.kaggle.com/martinpiotte/whale-recognition-model-with-score-0-78563/notebook). In general, the conversation in the Kaggle [forum](https://www.kaggle.com/c/humpback-whale-identification/discussion) also seems to have some very informative threads.

Either way, starting with a simple model that can be hacked together in a couple of lines of code is a recommended approach. It is good to have a baseline to build on - going for a complex model from start is a way for dying a thousand deaths by subtle bugs.

In [16]:
df.Id.nunique()

5005

In [17]:
df.shape

(25361, 2)

In [18]:
fn2label = {row[1].Image: row[1].Id for row in df.iterrows()}

In [19]:
SZ = 224
BS = 64
NUM_WORKERS = 12
SEED=0

In [20]:
data = (
    ImageItemList
        .from_folder('data/train')
        .random_split_by_pct(seed=SEED)
        .label_from_func(lambda path: fn2label[path.name])
        .add_test(ImageItemList.from_folder('data/test'))
        .transform(get_transforms(do_flip=False, max_zoom=1, max_warp=0, max_rotate=2), size=SZ, resize_method=ResizeMethod.SQUISH)
        .databunch(bs=BS, num_workers=NUM_WORKERS, path='data')
)

Exception: Your validation data contains a label that isn't present in the training set, please fix your data.

In [None]:
data.show_batch(rows=3)

## Train

In [None]:
name = f'res50-{SZ}'

In [None]:
learn = create_cnn(data, models.resnet50, metrics=[accuracy, map5])

In [None]:
learn.fit_one_cycle(2)

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.save(f'{name}-stage-1')

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
max_lr = 1e-4
lrs = [max_lr/100, max_lr/10, max_lr]

In [None]:
learn.fit_one_cycle(5, lrs)

In [None]:
learn.save(f'{name}-stage-2')

In [None]:
learn.recorder.plot_losses()

This is not a loss plot you would normally expect to see. Why does it look like this? Let's consider what images appear in the validation set:
 * images of whales that do not appear in the train set (whales where all their images were randomly assigned to the validation set) - there is nothing our model can learn about these!
 * images of whales with multiple images in the dataset where some subset of those got assigned to the validation set
 * `new_whale` images
 
Intuitively, a model such as the above does not seem to frame the problem in a way that would be easy for a neural network to solve. Nonetheless, it is interesting to think how we could improve on the construction of the validation set? What tweaks could be made to the model to improve its performance?

## Predict

In [None]:
preds, _ = learn.get_preds(DatasetType.Test)

In [None]:
mkdir -p subs

In [None]:
create_submission(preds, learn.data, name)

In [None]:
pd.read_csv(f'subs/{name}.csv.gz').head()

In [None]:
!kaggle competitions submit -c humpback-whale-identification -f subs/{name}.csv.gz -m "{name}"