Adapted from radekosmulski [notebook](https://github.com/radekosmulski/whale/blob/master/first_submission.ipynb) for learning purpose on the playground competition instead of the original.

In [None]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[K     |████████████████████████████████| 727kB 12.2MB/s 
[K     |████████████████████████████████| 1.1MB 59.2MB/s 
[K     |████████████████████████████████| 51kB 9.2MB/s 
[K     |████████████████████████████████| 194kB 66.8MB/s 
[K     |████████████████████████████████| 61kB 10.3MB/s 
[?25hMounted at /content/gdrive


In [None]:
root='/content/gdrive/MyDrive/'
base=f'{root}/Colab Notebooks/whale'

## Setup Kaggle and Data

In [None]:
#Setup kaggle key
!mkdir -p ~/.kaggle
!cp '{base}/kaggle.json' ~/.kaggle/
!ls ~/.kaggle


kaggle.json


In [None]:
#Download kaggle cli
! pip install --upgrade --force-reinstall --no-deps kaggle -q

[?25l[K     |█████▌                          | 10kB 27.7MB/s eta 0:00:01[K     |███████████                     | 20kB 34.6MB/s eta 0:00:01[K     |████████████████▋               | 30kB 39.8MB/s eta 0:00:01[K     |██████████████████████▏         | 40kB 30.5MB/s eta 0:00:01[K     |███████████████████████████▊    | 51kB 32.8MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 8.4MB/s 
[?25h  Building wheel for kaggle (setup.py) ... [?25l[?25hdone


In [None]:
from pathlib import Path

data_dir='data'
if not Path(data_dir).exists():
  ! mkdir -p {data_dir}
  
  #Download competition data
  ! kaggle competitions download -c whale-categorization-playground
  ! unzip -q whale-categorization-playground.zip

  ! mv train/train {data_dir}/
  ! mv test/test {data_dir}/
  ! mv train.csv {data_dir}/
  ! mv sample_submission.csv /

  !rm -f train
  !rm -f test

## Imports

In [None]:
from nbdev.showdoc import *
from fastai.vision.all import *

import pandas as pd

## Setup

In [None]:
SEED=42
set_seed(s=SEED, reproducible=True)

## Data

In [None]:
df = pd.read_csv(f'{data_dir}/train.csv')
df.head()

Unnamed: 0,Image,Id
0,00022e1a.jpg,w_e15442c
1,000466c4.jpg,w_1287fbc
2,00087b01.jpg,w_da2efe0
3,001296d5.jpg,w_19e5482
4,0014cfdf.jpg,w_f22f3e3


In [None]:
df.Id.value_counts()

new_whale    810
w_1287fbc     34
w_98baff9     27
w_7554f44     26
w_1eafe46     23
            ... 
w_c774326      1
w_740dfd4      1
w_65ec378      1
w_0c8a724      1
w_d26cc27      1
Name: Id, Length: 4251, dtype: int64

In [None]:
(df.Id == 'new_whale').mean()

0.08223350253807106

In [None]:
(df.Id.value_counts() == 1).mean()

0.5222300635144672

52% of all whales have only single image associated with them. 

8.2% of all images contain a new whale ie not a known whale.

[TODO] There is a superb writeup on what a solution to this problem might look like [here](https://www.kaggle.com/martinpiotte/whale-recognition-model-with-score-0-78563/notebook). In general, the conversation in the Kaggle [forum](https://www.kaggle.com/c/humpback-whale-identification/discussion) also seems to have some very informative threads.

In [None]:
df.Id.nunique()

4251

In [None]:
df.shape

(9850, 2)

## DataBlock

In [None]:
SZ = 224
BS = 64
NUM_WORKERS = 12 
SEED = 42

In [None]:
source=Path(data_dir)

In [None]:
source

Path('data')

In [None]:
data = DataBlock(
        blocks = (ImageBlock, CategoryBlock),
        get_x =ColReader(0, pref=source/'train'),
        get_y=ColReader(1),
        splitter=RandomSplitter(seed=SEED),
        item_tfms=Resize(SZ),
        batch_tfms=aug_transforms())

In [None]:
#data.summary(source=df)

In [None]:
dls = data.dataloaders(source=df, bs=BS, num_workers=NUM_WORKERS)

In [None]:
#dls.show_batch()

In [None]:
def map5(preds,targs):
    if type(preds) is list:
        return torch.cat([map5fast(p, targs, 5).view(1) for p in preds ]).mean()
    return map5kfast(preds,targs, 5)

## Modeling

In [None]:
learn = cnn_learner(dls, resnet50, metrics=[accuracy, map5])

Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth


HBox(children=(FloatProgress(value=0.0, max=102502400.0), HTML(value='')))




In [None]:
#learn.lr_find()

In [None]:
#learn.fine_tune(5, 1e-2, freeze_epochs=1)