
# Dog breed classification 

CNN based on [lesson 2](http://course.fast.ai/lessons/lesson2.html) of the deep learning fast.ai course, with data from the Kaggle Competition [dog-breed-identification](https://www.kaggle.com/c/dog-breed-identification).

This excercise follows the lesson's steps to train a world class classification model:

1. Enable data augmentation, and precompute=True
1. Use `lr_find()` to find highest learning rate where loss is still clearly improving
1. Train last layer from precomputed activations for 1-2 epochs
1. Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
1. Unfreeze all layers
1. Set earlier layers to 3x-10x lower learning rate than next higher layer
1. Use `lr_find()` again
1. Train full network with cycle_mult=2 until over-fitting

In [None]:
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
# This file contains all the main external libs we'll use
from fastai.imports import *

In [None]:
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

In [None]:
# See http://forums.fast.ai/t/torch-cuda-is-available-returns-false/16721/8
# And https://aws.amazon.com/blogs/machine-learning/get-started-with-deep-learning-using-the-aws-deep-learning-ami/
torch.cuda.is_available()

In [None]:
torch.backends.cudnn.enabled

`PATH` is the path to your data - if you use the recommended setup approaches from the lesson, you won't need to change this. `sz` is the size that the images will be resized to in order to ensure that the training runs quickly. We'll be talking about this parameter a lot during the course. Leave it at `224` for now.

In [None]:
# ! dir ..\..\..\Github\data\dog-breed-identification

In [None]:
PATH = "data/dog-breed-identification/"

## Initial Exploration

In [None]:
# os.listdir(PATH)
# ! ls {PATH}
! dir {PATH}

In [None]:
os.listdir(f'{PATH}train')[:5]

In [None]:
labels_df = pd.read_csv(PATH + 'labels.csv')
print('There are', len(labels_df), 'training observations, and',
      len(os.listdir(f'{PATH}test')), 'test observations.')
labels_df.sample(5)

In [None]:
plt.hist(labels_df.pivot_table(index='breed', aggfunc=len).values);

In [None]:
img = plt.imread(PATH + 'train/' + labels_df.sample(1)['id'].values[0] + '.jpg')
plt.imshow(img)

### Sample subset of images (for cpu only)

In [None]:
frac = 0.5
sample_labels_df = labels_df.sample(frac=frac).set_index('id')
sample_labels_df.to_csv(PATH + 'sample_labels.csv')

In [None]:
print('Sample:', len(sample_labels_df), ' out of', len(labels_df), 'observations.')
sample_labels_df.head()

# Model

> If the original images do not correspond to this size, thei are center cropped. For computational GPU reasons, the cropped images must be squared."

In [None]:
# Image size to feed into the model.  
sz = 224

arch = resnext101_64  # resnet34  # Model architecture.
bs = 64  # Batch size


n = len(sample_labels_df)  # len(list(open(PATH + 'labels.csv'))) - 1
val_idxs = get_cv_idxs(n, val_pct=0.2)

In [None]:
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(path=PATH, folder='train', csv_fname=PATH + 'sample_labels.csv',
                                    test_name='test', suffix='.jpg', val_idxs=val_idxs, tfms=tfms, bs=bs)

In [None]:
fn = PATH + data.trn_ds.fnames[100]; fn

In [None]:
img = PIL.Image.open(fn); img

In [None]:
img.size

In [None]:
sizes_d = {k: PIL.Image.open(PATH + k).size for k in data.trn_ds.fnames}
row_sz, col_sz = list(zip(*sizes_d.values()))
row_sz = np.array(row_sz); col_sz = np.array(col_sz)

In [None]:
plt.hist(row_sz, bins = 50);  # semi-colon for not printing the bins of the histogram.

In [None]:
plt.hist(col_sz, bins = 50);

## 1. Precompute

* `precomute = True` ensures that the activations of all the frozen layers in the model are computed only once.  Afterwards, they serve as input to the last (unfrozen) layers of the model for gradient descent, speeding up computation. 
* `precompute = False` enables the recalculation of the frozen layers' activations and, thus, allows data augmentation.  However, only the weights from the unfrozen layers are being updated.
* `learn.unfreeze()` unfreezes all the layers of the model for further calibration.

The following function helps iterate faster with the model.  It receives the image's size (`sz`) and the batch size (`bs`).

1. Start with small sizes (eg. 64) for fast computing at the beginning.  Then increase the size.
1. If one runs out of memory, first **restart the kernel**, then decrease the batch size.


In [None]:
def get_data(sz, bs):
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
    data = ImageClassifierData.from_csv(path=PATH, folder='train', csv_fname=PATH + 'sample_labels.csv',
                                        test_name='test', suffix='.jpg', val_idxs=val_idxs, tfms=tfms, bs=bs)
    return data if sz > 300 else data.resize(340, 'tmp')

In [None]:
sz = 224
bs = 64
data = get_data(sz, bs)

In [None]:
# See http://forums.fast.ai/t/dog-breed-challenge-precompute-error/10988/8
# If No such file... error: ~/data/dog-breed-identification$ rm -r tmp
learn = ConvLearner.pretrained(arch, data, precompute=True)

In [None]:
# learn.save('pretrained_128')
learn.save('pretrained_224')

In [None]:
# learn.load('pretrained_128')
learn.load('pretrained_224')

In [None]:
lrf = learn.lr_find()
# learn.sched.plot_lr()
learn.sched.plot()

In [None]:
learn.fit(1e-1, 5)

> The difference between `trn_loss` and `val_loss` indicates **overfitting**.  Maybe with dropout (`ps` parameter in `ConvLearner.pretrained`) or another form of regularization.

In [None]:
# ps: dropout parameter
learn = ConvLearner.pretrained(arch, data, precompute=True, ps = 0.75)

In [None]:
learn.fit(1e-2, 5)

## 2. Augment 

In [None]:
from sklearn import metrics

In [None]:
data = get_data(sz, bs)

In [None]:
learn.precompute = False

In [None]:
learn.fit(1e-2, 2, cycle_len=1)

In [None]:
learn.save('aug_224')

In [None]:
learn.load('aug_224')

## 3. Increase image size

* This trick needs a _fully convolutional_ architecture.
* It also performs a regularization of sorts, because the data structure changes.

In [None]:
learn.set_data(get_data(299, bs))
learn.freeze()  # Just to make sure its frozen (only updating the weights in the last layers).

In [None]:
lrf = learn.lr_find()
# learn.sched.plot_lr()
learn.sched.plot()

In [None]:
learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2)

In [None]:
learn.sched.plot_lr()

In [None]:
learn.sched.plot_loss()

In [None]:
learn.save('model_299')

## 4. Predictions

* TO DO: `learn.TTA`
* TO DO: submit to Kaggle

In [None]:
log_preds, y = learn.TTA(is_test=False)  # Default, for validation set.
probs = np.exp(log_preds)
metrics.log_loss(y, probs), accuracy(log_preds, y)

## 5. Submit to Kaggle

File format from Kaggle:

* TO DO

In [None]:
data.classes

In [None]:
data.test_ds.fnames

In [None]:
log_preds, y = learn.TTA(is_test=True)  # True for test set.
probs = np.exp(log_preds)
metrics.log_loss(y, probs), accuracy(log_preds, y)

In [None]:
probs.shape

In [None]:
df = pd.DataFrame(probs)
df.columns = data.classes
df.insert(0, 'id', [o[5:-4] for o in data.test_df.fnames])
df.head()

In [None]:
SUBM = PATH + 'subm/'
os.makedirs(SUBM, exist_ok=True)
df.to_csv(SUBM + 'subm.gz')
FileLink(SUBM + 'subm.gz')  # To download from server :)

## Predicting one observation

In [None]:
j = 100
fn = data.test_ds.fnames[j]
trn_tfms, val_tfms = tfms_from_model(arch, sz)  # Actually returns a tupple
im = val_tfms(Image.open(PATH + fn))
preds = learn.predict_array(im[None])  
# [None] to add additional dimension. That is, to specify that it is not a minibatch.

np.argmax(preds)