<a href="https://colab.research.google.com/github/joshgregory42/practical_deep_learning/blob/main/ch_07_sizing_and_tta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a State-of-the-Art Model

Will look at more advanced techniques for training an image classification model.

Will train a model from scratch. Will be using images of objects of different sizes, different orientations, different lighting, etc.

This will help training a model from scratch, or using transfer learning to train a model on a very different kind of dataset than the pretrained model used.

## Imagenette

* A subset of the ImageNet dataset.
* Scales up to the full ImageNet dataset, making experimentation much easier.

In [None]:
from fastai.vision.all import *

path = untar_data(URLs.IMAGENETTE)

In [None]:
# Put dataset into a DataLoaders object using presizing:

dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=aug_transforms(size=224, min_scale=0.75))

dls = dblock.dataloaders(path, bs=64)


In [None]:
# Baseline training run:

model = xresnet50(n_out=dls.c)

learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.582379,1.480935,0.5295,03:04
1,1.2128,1.197451,0.6236,03:06
2,0.952516,1.000156,0.698656,03:04
3,0.740175,0.70599,0.771098,03:04
4,0.595625,0.578908,0.817401,03:05


## Normalization

When training, typically want input data to be normalized (mean of 0 and standard deviation of 1).

Look at a batch of our data, which won't have a mean of 0 and a std. of 1:

In [None]:
x, y = dls.one_batch()

# Average over all axes except for the channel axis, which is ch. 1
x.mean(dim=[0, 2, 3]), x.std(dim=[0, 2, 3])

(TensorImage([0.4867, 0.4787, 0.4575], device='cuda:0'),
 TensorImage([0.2648, 0.2613, 0.2866], device='cuda:0'))

Can normalize the data easily using fastai's `Normalize` transform. Acts on a whole mini-batch, so just add it to the `batch_tfms` section of the data block.

Look at statistics of one batch:

In [None]:
def get_dls(bs, size):
  dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                               Normalize.from_stats(*imagenet_stats)])
  return dblock.dataloaders(path, bs=bs)

dls = get_dls(64, 224)

Generally need to make sure that the normalization techniques that you use when using another model match.

**Progressive resizing**: Gradually using larger and larger images as training progresses.

In [8]:
dls = get_dls(128, 128)

learn = Learner(dls, xresnet50(n_out=dls.c), loss_func=CrossEntropyLossFlat(),
                metrics=accuracy)

learn.fit_one_cycle(4, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.603324,2.374149,0.440627,02:50


epoch,train_loss,valid_loss,accuracy,time
0,1.603324,2.374149,0.440627,02:50
1,1.256943,1.532799,0.557879,02:41
2,0.943666,0.870385,0.732636,02:38
3,0.736266,0.667811,0.796863,02:39


In [9]:
# Replace the DataLoaders inside the Learner and fine-tune:

learn.dls = get_dls(64, 224)

learn.fine_tune(5, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.824582,1.060832,0.67177,03:09


epoch,train_loss,valid_loss,accuracy,time
0,0.657201,0.775802,0.764376,03:07
1,0.674283,0.691783,0.790142,03:06
2,0.598348,0.646753,0.796863,03:04
3,0.489432,0.478447,0.846901,03:03
4,0.431939,0.446233,0.855116,03:03


**Test time augmentation**: During inference or validation, creating multiple versions of each image, using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image.

Can pass any `DataLoader` to fastai's `tta` method. Will use the validation set by default:

In [10]:
preds, targs = learn.tta()

accuracy(preds, targs).item()

0.8666915893554688

## Mixup

Really powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don't have much data and don't have a pretrained model that was trained on data similar to your dataset.

How it works for each image:

1. Select another image from your dataset at random.
2. Pick a weight at random.
3. Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable.
4. Take a weighted average (with the same weight) of this image's labels with your image's labels; this will be your dependent variable.

Example for when we take a linear combination of images, as done in Mixup:

In [11]:
#caption Mixing a church and a gas station
#alt An image of a church, a gas station and the two mixed up.
church = PILImage.create(get_image_files_sorted(path/'train'/'n03028079')[0])
gas = PILImage.create(get_image_files_sorted(path/'train'/'n03425413')[0])
church = church.resize((256,256))
gas = gas.resize((256,256))
tchurch = tensor(church).float() / 255.
tgas = tensor(gas).float() / 255.

_,axs = plt.subplots(1, 3, figsize=(12,4))
show_image(tchurch, ax=axs[0]);
show_image(tgas, ax=axs[1]);
show_image((0.3*tchurch + 0.7*tgas), ax=axs[2]);

NameError: ignored

The third image is built by adding 0.3 times the first one and 0.7 times the second. In this example, should the model predict "church" or "gas station"? The right answer is 30% church and 70% gas station, since that's what we'll get if we take the linear combination of the one-hot-encoded targets. This is done using a *callback* to our `Learner`. `Callback` is what fastai uses to inject custom behavior in the training loop (like a learning rate schedule, or training in mixed precision).

For now, just do this to train a model with Mixup:

In [12]:
model = xresnet50(n_out=dls.c)

learn = Learner(dls, model,
loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=MixUp())

learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.89137,2.151761,0.419343,02:55
1,1.666307,1.430972,0.566468,03:07
2,1.458442,1.18268,0.640777,02:50
3,1.318334,0.845568,0.748693,02:43
4,1.201527,0.711317,0.785661,02:44


Potential issue with labeling and one-hot encoding: we only have 0s and 1s, but in reality we should be something like 95% confident. Things start to break down if our data isn't perfectly labeled, which never happens. This encourages overfitting and can make your models shit.

Solution to this is to *label smooth*, where we replace all of our 1s with a number a bit less than 1, and our 0s by a number a bit less than 0, and then train. This forces the algorithm to be more humble which is always good.