<a href="https://colab.research.google.com/github/mlfa19/assignments/blob/master/Module%201/m1_project/Examples/How_to_(Re)train_Your_ConvNet%3F.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to (Re)train Your ConvNet?

If you haven't already, hopefully sometime soon you will get to the point where you are able to train a ConvNet on your particular dataset.  At this point, you need to ask yourself what you should do next.

Before getting into the *how*, let's break down the *should*.

Characteristics of a good experiment:
* Contribute to your learning goals
* Contribute to understanding your model

## Modify the Structure of the Network

One of the most logical things to do once you've trained version 1 of your ConvNet is to try modifying the structure of the network in some way.  At first you may manually tweak some aspects of the network and see what the result is.  Eventually, we advise you to actually do an automated experiment where you search over some space of network architectures (e.g., try various numbers of convolutional filters at a particular layer).

## Make it Faster

When running a bunch of experiments, it will be helpful to try to optimize the speed of model training.  It turns out that for the Colab VMs if you are using a relatively small network, a lot of processing time gets spent loading and transforming the data.  There are a couple of things you can do to speed this up.

### Cache Your Data in Memory

If your preprocessed data is small enough to fit into memory, you can simply memoize the `__get_item__` function to avoid loading and processing the same data from disk (e.g., on each epoch).

Here is some code we wrote to do this for the `ImageFolder` class that we showed last time.  We have found that this results in about a 4-5x speedup when training on a relatively small network.

In [0]:
from torchvision.datasets import ImageFolder

# Datasets must always subclass either Dataset (either directly or indirectly)
# Here, we use subclass the ImageFolder class.
class CachedDataset(ImageFolder):
    def __init__(self, root, move_to_GPU=False, transform=None, target_transform=None):
        """ The init method passes most arguments up to the `ImageFolder` class.

            The exception is the `move_to_GPU` input, which if set to true will
            move the returned data to CUDA and if false, will keep it on the CPU
        """
        # make sure to call the super class init method
        super(CachedDataset, self).__init__(root,
                                            transform=transform,
                                            target_transform=target_transform)
        self.total_time_loading = 0
        self.move_to_GPU = move_to_GPU
        # we'll cache the loaded tensors here
        self.tensor_cache = {}

    def __getitem__(self, index):
        """
        Args:
            index (int): Index
        Returns:
            tuple: (image, target) where target is index of the target class.
        """
        t_start = time.time()
        if int(index) in self.tensor_cache:
            self.total_time_loading += time.time() - t_start
            return self.tensor_cache[int(index)]

        inputs, target = super(CachedDataset, self).__getitem__(index)
        if self.move_to_GPU:
            self.tensor_cache[int(index)] = inputs.to('cuda'), target
        else:
            self.tensor_cache[int(index)] = inputs, target
        self.total_time_loading += time.time() - t_start
        return self.tensor_cache[int(index)]

### Save Your Preprocessed Data as an HDF5 File

If your data doesn't fit into memory, another thing you can do is save your data as an [hdf5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file.  We have found that this can give you a similar performance boost to caching the data in memory.  In this workflow you would use the code below to convert your dataset into an hdf5 file, save it to your Google Drive, and then modify your code to download the file each time you want to train your network.

In the example below, we assume that we already have created a dataset called cal_tech (you could created this using our sample code).  We will create an hdf5 file and populate it with the data from cal_tech.  If you want to use this code, you'd have to modify it for the particular aspects of your preprocessed dataset (e.g., resolution of the images).

In [0]:
import h5py

h5_file = h5py.File("caltech.hdf5", 'w')
# note: 128 by 128 is the preprocessed size of this dataset
data = h5_file.create_dataset('data', shape=(len(cal_tech), 3, 128, 128), dtype=np.float32)
targets = h5_file.create_dataset('targets', shape=(len(cal_tech),), dtype=np.long)

# loop through all the data and populate the hdf5 file
for i, (im, target) in enumerate(cal_tech):
    if i % 2000 == 0:
        print(i)
    data[i] = im.to('cpu')
    targets[i] = target
h5_file.close()

# download the dataset to Google Drive using rsync
from google.colab import drive
drive.mount('/content/gdrive')
!rsync -ah --progress caltech.hdf5 /content/gdrive/My\ Drive/Datasets\ for\ ML\ Class/caltech.hdf5

To use the generated dataset on subsequent runs, you can use the following Dataset class.

In [0]:
from torchvision.datasets import VisionDataset

class H5Dataset(VisionDataset):
    def __init__(self, h5_path):
        """ Initialize the dataset with the path to the hdf5 file.  You could
            Also add support for transforms, but we decided to leave that out.
        """
        super(H5Dataset, self).__init__('.', transform=None, target_transform=None)
        self.h5_file = h5py.File(h5_path, 'r', libver='latest', swmr=True)
        self.target_cache = []
        # caching the targets speeds up training
        for i in range(len(self)):
            self.target_cache.append(self.h5_file['targets'][i])

    def __getitem__(self, index):
        return self.h5_file['data'][index], self.target_cache[index]

    def __len__(self):
        return len(self.h5_file['data'])

    def close_dataset(self):
        self.h5_file.close()

# for example
cal_tech = H5Dataset('caltech.hdf5')

## Weight Decay

* https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

## Do some image normalization

* https://discuss.pytorch.org/t/understanding-transform-normalize/21730/7

## Test with Out-of-dataset images

One interesting thing you can try is to test your network on images you collect (or that are from a dataset you didn't use for traning).  One thing to watch out for is to make sure your input images have similar average R, G, B values to the images in your data.   Image normalization can be particularly useful for doing this.  Further, comparing how well out-of-dataset images work with and without image normalization could be particularly interesting.

## Do Some Data Augmentation

If you want your system to be more robust to rotations, translations, and other sorts of variation, you can augment your data by randomly generating transformed versions of the original images.



## Modify the Optimizer

* Change learning rate
* Try a different optimizer (explore various parameter settings)
* [Other ideas](https://pytorch.org/docs/stable/optim.html)

## Visualize What Your Network is Doing

While we are still working on a more complete writeup of this, you should certainly consider the following sorts of visualizations.
* Show the activation gradients (we showed this in the example on [training on the COCO dataset](https://colab.research.google.com/github/mlfa19/assignments/blob/master/Module%201/m1_project/Examples/Working_with_Large_Datasets_COCO_Example.ipynb))
* Maximize the class output (e.g., start with an image from the test set and use gradient ascent to maximize a particular output of the model).
* Maximize the output of a particular unit in the network.