In [1]:
# run this cell to ensure course package is installed
import sys
from pathlib import Path

course_tools_path = Path('../../Lessons/Course_Tools/').resolve() # change this to the local path of the course package
sys.path.append(str(course_tools_path))

from install_introdl import ensure_introdl_installed
ensure_introdl_installed(force_update=False, local_path_pkg= course_tools_path / 'introdl')

The `introdl` module is already installed.


In [2]:
# imports and configuration
#### Solution

# Homework 3 - Better Training

In this assignment you will build a deeper CNN model to improve the classification performance on the FashionMNIST dataset.  Deeper models can be more difficult to train so you'll employ some of the techniques from Lesson 3 to improve the training.  You'll also use data augmentation to improve the performance of the model while reducing overfitting.  Along the way you'll see how to downsample a dataset to make for more efficient experimentaton.

## Build the model (5 pts)

Implement a PyTorch model of class `nn.module` to reproduce a model with this structure
```
====================================================================================================
Layer (type (var_name))                  Input Shape          Output Shape         Param #
====================================================================================================
FashionMNISTModel (FashionMNISTModel)    [64, 1, 28, 28]      [64, 10]             --
├─Sequential (block1)                    [64, 1, 28, 28]      [64, 32, 14, 14]     --
│    └─Sequential (0)                    [64, 1, 28, 28]      [64, 32, 28, 28]     --
│    │    └─Conv2d (0)                   [64, 1, 28, 28]      [64, 32, 28, 28]     320
│    │    └─ReLU (1)                     [64, 32, 28, 28]     [64, 32, 28, 28]     --
│    │    └─Conv2d (2)                   [64, 32, 28, 28]     [64, 32, 28, 28]     9,248
│    │    └─ReLU (3)                     [64, 32, 28, 28]     [64, 32, 28, 28]     --
│    │    └─Conv2d (4)                   [64, 32, 28, 28]     [64, 32, 28, 28]     9,248
│    │    └─ReLU (5)                     [64, 32, 28, 28]     [64, 32, 28, 28]     --
│    └─MaxPool2d (1)                     [64, 32, 28, 28]     [64, 32, 14, 14]     --
├─Sequential (block2)                    [64, 32, 14, 14]     [64, 64, 7, 7]       --
│    └─Sequential (0)                    [64, 32, 14, 14]     [64, 64, 14, 14]     --
│    │    └─Conv2d (0)                   [64, 32, 14, 14]     [64, 64, 14, 14]     18,496
│    │    └─ReLU (1)                     [64, 64, 14, 14]     [64, 64, 14, 14]     --
│    │    └─Conv2d (2)                   [64, 64, 14, 14]     [64, 64, 14, 14]     36,928
│    │    └─ReLU (3)                     [64, 64, 14, 14]     [64, 64, 14, 14]     --
│    │    └─Conv2d (4)                   [64, 64, 14, 14]     [64, 64, 14, 14]     36,928
│    │    └─ReLU (5)                     [64, 64, 14, 14]     [64, 64, 14, 14]     --
│    └─MaxPool2d (1)                     [64, 64, 14, 14]     [64, 64, 7, 7]       --
├─Sequential (block3)                    [64, 64, 7, 7]       [64, 128, 7, 7]      --
│    └─Conv2d (0)                        [64, 64, 7, 7]       [64, 128, 7, 7]      73,856
│    └─ReLU (1)                          [64, 128, 7, 7]      [64, 128, 7, 7]      --
│    └─Conv2d (2)                        [64, 128, 7, 7]      [64, 128, 7, 7]      147,584
│    └─ReLU (3)                          [64, 128, 7, 7]      [64, 128, 7, 7]      --
│    └─Conv2d (4)                        [64, 128, 7, 7]      [64, 128, 7, 7]      147,584
│    └─ReLU (5)                          [64, 128, 7, 7]      [64, 128, 7, 7]      --
├─Linear (fc)                            [64, 6272]           [64, 10]             62,730
====================================================================================================
Total params: 542,922
Trainable params: 542,922
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 3.26
====================================================================================================
Input size (MB): 0.20
Forward/backward pass size (MB): 67.44
Params size (MB): 2.17
Estimated Total Size (MB): 69.81
====================================================================================================
```
You can, of course, type out all of the individual layers or you can build the repeating structure programatically (we'll see more of that next week - your book does this in Chapter 6 on page 209).  Make a model summary to check your work.


In [3]:
#### Solution


## Setup the data (5 pts)

Load the FashionMNIST dataset.  Normalize with mean 0.2860 and standard deviation 0.3530.  Downsample the train dataset to 10% of its original size to make experimentation quick.  You can use this code for downsampling:

```python
from torch.utils.data import Subset
np.random.seed(42)  # use this seed for reproducibility
subset_indices = np.random.choice(len(train_dataset), size=int(0.1 * len(train_dataset)), replace=False)
train_dataset = Subset(train_dataset, subset_indices)
```

Use the FashionMNIST test dataset for your `valid_dataset`.

For the DataLoaders try batch size 64 to start.



In [4]:
#### Solution


## Training with SGD (5 pts)

Train your model with Stochastic Gradient Descent.  Track the accuracy metric.  You'll likely need to increase both the learning rate and the number of epochs to see the validation accuracy plateau.  

Make sure to instantiate a fresh model to see complete training results. (Although you could resume from a checkpoint as part of your experimentation.)

In [5]:
#### Solution


Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [6]:
#### Solution


## Training with AdamW (5 pts)

Now repeat the previous training using AdamW.  You should be able to use the default learning rate of 0.001 and fewer epochs.

In [7]:
#### Solution


Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [8]:
#### Solution


#### Compare SGD and AdamW Training Performance

Make plots of validation loss and accuracy for both SGD and AdamW.

In [9]:
#### Solution


## Data Augmentation (5 pts)

Now use data augmentation.  Build a transform_train pipleline that includes
* Random horizontal flips
* Random crops of size 28, padding = 4
* Random rotations up to 10 degrees

Use the same seed to downsample the train_dataset to 10% of its size.

In the next cell, set up the data and augmentation transforms (don't augment the validation data).  Build the DataLoaders.

In [10]:
#### Solution


Train a new instance of your model with the new DataLoaders and AdamW.  Training will take more epochs so you may have to experiment a little

In [11]:
#### Solution


Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [12]:
#### Solution


Compare validation loss and accuracy for the three different approaches so far: SGD, AdamW, and AdamW with augmentation.  Make approriate graphs and comment on the three training strategies in terms of their performance on metrics and overfitting.

In [13]:
#### Solution


#### Solution
REPLACE_WITH_SOLUTION


## Early Stopping (5 pts)

Early stopping isn't really necessary unless the metrics on the validation or test set start to degrade.  Try it anyway just to reenforce how it works.  In this section implement early stopping based on the validation loss.  Use AdamW and data augmentation.  Add a comparison plot of the two methods.  Comment on the performance with and without early stopping.  Do you get comparable performance?  Add cells in this section as needed.

In [14]:
#### Solution


#### Solution
REPLACE_WITH_SOLUTION


## OneCycleLR (5 pts)

Create a new instance of the model.  Implement a OneCycleLR learning rate scheduler and add it your AdamW approach with data augmentation.
You should be able to use a larger max learning rate of 0.003 or so.  Experiment a little to see if you can get similar results to the above with few epochs (you may not be able to).

In [15]:
#### Solution


Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [16]:
#### Solution


Make a plot comparing the validation losses and accuracies for all of the training approaches above (there should be 4).

In [17]:
#### Solution


Which approach works best?   Why?  

#### Solution
REPLACE_WITH_SOLUTION


## Use best approach on full dataset (5 pts)

Take your best approach and apply it to the full dataset.  (Don't downsample)

This will take a little more than a minute per epoch so run your experiments with the smaller dataset above, then run this once.  You can use `resume_from_checkpoint = True` if you want to extend the training.

How does this compare to the performance you achieved in HW 2.  Import your best run from HW 2 and make a plot comparing the performance of your best approach from this assignment to the approach from the second assignment.  You might need to quickly retrain your HW2 model using the val_loader instead of the test_loader in train_network.

Add code and markdown cells below as needed. 


In [18]:
#### Solution


In [19]:
#### Solution
