# Deep Learning with MXNet Gluon - Assignment 2

## Assignment Description


Welcome to Deep Learning with MXNet/Gluon Week 2 assignment. This assignment will focus computer vision and using gluon-cv. In the first question, you will implement convolution from scratch and answer some questions about convolution. In the remaining questions, you will finetune an image classification model on a new dataset and train an object detection model on a dataset.

### Supplemental Reading
* [Convolutions Explained (Blog)](https://medium.com/apache-mxnet/convolutions-explained-with-ms-excel-465d6649831c)
* [Convolutional Networks (Dive into deep learning)](https://d2l.ai/chapter_convolutional-neural-networks/index.html)
* [ConvNet Architectures (Dive into deep learning)](https://d2l.ai/chapter_convolutional-modern/index.html)
* [Deep Learning Computation (Dive into deep learning)](https://d2l.ai/chapter_deep-learning-computation/index.html)

In [None]:
!pip install gluoncv

## Convolution from Scratch
In class, we talked about convolution as the way we impose the properties of spatial locality and translation invariance when learning parameters for models that deal with image inputs. We can write the convolution operation used in deep learning for a single channel input and single channel output as:

$$Y[i, j] = \sum_{a=0}^{m}\sum_{b=0}^{n}X[i+a, j+b]\cdot K[a, b]$$

Where $X$ is the 2D input and $K$ is the 2D kernel or shape (m, n) and $Y$ is the output

## Question 1
Write a function using  that takes in a 2D `ndarray` input and a kernel and performs 2D convolution on the input.

What is the shape of the output in terms of the shape of the kernel and the shape of the input?
What if you add padding and stride? How does that change the output shape.

In the code that's run below look at the input, imagine it's a black and white image, how would you interpret the output feature map? What if you didn't know K ahead of time. Describe how you would compute K that would give similar results?

In [None]:
from mxnet import nd

## Convolution
def conv(X, K):
    # Your code here
    

X = nd.ones((6, 8))
X[:, 2:6] = 0
print(X)
K = nd.array([[1, -1]])
Y = conv(X, K)
print(Y)

## Hot Dog or Not Hot Dog

If you're a fan of the HBO show Silicon Valley, then you know about the [gag](https://www.youtube.com/watch?v=ACmydtFDTGs) about a food app that one of the show's characters create to recognize whether a picture contains a hot dog. We will implement this functionality in gluon by finetuning ResNet on a hot dog dataset. The hot dog data set we use was taken from online images and contains 1,400 positive images containing hot dogs and same number of negative images containing other foods. 1,000 images of various classes are used for training and the rest are used for testing.

In [None]:
import mxnet as mx
from mxnet import gluon, init, nd, autograd
from mxnet.gluon import data as gdata, loss as gloss
from mxnet.gluon import utils as gutils
import os
import zipfile
from matplotlib import pyplot as plt


data_dir = 'data'
if not os.path.exists('data/hotdog/train'):
    if not os.path.exists(data_dir):
        os.makedirs(data)
    base_url = 'https://apache-mxnet.s3-accelerate.amazonaws.com/'
    fname = gutils.download(
        base_url + 'gluon/dataset/hotdog.zip',
        path=data_dir, sha1_hash='fba480ffa8aa7e0febbb511d181409f899b9baa5')
    with zipfile.ZipFile(fname, 'r') as z:
        z.extractall(data_dir)

train_dataset = gdata.vision.ImageFolderDataset(os.path.join(data_dir, 'hotdog/train'))
test_dataset = gdata.vision.ImageFolderDataset(os.path.join(data_dir, 'hotdog/test'))

def show_images(imgs, num_rows, num_cols, scale=2):
    figsize = (num_cols * scale, num_rows * scale)
    _, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
    for i in range(num_rows):
        for j in range(num_cols):
            axes[i][j].imshow(imgs[i * num_cols + j].asnumpy())
            axes[i][j].axes.get_xaxis().set_visible(False)
            axes[i][j].axes.get_yaxis().set_visible(False)

hotdogs = [train_dataset[i][0] for i in range(8)]
not_hotdogs = [train_dataset[-i - 1][0] for i in range(8)]
show_images(hotdogs + not_hotdogs, 2, 8)

## Question 2
Same as the last homework the training and validation dataset and stored into a gluon `dataset` and we will have do some augmentation on the dataset. 

Write code that uses `transforms.Compose` to perform augmentations and preprocessing on the images in the dataset. 

On the training data perform the following augmentations and normalization.

* Random resizing and crop to 224 x 224 pixels
* Random horizontal (left, right) flips
* Convert to image tensor representation
* and normalize

In order to ensure our results is reproducible, we will not apply any random augmentations on our test dataset. Instead apply the following transformations.

* Resizing 256 x 256 pixels
* Crop to 224 x 224 pixels
* Convert to image tensor representation
* and normalize

Write code to create dataloaders for the training and test dataset and make sure the training dataloader shuffles the data at each epoch.

In [None]:
# We specify the mean and variance of the three RGB channels to normalize the image channel.

batch_size = 128
normalize_means = [0.485, 0.456, 0.406]
normalize_stdevs = [0.229, 0.224, 0.225]

# Your code here 

## Finetuning ResNet
Now that we have applied the necessary transforms and augmentations to our image dataset. We are ready to train the model. Instead of training a model from scratch, we will be using a pretrained model and finetuning that model to work on our dataset. Here is how that works.

* We pre-train a neural network model, i.e. the source model, on a source data set. In our case, we will be using the ResNet-18 network pretrained on ImageNet. This is available in gluon-cv.

* We only modify the neural network output layer so that we have the correct number of outputs. The ImageNet dataset has 1000 classes but for our hotdog/not hotdog dataset, we only need 2 classes. The assumption here is that the model parameters already contain knowledge learned ImageNet and that knowledge will be applicable to our hotdog or not hotdog training set so the existing model parameters will be a good initialization point, except for the output layer.

* We randomly initialize the model parameters of the output layer.

* Finally , we can train the model on our hotdog data set. The output layer from scratch, while the parameters of all remaining layers are fine tuned based on the parameters of the source model.
  


## Question 3
Write code to finetune a `ResNet18_v2` model trained on ImageNet to our hotdog dataset. You will need to use GPU to finetune this model otherwise the training would be really slow. Use the following hyperparameters.

* learning rate: 0.01
* weight decay: 0.001
* learning rate multiplier ('lr_mult'): 10 use this only on the parameters of the output layer. You can set this by running something like `resnet.output.collect_params().setattr('lr_mult', 10)`.

Use `SoftmaxCrossEntropyLoss` as your loss function and `sgd` optimizer

Don't forget to initialize your network parameters on GPU if you are using GPU to train. Similarly, in your training loop don't forget your data batch to the GPU before the forward pass.

Run your finetuning for 5 epochs. You should already good results above .85 training accuracy.

Try to get images of hotdogs from the internet and preprocesses them so that you can make predictions on them using the network? How well does the network do? Can you think of ways to fool the network? Report your findings on what kinds of images successfully fool the network?

In [None]:
import gluoncv as gcv
from time import time

def acc(output, label):
    return (output.argmax(axis=1) == label.astype('float32')).sum().asscalar()

model_name = 'ResNet18_v2'
ctx = mx.gpu(0)

# Your code here

## Pikachu Dataset
In class, we motivated thinking about some of the properties of how humans process visual information with the Where's waldo game. Now, instead of finding waldo, we will be finding pikachu using object detection.

In [None]:
from mxnet.gluon import utils as gutils
from gluoncv.utils import viz
import os

def _download_pikachu(data_dir):
    root_url = ('https://apache-mxnet.s3-accelerate.amazonaws.com/'
                'gluon/dataset/pikachu/')
    dataset = {'train.rec': 'e6bcb6ffba1ac04ff8a9b1115e650af56ee969c8',
               'train.idx': 'dcf7318b2602c06428b9988470c731621716c393',
               'val.rec': 'd6c33f799b4d058e82f2cb5bd9a976f69d72d520'}
    for k, v in dataset.items():
        gutils.download(root_url + k, os.path.join(data_dir, k), sha1_hash=v)

data_dir = 'data/pikachu'
_download_pikachu(data_dir)
train_dataset = gcv.data.RecordFileDetection('data/pikachu/train.rec')
classes = ['pikachu']  # only one foreground class here
# display some images
for i in range(3):
    image, label = train_dataset[i]
    ax = viz.plot_bbox(image, bboxes=label[:, :4], labels=label[:, 4:5], class_names=classes)
    plt.show()

## Object Detection with SSD
We will train the an object detection model on the pikachu dataset that we just loaded. We will be using the Single Shot Multibox Detector (SSD) model. As described in the lecture, SSD consists of base network for feature extraction (in this instance we will be using mobilenet because of it's low footprint but it's very common to use ResNet or VGG) and multiscale feature blocks connnected in series. 

## Question 4

Write code to train the ssd model on the pikachu dataset prepared above. The dataloader for SSD is quite complex and has been implemented for you. Here you simply have to write code for each training epoch to loop over the provided `train_data`. Use the `gluoncv.loss.SSDMultiBoxLoss` as your loss function. Use the `sgd` optimizer with the following hyperparameters:
* learning rate: 0.001
* weight decay: 0.0005
* momentum: 0.9

As always, if you're training using GPU (and you should be), ensure that your model parameters and training data batch all live on the GPU during training.

Note the `SSDMultiBoxLoss` already normalizes the loss so when you call your trainer step, you can treat the batch size as 1 effectively.

This can take a long time to train so after two epochs of training save your network parameters to disk with the name `'ssd_512_mobilenet1.0_pikachu.params'`

In [None]:
from gluoncv.data.batchify import Tuple, Stack, Pad
from gluoncv.data.transforms.presets.ssd import SSDDefaultTrainTransform
    
def get_dataloader(net, train_dataset, data_shape, batch_size, num_workers, ctx):
    width, height = data_shape, data_shape
    #generate fixed anchors for target generation
    with autograd.train_mode():
        _, _, anchors = net(mx.nd.zeros((1, 3, height, width)))
    batchify_fn = Tuple(Stack(), Stack(), Stack())  # stack image, cls_targets, box_targets
    train_loader = gluon.data.DataLoader(
        train_dataset.transform(SSDDefaultTrainTransform(width, height, anchors)),
        batch_size, shuffle=True, batchify_fn=batchify_fn, last_batch='rollover', num_workers=4)
    return train_loader

ssd_model = 'ssd_512_mobilenet1.0_voc'
ctx = mx.gpu(0)
net = gcv.model_zoo.get_model(ssd_model, pretrained=True)
net.reset_class(classes)
train_data = get_dataloader(net, train_dataset, 512, 16, 4, ctx)
net.collect_params().reset_ctx(ctx)
net.hybridize(static_alloc=True, static_shape=True)

# Your code here

In [None]:
net.save_parameters('ssd_512_mobilenet1.0_pikachu.params')

test_url = 'https://raw.githubusercontent.com/zackchase/mxnet-the-straight-dope/master/img/pikachu.jpg'
gutils.download(test_url, 'data/pikachu_test.jpg')
net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_custom', classes=classes, pretrained_base=False)
net.load_parameters('ssd_512_mobilenet1.0_pikachu.params')
x, image = gcv.data.transforms.presets.ssd.load_test('data/pikachu_test.jpg', 512)
cid, score, bbox = net(x)
ax = viz.plot_bbox(image, bbox[0], score[0], cid[0], class_names=classes)
plt.show()