<a href="https://colab.research.google.com/github/003084-K/cardsort/blob/master/Presbrey_Copy_of_BMI219_Homework_MNIST_Autoencoders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 1: Autoencoding MNIST and Celebrity Faces

**Due Date: May 1st, 2020 @ 10:00am (before class)**

Please turn in this completed notebook and requested material to kangway@keiserlab.org.

**Collaboration Policy and More** 

You're welcome (and highly encouraged) to work with and discuss this homework assignment with others in the class, and feel free to use any resources (textbooks, online notebooks, etc). The only requirement is that the final notebook that you turn in must be your own written work (no copy and pasting, please).

**Overview**

In class, we covered how Hinton and Salakhutdinov's 2006 Science Paper, ["Reducing the Dimensionality of Data with Neural Networks"](https://www.cs.toronto.edu/~hinton/science.pdf) was one of the first demonstrations of unsupervised pretraining for use in training deep neural networks. In this homework, we'll implement autoencoders in the context of MNIST. Additionally, as an optional assignment, a similar architecture can be used for a subset of CelebA on celebrity faces provided in the same directory.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Before you get started

**1) Background Reading**

Please Read Hinton and Salakhutdinov's 2006 seminal work on deep autoencoders (https://www.cs.toronto.edu/~hinton/science.pdf), as this notebook aims to recreate this important work. A few questions to think about as you read that will help you in this assignment:
  - What architecture do they use for their deep autoencoders?
  - Why were deep neural networks so much harder to train in 2006?

**2) How to run this notebook**

This Jupyter Notebook can be used in two ways:
* *Option 1: Download the notebook*

  We've included all the imports necessary for this homework. Please make sure you're running Python 3 with PyTorch (and Torchvision) 1.0 installed and ready to go, along with NumPy and Matplotlib. Although you might find that these models train a bit faster on GPU, this homework assignment should be doable on most modern laptops. If you're having trouble please let us know ASAP.

* *Option 2: Run it online on Google Colaboratory*

  - Colab gives access to a GPU, so it could be useful in case you don't have CUDA installed on your computer (**Note: you can use this as an opportunity to get started on GPU training, but we recomend you develop your model and make sure everything works on CPU first**)
  - Make a copy of this notebook in your Google Drive folder: "File" -> "Save a copy in Drive..."
  - By default, Colab does not make GPUs available, but you can easily access them by selecting GPU in "Runtime" -> "Change runtime type..."
  - Remember that Colab runs in a temporary virtual machine, so all the data created while running the notebook will be lost at the end of the session, or when the runtime disconnects due to inactivity. We provide functions to download and re-upload data.
  - If you don't want to keep downloading/uploading files, you can very easily link Colab to your personal Google Drive by mounting it on your runtime, see instructions [here](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=u22w3BFiOveA).

**3) How to complete this assignment**

  - Fill out the relevant code blocks as indicated
  - Answer questions by writing them directly in the text block. Please keep your written answers concise, most are meant to be answered in a sentence or two.
  - Provide your best trained model along with intermediate checkpoints, as explained below.

**4) Optional exercise: CelebA Data** 

Whereas MNIST is a toy dataset built into PyTorch, we can also examine a more complex feature space using a subset of 90,000 celebrity portraits from CelebA (see [Liu et al. (2014), "Deep Learning Face Attributes in the Wild"](https://arxiv.org/abs/1411.7766)). This is an optional part of the homework, but is a nice way to see how autoencoders perform on other types of visual data. There will be a .zip file of the relevant celebrity faces dataset on the Google Classroom link.

***Let's start!***

---

## Train an autoencoder on MNIST

In [0]:
### Import all the necessary libraries
import numpy as np
from matplotlib import pyplot as plt

%matplotlib inline

from IPython.display import Image, display

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
from torchvision import datasets, transforms
from torchvision.utils import save_image

You shouldn't need CUDA for this assignment, but if you want a head start, or if you just want to see the difference between using a CPU versus a GPU, set `use_cuda = True` below. 
You can check if CUDA is available on your computer with: `torch.cuda.is_available()`

If you are working on Colab, make sure to activate the GPU ("Runtime" -> "Change runtime type...").

In [0]:
use_cuda = False
device = torch.device("cuda" if use_cuda else "cpu")

In [0]:
torch.manual_seed(7);

> **Question 0.1) Why is it important to set the seed for the random number generator?**

*Double-click to add your answer...*



### 1. MNIST Dataset

As noted in class, MNIST has been widely used to benchmark new deep learning architectures and is already built into PyTorch. We provide this data as a starting point, again noting that the mean and std of the training set are calculated to be 0.1307 and 0.3081, respectively.

In [0]:
preprocessing = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST(
    './bmi219_downloads', train=True, download=True,
    transform=preprocessing)
                               
test_dataset = datasets.MNIST(
    './bmi219_downloads',train=False, download=True, 
    transform=preprocessing)

> **Q1.1) How many examples do the training set and test set have?**

...

> **Q1.2) What's the format of each input example? Can we directly put these into a fully-connected layer?**

...

> **Q1.3) Why do we normalize the input data for neural networks?**

...

> **Q1.4) In this scenario, MNIST is already split into a training set and a test set. What is the purpose of dataset splitting (and specifically, the purpose of a test set)? For modern deep learning, a three-way split into training, validation, and test sets is usually preferred, why?**

...

### 2. Using DataLoaders for MNIST

Set up the DataLoader objects below. Although the arguments are prepopulated, you may need to change the batch sizes or other arguments during training.

In [0]:
BATCH_SIZE = 1  ### Please change this as necessary
NUM_WORKERS = 1  ### Use more workers for more CPU threads

In [0]:
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE, 
    shuffle=True,
    num_workers=NUM_WORKERS)

test_loader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE, 
    shuffle=False,
    num_workers=NUM_WORKERS)

As you begin training your model, you may need to adjust the batch size or experiment with several of these different parameters.

> **Q2.1) It's recommended to shuffle the training data over each epoch, but this isn't typically the case for the test set, why?**

...

> **Q2.2) What seems to be a good batch size for training? What happens if you train with a batch size of 1? What about a batch size equal to the total training set?**

...
 
> **Q2.3) The PyTorch DataLoader object is an iterator that generates batches as it's called. Try to pull a few images from the training set to see what these images look like. Does the DataLoader return only the images? What about the labels?**

...

### 3. Define your neural network architecture

With your data and dataloaders appropriately set, you're ready to define a network architecture. In this homework, we'll ask you to evaluate two different architectures.

For the first (we'll call it HNet in this homework), please implement Hinton's 2006 architecture of 7-hidden layers:

```[1000 x 500 x 250 x 2 x 250 x 500 x 1000]. ```

For the second, implement your own autoencoder architecture, again using a bottleneck dimension of 2. As a note, the larger your model, the longer it will take to train. Can you achieve similar performance to the model above using a more condensed model?

(As a reminder, what is the size of the inputs to these neural networks?)

Try to vary the type of activation functions (tanh, sigmoid, relu, etc). You may also find building modular Sequential Layers helpful (nn.Sequential).

In [0]:
class HNet(nn.Module):
    def __init__(self):
        super(HNet, self).__init__()
        ### Implement a version of Hinton's 2006 Autoencoder
        ### Using a bottleneck latent dimension of 2
        pass

    def forward(self, x):
        ### *** Fill this in as necessary ***
        pass

In [0]:
class MyNet(nn.Module):
    def __init__(self):
        super(MyNet, self).__init__()
        ### Fill in your network architecture here!
        pass
    
    def forward(self, x):
        ### *** Fill this in as necessary ***
        pass

> **Q3.1) What activation functions did you use, and why?**

...

### 4. Write your own training function

Write your own training function that takes your **model**, an **optimizer**, and a **training criterion**, and iterates over the **training set**. 
* *Hint*: Because an autoencoder is a form of unsupervised learning, we won't need to use the labels like in the MNIST classification example. Keep in mind the format of the images and whether they're compatible with feed-forward networks.
* For each epoch, print and record (in an array or list) the training loss.
* You may want to store the model and its weights on file at regular intervals ("checkpoints"). In order to visualize the autoencoder's learning process, we suggest to save at least three timepoints: early, intermediate, and final (for instance, if your model converges after 60 epochs, save your model at 5 epochs, 30 epochs, and 60 epochs). 
* **Save your best trained model and provide those checkpoints along with your final homework. There are more directions on uploading/download, and also saving directly to your Google Drive folder below.**

As a side note, PyTorch offers tremendous flexibility in how to train your neural networks. Although we'll write our own training function here, there are many ways to do this: as a method written into your Net() class, as its own Class, etc.

A few other useful tips: 
- Feel free to look at the MNIST notebooks on dense networks, as well as with CNNs as a guide.
- Printing out the intermediate variables to understand whats happening in each step can be helpful. 

In [0]:
def train(train_loader, model, optimizer, criterion, 
          n_epochs=10, **kwargs):
    """
    Define your training loop in this function
    """
    pass

### 5. Define your optimization and evaluation criterion

Define an optimizer and criterion (loss function) for your neural network training. To setup your optimizer, you'll have to instantiate your models above, and choose a learning rate. Try a few different optimizers and learning rates to get a sense of what will train within a reasonable timeframe (if your deep network isn't too deep, reaching convergence shouldn't take more than 5-10 minutes with the right choice of learning rate and optimizer).

> **Q5.1) What loss function is suited to this problem?**

...

> **Q5.2) Try a few optimizers, what seemed to work best?**

...

> **Q5.3) What's the effect of choosing different batch sizes?**

...

In [0]:
### Instantiate your model
### Define your loss function (training criterion)
### Choose your optimizer

### 6. Run your training loop and start training

It's a great idea to monitor the early epochs of your training ("babysit your training") to keep an eye on learning. Does the learning rate seem too high? too low?

(**Hint: it's recommended that you just test a single epoch at a time while you write your training function, to debug and make sure everything is working appropriately.**)

In [0]:
### Set a number of training epochs and train your model.

In your training loop, we requested that you store your training loss for each epoch. Using Matplotlib, please plot your training loss as a function of epoch.

In [0]:
### plot loss curve using Matplotlib

> **Q6.1)  How do you know when your network is done training?**

...



Another way to check if your models (HNet and MyNet) are well trained is to plot a few image reconstructions to see how well your models do. 

In [0]:
# extract 6 figures from training DataLoader
mini_batch, _ = next(iter(train_loader))
n_examples = min(6, mini_batch.shape[0])
examples = mini_batch[:n_examples]

# compute reconstructions
with torch.no_grad():
    reconstr_examples = model.forward(
        examples.view(n_examples, -1).to(device))

# save image with original v. reconstructed images
comparison = torch.cat([
    examples,
    reconstr_examples.view(-1, 1, 28, 28).cpu()])
save_image(comparison.cpu(), 'training_reconstruction.png', nrow=n_examples)

In [0]:
Image('training_reconstruction.png', width=800)

> **Q6.2) What does `torch.no_grad()` do?**

...

### [Only if running in Colab] Download/upload your saved checkpoints

**You can skip this section if you are running this notebook locally on your computer.**


Google Colab stores data in folder `/content`:

In [0]:
%%bash
pwd

In [0]:
!ls


If, after some training runs, you feel like you trained a good model and would like to store it, follow the instructions in this subsection. Next time you reopen Colab, you can reupload your best trained model and skip the training step!

**Pro tip:** Another option is to link Colab to your personal Google Drive (if you have it). Your Drive folder will then be accessible directly from Colab and you can save your data in it. No need to keep downloading and reuploading files!

In [0]:
# Mount your personal Drive folder in Colab
# (follow instructions, you'll need an authorization code)
colab.drive.mount('/content/gdrive')

In [0]:
# Helper functions for downloading/uploading data to Colab

import os
import google.colab as colab
import tarfile

def download_files(files):
    '''
    Download a file, or a list of files, as a 
    compressed .tar archive named 'backup.tar'
    '''
    if type(files) not in (list, tuple):
        files = [files,]
    try:
        # compress files
        tar = tarfile.open('backup.tar', 'w')
        for f in files:
            tar.add(f)
        tar.close()
        # start download
        colab.files.download('backup.tar')
    except Exception as e:
        raise e
    finally:
        os.remove('backup.tar')
    return

def upload_files():
    '''
    Open browser dialogue for uploading files to Colab folder.
    If a single .tar archive is uploaded, it will be uncompressed.
    '''
    d = colab.files.upload()
    if len(d) == 1:
        filename = list(d.keys())[0]
        if filename.endswith('.tar'):
            tar = tarfile.open(filename)
            tar.extractall()
            tar.close()
            print(f'{filename} has been uploaded and '
                  'uncompressed successfully.')
    return

In [0]:
### Download file(s):
download_files(['model-59.pth', 'train_loss.npy'])

In [0]:
# Re-upload a file (e.g. 'backup.tar') from your computer to Colab
upload_files()

### 7. Visualize the learning process

We'll next try to visualize how well the model is learning on the **test set**. To do this, we'll first visualize the "learning process" by viewing reconstruction.

* Using your saved models in the last step, plot a batch of original images from their test set, and their corresponding reconstructions based on each of your saved models over time. You should see the quality of the reconstructions improving over time.
* To visualize images, you can either use the same approach as for the training examples, or use the helper functions provided below. Note that the latter approach is more correct because it removes the preprocessing transformations introduced when creating the DataLoader.


In [0]:
### Optional Helper Functions for Plotting Multiple Images

def imshow(inp, 
           figsize=(10,10),
           mean=0.1307, # for MNIST train
           std=0.3081, # for MNIST train
           title=None):
    """Imshow for Tensor."""
    inp = inp.cpu().detach()
    inp = inp.numpy().transpose((1, 2, 0))
    mean = np.array(mean)
    std = np.array(std)
    inp = std * inp + mean
    inp = np.clip(inp, 0, 1)
    plt.figure(figsize=figsize)
    plt.imshow(inp)
    if title is not None:
        plt.title(title)
    plt.pause(0.001)  # pause a bit so that plots are updated
    
def reconstructions_from_batch(model, batch):
    batch = batch.view(-1, 28 * 28).to(device)
    reconstruction = model(batch)
    return reconstruction.reshape(batch.shape[0],1,28,28)

# Get a batch of training data
batch, classes = next(iter(test_loader))

# Make a grid from batch
out = torchvision.utils.make_grid(batch)
imshow(out)

In [0]:
### Iterate over checkpoints and plot reconstruction 
### figures from the test set.

As discussed in class, the first part of an autoencoder maps the original input into a lower-dimensional latent space. 
* Just as shown in Hinton and Salakhutdinov, run your test set of 10,000 MNIST digits through the **encoding layer** of one of the trained networks above. Each sample should readily map to a 2-dimension point. To do this, it will be helpful to fill out a new function, **encode** below, that takes in your trained model and your test_dataloader to produce pairs of 2d-latent embeddings and their corresponding labels (e.g. 3, 4).
* Plot each point in these two dimensions, and color each point in this **latent space** by their known **labels**. 
* Does your autoencoder separate out different classes effectively? What classes seem to be closer and what classes are farther apart in this latent space?

In [0]:
### Write a helper function to grab examples from the test_loader to generate
### pairs of embeddings and their associated labels

def encode(model, device, test_loader):
  #### Fill this in! ####
  latent_embeddings # get the latent embeddings, which will ultimately be a vector of x, y coordinates
  labels # this should match the dim of latent_embeddings, so each pair of coordinates has an associated label
  return latent_embeddings, labels

### Plot latent space representation color-coded 
### according to their "true" labels

## Optional: Train an autoencoder on CelebA Faces

Real-world images tend to be far more complex than digits from MNIST. As an optional exercise for your own interest, or for students looking for more experience, we'll investigate a subset of CelebA below.

We provide the images in a .zip file (`faces.zip`) in the class's Google Drive folder, which contains a "train" and "test" set of 80k and 10k images, respectively. Although these are color, RGB images, below we've set up the datasets to convert these to grayscale with precomputed means (0.4401) and stds (0.2407), for convenience and easier compute.

In [0]:
### Download faces.zip and unzip it into bmi219_downloads/

In [0]:
preprocessing = transforms.Compose([
    transforms.Grayscale(),
    transforms.ToTensor(),
    transforms.Normalize((0.4401,), (0.2407,)),
])

train_dataset = datasets.ImageFolder(
    'bmi219_downloads/Faces/train',
    transform=preprocessing)

test_dataset = datasets.ImageFolder(
    'bmi219_downloads/Faces/test',
    transform=preprocessing)

As above, you'll want to:

1. set up your dataloaders and visualize some of the images
2. set up your autoencoder network architecture
3. define your training criterion and optimizer
4. train your network
    
In this case, you should be able to reuse much of your code from above. Consider a few questions:

1. How well do complex images like faces work with a latent dimension of 2?
2. Do reconstructions look better with a larger bottleneck?
3. What kind of features are poorly reconstructed? What happens to sunglasses, hats, and hands?
4. Try sampling the 2-d latent space close to existing examples (by adding some noise...) or randomly. What do the generated images look like?