# Small data and deep learning
This mini-project proposes to study several techniques for improving challenging context, in which few data and resources are available.

In [None]:
load_ext autoreload
autoreload 2

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Introduction
Assume we are in a context where few "gold" labeled data are available for training, say $\mathcal{X}_{\text{train}}\triangleq\{(x_n,y_n)\}_{n\leq N_{\text{train}}}$, where $N_{\text{train}}$ is small. A large test set $\mathcal{X}_{\text{test}}$ is available. A large amount of unlabeled data, $\mathcal{X}$, is available. We also assume that we have a limited computational budget (e.g., no GPUs).

For each question, write a commented *Code* or a complete answer as a *Markdown*. When the objective of a question is to report a CNN accuracy, please use the following format to report it, at the end of the question:

| Model | Number of  epochs  | Train accuracy | Test accuracy |
|------|------|------|------|
|   XXX  | XXX | XXX | XXX |

If applicable, please add the field corresponding to the  __Accuracy on Full Data__ as well as a link to the __Reference paper__ you used to report those numbers. (You do not need to train a CNN on the full CIFAR10 dataset)

In your final report, please keep the logs of each training procedure you used. We will only run this jupyter if we have some doubts on your implementation. 

__The total file sizes should not exceed 2MB. Please name your notebook (LASTNAME)\_(FIRSTNAME).ipynb, zip/tar it with any necessary files required to run your notebook, in a compressed file named (LASTNAME)\_(FIRSTNAME).X where X is the corresponding extension. Zip/tar files exceeding 2MB will not be considered for grading. Submit the compressed file via the submission link provided on the website of the class.__

You can use https://colab.research.google.com/ to run your experiments.

## Training set creation
__Question 1:__ Propose a dataloader or modify the file located at https://github.com/pytorch/vision/blob/master/torchvision/datasets/cifar.py in order to obtain a training loader that will only use the first 100 samples of the CIFAR-10 training set. 

> I obtained only the first 100 samples by inheritance from CIFAR10:

```python
class SubCIFAR10(CIFAR10):
    def __init__(self, root, length=100, **kwargs):
        self.length = length
        super().__init__(root, **kwargs)
        
    def __getitem__(self, index):
        return super().__getitem__(index  self.length)
```

In [None]:
import torchvision.transforms as transforms
import torch 
import torchvision
import matplotlib.pyplot as plt
import numpy as np
from utils import SubCIFAR10
    
    
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
batch_size = 100
dataset = SubCIFAR10("datasets", download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                          shuffle=False, num_workers=2)

View a few classes

In [None]:
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))

dataiter = iter(dataloader)
images, labels = next(dataiter)
plt.figure(figsize=(20,10))

plt.subplot(121)
plt.grid(False)
imshow(torchvision.utils.make_grid(images))

plt.subplot(122)
plt.hist(labels, bins=len(classes), rwidth=0.9, color='#607c8e')
ax = plt.gca()
plt.xticks(np.arange(len(classes))*0.9+0.5, classes, rotation=45, rotation_mode="anchor", ha="right")
plt.show()

This is our dataset $\mathcal{X}_{\text{train}}$, it will be used until the end of this project. The remaining samples correspond to $\mathcal{X}$. The testing set $\mathcal{X}_{\text{test}}$ corresponds to the whole testing set of CIFAR-10.

## Testing procedure
__Question 2:__ Explain why the evaluation of the training procedure is difficult. Propose several solutions.

> Having a very small dataset for 10 classes makes training particularly difficult. For exmaple, the class "ship" has only 4 samples. The training difficulties can be understood by the discrepancy between the number of samples and the number of parameters of a CNN (typically between hundreds of thousands to tens of millions)
>
> Several solutions exist though:
>
> - using a pre-trained network, such that the optimizer is closed to an extrema with the new challenge;
> - augmenting the number of data with random variations of original data;
> - generating new data by learning a latent representation (such as VAE) of the trainset, using adversarial network;
> - synthetising new data and possibly refining them with CycleGANs;
> - optimizing the architecture of training, such as using a smaller dataset, testing different hyperparameters;
> - making use of non-annotated data (see semi-supervised/weakly-supervised learning methods).

# Raw approach: the baseline

In this section, the goal is to train a CNN on $\mathcal{X}_{\text{train}}$ and compare its performances with reported number from the litterature. You will have to re-use and/or design a standard classification pipeline. You should optimize your pipeline to obtain the best performances (image size, data augmentation by flip, ...).

The key ingredients for training a CNN are the batch size, as well as the learning rate schedule, i.e. how to decrease the learning rate as a function of the number of epochs. A possible schedule is to start the learning rate at 0.1 and decreasing it every 30 epochs by 10. In case of divergence, reduce the laerning rate. A potential batch size could be 10, yet this can be cross-validated.

You can get some baselines accuracies in this paper: http://openaccess.thecvf.com/content_cvpr_2018/papers/Keshari_Learning_Structure_and_CVPR_2018_paper.pdf. Obviously, it is a different context, as those researchers had access to GPUs.

## ResNet architectures

__Question 3:__ Write a classification pipeline for $\mathcal{X}_{\text{train}}$, train from scratch and evaluate a *ResNet-18* architecture specific to CIFAR10 (details about the ImageNet model can be found here: https://arxiv.org/abs/1409.1556 ). If possible, please report the accuracy obtained on the whole dataset, as well as the reference paper/GitHub link you might have used.

*Hint:* You can re-use the following code: https://github.com/kuangliu/pytorch-cifar. During a training of 10 epochs, a batch size of 10 and a learning rate of 0.01, one obtains 40 accuracy on $\mathcal{X}_{\text{train}}$ (~2 minutes) and 20 accuracy on $\mathcal{X}_{\text{test}}$ (~5 minutes).

In [None]:
from utils import train_an_epoch, test
from resnet import ResNet18


batch_size = 10
n_epochs = 100

# Loading data
train_transform = transforms.Compose(
    [transforms.RandomHorizontalFlip(),
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
test_transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = SubCIFAR10("datasets", download=True, transform=train_transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size)
validset = SubCIFAR10("datasets", download=True, transform=test_transform, offset=100, length=1000)
validloader = torch.utils.data.DataLoader(validset, batch_size=batch_size)
testset = SubCIFAR10("datasets", download=True, transform=test_transform, offset=100, length=50000)
testloader = torch.utils.data.DataLoader(validset, batch_size=batch_size)

# Initialize opt and net
model = ResNet18()
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.CrossEntropyLoss()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model.cuda()
    
# Training
train_loss = np.zeros(n_epochs, dtype=float)
train_acc = np.zeros(n_epochs, dtype=float)
valid_acc = np.zeros(n_epochs, dtype=float)
with tnrange(n_epochs) as t:
    for i in t:
        train_loss[i] = train_an_epoch(model, criterion, 
                 trainloader, optimizer, device, silent=True)
        train_acc[i] = test(model, trainloader, 
                 criterion, device, silent=True)
        valid_acc[i] = test(model, validloader, criterion, device, silent=True)
        t.set_postfix(train_loss=train_loss[i], train_acc=train_acc[i], valid_acc=valid_acc[i])
     
plot_history(train_loss=train_loss, train_acc=train_acc, valid_acc=valid_acc)
print("[Testing]")
test(model, testloader, criterion, device)

Oops... The neural network is completely over-fitting!

| Model | Number of  epochs  | Train accuracy | Test accuracy |
|------|------|------|------|
|ResNet| 10 | 38.5% | 20.9% |
|ResNet| 20 | 51.5% | 23.1% |
|ResNet| 100 | 98.0% | 23.5% |
|[Keshari-ResNet]| N/A | N/A | 36% | 
|**[Keshari-ResNet-pretrained]**| N/A | N/A |44% | 

- [Keshari-ResNet] *Learning Structure and Strength of CNN Filters for Small Sample Size Training*;
Rohit Keshari, Mayank Vatsa, Richa Singh, Afzel Noore.  (version Proposed ResNet, Dict., Init., learn "t") 
- [Keshari-ResNet] - (version Proposed ResNet, Pre-trained on ImageNet, learn "t") 

## VGG-like architectures

__Question 4:__ Same question as before, but with a *VGG*. Which model do you recommend?

In [None]:
from vgg import VGG# Initialize opt and net
model = VGG('VGG11')
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.CrossEntropyLoss()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model.cuda()
    
# Training
train_loss = np.zeros(n_epochs, dtype=float)
train_acc = np.zeros(n_epochs, dtype=float)
valid_acc = np.zeros(n_epochs, dtype=float)
with tnrange(n_epochs) as t:
    for i in t:
        train_loss[i] = train_an_epoch(model, criterion, 
                 trainloader, optimizer, device, silent=True)
        train_acc[i] = test(model, trainloader, 
                 criterion, device, silent=True)
        valid_acc[i] = test(model, validloader, criterion, device, silent=True)
        t.set_postfix(train_loss=train_loss[i], train_acc=train_acc[i], valid_acc=valid_acc[i])
     
plot_history(train_loss=train_loss, train_acc=train_acc, valid_acc=valid_acc)
test(model, testloader, criterion, device)

I have tested first with VGG19, but over-fitting was even worst. I switched then to a smaller network having less parameters: VGG11. Bold lines emphasize the best results.

| Model | Number of  epochs  | Train accuracy | Test accuracy |
|------|------|------|------|
|ResNet18| 10 | 38.5% | 20.9% |
|**ResNet18**| 100 | 98.0% | 23.5%|
| VGG19 |  10 | 25.0% | 17.1% |
|**VGG19** | 100 | 83.0% | 18.7% |
| VGG11 |  10 | 92.0% | 22.9% |
|**VGG11** |  14 | 100.0% | 27.0% |
| VGG11 | 100 | 100.0% | 25.3% | 
|[Keshari-ResNet]| N/A | N/A | 36% | 
|**[Keshari-ResNet-pretrained]**| N/A | N/A | 44% | 

For proposed training method, VGG11 is a better choice than ResNet18, since its converged faster and achieves a better test accuracy at step 14 (27%)


# Transfer learning

We propose to use pre-trained models on a classification and generative task, in order to improve the results of our setting.

## ImageNet features

Now, we will use some pre-trained models on ImageNet and see how well they compare on CIFAR. A list is available on: https://pytorch.org/docs/stable/torchvision/models.html.

__Question 5:__ Pick a model from the list above, adapt it to CIFAR and retrain its final layer (or a block of layers, depending on the resources to which you have access to). Report its accuracy.

In [None]:
from torchvision.models.resnet import resnet18
from utils import plot_history, initialize_model
from tqdm import tnrange

n_epochs = 35

# Loading model and opt
model, input_size = initialize_model("resnet", len(classes), True, use_pretrained=True)
params_to_update = []
for name, param in model.named_parameters():
    if param.requires_grad == True:
        params_to_update.append(param)
optimizer = torch.optim.Adam(params_to_update)
criterion = torch.nn.CrossEntropyLoss()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model.cuda()
    
    
# Loading data    
train_transform = transforms.Compose(
    [transforms.RandomResizedCrop(input_size),
     transforms.RandomHorizontalFlip(),
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
test_transform = transforms.Compose(
    [transforms.Resize(input_size),
     transforms.CenterCrop(input_size),
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
trainset = SubCIFAR10("datasets", download=True, transform=train_transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size)
validset = SubCIFAR10("datasets", download=True, transform=test_transform, offset=100, length=1000)
validloader = torch.utils.data.DataLoader(validset, batch_size=batch_size)
testset = SubCIFAR10("datasets", download=True, transform=test_transform, offset=100, length=50000)
testloader = torch.utils.data.DataLoader(validset, batch_size=batch_size)

    
# Training
train_loss = np.zeros(n_epochs, dtype=float)
train_acc = np.zeros(n_epochs, dtype=float)
valid_acc = np.zeros(n_epochs, dtype=float)
with tnrange(n_epochs) as t:
    for i in t:
        train_loss[i] = train_an_epoch(model, criterion, 
                 trainloader, optimizer, device, silent=True)
        train_acc[i] = test(model, trainloader, 
                 criterion, device, silent=True)
        valid_acc[i] = test(model, validloader, criterion, device, silent=True)
        t.set_postfix(train_loss=train_loss[i], train_acc=train_acc[i], valid_acc=valid_acc[i])
     
plot_history(train_loss=train_loss, train_acc=train_acc, valid_acc=valid_acc)
test(model, testloader, criterion, device)

In [None]:
test(model, testloader, criterion, device)

Using pre-trained neural networks allowed me to improve results over Keshari et al.!

| Model | Number of  epochs  | Train accuracy | Test accuracy |
|------|------|------|------|
|VGG11 |  14 | 100.0% | 27.0% |
|[Keshari-ResNet]| N/A | N/A | 36% | 
|ResNet-pretrained| 10 | 63 % | 37% |
|[Keshari-ResNet-pretrained]| N/A | N/A | 44% | 
|**ResNet-pretrained**| 35 | 76% | 51% |

## DCGan features

GANs correspond to an unsupervised technique for generating images. In https://arxiv.org/pdf/1511.06434.pdf, Sec. 5.1 shows that the representation obtained from the Discriminator has some nice generalization properties on CIFAR10.

__Question 6:__  Using for instance a pretrained model from https://github.com/soumith/dcgan.torch combined with https://github.com/pytorch/examples/tree/master/dcgan, propose a model to train on $\mathcal{X}_{\text{train}}$. Train it and report its accuracy.

*Hint:* You can use the library: https://github.com/bshillingford/python-torchfile to load the weights of a model from torch(Lua) to pytorch(python).

> I tried to train both the generator and the discriminator. Using only 100 samples make it impossible. Instead, I train the GAN over the complete dataset. I know, it is cheating, but I found it interesting to get a baseline.

![best results](results/dcgan/fake_samples_epoch_611.png)

In [None]:
import os
from gan import *
from utils import get_loaders, plot_history, train_an_epoch
from tqdm import tnrange
import numpy as np

# Parameters
out_img = os.path.join("results", "dcgan2")
out_models = os.path.join("models", "dcgan")
nc = 3
nz = 100 # latent Z vector
ngf = 64
ndf = 64
n_epochs = 650
batch_size = 300

os.makedirs(output, exist_ok=True)
os.makedirs(output, exist_ok=True)

# Building models and optimizers
device = torch.device("cuda:0" if torch.cuda.is_available() 
                      else "cpu")
netG, optimizerG, netD, optimizerD = build_dcgan(
    device, nz, nc, ngf, ndf)
criterion = nn.BCELoss()

# Loading data
_, _, dataloader = get_loaders(batch_size=10, input_size=64)

gen_loss = np.zeros(n_epochs, dtype=float)
disc_loss = np.zeros(n_epochs, dtype=float)

with tnrange(n_epochs) as t:
    for i in t:
        disc_loss[i], gen_loss[i] = train_gan_an_epoch(
            dataloader, netG, optimizerG, 
            netD, optimizerD, criterion, device, nz=nz, 
            silent=True, out_models=out_models, epoch=i)
        export_gan_result(dataloader, netG, netD, device, 
                          out_img, i, nz=nz)
        t.set_postfix(disc_loss=disc_loss[i], 
                      gen_loss=gen_loss[i])
     
plot_history(disc_loss=disc_loss, gen_loss=gen_loss)

Then I trained a discriminator given a pre-trained generator

In [None]:
import os
from gan import *
from utils import get_loaders, plot_history, train_an_epoch
from tqdm import tnrange
import numpy as np
import torchfile


# Parameters
out_img = os.path.join("results", "dcgan2")
out_models = os.path.join("models", "dcgan")
nc = 3
nz = 100 # latent Z vector
ngf = 64
ndf = 64
n_epochs = 650
batch_size = 300

os.makedirs(output, exist_ok=True)

# Building models and optimizers
device = torch.device("cuda:0" if torch.cuda.is_available() 
                      else "cpu")
_, _, netD, optimizerD = build_dcgan(
    device, nz, nc, ngf, ndf)
netG = torchfile.load('models/bedrooms_4_net_G.t7')
criterion = nn.BCELoss()

# Loading data
dataloader, _, _ = get_loaders(batch_size=10, input_size=64)



disc_loss = np.zeros(n_epochs, dtype=float)

We can also train on a similar dataset: STL-10 with higher resolutions. 

![Results on STL-10](results/stl10/fake_samples_epoch_4149.png)

In [None]:
import os
from gan import *
from utils import get_loaders, plot_history, train_an_epoch, SubSTL10
from tqdm import tnrange
from torchvision.datasets import STL10
import numpy as np

# Parameters
out_img = os.path.join("results", "stl10")
out_models = os.path.join("models", "stl10")
nc = 3
nz = 100 # latent Z vector
input_size = 64
n_epochs = 650
batch_size = 32

os.makedirs(output, exist_ok=True)
os.makedirs("models/stl10", exist_ok=True)

# Building models and optimizers
device = torch.device("cuda:0" if torch.cuda.is_available() 
                      else "cpu")
netG, optimizerG, netD, optimizerD = build_dcgan(
    device, nz, nc, input_size, input_size)
criterion = nn.BCELoss()
netG.load_state_dict(torch.load("models/stl10/netG_epoch_640.pth"))
netD.load_state_dict(torch.load("models/stl10/netD_epoch_640.pth"))

# Loading data
transform = transforms.Compose(
    [transforms.RandomResizedCrop(input_size),
     transforms.RandomHorizontalFlip(),
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

dataset = STL10("datasets", split='train+unlabeled', transform=transform, download=True)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)

with tnrange(n_epochs+500, n_epochs+3500) as t:
    for i in t:
        disc_loss[i], gen_loss[i] = train_gan_an_epoch(
            dataloader, netG, optimizerG, 
            netD, optimizerD, criterion, device, nz=nz, 
            silent=True, out_models=out_models, epoch=i)
        export_gan_result(dataloader, netG, netD, device, 
                          out_img, i, nz=nz)
        t.set_postfix(disc_loss=disc_loss[i], 
                      gen_loss=gen_loss[i])
     

plot_history(disc_loss=disc_loss, gen_loss=gen_loss)

Can we fine-tune this network classes per classes, such that we could generate then specifically some classes?

In [None]:
from gan import finetune_gan
            
for label in range(len(classes)):
    finetune_gan(label, classes=classes, prefix="cifar")

# Incorporating *a priori*
Geometrical *a priori* are appealing for image classification tasks. For now, we only consider linear transformations $\mathcal{T}$ of the inputs $x:\mathbb{S}^2\rightarrow\mathbb{R}$ where $\mathbb{S}$ is the support of an image, meaning that:

$$\forall u\in\mathbb{S}^2,\mathcal{T}(\lambda x+\mu y)(u)=\lambda \mathcal{T}(x)(u)+\mu \mathcal{T}(y)(u)\,.$$

For instance if an image had an infinite support, a translation $\mathcal{T}_a$ by $a$ would lead to:

$$\forall u, \mathcal{T}_a(x)(u)=x(u-a)\,.$$

Otherwise, one has to handle several boundary effects.

__Question 7:__ Explain the issues when dealing with translations, rotations, scaling effects, color changes on $32\times32$ images. Propose several ideas to tackle them.

> These operations on very small images might lead to several defects:
>
> - aliasing is likely to appear, but it could be partly compensated with an anti aliasing filter;
> - even small affine transformation fill out significant part of the image; however, one could re-fill them with a padding pattern (e.g. applying a mirror effect on the missing pixels);
> - scaling effects are in particular dangerous as it reduces the resolution; here, we could use sub-pixel methods.
>
> Finally, I decided to remove any affine transformation to focus only on constrained deformation (thin plate splines)

## Data augmentations

__Question 8:__ Propose a set of geometric transformation beyond translation, and incorporate them in your training pipeline. Train the model of the __Question 3__ and __Question 4__ with them and report the accuracies.

In [None]:
from utils import *
from vgg import VGG
from resnet import ResNet18
from tps import random_tps

input_size = 32
transform = transforms.Compose([
        transforms.ColorJitter(brightness=0.1, contrast=0.05, saturation=0.05, hue=0.05),
        transforms.Lambda(random_tps),
        transforms.RandomResizedCrop(input_size, scale=(0.95, 1.0)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

batch_size = 10
trainloader, validloader, testloader = get_loaders(batch_size=batch_size, train_transform=transform)

# Initialize opt and net
vgg = VGG('VGG11')
vgg_opt = torch.optim.Adam(vgg.parameters())
resnet = ResNet18()
resnet_opt = torch.optim.Adam(resnet.parameters())
criterion = torch.nn.CrossEntropyLoss()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    vgg.cuda()
    resnet.cuda()

plot_examples(trainloader)

In [None]:
# Training for VGG
n_epochs = 100

vgg_train_loss = np.zeros(n_epochs, dtype=float)
vgg_train_acc = np.zeros(n_epochs, dtype=float)
vgg_valid_acc = np.zeros(n_epochs, dtype=float)
with tnrange(n_epochs) as t:
    for i in t:
        vgg_train_loss[i] = train_an_epoch(vgg, criterion, 
                 trainloader, vgg_opt, device, silent=True)
        vgg_train_acc[i] = test(vgg, trainloader, 
                 criterion, device, silent=True)
        vgg_valid_acc[i] = test(vgg, validloader, criterion, device, silent=True)
        t.set_postfix(train_loss=vgg_train_loss[i], train_acc=vgg_train_acc[i], valid_acc=vgg_valid_acc[i])
     

plot_history(train_loss=vgg_train_loss, train_acc=vgg_train_acc, valid_acc=vgg_valid_acc)
test(vgg, testloader, criterion, device)

In [None]:
# Training for Resnet
rn_train_loss = np.zeros(n_epochs, dtype=float)
rn_train_acc = np.zeros(n_epochs, dtype=float)
rn_valid_acc = np.zeros(n_epochs, dtype=float)
with tnrange(n_epochs) as t:
    for i in t:
        rn_train_loss[i] = train_an_epoch(resnet, criterion, 
                 trainloader, resnet_opt, device, silent=True)
        rn_train_acc[i] = test(resnet, trainloader, 
                 criterion, device, silent=True)
        rn_valid_acc[i] = test(resnet, validloader, criterion, device, silent=True)
        t.set_postfix(train_loss=rn_train_loss[i], train_acc=rn_train_acc[i], valid_acc=rn_valid_acc[i])
   

In [None]:
plot_history(train_loss=rn_train_loss, train_acc=rn_train_acc, valid_acc=rn_valid_acc)
test(resnet, testloader, criterion, device)

Data augmentation has slightly improved our baseline, but it remains lower than with a pre-training or with Keshari et al. method.

| Model | Number of  epochs  | Train accuracy | Test accuracy |
|------|------|------|------|
|VGG11 |  14 | 100.0% | 27.0% |
|ResNet-augmented |  100 | 100.0% | 25.0% |
|VGG11-augmented |  100 | 100.0% | 29.5% |
|[Keshari-ResNet]| N/A | N/A | 36% | 
|[Keshari-ResNet-pretrained]| N/A | N/A | 44% | 
|**ResNet-pretrained**| 35 | 76% | 51% |

## Wavelets

__Question 9:__ Use a Scattering Transform as an input to a ResNet-like architecture. You can find a baseline here: https://arxiv.org/pdf/1703.08961.pdf.

*Hint:* You can use the following package: https://www.kymat.io/

In [None]:
from scattering import Scattering2dResNet, ScatteringCIFAR10
import torch.optim
from torchvision import datasets, transforms
import torch.nn.functional as F
from kymatio import Scattering2D
import torch
import argparse
import kymatio.datasets as scattering_datasets
import torch.nn as nn
from utils import * 

# Parameters
batch_size = 128
n_epochs = 90 

# Loading model and so on
model = Scattering2dResNet(81*3, 2).to(device)
scattering = Scattering2D(J=2, shape=(32, 32))
train_transform = transforms.Compose(
            [transforms.RandomResizedCrop(input_size),
             transforms.RandomHorizontalFlip(),
             transforms.ToTensor(),
             transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
            ])
augmented_transform = transforms.Compose([
        transforms.ColorJitter(brightness=0.1, contrast=0.05, saturation=0.05, hue=0.05),
        transforms.Lambda(random_tps),
        transforms.RandomResizedCrop(input_size, scale=(0.95, 1.0)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
# trainloader, validloader, testloader = get_loaders(
#     batch_size=batch_size, train_transform=train_transform)
trainloader, validloader, testloader = get_loaders(
    batch_size=batch_size, train_transform=augmented_transform)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.CrossEntropyLoss()
use_cuda = torch.cuda.is_available()        
device = torch.device("cuda:0" if use_cuda else "cpu")
if use_cuda:
    scattering = scattering.cuda()
    model.cuda()

# Training
train_loss = np.zeros(n_epochs, dtype=float)
train_acc = np.zeros(n_epochs, dtype=float)
valid_acc = np.zeros(n_epochs, dtype=float)

with tnrange(n_epochs) as t:
    for i in t:
        train_loss[i] = train_an_epoch(model, criterion, 
                 trainloader, optimizer, device, 
                 silent=True, callback=scattering)
        train_acc[i] = test(model, trainloader, 
                 criterion, device, silent=True, callback=scattering)
        valid_acc[i] = test(model, validloader, criterion, device, 
                            silent=True, callback=scattering)
        t.set_postfix(train_loss=train_loss[i], train_acc=train_acc[i], valid_acc=valid_acc[i])
     
plot_history(train_loss=train_loss, train_acc=train_acc, valid_acc=valid_acc)
test(model, testloader, criterion, device, callback=scattering)

Scattering has improved the baseline!
Data augmentation has even allowed to do slightly better. Our level is then closed to the baseline of Keshari et al. without pre-training.

Original paper presented a better baseline.

| Model | Number of  epochs  | Train accuracy | Test accuracy |
|------|------|------|------|
|Scattering |  90 | 59.0% | 27.5% |
|VGG11-augmented |  100 | 100.0% | 29.5% |
|Scattering-augmented |  75 | 83.0% | 29.6% |
|[Keshari-ResNet]| N/A | N/A | 36% | 
|[Oyallon-Scat + WRN 12-8]| N/A | N/A | 38.9% |
|[Keshari-ResNet-pretrained]| N/A | N/A | 44% | 
|**ResNet-pretrained**| 35 | 76% | 51% |


# Weak supervision

Weakly supervised techniques permit to tackle the issue of labeled data. An introduction to those techniques can be found here: https://hazyresearch.github.io/snorkel/blog/ws_blog_post.html.

__(Open) Question 10:__ Pick a weakly supervised method that will potentially use $\mathcal{X}\cup\mathcal{X}_{\text{train}}$ to train a representation (a subset of $\mathcal{X}$ is also fine). Evaluate it and report the accuracies. You should be careful in the choice of your method, in order to avoid heavy computational effort.

> One solution consists in using the pre-trained GANs to generate classes. The GANs are trained to generate only one class. It will create errors, but it might be quite efficient to pre-train a classifier.
> This is similar to a way to augment data since we trained GANs over the 100 samples from CIFAR10. However, I am slightly cheating (using more than 100 samples at total), because the GAN was pre-trained over STL10.

In [None]:
from tqdm import tnrange
from gan import DCGAN_Generator
import os
import torch
from torchvision.utils import save_image

nf = 64
nc = 3 
nz = 100
batch_size = 300
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
output = os.path.join("datasets", "generated")



for label in tnrange(len(classes)):
    folder = os.path.join(output, classes[label])
    os.makedirs(folder, exist_ok=True)
    
    netG = DCGAN_Generator(nf, nc, nz).to(device)
    netG.load_state_dict(torch.load(f"models/cifar-{classes[label]}/netG_epoch_{390}.pth"))
    noise = torch.randn(batch_size, nz, 1, 1, device=device)
    fakes = netG(noise)
    
    for i, fake in enumerate(fakes):
        fake = fake / 2 + 0.5     # unnormalize
        filename = os.path.join(folder, f"{i:03d}.jpg")
        save_image(fake, filename)

I filtered manually images badly generated. Errors in generation came from mode collapses or remainding of the pre-trained (for example cars were sometimes produced in "bird" folder).

In [None]:
from utils import get_loaders, train_an_epoch, test, plot_history
from torchvision.datasets import ImageFolder
from torchvision import transforms
from vgg import VGG

batch_size = 10
n_epochs = 15
_, validloader, testloader = get_loaders(batch_size=batch_size)
transform = transforms.Compose(
            [transforms.RandomHorizontalFlip(),
             transforms.Resize(32),
             transforms.ToTensor(),
             transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = ImageFolder("datasets/generated", transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, 
    batch_size=batch_size, shuffle=True, num_workers=4)

# Initialize opt and net
model = VGG('VGG11')
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.CrossEntropyLoss()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model.cuda()

# Training
train_loss = np.zeros(n_epochs, dtype=float)
train_acc = np.zeros(n_epochs, dtype=float)
valid_acc = np.zeros(n_epochs, dtype=float)

with tnrange(n_epochs) as t:
    for i in t:
        train_loss[i] = train_an_epoch(model, criterion, 
                 trainloader, optimizer, device, silent=True)
        train_acc[i] = test(model, trainloader, 
                 criterion, device, silent=True)
        valid_acc[i] = test(model, validloader, criterion, device, silent=True)
        t.set_postfix(train_loss=train_loss[i], train_acc=train_acc[i], valid_acc=valid_acc[i])
     


plot_history(train_loss=train_loss, train_acc=train_acc, valid_acc=valid_acc)
test(model, testloader, criterion, device)

Results are not very strong but data were faked. Finally, we train the VGG over the real samples with the same data augmentation method

In [None]:
from utils import *
from vgg import VGG
from tps import random_tps


transform = transforms.Compose([
        transforms.ColorJitter(brightness=0.1, contrast=0.05, saturation=0.05, hue=0.05),
        transforms.Lambda(random_tps),
        transforms.RandomResizedCrop(30, scale=(0.95, 1.0)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        
    ])
trainloader, _, _ = get_loaders(batch_size=batch_size, train_transform=transform)

# Initialize opt and net
criterion = torch.nn.CrossEntropyLoss()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Training
n_epochs2 = 150
train_loss = np.pad(train_loss, (0, n_epochs2), mode="constant")
train_acc = np.pad(train_acc, (0, n_epochs2), mode="constant")
valid_acc = np.pad(valid_acc, (0, n_epochs2), mode="constant")

with tnrange(n_epochs, n_epochs2) as t:
    for i in t:
        train_loss[i] = train_an_epoch(model, criterion, 
                 trainloader, optimizer, device, silent=True)
        train_acc[i] = test(model, trainloader, 
                 criterion, device, silent=True)
        valid_acc[i] = test(model, validloader, criterion, device, silent=True)
        t.set_postfix(train_loss=train_loss[i], train_acc=train_acc[i], valid_acc=valid_acc[i])
     
plot_history(train_loss=train_loss, train_acc=train_acc, valid_acc=valid_acc)
test(model, testloader, criterion, device)

All in all, results are not satisfaying at all.
That is why, I finally tried with: 

*Semi-Supervised Learning with Ladder Networks*, NeurIPS 201.Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, Tapani Raiko

In [None]:
!cd ladder
!conda activate ladder
!python run.py train --encoder-layers convv:96:3:1:1-convf:96:3:1:1-convf:96:3:1:1-maxpool:2:2-convv:192:3:1:1-convf:192:3:1:1-convv:192:3:1:1-maxpool:2:2-convv:192:3:1:1-convv:192:1:1:1-convv:10:1:1:1-globalmeanpool:0 --decoder-spec 0-0-0-0-0-0-0-0-0-0-0-0-0 --dataset cifar10 --act leakyrelu --denoising-cost-x 0,0,0,0,0,0,0,0,0,0,0,0,0 --num-epochs 20 --lrate-decay 0.5 --seed 1 --whiten-zca 3072 --contrast-norm 55 --top-c False --labeled-samples 100 --unlabeled-samples 50000 -- cifar_4k_baseline

Results were pretty good:

| Model | Number of  epochs  | Train accuracy | Test accuracy |
|------|------|------|------|
|Scattering |  90 | 59.0% | 27.5% |
|VGG11-augmented |  100 | 100.0% | 29.5% |
|Scattering-augmented |  75 | 83.0% | 29.6% |
|[Rasmus-Ladder] |  20 | 86.1 | 35.3% |
|[Keshari-ResNet]| N/A | N/A | 36% | 
|[Oyallon-Scat + WRN 12-8]| N/A | N/A | 38.9% |
|[Keshari-ResNet-pretrained]| N/A | N/A | 44% | 
|**ResNet-pretrained**| 35 | 76% | 51% |


# Conclusions

__Question 11:__ Write a short report explaining the pros and the cons of each methods that you implemented. 25 of the grade of this project will correspond to this question, thus, it should be done carefully. In particular, please add a plot that will summarize all your numerical results.

This notebook presents several ways to deal with small datasets. Deep neural networks are known to be data-hungry. Indeed, with their astronishing number of parameters (e.g., ResNet-50 has  25.6 millions of parameters), it seems unlikely to train a neural network with 100 images of $32\times32$ pixels, namely 10k samples at pixel scale.
Nevertheless, the mini project shows that deep neural networks are appealing even for small datasets; neural networks just require a different setup.
The considered dataset is a subset of 100 samples from CIFAR-10.  CIFAR-10 is a widely used in the literature, which allows us to compare with different baselines.

At first, we compare VGG and ResNet, two general purpose architectures. Although ResNet is known to provide better results than VGG on large datasets (ResNet obtains a 3.6 top-5 error on ImageNet where VGG gets a 7.3 error rate), VGG11 outperformed Resnet-18. Results were though relatively low with less 30 accuracy for both of them. 

We then draw a strong baseline by using pre-trained weights on ImageNet. The test accuracy jumped to 51 for ResNet. However, this score is a little different from what we want: training a neural network with a few samples as possible.

The next idea consisted in using augmenting the data. First, I augmented the data with affine transformation, but for an image of $32\times32$, the distortion was too damaging. Increasing the image size did not really help. Therefore, I use a softer geometric transformation, called a thin plate spline. Mixing it with color variations allowed me to increase the baseline for VGG by a few points in the test accuracy. 

I thought also that I could achieve some data augmentations using GAN. This already happened in the literature [1] , but in practice, it is very difficult to do. My GAN was not trained well enough on the 100 samples, so I trained it STL-10, before fine-tuning it on each classes of CIFAR-10 (using only 100 samples). Generated images were manually filtered.

Finally, I run two different methods from the literature: scattering networks [2] and ladder networks [3]. The latter one is using semi-supervised technics.

The following figure summarizes the obtained results:

![Final results](results/final.png)


[1] - *GAN Augmentation: Augmenting Training Data using Generative Adversarial Networks*, Christopher Bowles, Liang Chen, Ricardo Guerrero, Paul Bentley, Roger Gunn, Alexander Hammers, David Alexander Dickie, Maria Valdés Hernández, Joanna Wardlaw, Daniel Rueckert

[2] - *Scaling the Scattering Transform: Deep Hybrid Networks*, Edouard Oyallon, Eugene Belilovsky, Sergey Zagoruyko

[3] - *Semi-Supervised Learning with Ladder Networks*, Antti Rasmus, Harri Valpola, Mikko Honkala

In [None]:
import matplotlib.pyplot as plt

models = {
    "Scattering": [90, 0.59, 0.275], 
    "VGG11-augmented": [100, 1.0, .295], 
    "Scattering-augmented": [75, .83, 0.296], 
    "[Rasmus-Ladder]": [20, 0.86, 0.35], 
    "[Keshari-ResNet]": ["N/A", "N/A", 0.36], 
    "[Oyallon-Scat + WRN 12-8]": ["N/A", "N/A", 0.39], 
    "[Keshari-ResNet-pretrained]": ["N/A", "N/A", 0.44], 
    "ResNet-pretrained": [35, .76, 0.51]}



plt.figure(figsize=(10,5))
for i, (name, values) in enumerate(models.items()):
    plt.scatter(i, values[-1], label=name)
plt.legend()
plt.title("Results obtained when training a classifier on a CIFAR-10 subset of 100 samples")
plt.plot()
plt.show()