# Task Overview

In this task, your goal is to verify the impact of data noise level in neural network training.
You should use MLP architecture trained on MNIST dataset (like in previous lab exercises).


We will experiment with two setups:
1. Pick X. Take X% of training examples and reassign their labels to random ones. Note that we don't change anything in the test set.
2. Pick X. During each training step, for each sample, change values of X% randomly selected pixels to random values. Note that we don't change anything in the test set.

For both setups, check the impact of various levels of noise (various values of X%) on model performance. Show plots comparing crossentropy (log-loss) and accuracy with varying X%, and also comparing two setups with each other.
Prepare short report briefly explaining the results and observed trends. Consider questions like "why accuracy/loss increases/decreases so quickly/slowly", "why Z is higher in setup 1/2" and any potentially surprising things you see on charts.

### Potential questions, clarifications
* Q: Can I still use sigmoid/MSE loss?
  * You should train your network with softmax and crossentropy loss (log-loss), especially since you should report crossentropy loss.
* Q: When I pick X% of pixels/examples, does it have to be exactly X% or can it be X% in expectation?
  * A: It's fine either way.
* Q: When I randomize pixels, should I randomize them again each time a particular example is drawn (each training step/epoch) or only once before training?
  * A: Each training step/epoch.
* Q: When I randomize labels, should I randomize them again each time a particular example is drawn (each training step/epoch) or only once before training?
  * A: Only once before training.
* Q: What is the expected length of report/explanation?
  * A: There is no minimum/maximum, but between 5 (concise) and 20 sentences should be good. Don't forget about plots.
* Q: When I replace labels/pixels with random values, what random distribution should I use?
  * A: A distribution reasonably similar to the data. However, you don't need to match dataset's distribution exactly - approximation will be totally fine, especially if it's faster or easier to get.
* Q: Can I use something different than Colab/Jupyter Notebook? E.g. just Python files.
  * A: Yes, although notebook is encouraged; please include in you solution code and pdf.

# Model definition and training.

In [1]:
!pip install plotly --upgrade

import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import pandas as pd
import plotly.express as px
import random
import math
import numpy as np

class NetParameters(nn.Module):
    def __init__(self, hidden_layers=[128, 128]):
        super(NetParameters, self).__init__()
        input_size = 784
        output_size = 10
        layers = []
        for (input, output) in zip(
            [input_size, *hidden_layers], [*hidden_layers, output_size]
        ):
            layers.append(nn.Linear(input, output))
        self.layers = nn.ModuleList(layers)

    def forward(self, x):
        x = torch.flatten(x, 1)
        for i, layer in enumerate(self.layers):
            x = layer(x)
            if i != len(self.layers) - 1:
                x = F.relu(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(log_interval, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print(
                "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
                    epoch,
                    batch_idx * len(data),
                    len(train_loader.dataset),
                    100.0 * batch_idx / len(train_loader),
                    loss.item(),
                )
            )


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(
                output, target, reduction="sum"
            ).item()  # sum up batch loss
            pred = output.argmax(
                dim=1, keepdim=True
            )  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    accuracy = correct / len(test_loader.dataset)
    print(
        "\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n".format(
            test_loss,
            correct,
            len(test_loader.dataset),
            100.0 * correct / len(test_loader.dataset),
        )
    )

    return test_loss, accuracy

def run_learning(train_loader, test_loader, device, seed=1, batch_size=64, test_batch_size=1000, lr=0.01, sgd=True, log_interval=10, epochs=14):
    torch.manual_seed(seed)

    model = NetParameters([128, 128]).to(device)
    optimizer = (
        optim.Adam(model.parameters(), lr=lr)
        if not sgd
        else optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    )

    test_losses = []
    test_accuracies = []
    for epoch in range(1, epochs + 1):
        train(log_interval, model, device, train_loader, optimizer, epoch)
        test_loss, test_accuracy = test(model, device, test_loader)
        test_losses.append(test_loss)
        test_accuracies.append(test_accuracy)
    return test_losses, test_accuracies


def create_pixel_randomizer(ratio: float, device):
  pixels = []
  def randomizer(tensor):
    mask = torch.empty(1, 28, 28, device=device).uniform_() < ratio
    global possible_values
    global probabilities_tensor
    values_from_distribution = possible_values[torch.multinomial(probabilities_tensor, 28 * 28, replacement=True).reshape((1, 28, 28))]
    tensor[mask] = values_from_distribution[mask]
    return tensor
  return randomizer

def randomize_labels_(labels: torch.Tensor, ratio: float) -> None:
  labels_count = len(labels)
  to_pick = math.ceil(ratio * labels_count)
  indexes_to_change = random.sample(population=range(labels_count), k=to_pick)
  possible_labels = list(labels.unique())
  for i in indexes_to_change:
    labels[i] = random.choice(possible_labels)

def create_approximate_distribution():
  dataset = datasets.MNIST("../data", train=True, download=True, transform=transforms.ToTensor())
  thresholds = np.linspace(0.05, 1.05, num=11)
  buckets = 11 * [0]
  for batch, _ in dataset:
    for pixel in torch.flatten(batch).tolist():
      for i, threshold in enumerate(thresholds):
        if pixel < threshold:
          buckets[i] += 1
          break
  global possible_values
  possible_values = torch.tensor(thresholds - 0.05, dtype=torch.float32)
  s = np.sum(buckets)
  probabilities = [p/s for p in buckets]
  global probabilities_tensor
  probabilities_tensor = torch.tensor(probabilities)
  px.bar(probabilities).show()

def get_dataset(
    train_batch=64,
    test_batch=1000,
    pixel_randomization_ratio=0,
    label_randomization_ratio=0,
    device=None
):
    if device is None:
        device = torch.device('cpu')
    test_transform = transforms.Compose(
        [transforms.ToTensor()]
    )
    if pixel_randomization_ratio > 0:
      train_transform = transforms.Compose(
          [transforms.ToTensor(), create_pixel_randomizer(pixel_randomization_ratio, device)]
      )
    else:
      train_transform = transforms.Compose(
          [transforms.ToTensor()]
      )
    dataset1 = datasets.MNIST("../data", train=True, download=True, transform=train_transform)
    dataset2 = datasets.MNIST("../data", train=False, transform=test_transform)
    if label_randomization_ratio > 0:
      randomize_labels_(dataset1.targets, label_randomization_ratio)
    train_kwargs = {
      "batch_size": train_batch
    }
    test_kwargs = {
      "batch_size": test_batch
    }
    if device == torch.device('cuda'):
      cuda_kwargs = {"num_workers": 1, "pin_memory": True, "shuffle": True}
      train_kwargs.update(cuda_kwargs)
      test_kwargs.update(cuda_kwargs)
    train_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)
    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)
    return train_loader, test_loader

create_approximate_distribution()

Collecting plotly
  Downloading plotly-5.4.0-py2.py3-none-any.whl (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 62 kB/s 
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-5.4.0 tenacity-8.0.1
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw



# Training models in setup 1: with randomized labels.

In [2]:
device = torch.device('cpu')

def get_label_randomization_losses(ratios, epochs, device):
    losses_df = pd.DataFrame()
    accs_df = pd.DataFrame()
    for randomization_rate in ratios:
        train_loader, test_loader = get_dataset(label_randomization_ratio=randomization_rate, device=device)
        losses, accs = run_learning(train_loader=train_loader, test_loader=test_loader, epochs=epochs, device=device)
        losses_df[f"Ratio = {randomization_rate}"] = losses
        accs_df[f"Ratio = {randomization_rate}"] = accs
    return losses_df, accs_df

label_randomization_losses, label_randomization_accs  = get_label_randomization_losses([0, 0.3, 0.6, 0.9], epochs=10, device=device)



Test set: Average loss: 0.2518, Accuracy: 9224/10000 (92%)


Test set: Average loss: 0.1701, Accuracy: 9472/10000 (95%)


Test set: Average loss: 0.1241, Accuracy: 9609/10000 (96%)


Test set: Average loss: 0.1062, Accuracy: 9662/10000 (97%)


Test set: Average loss: 0.0968, Accuracy: 9686/10000 (97%)


Test set: Average loss: 0.0900, Accuracy: 9710/10000 (97%)


Test set: Average loss: 0.0886, Accuracy: 9719/10000 (97%)


Test set: Average loss: 0.0842, Accuracy: 9743/10000 (97%)


Test set: Average loss: 0.0811, Accuracy: 9750/10000 (98%)


Test set: Average loss: 0.0797, Accuracy: 9754/10000 (98%)


Test set: Average loss: 0.6367, Accuracy: 9150/10000 (92%)


Test set: Average loss: 0.5593, Accuracy: 9375/10000 (94%)


Test set: Average loss: 0.5129, Accuracy: 9527/10000 (95%)


Test set: Average loss: 0.4847, Accuracy: 9580/10000 (96%)


Test set: Average loss: 0.4600, Accuracy: 9619/10000 (96%)


Test set: Average loss: 0.4508, Accuracy: 9636/10000 (96%)


Test set: Average loss:

# Training models in setup 2: with randomized pixels.

In [3]:
device = torch.device('cpu')

def get_pixel_randomization_losses(ratios, epochs, device):
    losses_df = pd.DataFrame()
    accs_df = pd.DataFrame()
    for randomization_rate in ratios:
        train_loader, test_loader = get_dataset(pixel_randomization_ratio=randomization_rate, device=device)
        losses, accuracies = run_learning(train_loader=train_loader, test_loader=test_loader, epochs=epochs, device=device)
        losses_df[f"Ratio = {randomization_rate}"] = losses
        accs_df[f"Ratio = {randomization_rate}"] = accuracies
    return losses_df, accs_df

pixel_randomization_losses, pixel_randomization_accs = get_pixel_randomization_losses([0, 0.3, 0.6, 0.9], epochs=10, device=device)


Test set: Average loss: 0.2518, Accuracy: 9224/10000 (92%)


Test set: Average loss: 0.1701, Accuracy: 9472/10000 (95%)


Test set: Average loss: 0.1241, Accuracy: 9609/10000 (96%)


Test set: Average loss: 0.1062, Accuracy: 9662/10000 (97%)


Test set: Average loss: 0.0968, Accuracy: 9686/10000 (97%)


Test set: Average loss: 0.0900, Accuracy: 9710/10000 (97%)


Test set: Average loss: 0.0886, Accuracy: 9719/10000 (97%)


Test set: Average loss: 0.0842, Accuracy: 9743/10000 (97%)


Test set: Average loss: 0.0811, Accuracy: 9750/10000 (98%)


Test set: Average loss: 0.0797, Accuracy: 9754/10000 (98%)


Test set: Average loss: 0.3142, Accuracy: 9108/10000 (91%)


Test set: Average loss: 0.2102, Accuracy: 9351/10000 (94%)


Test set: Average loss: 0.1578, Accuracy: 9530/10000 (95%)


Test set: Average loss: 0.1313, Accuracy: 9596/10000 (96%)


Test set: Average loss: 0.1125, Accuracy: 9675/10000 (97%)


Test set: Average loss: 0.1045, Accuracy: 9704/10000 (97%)


Test set: Average loss:

# Plots and report.

In [6]:
def draw(df, title):
  px.line(df, title=title, labels={'index': 'Epoch'}).show()

mixed_losses = pd.merge(label_randomization_losses.drop(columns=["Ratio = 0"]), pixel_randomization_losses.drop(columns=["Ratio = 0"]), suffixes=(" (randomized labels)", " (randomized pixels)"), left_index=True, right_index=True)
mixed_accs = pd.merge(label_randomization_accs.drop(columns=["Ratio = 0"]), pixel_randomization_accs.drop(columns=["Ratio = 0"]), suffixes=(" (randomized labels)", " (randomized pixels)"), left_index=True, right_index=True)
experiments = [(label_randomization_losses, 'Label randomization, loss'), 
               (label_randomization_accs, 'Label randomization, accuracy'),
               (pixel_randomization_losses, 'Pixel randomization, loss'), 
                (pixel_randomization_accs, 'Pixel randomization, accuracy'),
               (mixed_losses, 'Both randomization methods, loss'),
               (mixed_accs, 'Both randomization methods, accuracy')
               ]

for x, title in experiments:
  x.index = np.arange(1, len(x) + 1)
  draw(x, title=title)

 


# Report
With label randomization, loss rises quickly with growing X because the model is punished a lot for being confident about its guesses. The opposite is true for pixel randomization since the model is not punished for being confident as long as the transformed image resembles the original digit and not some other one.

Loss and accuracy when pixel randomization is low (30% of the pixels) remains at similar level to training on the original dataset, sometimes even outperforming it. It should be noted that it might be due to the nature of the dataset - images are sparse, vast majority of pixels are black so most of the time randomization replaces a black pixel with a black pixel. More experiments would be needed to tell if randomizing 30% of pixels could be used to prevent overfitting and help the model to generalize.

When labels are randomized the accuracy does not fall quickly - even with 60% of random labels >90% accuracy is attained after 2 epochs. Gradients produced by incorrect labels point in various directions so they are not able to skew the result, especially because the optimization algorithm used is SGD with momentum.

When label randomization rate is equal to 90%, its accuracy starts falling after 5th epoch. I suspect that at that moment model started to overlearn and fit predictions for some images to their incorrect (randomized) labels.

It is interesting that accuracy for pixel randomization is higher than for label randomization when the randomization ratio equals 30% and 60% but the performance is inverted with ratio equal to 90%. I think this might be due to the fact that too much information in each of the images is lost with ratio that high. Even when 90% percent of images have labels assigned incorrectly, the remaining 10% is enough to attain reasonably good results.

# Technical details
The pixel distribution was computed as follows: the whole transformed dataset was scanned and the number of pixels in each of the following intervals: 
$$ [0, 0.05), [0.05, 0.15), ... [0.95, 1] $$
was calculated. Relative frequency of numbers in each of the buckets became the probability of sampling a number inside each of them. The numbers which could be picked were:
$$ \{0, 0.1, \ldots, 1\} $$

Test losses are reported in respective cells' outputs.