# Task Overview

In this task, your goal is to verify the impact of data noise level in neural network training.
You should use MLP architecture trained on MNIST dataset (like in previous lab exercises).


We will experiment with two setups:
1. Pick X. Take X% of training examples and reassign their labels to random ones. Note that we don't change anything in the test set.
2. Pick X. During each training step, for each sample, change values of X% randomly selected pixels to random values. Note that we don't change anything in the test set.

For both setups, check the impact of various levels of noise (various values of X%) on model performance. Show plots comparing crossentropy (log-loss) and accuracy with varying X%, and also comparing two setups with each other.
Prepare short report briefly explaining the results and observed trends. Consider questions like "why accuracy/loss increases/decreases so quickly/slowly", "why Z is higher in setup 1/2" and any potentially surprising things you see on charts.

### Potential questions, clarifications
* Q: Can I still use sigmoid/MSE loss?
  * You should train your network with softmax and crossentropy loss (log-loss), especially since you should report crossentropy loss.
* Q: When I pick X% of pixels/examples, does it have to be exactly X% or can it be X% in expectation?
  * A: It's fine either way.
* Q: When I randomize pixels, should I randomize them again each time a particular example is drawn (each training step/epoch) or only once before training?
  * A: Each training step/epoch.
* Q: When I randomize labels, should I randomize them again each time a particular example is drawn (each training step/epoch) or only once before training?
  * A: Only once before training.
* Q: What is the expected length of report/explanation?
  * A: There is no minimum/maximum, but between 5 (concise) and 20 sentences should be good. Don't forget about plots.
* Q: When I replace labels/pixels with random values, what random distribution should I use?
  * A: A distribution reasonably similar to the data. However, you don't need to match dataset's distribution exactly - approximation will be totally fine, especially if it's faster or easier to get.
* Q: Can I use something different than Colab/Jupyter Notebook? E.g. just Python files.
  * A: Yes, although notebook is encouraged; please include in you solution code and pdf.

# Model definition and training.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import random

## Model definition
Use modified version of model from previous labs. Add dropout after each linear layer, except the last one as a form of regularization.

In [None]:
class Net(nn.Module):
    def __init__(self, layers, dropout):
        super(Net, self).__init__()
        # After flattening an image of size 28x28 we have 784 inputs
        self.fcs = nn.ModuleList([nn.Linear(layers[i-1], layers[i]) for i in range(1, len(layers))])
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x):
        x = torch.flatten(x, 1)
        for fc in self.fcs[:-1]:
            x = fc(x)
            x = self.dropout(x)
            x = F.relu(x)
        x = self.fcs[-1](x)
        output = F.log_softmax(x, dim=1)
        return output


def change_pixels(data, pixels_frac, values, counts_p, batch_idx, log_interval, 
                  device):
    """
    Function modifies randomly chosen pixels in passed data,
    taken from current batch. New value of each pixel is chosen according to 
    pixels distribution in whole training part of dataset.
    Disabled when any of pixels, values or counts_p is None.
    :param data: images from current batch;
    :param pixels_frac: percent of pixels to be changed;
    :param values: pixels colours in dataset;
    :param counts_p: distribution of each pixels color from values;
    """
    if all(l is not None for l in [pixels_frac, values, counts_p]):
        if log_interval is not None and batch_idx % log_interval == 0:
            plt.imshow(data[0].reshape(28,28).cpu(), cmap="gray")
            plt.show()
        rand_num = round(data.shape[0] * 28 * 28 * pixels_frac)
        indexes = random.sample(range(data.shape[0] * 28 * 28), rand_num)

        values_new = values[torch.multinomial(replacement=True, 
                                              num_samples=rand_num, 
                                              input=counts_p)]
        data.flatten()[indexes] = values_new.to(device)
        if log_interval is not None and batch_idx % log_interval == 0:
            plt.imshow(data[0].reshape(28,28).cpu(), cmap="gray")
            plt.show()


def train(model, device, train_loader, optimizer, epoch, log_interval, 
          log_interval2, pixels_frac: float = None, values=None, counts_p=None):
    model.train()
    train_loss = 0
    correct = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        change_pixels(data, pixels_frac, values, counts_p, batch_idx, 
                      log_interval, device)
        optimizer.zero_grad()
    
        output = model(data)
        loss = F.nll_loss(output, target)
        pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        if log_interval is not None and batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    train_loss /= len(train_loader.dataset)
    acc = 100. * correct / len(train_loader.dataset)
    if log_interval2 is not None and epoch % log_interval2 == 0:
        print('\nTrain epoch: {} Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            epoch, train_loss, correct, len(train_loader.dataset), acc))
    return train_loss, acc


def test(model, device, test_loader, log_interval2, epoch):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    acc = 100. * correct / len(test_loader.dataset)
    if log_interval2 is not None and epoch % log_interval2 == 0:
        print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            test_loss, correct, len(test_loader.dataset), acc))
    return test_loss , acc


In [None]:
batch_size = 256
test_batch_size = 1000
use_cuda = False
seed = 1

In [None]:
use_cuda = not use_cuda and torch.cuda.is_available()

torch.manual_seed(seed)
device = torch.device("cuda" if use_cuda else "cpu")

train_kwargs = {'batch_size': batch_size}
test_kwargs = {'batch_size': test_batch_size}
if use_cuda:
    print('cuda used')
    cuda_kwargs = {'num_workers': 1,
                    'pin_memory': True,
                    'shuffle': True}
    train_kwargs.update(cuda_kwargs)
    test_kwargs.update(cuda_kwargs)

In [None]:
transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])
dataset1 = datasets.MNIST('../data', train=True, download=True,
                    transform=transform)
dataset2 = datasets.MNIST('../data', train=False,
                    transform=transform)

train_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)
test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

In [None]:
def run_all(epochs: int, lr: float, log_interval: int, log_interval2: int, 
            momentum: float = None, scenario: int = 0, 
            pixels_frac: float = None, values=None, 
            counts_p=None, t_loader=None):
    """
    Function responsible for whole routine of creating, training and testing
    model and collecting necessary statistics for ploting.
    """

    model = Net(layers=[784, 128, 128, 10], dropout=0.05).to(device)
    if momentum is None:
        optimizer = optim.Adam(model.parameters(), lr=lr)
    else:
        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

    if scenario == 0 or scenario == 1:
        train_args = {'pixels_frac': None, 'values': None, 'counts_p': None}
    elif scenario == 2:
        if pixels_frac == 0.:
            pixels_frac = None
        train_args = {'pixels_frac': pixels_frac, 'values': values, 
                      'counts_p': counts_p}

    train_losses = []
    train_accs = []
    test_losses = []
    test_accs = []
    if (t_loader is None):
        t_loader = train_loader
    for epoch in range(1, epochs + 1):
        train_loss, train_acc = train(model, device, t_loader, optimizer, 
                                      epoch, log_interval, log_interval2,
                                      **train_args)
        test_loss, test_acc = test(model, device, test_loader, 
                                   log_interval2, epoch)
        train_losses.append(train_loss)
        test_losses.append(test_loss)
        train_accs.append(train_acc)
        test_accs.append(test_acc)
    return [train_losses, train_accs, test_losses, test_accs]

In [None]:
def avg_run(pixels_frac, run_num, epochs, log_interval, log_interval2, 
            values=None, counts_p=None, scenario=0, t_loader=None):
    """
    Runs 'run_all' 'run_num' times and counts mean of all collected statistics.
    """
    avg = np.zeros((4, epochs))
    for i in range(run_num):
        stats = run_all(epochs=epochs, lr=1e-2, log_interval=log_interval, 
                        log_interval2=log_interval2, momentum=0.9, 
                        scenario=scenario, pixels_frac=pixels_frac, 
                        values=values, counts_p=counts_p, t_loader=t_loader)
        for j, stat in zip(range(4), stats):
            avg[j] += np.array(stat)

    avg /= run_num
    return avg

In [None]:
run_num = 3
epochs = 10

All necessary functions are already implemented so now we can collect stats from basic run without providing any noise.
Later used to compare outcomes with noisy data.

In [None]:
stats_basic = avg_run(pixels_frac=None, run_num=run_num, epochs=epochs, 
                      log_interval=None, log_interval2=int(epochs/2), 
                      scenario=0)


Train epoch: 5 Average loss: 0.0005, Accuracy: 57837/60000 (96%)


Test set: Average loss: 0.1045, Accuracy: 9672/10000 (97%)


Train epoch: 10 Average loss: 0.0003, Accuracy: 58829/60000 (98%)


Test set: Average loss: 0.0742, Accuracy: 9766/10000 (98%)


Train epoch: 5 Average loss: 0.0005, Accuracy: 57775/60000 (96%)


Test set: Average loss: 0.1106, Accuracy: 9662/10000 (97%)


Train epoch: 10 Average loss: 0.0003, Accuracy: 58781/60000 (98%)


Test set: Average loss: 0.0772, Accuracy: 9759/10000 (98%)


Train epoch: 5 Average loss: 0.0005, Accuracy: 57816/60000 (96%)


Test set: Average loss: 0.1058, Accuracy: 9674/10000 (97%)


Train epoch: 10 Average loss: 0.0003, Accuracy: 58793/60000 (98%)


Test set: Average loss: 0.0753, Accuracy: 9765/10000 (98%)



# Training models in setup 1: with randomized labels.

In [None]:
def analyze_targets():
    """
    Reads MNIST data and load in to one batch in dataloader.
    Returns imgs, targets and found values with counters.
    """
    dataset1_calc = datasets.MNIST('../data', train=True, download=True, 
                                   transform=transform)
    train_kwargs2 = train_kwargs.copy()
    train_kwargs2.update({'batch_size': len(dataset1_calc), 'num_workers': 0})
    train_loader_calc = torch.utils.data.DataLoader(dataset1_calc, **train_kwargs2)
    data, target = next(iter(train_loader_calc))
    return data, target, torch.unique(target, return_counts=True, sorted=True)

Analyze distribution of targets.

In [None]:
data_an, target_an, (values, counts) = analyze_targets()
counts_p = counts / torch.sum(counts)

In [None]:
def change_targets(frac: float, target, values, counts_p):
    """
    Changes randomly chosen targets according to distribution in dataset.
    Returns modified targets.
    """
    rand_num = round(target.shape[0] * frac)
    values_new = values[torch.multinomial(replacement=True, 
                                          num_samples=rand_num, 
                                          input=counts_p)]
    indexes = random.sample(range(len(target)), rand_num)
    target_new = target.clone().detach()
    target_new[indexes] = values_new
    return target_new

In [None]:
def prepare_new_loader(frac: float):
    """
    Creates custom dataloader with modified targets.
    """
    target_new = change_targets(frac, target_an, values, counts_p)
    dataset_new = torch.utils.data.TensorDataset(data_an, target_new)
    return torch.utils.data.DataLoader(dataset_new, **train_kwargs)

In [None]:
target_fracs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
all_stats_mode1 = []
for target_frac in target_fracs:
    print('Training for: ', target_frac)
    train_loader_new = prepare_new_loader(target_frac)
    stats = avg_run(pixels_frac=None, run_num=run_num, epochs=epochs, 
                    log_interval=None, log_interval2=int(epochs/2), 
                    scenario=1, 
                    t_loader=train_loader_new)
    all_stats_mode1.append(stats)

Training for:  0.1

Train epoch: 5 Average loss: 0.0025, Accuracy: 52486/60000 (87%)


Test set: Average loss: 0.2317, Accuracy: 9642/10000 (96%)


Train epoch: 10 Average loss: 0.0023, Accuracy: 53219/60000 (89%)


Test set: Average loss: 0.2013, Accuracy: 9739/10000 (97%)


Train epoch: 5 Average loss: 0.0025, Accuracy: 52412/60000 (87%)


Test set: Average loss: 0.2297, Accuracy: 9625/10000 (96%)


Train epoch: 10 Average loss: 0.0023, Accuracy: 53225/60000 (89%)


Test set: Average loss: 0.2026, Accuracy: 9726/10000 (97%)


Train epoch: 5 Average loss: 0.0025, Accuracy: 52435/60000 (87%)


Test set: Average loss: 0.2223, Accuracy: 9641/10000 (96%)


Train epoch: 10 Average loss: 0.0023, Accuracy: 53278/60000 (89%)


Test set: Average loss: 0.1968, Accuracy: 9733/10000 (97%)

Training for:  0.2

Train epoch: 5 Average loss: 0.0039, Accuracy: 47003/60000 (78%)


Test set: Average loss: 0.3161, Accuracy: 9577/10000 (96%)


Train epoch: 10 Average loss: 0.0037, Accuracy: 47795/60000 (8

# Training models in setup 2: with randomized pixels.

In [None]:
def analyze_pixels():
    """
    Loads MNIST to one batch dataloader and returns values with counters 
    of their appearance in dataset.
    """
    dataset1_calc = datasets.MNIST('../data', train=True, download=True,
                        transform=transform)
    train_kwargs2 = train_kwargs.copy()

    train_kwargs2.update({'batch_size': len(dataset1_calc), 'num_workers': 0})
    train_loader_calc = torch.utils.data.DataLoader(dataset1_calc, **train_kwargs2)
    data, target = next(iter(train_loader_calc))
    return torch.unique(data, return_counts=True, sorted=True)

Evaluate distribution in training dataset.

In [None]:
values, counts = analyze_pixels()
counts_p = counts / torch.sum(counts)

Let's create list of X% values which correspond to number of pixels which values are changed according to distribution in dataset. For each parameter train model and collect produced statistics.

In [None]:
pixels_fracs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
all_stats_mode2 = []
for pixels_frac in pixels_fracs:
    print('Training for: ', pixels_frac)
    stats = avg_run(pixels_frac=pixels_frac, run_num=run_num, epochs=epochs,
                    log_interval=None, log_interval2=int(epochs/2), 
                    values=values, counts_p=counts_p, scenario=2)
    all_stats_mode2.append(stats)

Training for:  0.1

Train epoch: 5 Average loss: 0.0006, Accuracy: 57355/60000 (96%)


Test set: Average loss: 0.1053, Accuracy: 9658/10000 (97%)


Train epoch: 10 Average loss: 0.0004, Accuracy: 58333/60000 (97%)


Test set: Average loss: 0.0738, Accuracy: 9780/10000 (98%)


Train epoch: 5 Average loss: 0.0006, Accuracy: 57174/60000 (95%)


Test set: Average loss: 0.1109, Accuracy: 9651/10000 (97%)


Train epoch: 10 Average loss: 0.0004, Accuracy: 58269/60000 (97%)


Test set: Average loss: 0.0776, Accuracy: 9759/10000 (98%)


Train epoch: 5 Average loss: 0.0006, Accuracy: 57300/60000 (96%)


Test set: Average loss: 0.1136, Accuracy: 9657/10000 (97%)


Train epoch: 10 Average loss: 0.0004, Accuracy: 58293/60000 (97%)


Test set: Average loss: 0.0771, Accuracy: 9761/10000 (98%)

Training for:  0.2

Train epoch: 5 Average loss: 0.0007, Accuracy: 56481/60000 (94%)


Test set: Average loss: 0.1217, Accuracy: 9627/10000 (96%)


Train epoch: 10 Average loss: 0.0005, Accuracy: 57740/60000 (9

# Plots and report.

In [None]:
plots_titles = ['Train loss', 'Train accuracy', 'Test loss', 'Test accuracy']

In [None]:
def create_plots(all_neceseary_stats, name, params, ytype=None):
    all_stats_stacked = np.stack(all_neceseary_stats)
    xs = np.linspace(1, all_stats_stacked.shape[2], num=all_stats_stacked.shape[2])
    data_len = round(len(xs) / 2)
    for i in range(4):
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=xs, y=stats_basic[i], mode='lines', 
                                 name='base_model'))
        for j in range(len(all_stats_stacked)):
            fig.add_trace(go.Scatter(x=xs, y=all_stats_stacked[j][i], mode='lines',
                                     name=name+': '+str(params[j])))
        fig.update_layout(title=plots_titles[i], xaxis_title='epochs', yaxis_title=plots_titles[i])
        if ytype is not None:
            fig.update_yaxes(type='log')
        fig.show()

### Plot stats for basic model vs models trained on data with X% changed labels

In [None]:
# setup 1
create_plots(all_stats_mode1, 'target_frac', target_fracs)

### Plot stats for basic model vs models trained on data with X% changed pixels

In [None]:
# setup 2
create_plots(all_stats_mode2, 'pixels_frac', pixels_fracs)

### Plot setup1 vs setup2

In [None]:
# setup1 vs setup2
def create_plots2(all_necessary_stats, all_necessary_stats2, name, name2, params, params2, ytype=None):
    all_stats_stacked = np.stack(all_necessary_stats)
    all_stats_stacked2 = np.stack(all_necessary_stats2)
    xs = np.linspace(1, all_stats_stacked.shape[2], num=all_stats_stacked.shape[2])
    data_len = round(len(xs) / 2)
    for i in range(4):
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=xs, y=stats_basic[i], mode='lines', 
                                 name='base_model'))
        for j in range(len(all_stats_stacked)):
            fig.add_trace(go.Scatter(x=xs, y=all_stats_stacked[j][i], mode='lines',
                                     name=name+': '+str(params[j])))
            fig.add_trace(go.Scatter(x=xs, y=all_stats_stacked2[j][i], mode='lines',
                                     name=name2+': '+str(params2[j]), line=dict(dash='dash')))
        fig.update_layout(title=plots_titles[i], xaxis_title='epochs', yaxis_title=plots_titles[i])
        if ytype is not None:
            fig.update_yaxes(type='log')
        fig.show()

In [None]:
create_plots2(all_stats_mode1, all_stats_mode2, 'target_frac', 'pixels_frac', target_fracs, pixels_fracs)

## Additional plots, presenting tendency after last epoch for different values of X%

In [None]:
def create_plots_of_last_epochs(all_neceseary_stats, name):
    all_stats_stacked = np.stack(all_neceseary_stats)
    xs = target_fracs
    last_epochs_stats = all_stats_stacked[:, :, -1]
    
    for i in range(4):
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=xs, y=last_epochs_stats[:, i], mode='lines'))
        fig.update_layout(title=plots_titles[i], xaxis_title=name, yaxis_title=plots_titles[i])
        fig.show()

In [None]:
create_plots_of_last_epochs(all_stats_mode1, 'target fracs')

In [None]:
create_plots_of_last_epochs(all_stats_mode2, 'pixels frac')