Assignment 5: Classification in PyTorch
=======================================


Microsoft Forms Document: https://forms.office.com/r/4PAnYRs2Bf


Task 1: Dataset Loading
-----------------------

We use two different datasets, the spambase dataset https://archive.ics.uci.edu/ml/datasets/spambase for binary classification and the wine dataset https://archive.ics.uci.edu/ml/datasets/wine for categorical classification.
Both datasets are avaliable on the UCI Machine Learning repository.
In the first dataset, the target values are stored in the first column, while the rest is input.
For the second dataset, the target is stored in the last column, the rest is input.

When running with pytorch, samples should be stored as datatype ``torch.tensor``, and split between input sets $\mathbf X = [\vec x^{[1]}, \ldots, \vec x^{[N]}]^T \in \mathbb R^{N\times D}$ and targets.
There is **no need** to add a bias neuron to the input, and the transposition of the data matrix is different from what we have seen before.

For the targets, we have to be more careful as there are differences w.r.t. the applied loss function.
For binary classification, we need $\mathbf T = [[t^{[1]}, \ldots, t^{[N]}]]$ to be in dimension $\mathbb R^{N\times1}$ and of type ``torch.float``.
For categorical classification, we only need the class indexes $\vec t = [t^{[1]}, \ldots, t^{[N]}]$ to be in dimension $\mathbb N^N$ and of type ``torch.long``.

In [None]:
import os
import torch

# download the two dataset files
dataset_files = {
  "spambase.data": "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/",
  "wine.data": "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/"
}
for name, url in dataset_files.items():
  if not os.path.exists(name):
    import urllib.request
    urllib.request.urlretrieve(url+name, name)
    print ("Downloaded datafile", name)


def dataset(dataset_file="wine.data"):
  # read dataset
  data = []
  with open(dataset_file, 'r') as f:
    ...

  print (f"Loaded dataset with {len(data)} samples")
  
  # convert to torch.tensor
  ...

  if dataset_file == "wine.data":
    # target is in the first column and needs to be converted to long
    X = ...
    T = ...
  else:
    # target is in the last column and needs to be of type float
    X = ...
    T = ...
  return X, T

Test 1: Assert Valid Data
-------------------------

Load the wine dataset and make sure that the dataset is in the correct dimensions, i.e., $\mathbf X\in \mathbb R^{N\times D}$ and $\mathbf T \in \mathbb N^N$.
Also assure that all class labels are in the correct range $[0, O-1]$.

Load the spambase data and assure that all dimensions are correct and that class labels are in range $\{0, 1\}$.

In [None]:
# load email data
X, T = dataset("spambase.data")
# assert that everything is correct with the dataset
...

# load wine data
X, T = dataset("wine.data")
# assert that everything is correct with the dataset
...


Task 2: Split Training and Validation Data
------------------------------------------

Write a function that splits off training and validation samples from a given dataset.
Use randomly 80% of the data for training, and 20% for validation.

What do we need to assure before splitting?


In [None]:
def split_training_data(X,T,train_percentage=0.8):
  # split into 80/20 training/validation
  X_train = ...
  T_train = ...
  X_val = ...
  T_val = ...

  return X_train, T_train, X_val, T_val


Task 3: Input Data Standardization
----------------------------------

Implement a function that standardizes all input data for the training and validation set.
Return the normalized data.

In [None]:
def standardize(X_train, X_val):
  # compute statistics
  mean = ...
  std = ...

  # standardize both X_train and X_val
  ...
  return X_train, X_val


Task 4: Network Implementation
------------------------------

Implement a function that returns a two-layer fully-connected network in pytorch.
Use tanh as activation function, and provide the possibility to change the number of inputs $D$, the number of hidden neurons $K$ and the number of outputs $O$.

In [None]:
import torch

def Network(D, K, O):
  return torch.nn.Sequential(
    ...
  )


Task 5: Accuracy Computation
----------------------------

Implement a function that computes the accuracy of the provided network output (the logits) and the given target values.
Make sure that the implementation supports both binary as well as categorical targets.

In [None]:
def accuracy(Z, T):
  # check if we have binary or categorical classification
  if ...:
    # binary classification
    return ...
  else:
    # categorical classification
    return ...

Test 2: Test Accuracy Function
------------------------------

Design test data and according logit values with which you can test the correctness of your accuracy function.
Make sure that the accuracy will compute the correct values.
Test both binary and categorical accuracy.

In [None]:
# test binary classification
# ... design test logits and target values
Z = ...
T = ...
# ... test that the expected accuracy is computed
...

# test categorical classification
# ... design test logits and target values
Z = ...
T = ...
# ... test that the expected accuracy is computed
...

Task 6: Training Loop
---------------------

Implement a function that takes all necessary parameters to run a training on a given dataset.
In this week, we will make use of the whole dataset in each training step, so we will perform gradient descent (not SGD), so there is no need to define anything related to batches.

For each epoch, compute the training set and the validation set accuracy, as well as their losses, and return all of them

In [None]:
def train(...):
  optimizer = ...

  # collect loss and accuracy values
  train_loss, train_acc, val_loss, val_acc = [], [], [], []

  for epoch in ...:
    # train on training set
    # ... compute network output on training data
    ...
    # ... compute loss from network output and target data
    ...
    # ... perform parameter update
    ...
    # ... remember loss
    train_loss.append(...)
    # ... compute training set accuracy
    train_acc.append(...)

    # test on validation data
    with torch.no_grad():
      # ... compute network output on validation data
      ...
      # ... compute loss from network output and target data
      ...
      # ... remember loss
      val_loss.append(...)
      # ... compute validation set accuracy
      val_acc.append(...)

  # return the four lists of losses and accuracies
  return train_loss, train_acc, val_loss, val_acc

Task 7: Plotting Function
-------------------------

Implement a function that takes four lists containing the training loss, the training accuracy, the validation loss and the validation accuracy.
Plot the two losses into one plot, and the two accuracies into another plot.

In [None]:
from matplotlib import pyplot
def plot(train_loss, train_acc, val_loss, val_acc):
  pyplot.figure(figsize=(10,3))
  ax = pyplot.subplot(121)
  ax.plot(..., "g-", label="Training set loss")
  ax.plot(..., "b-", label="Validation set loss")
  ax.legend()

  ax = pyplot.subplot(122)
  ax.plot(..., "g-", label="Training set accuracy")
  ax.plot(..., "b-", label="Validation set accuracy")
  ax.legend()


Task 8: Binary Classification
-----------------------------

Load the data for binary classification, using the ``"spambase.data"`` file.
Split the data into training and validation sets.
Standardize both training and validation input data.

Instantiate a network with the correct number of input neurons, a given number of $K$ hidden neurons and one output neuron.
Instantiate the binary cross entropy loss function.

Train the network with our data for 10'000 epochs and plot the training and validation accuracies and losses.

In [None]:
# define loss function
loss = ...
# load dataset
X, T = ...
# split dataset
X_train, T_train, X_val, T_val = ...
# standardize input data
X_train, X_val = ...
# instantiate network
network = ...

# train network on our data
results = ...

# plot the results
plot(...)


Task 9: Categorical Classification
----------------------------------

Perform the same tasks with the ``"wine.data"`` dataset.
How many output neurons do we need?
Which loss function will we need this time?

How many hidden neurons will we need to get 100% training set accuracy?

In [None]:
# define loss function
loss = ...
# load dataset
X, T = ...
# split dataset
X_train, T_train, X_val, T_val = ...
# standardize input data
X_train, X_val = ...
# instantiate network
network = ...

# train network on our data
results = ...

# plot the results
plot(...)
