# Exercise 1: Classifying penguin species with PyTorch

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" width="750" />


Artwork by @allison_horst

In this exercise, we will use the python package [``palmerpenguins``](https://github.com/mcnakhaee/palmerpenguins) to supply a toy dataset containing various features and measurements of penguins.

We have already created a PyTorch dataset which yields data for each of the penguins, but first we should examine the dataset and see what it contains.

### Task 1: look at the data
In the following code block, we import the ``load_penguins`` function from the ``palmerpenguins`` package.

- Call this function, which returns a single object, and assign it to the variable ``data``.
  - Print ``data`` and recognise that ``load_penguins`` has returned a ``pandas.DataFrame``.
- Consider which features it might make sense to use in order to classify the species of the penguins.
  - You can print the column titles using ``pd.DataFrame.keys()``
  - You can also obtain useful information using ``pd.DataFrame.Series.describe()``

In [5]:
from palmerpenguins import load_penguins

data = load_penguins()
# print(data)
print(data.keys())

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')


Let's now discuss the features we will use to classify the penguins' species, and populate the following list together:
- flipper length, body mass, bill length - these are all the physical characteristics that vary by species.
- sex - biologically relavent i.e. mass may depend on species and sex.
We will not use:
- island - whilst it may be correlated, there is a danger we learn to predict the wrong thing (i.e. predict the island not the penguin type).
- year - as above

### Task 2: creating a ``torch.utils.data.Dataset``

All PyTorch dataset objects are subclasses of the ``torch.utils.data.Dataset`` class. To make a custom dataset, create a class which inherits from the ``Dataset`` class, implement some methods (the Python magic (or dunder) methods ``__len__`` and ``__getitem__``) and supply some data.

Spoiler alert: we've done this for you already in ``src/ml_workshop/_penguins.py``.

- Open the file ``src/ml_workshop/_penguins.py``.
- Let's examine, and discuss, each of the methods together.
  - ``__len__``
    - What does the ``__len__`` method do?
    - ...
  - ``__getitem__``
    - What does the ``__getitem__`` method do?
    - ...
- Review and discuss the class arguments.
  - ``input_keys``— columns to use as inputs
  - ``target_keys``— column to predict
  - ``train``— is this dataset going to be used to train or validate/
  - ``x_tfms``— augment the data so it is useable by the network
  - ``y_tfms``— augment the data so it is useable by the network

### Task 3: Obtaining training and validation datasets

- Instantiate the penguin dataloader.
  - Make sure you supply the correct column titles for the features and the targets.
- Iterate over the dataset
    - Hint:
        ```python
        for features, targets in dataset:
            # print the features and targets here
        ```

In [17]:
from ml_workshop import PenguinDataset
input_keys = ["bill_length_mm", "body_mass_g", "bill_depth_mm", "flipper_length_mm", "sex"]

data_set = PenguinDataset(
    input_keys=input_keys,
    target_keys=["species"],
    train=True,
)

output_keys = data.species.unique().tolist()

print(output_keys)

for features, target in data_set:
    print(features, target)

['Adelie', 'Gentoo', 'Chinstrap']
(42.9, 5000.0, 13.1, 215.0, 0.0) ('Gentoo',)
(46.1, 4500.0, 13.2, 211.0, 0.0) ('Gentoo',)
(44.9, 5100.0, 13.3, 213.0, 0.0) ('Gentoo',)
(43.3, 4400.0, 13.4, 209.0, 0.0) ('Gentoo',)
(42.0, 4150.0, 13.5, 210.0, 0.0) ('Gentoo',)
(46.5, 4550.0, 13.5, 210.0, 0.0) ('Gentoo',)
(44.0, 4350.0, 13.6, 208.0, 0.0) ('Gentoo',)
(40.9, 4650.0, 13.7, 214.0, 0.0) ('Gentoo',)
(42.6, 4950.0, 13.7, 213.0, 0.0) ('Gentoo',)
(42.7, 3950.0, 13.7, 208.0, 0.0) ('Gentoo',)
(45.3, 4300.0, 13.7, 210.0, 0.0) ('Gentoo',)
(47.2, 4925.0, 13.7, 214.0, 0.0) ('Gentoo',)
(45.2, 4750.0, 13.8, 215.0, 0.0) ('Gentoo',)
(43.6, 4900.0, 13.9, 217.0, 0.0) ('Gentoo',)
(43.8, 4300.0, 13.9, 208.0, 0.0) ('Gentoo',)
(45.5, 4200.0, 13.9, 210.0, 0.0) ('Gentoo',)
(45.7, 4400.0, 13.9, 214.0, 0.0) ('Gentoo',)
(43.3, 4575.0, 14.0, 208.0, 0.0) ('Gentoo',)
(47.5, 4875.0, 14.0, 212.0, 0.0) ('Gentoo',)
(46.2, 4375.0, 14.1, 217.0, 0.0) ('Gentoo',)
(48.5, 5300.0, 14.1, 220.0, 1.0) ('Gentoo',)
(48.7, 4450.0, 14.1, 

- Can we give these items to a neural network, or do they need to be transformed first?
  - Short answer: no, we can't just pass tuples of numbers or strings to a neural network.
    - We must represent these data as ``torch.Tensor``s.

### Task 4: Applying transforms to the data

A common way of transforming inputs to neural networks is to apply a series of transforms using ``torchvision.transforms.Compose``. The ``Compose`` object takes a list of callable objects and applies them to the incoming data.

These transforms can be very useful for mapping between file paths and tensors of images, etc.

- Note: here we create a training and validation set.
    - We allow the model to learn directly from the training set — i.e. we fit the function to these data.
    - During training, we monitor the model's performance on the validation set in order to check how it's doing on unseen data. Normally, people use the validation performance to determine when to stop the training process.
- For the validation set, we choose ten males and ten females of each species. This means the validation set is less likely to be biased by sex and species, and is potentially a more reliable measure of performance. You should always be _very_ careful when choosing metrics and splitting data.

In [25]:
from torchvision.transforms import Compose
from torch import tensor, float32, eye
# Apply the transforms we need to the PenguinDataset to get out inputs
# targets as Tensors.

def input_transforms():
    """
    Transform the tuple of data values into a torch tensor
    """
    tfm_list = [lambda x: tensor(x, dtype=float32)]
    
    return Compose(tfm_list)

def make_one_hot(x):
    """
    Transform penguin species string to a one-hot vector.
    
    :param x: Tuple containing species string
    :return: 
    """
    ident = eye(len(output_keys))
    return ident[output_keys.index(x[0])]

def target_transforms():
    """
    Transform the tuple of data values into a torch tensor
    """
    tfm_list = [make_one_hot]
    
    return Compose(tfm_list)

train_set = PenguinDataset(
    input_keys=input_keys,
    target_keys=["species"],
    x_tfms=input_transforms(),
    y_tfms=target_transforms(),
    train=True,
)

valid_set = PenguinDataset(
    input_keys=input_keys,
    target_keys=["species"],
    x_tfms=input_transforms(),
    y_tfms=target_transforms(),
    train=False,
)

dummy = 'Chinstrap'
ident = eye(len(output_keys))
print(ident[output_keys.index(dummy)])

for features, target in valid_set:
    print(features, target)

tensor([0., 0., 1.])
tensor([  39.6000, 3900.0000,   20.7000,  191.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  34.5000, 2900.0000,   18.1000,  187.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  42.2000, 3550.0000,   18.5000,  180.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  40.3000, 3250.0000,   18.0000,  195.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  36.5000, 3150.0000,   18.0000,  182.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  38.1000, 3700.0000,   18.6000,  190.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  38.7000, 3450.0000,   19.0000,  195.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  36.4000, 2850.0000,   17.1000,  184.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  35.9000, 3050.0000,   16.6000,  190.0000,    0.0000]) tensor([1., 0., 0.])
tensor([  36.2000, 3550.0000,   16.1000,  187.0000,    0.0000]) tensor([1., 0., 0.])
tensor([3.7600e+01, 3.7500e+03, 1.9100e+01, 1.9400e+02, 1.0000e+00]) tensor([1., 0., 0.])
tensor([3.8600e+01, 3.8000e+03, 2.1200e

### Task 5: Creating ``DataLoaders``—and why

- Once we have created a ``Dataset`` object, we wrap it in a ``DataLoader``.
  - The ``DataLoader`` object allows us to put our inputs and targets in mini-batches, which makes for more efficient training.
    - Note: rather than supplying one input-target pair to the model at a time, we supply "mini-batches" of these data at once.
    - The number of items we supply at once is called the batch size.
  - The ``DataLoader`` can also randomly shuffle the data each epoch (when training).
  - It allows us to load different mini-batches in parallel, which can be very useful for larger datasets and images that can't all fit in memory at once.


Note: we are going to use batch normalisation layers in our network, which don't work if the batch size is one. This can happen on the last batch, if we don't choose a batch size that evenly divides the number of items in the data set. To avoid this, we can set the ``drop_last`` argument to ``True``. The last batch, which will be of size ``len(data_set) % batch_size`` gets dropped, and the data are reshuffled. This is only relevant during the training process - validation will use population statistics.

In [32]:
from torch.utils.data import DataLoader

batch_size = 16

# Create training and validation DataLoaders.
train_loader = DataLoader(train_set, batch_size, shuffle=True, drop_last=True)
valid_loader = DataLoader(valid_set, batch_size, shuffle=False)

for batch, targets in train_loader:
    print(batch.shape, targets.shape)

torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])
torch.Size([16, 5]) torch.Size([16, 3])


### Task 6: Creating a neural network in PyTorch

Here we will create our neural network in PyTorch, and have a general discussion on clean and messy ways of going about it.

- First, we will create quite an ugly network to highlight how to make a neural network in PyTorch on a very basic level.
- We will then discuss a trick for making the print-out nicer.
- Finally, we will discuss how the best approach would be to write a class where various parameters (e.g. number of layers, dropout probabilities, etc.) are passed as arguments.

In [33]:
from torch.nn import Module
from torch.nn import BatchNorm1d, Linear, ReLU, Dropout, Sequential


class FCNet(Module):
    """Fully-connected neural network."""
    
    def __init__(self, in_feats, out_feats):
        super().__init__()
        self._fwd_seq_ = Sequential(BatchNorm1d(in_feats), Linear(in_feats, 16),
                                    BatchNorm1d(16), Dropout(0.1), Linear(16, 16), 
                                    BatchNorm1d(16), Dropout(0.1), Linear(16, out_feats))
    
    def forward(self, batch):
        return self._fwd_seq_(batch)
    

model = FCNet(len(features), len(output_keys))
print(model)

FCNet(
  (_fwd_seq_): Sequential(
    (0): BatchNorm1d(5, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Linear(in_features=5, out_features=16, bias=True)
    (2): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.1, inplace=False)
    (4): Linear(in_features=16, out_features=16, bias=True)
    (5): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): Dropout(p=0.1, inplace=False)
    (7): Linear(in_features=16, out_features=3, bias=True)
  )
)


### Task 7: Selecting a loss function

- Binary cross-entropy is about the most common loss function for classification.
  - Details on this loss function are available in the [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html).
- Let's instantiate it together.

In [34]:
from torch.nn import BCELoss

loss = BCELoss()

### Task 8: Selecting an optimiser

While we talked about stochastic gradient descent in the slides, most people use the so-called [Adam optimiser](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html).

You can think of it as a more complex and improved implementation of SGD.

In [35]:
# Create an optimiser and give it the model's parameters.
from torch.optim import Adam

optim = Adam(model.parameters())

### Task 9: Writing basic training and validation loops

- Before we jump in and write these loops, we must first choose an activation function to apply to the model's outputs.
  - Here we are going to use the softmax activation function: see [the PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html).
  - For those of you who've studied physics, you may be remininded of the partition function in thermodynamics.
  - This activation function is good for classifcation when the result is one of ``A or B or C``.
    - It's bad if you even want to assign two classification to one images—say a photo of a dog _and_ a cat.
  - It turns the raw outputs, or logits, into "psuedo probabilities", and we take our prediction to be the most probable class.

- We will write the training loop together, then you can go ahead and write the (simpler) validation loop.

In [None]:
from typing import Dict


def train_one_epoch(
    model: Module,
    train_loader: DataLoader,
    optimiser: Adam,
    loss_func: BCELoss,
) -> Dict[str, float]:
    """Train ``model`` for once epoch.

    Parameters
    ----------
    model : Module
        The neural network.
    train_loader : DataLoader
        Training dataloader.
    optimiser : Adam
        The optimiser.
    loss_func : BCELoss
        Binary cross-entropy loss function.

    Returns
    -------
    Dict[str, float]
        A dictionary of metrics.

    """


def validate_one_epoch(
    model: Module,
    valid_loader: DataLoader,
    loss_func: BCELoss,
) -> Dict[str, float]:
    """Validate ``model`` for a single epoch.

    Parameters
    ----------
    model : Module
        The neural network.
    valid_loader : DataLoader
        Validation dataloader.
    loss_func : BCELoss
        Binary cross-entropy loss function.

    Returns
    -------
    Dict[str, float]
        Metrics of interest.

    """

### Task 10: Training, extracting and plotting metrics

- Now we can train our model for a specified number of epochs.
  - During each epoch the model "sees" each training item once.
- Append the training and validation metrics to a list.
- Turn them into a ``pandas.DataFrame``
  - Note: You can turn a ``List[Dict[str, float]]``, say ``my_list`` into a ``DataFrame`` with ``DataFrame(my_list)``.
- Use Matplotlib to plot the training and validation metrics as a function of the number of epochs.

We will begin the code block together before you complete it independently.  
After some time we will go through the solution together.

In [None]:
epochs = 3

for _ in range(epochs):
    pass

### Task 11: Visualise some results

Let's do this part together—though feel free to make a start on your own if you have completed the previous exercises.

In [None]:
import matplotlib.pyplot as plt

### Bonus: Run the net on 'new' inputs

We have built and trained a net, and evaluated and visualised its performance. However, how do we now utilise it going forward?

Here we construct some 'new' input data and use our trained net to infer the species. Whilst this is relatively straightforward there is still some work required to transform the outputs from the net to a meaningful result.

In [None]:
from torch import no_grad

# Construct a tensor of inputs to run the model over

# Place model in eval mode and run over inputs with no_grad

# Print the raw output from the net

# Transform the raw output back to human-readable format
