# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

## 4.12 Lab 1B: Non-Linear Regression

In this lab, we will keep using the same [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg), and we'll be building upon the previous lab (Lab 1A).

The columns, or attributes, of this dataset, are as follows:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Remember that the last column, `car name`, is actually separated by tabs (instead of spaces), so we're considering the cars' names as comments while loading the dataset.

The following section offers a quick recap of the work done in the previous lab. You're welcome to use your own solution as starting point, but please keep in mind that you may need to do some adjustments in this case. We suggest you work on this lab using the suggested recap first and, only once you're finished try replacing the recap with your own code.

### 4.12.1 Recap

Let's recap what we did in the last lab to properly load and preprocess our dataset, so we can use it to train a non-linear regression in PyTorch. You may run all the cells in this section as they are.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

First, we loaded the data into a Pandas dataframe and split it into training, validation, and test sets:

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cyl', 'disp', 'hp', 'weight', 'acc', 'year', 'origin']

df = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True)

shuffled = df.sample(frac=1, random_state=1).reset_index(drop=True)
raw_data = {}
trainval, raw_data['test'] = train_test_split(shuffled, test_size=0.16, shuffle=False)
raw_data['train'], raw_data['val'] = train_test_split(trainval, test_size=0.2, shuffle=False)

Next, we dropped any rows with missing values in them:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step2.png)

In [2]:
for k in raw_data.keys():
    raw_data[k].dropna(inplace=True)

In Chapter 1, we wrote helper functions to both standardize continuous attributes and encode categorical attributes as sequential indices, so they can be used to retrieve embeddings later:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder

def standardize(df, cont_attr, scaler=None):
    cont_X = df[cont_attr].values
    if scaler is None:
        scaler = StandardScaler()
        scaler.fit(cont_X)
    cont_X = scaler.transform(cont_X)
    cont_X = torch.as_tensor(cont_X, dtype=torch.float32)
    return cont_X, scaler

def encode(df, cat_attr, encoder=None):
    cat_X = df[cat_attr].values
    if encoder is None:
        encoder = OrdinalEncoder()
        encoder.fit(cat_X)
    cat_X = encoder.transform(cat_X)
    cat_X = torch.as_tensor(cat_X, dtype=torch.int)
    return cat_X, encoder

In the previous lab, we built a custom dataset that returned a tuple `(features, target)` where the `features` element was a tuple `(cont_X, cat_X)` itself:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

In [5]:
# Adapted from https://github.com/yashu-seth/pytorch-tabular
from torch.utils.data import Dataset

class TabularDataset(Dataset):
    def __init__(self, raw_data, cont_attr, disc_attr, target_col, scaler=None, encoder=None):
        self.n = raw_data.shape[0]
        self.target = torch.as_tensor(raw_data[[target_col]].values, dtype=torch.float32)
        self.cont_data, self.scaler = standardize(raw_data, cont_attr, scaler)
        self.cat_data, self.encoder = encode(raw_data, disc_attr, encoder)
        
    def __len__(self):
        return self.n

    def __getitem__(self, idx):
        features = (self.cont_data[idx], self.cat_data[idx])
        target = self.target[idx]
        return (features, target)
    
    
cont_attr = ['disp', 'hp', 'weight', 'acc']
disc_attr = ['cyl', 'origin']
target_col = 'mpg'

datasets = {'train': None, 'val': None, 'test': None}
datasets['train'] = TabularDataset(raw_data['train'], cont_attr, disc_attr, target_col)
datasets['val'] = TabularDataset(raw_data['val'], cont_attr, disc_attr, target_col, 
                                 datasets['train'].scaler, datasets['train'].encoder)
datasets['test'] = TabularDataset(raw_data['test'], cont_attr, disc_attr, target_col, 
                                  datasets['train'].scaler, datasets['train'].encoder)

Once the datasets are ready, we created data loaders so we can load mini-batches of data, one at a time:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

In [10]:
from torch.utils.data import DataLoader

dataloaders = {'train': None, 'val': None, 'test': None}
dataloaders['train'] = DataLoader(datasets['train'], batch_size=32, shuffle=True, drop_last=True)
dataloaders['val'] = DataLoader(datasets['val'], batch_size=16, drop_last=True)
dataloaders['test'] = DataLoader(datasets['test'], batch_size=16, drop_last=True)

Finally, we may fetch one mini-batch of data, and we'll use it to try out the model we're about to build:

In [11]:
(cont_feat, cat_feat), targets = next(iter(dataloaders['train']))

### 4.12.2 Embeddings: From Categorical to Continuous

Write code to create a list of embedding layers, each layer configured to handle one particular attribute, that is, one layer to embed `cyl` and another one to embed `origin`. You're free to choose the number of elements/dimensions that the resulting arrays will have.

In [24]:
encoder = datasets['train'].encoder

embedding_layers = []

# write your code here
emb_dim = ...

for categories in encoder.categories_:
    # write your code here
    layer = ...
    embedding_layers.append(layer)

Just run the cell below as is to visualize the output:

In [None]:
embedding_layers

Now, try out your layers by embedding the first five rows of your categorical training data. You should get a list containing two tensors with five rows and as many columns/dimensions as you choose in the previous step.

In [None]:
embeddings = []

for i in range(encoder.n_features_in_):
    data = cat_feat[:5, i]
    
    # write your code here
    emb_values = ...
    
    embeddings.append(emb_values)

Just run the cell below as is to visualize the output:

In [None]:
embeddings

In practice, thoug, your model won't be using a list of embeddings, but their concatenation along the horizontal axis instead. You can use `torch.cat` to accomplish this. Just run the cell below as is to visualize the output:

In [None]:
torch.cat(embeddings, 1)

Now your categorical attributes are represented by many (learned) numerical features. Later on, when building your model, you will have to concatenate both the original continuous features, and those learned via embeddings.

### 4.12.3 Custom Model

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

Your next task is to build a custom model that can handle continuous and categorical features (via embeddings), and that is non-linear in nature. Before moving on, let's briefly discuss two topics: `ModuleList` and the importance of non-linearities.

#### 4.12.3.1 `ModuleList`

`ModuleList` is a special type of list, one that allows PyTorch to recursively look for learnable parameters of layers and model inside its contents. As it turns out, if the class attribute of your custom model is a regular Python list, any layers or models inside it will be ignore by PyTorch during training. By explicitly making a `ModuleList` out of a regular Python list we ensure that its parameters are also accounted for.

In our custom model, we have a list of embedding layers, one for each categorical attribute. Therefore, if we want our model to properly learn these embeeddings, we need to make it a `ModuleList`.

#### 4.12.3.2 Methods

A custom model class must implement a couple of methods:
- `__init__(self)`
- `forward(self, x)`

In the constructor method, you will define the parts that make up your model, like linear layers and embeddings, as class attributes. Don't forget to include a call to `super().__init__()` at the top of the method so it executes the code from the parent class before your own. In our case, the model will receive the following arguments:

- `n_cont`: the number of continuous attributes
- `cat_list`: a list of lists of unique values of categorical attributes (as returned by the `categories_` property of the `OrdinalEncoder`)
- `emb_dim`: the number of dimensions of each embedding (we're keeping them the same for every categorical attribute for simplicity)

The `forward()` method is where the magic happens, as you know. It receives an input `x`, which can be anything (e.g. a tensor, a tuple, a dictionary), and forwards this input through your model's components, such as layers, activation functions, and embeddings. In the end, it should return a prediction. The diagram below illustrates the flow of the inputs through the model's components in the forward pass. Please refer to it for its implementation.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch3/lab1_model.png)

In [None]:
import torch.nn.functional as F

class FFN(nn.Module):
    def __init__(self, n_cont, cat_list, emb_dim):
        super().__init__()
        
        # Embedding layers
        embedding_layers = []
        # Creates one embedding layer for each categorical feature
        # just like you did in the previous section
        # write your code here
        ...
        
        self.emb_layers = nn.ModuleList(embedding_layers)

        # Total number of embedding dimensions
        self.n_emb = len(cat_list) * emb_dim
        self.n_cont = n_cont

        # Linear Layer(s)
        lin_layers = []
        
        # The input layers takes as many inputs as the number of continuous features
        # plus the total number of concatenated embeddings
        
        # The number of outputs is your own choice
        # Optionally, add more hidden layers, don't forget to match the dimensions if you do

        # write your code here
        ...
        
        self.lin_layers = nn.ModuleList(lin_layers)
        
        # The output layer must have as many inputs as there were outputs in the last hidden layer
        # write your code here
        self.output_layer = ...

        # Layer initialization - initialization scheme
        for lin_layer in self.lin_layers:
            nn.init.kaiming_normal_(lin_layer.weight.data, nonlinearity='relu')
        nn.init.kaiming_normal_(self.output_layer.weight.data, nonlinearity='relu')

    def forward(self, inputs):
        # The inputs are the features as returned in the first element of a tuple
        # coming from the dataset/dataloader
        # Make sure you split it into continuous and categorical attributes according
        # to your dataset implementation of __getitem__
        cont_data, cat_data = inputs
        
        # Retrieve embeddings for each categorical attribute and concatenate them
        embeddings = []
        
        # write your code here
        ...
        
        embeddings = torch.cat(embeddings, 1)
        
        # Concatenate all features together, continuous and embeddings
        # write your code here
        x = ...
        
        # Run the inputs through each layer and applies an activation function to each output
        for layer in self.lin_layers:
            # write your code here
            ...
            
        # Run the output of the last linear layer through the output layer
        # write your code here
        ...
        
        # Return the prediction
        # write your code here
        return ...

### 4.12.4 Training

Now it is time to write your own training loop. First, you need to instantiate your model.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

Just run the cell below as is to populate a few variables and visualize the outputs:

In [None]:
scaler = datasets['train'].scaler
encoder = datasets['train'].encoder

n_cont = scaler.n_features_in_
cat_list = encoder.categories_

n_cont, cat_list

The `n_cont` variable contains the number of continuous attributes you're using, and that were scaled by the `StandardScaler`. The `cat_list` variable contains a list of lists, each inner list containing the unique values corresponding to one of the categorical attributes.

Both variables, together with the number of embedding dimensions you chose earlier (`emb_dim`), should be used as arguments to create an instance of your custom model class (`FFN`):

In [None]:
torch.manual_seed(42)

# write your code here
model = ...

Now, create the appropriate loss function for the task:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step2.png)

In [None]:
# write your code here
loss_fn = ...

Then, create an optimizer that will update your model's parameters:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step3.png)

In [None]:
# Suggested learning rate
lr = 1e-2

# write your code here
optimizer = ...

Next, you will write the training loop using the data loaders to iterate through your training and validation data (these loops are written for you already).

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step4.png)

The features returned by our dataset are tuples (as opposed to simple tensors), so don't forget to send each one of its components to the appropriate device. 

Remember that model's have two modes, training and evaluation, set them accordingly. Optionally, you can also implement early stopping.

Use the model, optimizer, and loss function you just created to perform the four steps inside the training loop: forward pass, computing losses, computing gradients, and updating parameters. Don't forget to zero the gradients too.

***
**ASIDE: TQDM**

[TQDM](https://github.com/tqdm/tqdm) is a nice and simple Python package that works as a progress bar for loops. You simply wrap whatever you're looping over with a call to `tqdm()` and you get a working progress bar.

In the code below, we set the progress bar like this:

```python
progress_bar = tqdm(range(n_epochs))

for epoch in progress_bar:
    # do your magic here
```

As the loop runs, it will print a progress bar below the running cell.
***

In [None]:
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_epochs = 100

losses = torch.empty(n_epochs)
val_losses = torch.empty(n_epochs)

best_loss = torch.inf
best_epoch = -1
patience = 3

model.to(device)

progress_bar = tqdm(range(n_epochs))

for epoch in progress_bar:
    batch_losses = torch.empty(len(dataloaders['train']))
    
    ## Training
    for i, (batch_features, batch_targets) in enumerate(dataloaders['train']):
        # Set the model to training mode
        # write your code here
        ...
        
        # Send batch features and targets to the device
        # write your code here
        ...
        
        # Step 1 - forward pass
        predictions = ...

        # Step 2 - computing the loss
        loss = ...

        # Step 3 - computing the gradients
        # Tip: it requires a single method call to backpropagate gradients
        # write your code here
        ...
        
        batch_losses[i] = loss.item()

        # Step 4 - updating parameters and zeroing gradients
        # Tip: it takes two calls to optimizer's methods
        # write your code here
        ...
        
    losses[epoch] = batch_losses.mean()

    ## Validation   
    with torch.inference_mode():
        batch_losses = torch.empty(len(dataloaders['val']))    

        for i, (val_features, val_targets) in enumerate(dataloaders['val']):
            # Set the model to evaluation mode
            # write your code here
            ...

            # Send batch features and targets to the device
            # write your code here
            ...

            # Step 1 - forward pass
            predictions = ...

            # Step 2 - computing the loss
            loss = ...
            
            batch_losses[i] = loss.item()

        val_losses[epoch] = batch_losses.mean()
        
        if val_losses[epoch] < best_loss:
            best_loss = val_losses[epoch]
            best_epoch = epoch
            torch.save({'model': model.state_dict(), 
                        'optimizer': optimizer.state_dict()}, 'best_model.pth')
        elif (epoch - best_epoch) > patience:
            print(f"Early stopping at epoch #{epoch}")
            break

Let's check the evolution of the losses. Run the cell below as is to plot your losses:

In [None]:
plt.plot(losses[:epoch], label='Training')
plt.plot(val_losses[:epoch], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.yscale('log')
plt.legend()

Then, let's compare predicted and actual values in the validation set. Hopefully, it will be much better than our former linear regression. 

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

Run the cell below as is to visualize a scatterplot comparing predicted and actual values of fuel consumption. A perfect prediction corresponds to the dashed diagonal line.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
split = 'val'
batch = list(datasets[split][:][0])
batch[0] = batch[0].to(device)
batch[1] = batch[1].to(device)
ax.scatter(datasets[split][:][1].tolist(), model(batch).tolist(), alpha=.5)
ax.plot([0, 45], [0, 45], linestyle='--', c='k', linewidth=1)
ax.set_xlabel('Actual')
ax.set_xlim([0, 45])
ax.set_ylabel('Predicted')
ax.set_ylim([0, 45])
ax.set_title('MPG')