# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

## 2.7 Lab 1A: Non-Linear Regression

In this lab, you will use the same [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg), but we'll bring more features to the mix, as you will also learn how to encode discrete/categorical features so they can be used to train the model.

The columns, or attributes, of this dataset, are as follows:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Remember that the last column, `car name`, is actually separated by tabs (instead of spaces), so we're considering the cars' names as comments while loading the dataset.

We're loading the dataset into a Pandas dataframe just like before. Run the code below as is to load the data:

In [1]:
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['mpg', 'cyl', 'disp', 'hp', 'weight', 'acc', 'year', 'origin']

df = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True)

Just run the code below as is to visualize the output:

In [2]:
df

Unnamed: 0,mpg,cyl,disp,hp,weight,acc,year,origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1


### 2.7.1 Train-Validation-Test Split

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

Shuffle the dataset, and then split it into train, validation, and test sets using Scikit-Learn's `train_test_split()` method:

In [3]:
from sklearn.model_selection import train_test_split

# write your code here
shuffled = df.sample(frac=1, random_state=1).reset_index(drop=True)

# write your code here
raw_data = {}
trainval, raw_data['test'] = train_test_split(shuffled, test_size=0.16, shuffle=False)
raw_data['train'], raw_data['val'] = train_test_split(trainval, test_size=0.2, shuffle=False)

### 2.7.2 Cleaning Data

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step2.png)

In this lab, we're throwing rows with missing values away, so make sure there are no NAs left in your datasets.

In [None]:
# write your code here
for k in raw_data.keys():
    raw_data[k].dropna(inplace=True)

### 2.7.3 Continuous Attributes

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

We've done this already, but this time you should write a `standardize()` function that:
- takes a Pandas dataframe, a list of column names that are continuous attributes, and an optional scaler
- creates and trains a Scikit-Learn's `StandardScaler` if one isn't provided as an argument
- returns a PyTorch tensor containing the standardized features and an instance of Scikit-Learn's `StandardScaler`

In [None]:
import torch
from sklearn.preprocessing import StandardScaler

def standardize(df, cont_attr, scaler=None):
    # write your code here
    cont_X = df[cont_attr].values
    if scaler is None:
        scaler = StandardScaler()
        scaler.fit(cont_X)
    cont_X = scaler.transform(cont_X)
    cont_X = torch.as_tensor(cont_X, dtype=torch.float32)
    
    # cont_X is a tensor containing the standardized features
    # scaler is an instance of Scikit-Learn's StandardScaler
    return cont_X, scaler

Use your `standardize` function to standardize all continuous attributes in our datasets. Don't forget you shouldn't train scalers on validation and test sets. They must use the scaler trained on the training set!

In [None]:
cont_attr = ['disp', 'hp', 'weight', 'acc']

cont_data = {'train': None, 'val': None, 'test': None}

# write your code here
cont_data['train'], scaler = standardize(raw_data['train'], cont_attr)
cont_data['val'], _ = standardize(raw_data['val'], cont_attr, scaler)
cont_data['test'], _ = standardize(raw_data['test'], cont_attr, scaler)

### 2.7.4 Categorical Attributes

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

Similary to the standardization function you already wrote, now write a function to encode categorical attributes:
- takes a Pandas dataframe, a list of column names that are categorical attributes, and an optional encoder
- creates and trains a Scikit-Learn's `OrdinalEncoder` if one isn't provided as an argument
- returns a PyTorch tensor containing the encoded categorical features and an instance of Scikit-Learn's `OrdinalEncoder`

In [None]:
from sklearn.preprocessing import OrdinalEncoder

def encode(df, cat_attr, encoder=None):
    # write your code here
    cat_X = df[cat_attr].values
    if encoder is None:
        encoder = OrdinalEncoder()
        encoder.fit(cat_X)
    cat_X = encoder.transform(cat_X)
    cat_X = torch.as_tensor(cat_X, dtype=torch.int)
    
    return cat_X, encoder

Use your `encode` function to encode all categorical attributes in our datasets. Don't forget you shouldn't train encoders on validation and test sets. They must use the encoder trained on the training set!

In [None]:
disc_attr = ['cyl', 'origin']

cat_data = {'train': None, 'val': None, 'test': None}
# write your code here
cat_data['train'], encoder = encode(raw_data['train'], disc_attr)
cat_data['val'], _ = encode(raw_data['val'], disc_attr, encoder)
cat_data['test'], _ = encode(raw_data['test'], disc_attr, encoder)

The `categories_` attribute of the trained encoder should a list of lists of unique values, one list for each encoded attribute (just run the code below as is to visualize the output):

In [None]:
encoder.categories_

If we check the encoded attributes, their unique values should be lists of sequential numbers (just run the code below as is to visualize the output):

In [None]:
cat_data['train'][:, 0].unique(), cat_data['train'][:, 1].unique()

### 2.7.5 Target and Task

Your features are already taken care of, so it's time to create column tensors for your target attribute. Make sure they are of the type `float32`.

In [None]:
target_data = {'train': None, 'val': None, 'test': None}
target_col = 'mpg'

# write your code here
for k in raw_data.keys():
    target_data[k] = torch.as_tensor(raw_data[k][[target_col]].values, dtype=torch.float32)

### 2.7.6 Custom Dataset

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

Previously, we used a simple `TensorDataset` for our single feature and target. Now let's build our own custom dataset class instead by inheriting from the `Dataset` class. 

It needs to implement some basic methods:
- `__init__(self)`
- `__getitem__(self, index)`
- `__len__(self)`. 

The constructor (`__init__()`) method may receive any arguments you can possible need, so you can create and preprocess your tensors right away or, as it is often the case when your dataset is too large, load them on demand. In our case, the constructor will receive the following arguments:

- `raw_data`: a Pandas dataframe containing our (small) dataset
- `cont_attr`: a list of the continuous attributes we'd like to use
- `disc_attr`: a list of the discrete/categorical attributes we'd like to use
- `target`: the name of the column containing the target attribute we'd like to predict
- `scaler`: an optional instance of a `StandardScaler` to standardize the continuous attributes
- `encoder`: an optional instance of an `OrdinalEncoder` to encode the discrete attributes sequentially

You can use these arguments to preprocess and store the resulting tensors as class attributes, which you can retrieve at your convenience when other methods are called. Remember that you have already written functions to standardize continuous attributes and to encode categorical ones, feel free to use them.

In the `__getitem__()` method, which makes a dataset "sliceable" just like a Python list, you should return a tuple `(features, target)` corresponding to the requested index. Notice that the first element of your tuple, `features` does not necessarily need to be a single tensor. It may be anything, another tuple, or even a dictionary. Remember that we have two types of features, continuous and categorical, and they are going to be handled differently in our model.

In the `__len__()` method, you only need to return the total number of elements in your dataset.

In [None]:
from torch.utils.data import Dataset

class TabularDataset(Dataset):
    def __init__(self, raw_data, cont_attr, disc_attr, target_col, scaler=None, encoder=None):
        # write your code here
        self.n = raw_data.shape[0]
        self.target = torch.as_tensor(raw_data[[target_col]].values, dtype=torch.float32)
        self.cont_data, self.scaler = standardize(raw_data, cont_attr, scaler)
        self.cat_data, self.encoder = encode(raw_data, disc_attr, encoder)
        
    def __len__(self):
        return self.n

    def __getitem__(self, idx):
        # write your code here
        features = (self.cont_data[idx], self.cat_data[idx])
        target = self.target[idx]

        return (features, target)

Once your custom class has been defined, use it to create training, validation, and test datasets. Don't forget that scaling and encoding should be fitted in the training set only!

In [None]:
datasets = {'train': None, 'val': None, 'test': None}
# write your code here
datasets['train'] = TabularDataset(raw_data['train'], cont_attr, disc_attr, target_col)
datasets['val'] = TabularDataset(raw_data['val'], cont_attr, disc_attr, target_col, datasets['train'].scaler, datasets['train'].encoder)
datasets['test'] = TabularDataset(raw_data['test'], cont_attr, disc_attr, target_col, datasets['train'].scaler, datasets['train'].encoder)

Just run the code below as is to visualize the output:

In [None]:
datasets['train'][:5]

You should see the features and targets of the first five elements from your training set.

### 2.7.7 Data Loaders

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

Next, you need to create data loaders, one for each set. It is recommended to shuffle the training set, but don't bother shuffling the others. Dropping the last mini-batch, in case your set isn't a perfect multiple of your mini-batch size, is also recommended.

In [None]:
from torch.utils.data import DataLoader

dataloaders = {'train': None, 'val': None, 'test': None}
# write your code here
dataloaders['train'] = DataLoader(datasets['train'], batch_size=32, shuffle=True, drop_last=True)
dataloaders['val'] = DataLoader(datasets['val'], batch_size=16, drop_last=True)
dataloaders['test'] = DataLoader(datasets['test'], batch_size=16, drop_last=True)