# **Lab: Neural Networks**

## Exercise 3: Multi-Class Classification with Pytorch

In this exercise, we will build a a Neural Networks with Pytorch for predicting car evaluation. We will be woking on the Car dataset:
https://raw.githubusercontent.com/aso-uts/applied_ds/master/unit3/dataset/Car%20Evaluation.csv


The steps are:
1.   Setup Repository
2.   Load and Explore Dataset
3.   Prepare Data
4.   Baseline Model
5.   Define Architecture
6.   Train Model
7.   Push Changes

### 1. Setup Repository

In [3]:
# Task: Go inside the created folder adv_dsi_lab_5
! cd adv_dsi_lab_5

/bin/bash: line 0: cd: adv_dsi_lab_5: No such file or directory


In [6]:
cd /home/jovyan/work

/home/jovyan/work


In [7]:
# Task: Create a new git branch called pytorch_multi_class
! git checkout -b pytorch_multi_class

Switched to a new branch 'pytorch_multi_class'


### 2. Load and Explore Dataset

**[2.0]** Change Working Directory

In [2]:
cd /home/jovyan/work

/home/jovyan/work


In [8]:
! cat /etc/issue*

Ubuntu 20.04.4 LTS \n \l

Ubuntu 20.04.4 LTS


**[2.1]** Download the dataset into the `data/raw` folder:https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Beijing%20PM2.5/PRSA_data_2010.1.1-2014.12.31.csv

In [9]:
! wget -P data/raw https://raw.githubusercontent.com/aso-uts/applied_ds/master/unit3/dataset/Car%20Evaluation.csv

--2022-03-08 09:08:03--  https://raw.githubusercontent.com/aso-uts/applied_ds/master/unit3/dataset/Car%20Evaluation.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53678 (52K) [text/plain]
Saving to: ‘data/raw/Car Evaluation.csv’


2022-03-08 09:08:03 (4.70 MB/s) - ‘data/raw/Car Evaluation.csv’ saved [53678/53678]



**[2.2]** Launch the magic commands for auto-relaoding external modules

In [10]:
# Task: Launch the magic commands for auto-relaoding external modules
%load_ext autoreload
%autoreload 2

**[2.3]** Import the pandas and numpy packages

In [11]:
import pandas as pd
import numpy as np

**[2.4]** Load the data in a dataframe called `df`

In [13]:
df = pd.read_csv('data/raw/Car Evaluation.csv')

In [29]:
df.head()

Unnamed: 0,buying_price,maintenance_cost,doors,persons_capacity,luggage_boot,safety,evaluation
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [17]:
df.shape

(1728, 7)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   buying_price      1728 non-null   object
 1   maintenance_cost  1728 non-null   object
 2   doors             1728 non-null   object
 3   persons_capacity  1728 non-null   object
 4   luggage_boot      1728 non-null   object
 5   safety            1728 non-null   object
 6   evaluation        1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [20]:
df.describe()

Unnamed: 0,buying_price,maintenance_cost,doors,persons_capacity,luggage_boot,safety,evaluation
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,2,2,small,low,unacc
freq,432,432,432,576,576,576,1210


In [22]:
# Save the dataframe locally in the data/raw folder
df.to_csv('data/raw/car_evaluation.csv', index=False)

### 3. Prepare Data

In [24]:
# Task: Create a copy of df and save it into a variable called df_cleaned
df_cleaned = df.copy()

In [25]:
# Task: Create a dictionary called cats_dict that contains the categorical variables as keys and their respective values 
# sorted in ascending order
cats_dict = {
    'buying_price': [['low', 'med', 'high', 'vhigh']],
    'maintenance_cost': [['low', 'med', 'high', 'vhigh']],
    'doors': [['2', '3', '4', '5more']],
    'persons_capacity': [['2', '4', 'more']],
    'luggage_boot': [['small', 'med', 'big']],
    'safety': [['low', 'med', 'high']],
    'evaluation': [['unacc', 'acc', 'good', 'vgood']],
}



In [27]:
# Task: Import StandardScaler and OrdinalEncoder from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

In [28]:
# Task: Iterate through the elements of cast_dict, instantiate an OrdinalEncoder() and 
# transform the values of each column with this encoder

for col, cats in cats_dict.items():
    col_encoder = OrdinalEncoder(categories=cats)
    df_cleaned[col] = col_encoder.fit_transform(df_cleaned[[col]])


In [30]:
df_cleaned.head()

Unnamed: 0,buying_price,maintenance_cost,doors,persons_capacity,luggage_boot,safety,evaluation
0,3.0,3.0,0.0,0.0,0.0,0.0,0.0
1,3.0,3.0,0.0,0.0,0.0,1.0,0.0
2,3.0,3.0,0.0,0.0,0.0,2.0,0.0
3,3.0,3.0,0.0,0.0,1.0,0.0,0.0
4,3.0,3.0,0.0,0.0,1.0,1.0,0.0


In [31]:
# Task: Create a list called num_cols that contains all numeric columns
num_cols = ['buying_price', 'maintenance_cost', 'doors', 'persons_capacity', 'luggage_boot', 'safety']

In [32]:
# Task: Instantiate a StandardScaler and called it sc
sc = StandardScaler()

In [33]:
# Task: Fit and transform the numeric feature of X_train_cleaned and replace the data into it
df_cleaned[num_cols] = sc.fit_transform(df_cleaned[num_cols])

In [35]:
df_cleaned['evaluation'] = df_cleaned['evaluation'].astype(int)

In [36]:
df_cleaned.head()

Unnamed: 0,buying_price,maintenance_cost,doors,persons_capacity,luggage_boot,safety,evaluation
0,1.341641,1.341641,-1.341641,-1.224745,-1.224745,-1.224745,0
1,1.341641,1.341641,-1.341641,-1.224745,-1.224745,0.0,0
2,1.341641,1.341641,-1.341641,-1.224745,-1.224745,1.224745,0
3,1.341641,1.341641,-1.341641,-1.224745,0.0,-1.224745,0
4,1.341641,1.341641,-1.341641,-1.224745,0.0,0.0,0


In [37]:
# Task: Import train_test_split from sklearn.model_selection
from src.data.sets import split_sets_random, save_sets

In [38]:
# Task: Split the data into training and testing sets with 80-20 ratio
X_train, y_train, X_val, y_val, X_test, y_test = split_sets_random(df_cleaned, target_col='evaluation', test_ratio=0.2)

In [39]:
! mkdir data/processed/car_evaluation

In [40]:
# Task: Save the sets in the data/processed/credit_card_default folder
save_sets(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, X_test=X_test, y_test=y_test, path='data/processed/car_evaluation/')

### 4. Baseline Model

In [47]:
# Task: Import NullModel from src.models.null
from src.models.null import NullModel

In [48]:
# Task: Instantiate a NullModel and call .fit_predict() on the training target to extract your predictions into a variable called y_base
base_model = NullModel(target_type="classification")
y_base = base_model.fit_predict(y_train)

In [49]:
from src.models.performance import print_class_perf

In [50]:
print_class_perf(y_base, y_train, set_name='Training', average='weighted')

Accuracy Training: 0.6988416988416989
F1 Training: 0.5749561249561249


### 5. Define Architecture

In [51]:
# Task: Import torch and torch.nn as nn
import torch
import torch.nn as nn
import torch.nn.functional as F

**[5.2]** Create in `src/models/pytorch.py` a class called `PytorchMultiClass` that inherits from `nn.Module` with:
- `num_features` as input parameter
- attributes:
    - `layer_1`: fully-connected layer with 32 neurons
    - `layer_out`: fully-connected layer with 4 neurons
    - `softmax`: softmax function
- methods:
    - `forward()` with `inputs` as input parameter, perform ReLU and DropOut on the fully-connected layer followed by the output layer with softmax

In [52]:
# Task: Create a class called PytorchMultiClass that inherits from nn.Module
class PytorchMultiClass(nn.Module):
    def __init__(self, num_features):
        super(PytorchMultiClass, self).__init__()
        
        self.layer_1 = nn.Linear(num_features, 32)
        self.layer_out = nn.Linear(32, 4)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = F.dropout(F.relu(self.layer_1(x)), training=self.training)
        x = self.layer_out(x)
        return self.softmax(x)

In [36]:
# Task: Instantiate PytorchMultiClass with the correct number of input feature and save it into a variable called model


In [53]:
model = PytorchMultiClass(X_train.shape[1])

In [54]:
def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
    else:
        device = torch.device('cpu') # don't have GPU 
    return device

In [55]:
device = get_device()
model.to(device)

PytorchMultiClass(
  (layer_1): Linear(in_features=6, out_features=32, bias=True)
  (layer_out): Linear(in_features=32, out_features=4, bias=True)
  (softmax): Softmax(dim=1)
)

In [56]:
# Task: Print the architecture of model
print(model)

PytorchMultiClass(
  (layer_1): Linear(in_features=6, out_features=32, bias=True)
  (layer_out): Linear(in_features=32, out_features=4, bias=True)
  (softmax): Softmax(dim=1)
)


In [57]:
torch.cuda.is_available()

False

### 6. Create Data Loader

In [58]:
# Task: Import Dataset and DataLoader from torch.utils.data
from torch.utils.data import Dataset, DataLoader

In [59]:
# Task: Create a class called PytorchDataset
class PytorchDataset(Dataset):
    """
    Pytorch dataset
    ...

    Attributes
    ----------
    X_tensor : Pytorch tensor
        Features tensor
    y_tensor : Pytorch tensor
        Target tensor

    Methods
    -------
    __getitem__(index)
        Return features and target for a given index
    __len__
        Return the number of observations
    to_tensor(data)
        Convert Pandas Series to Pytorch tensor
    """
        
    def __init__(self, X, y):
        self.X_tensor = self.to_tensor(X)
        self.y_tensor = self.to_tensor(y)
    
    def __getitem__(self, index):
        return self.X_tensor[index], self.y_tensor[index]
        
    def __len__ (self):
        return len(self.X_tensor)
    
    def to_tensor(self, data):
        return torch.Tensor(np.array(data))

In [60]:
# Task: Convert all numpy array sets to PytorchDataset
train_dataset = PytorchDataset(X=X_train, y=y_train)
val_dataset = PytorchDataset(X=X_val, y=y_val)
test_dataset = PytorchDataset(X=X_test, y=y_test)

### 7. Train Model

In [61]:
# Task: Instantiate a nn.CrossEntropyLoss() and save it into a variable called criterion
criterion = nn.CrossEntropyLoss()

In [62]:
# Task: Instantiate a torch.optim.Adam() optimizer with the model's parameters and 0.1 as learning rate and save it into a variable called optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

In [57]:
# Task: Instantiate a torch.optim.lr_scheduler.StepLR() scheduler that will decrease the learning rate by a coefficient of 0.9 for each epoch
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

In [64]:
# Create a function called `train_classification()` that will perform forward and back propagation and
# calculate loss and Accuracy scores

def train_classification(train_data, model, criterion, optimizer, batch_size, device, scheduler=None, generate_batch=None):
    """Train a Pytorch multi-class classification model

    Parameters
    ----------
    train_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    optimizer: torch.optim
        Optimizer
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    scheduler : torch.optim.lr_scheduler
        Pytorch Scheduler used for updating learning rate
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        Accuracy Score
    """
    
    # Set model to training mode
    model.train()
    train_loss = 0
    train_acc = 0
    
    # Create data loader
    data = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=generate_batch)
    
    # Iterate through data by batch of observations
    for feature, target_class in data:

        # Reset gradients
        optimizer.zero_grad()
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Make predictions
        output = model(feature)
        
        # Calculate loss for given batch
        loss = criterion(output, target_class.long())

        # Calculate global loss
        train_loss += loss.item()
        
        # Calculate gradients
        loss.backward()

        # Update Weights
        optimizer.step()
        
        # Calculate global accuracy
        train_acc += (output.argmax(1) == target_class).sum().item()

    # Adjust the learning rate
    if scheduler:
        scheduler.step()

    return train_loss / len(train_data), train_acc / len(train_data)

In [65]:
# Task: Create a function called `test_classification()` that will perform forward and calculate loss and accuracy scores

def test_classification(test_data, model, criterion, batch_size, device, generate_batch=None):
    """Calculate performance of a Pytorch multi-class classification model

    Parameters
    ----------
    test_data : torch.utils.data.Dataset
        Pytorch dataset
    model: torch.nn.Module
        Pytorch Model
    criterion: function
        Loss function
    bacth_size : int
        Number of observations per batch
    device : str
        Name of the device used for the model
    collate_fn : function
        Function defining required pre-processing steps

    Returns
    -------
    Float
        Loss score
    Float:
        Accuracy Score
    """    
    
    # Set model to evaluation mode
    model.eval()
    test_loss = 0
    test_acc = 0
    
    # Create data loader
    data = DataLoader(test_data, batch_size=batch_size, collate_fn=generate_batch)
    
    # Iterate through data by batch of observations
    for feature, target_class in data:
        
        # Load data to specified device
        feature, target_class = feature.to(device), target_class.to(device)
        
        # Set no update to gradients
        with torch.no_grad():
            
            # Make predictions
            output = model(feature)
            
            # Calculate loss for given batch
            loss = criterion(output, target_class.long())

            # Calculate global loss
            test_loss += loss.item()
            
            # Calculate global accuracy
            test_acc += (output.argmax(1) == target_class).sum().item()

    return test_loss / len(test_data), test_acc / len(test_data)


In [66]:
# Task: Create 2 variables called N_EPOCHS and BATCH_SIZE that will take respectively 5 and 32 as values
N_EPOCHS = 50
BATCH_SIZE = 32

In [67]:
# Task: Create a for loop that will iterate through the specified number of epochs and 
# will train the model with the training set and assess the performance on the validation set and print their scores

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_classification(train_dataset, model=model, criterion=criterion, optimizer=optimizer, batch_size=BATCH_SIZE, device=device)
    valid_loss, valid_acc = test_classification(val_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)

    print(f'Epoch: {epoch}')
    print(f'\t(train)\t|\tLoss: {train_loss:.4f}\t|\tAcc: {train_acc * 100:.1f}%')
    print(f'\t(valid)\t|\tLoss: {valid_loss:.4f}\t|\tAcc: {valid_acc * 100:.1f}%')

    

Epoch: 0
	(train)	|	Loss: 0.0322	|	Acc: 73.4%
	(valid)	|	Loss: 0.0296	|	Acc: 80.6%
Epoch: 1
	(train)	|	Loss: 0.0307	|	Acc: 77.6%
	(valid)	|	Loss: 0.0294	|	Acc: 81.8%
Epoch: 2
	(train)	|	Loss: 0.0300	|	Acc: 79.6%
	(valid)	|	Loss: 0.0286	|	Acc: 84.1%
Epoch: 3
	(train)	|	Loss: 0.0293	|	Acc: 82.3%
	(valid)	|	Loss: 0.0285	|	Acc: 84.4%
Epoch: 4
	(train)	|	Loss: 0.0293	|	Acc: 82.0%
	(valid)	|	Loss: 0.0286	|	Acc: 84.1%
Epoch: 5
	(train)	|	Loss: 0.0291	|	Acc: 82.9%
	(valid)	|	Loss: 0.0283	|	Acc: 85.0%
Epoch: 6
	(train)	|	Loss: 0.0291	|	Acc: 82.7%
	(valid)	|	Loss: 0.0276	|	Acc: 87.6%
Epoch: 7
	(train)	|	Loss: 0.0290	|	Acc: 83.7%
	(valid)	|	Loss: 0.0293	|	Acc: 82.4%
Epoch: 8
	(train)	|	Loss: 0.0289	|	Acc: 83.4%
	(valid)	|	Loss: 0.0278	|	Acc: 87.0%
Epoch: 9
	(train)	|	Loss: 0.0287	|	Acc: 84.4%
	(valid)	|	Loss: 0.0275	|	Acc: 87.6%
Epoch: 10
	(train)	|	Loss: 0.0288	|	Acc: 83.6%
	(valid)	|	Loss: 0.0281	|	Acc: 86.1%
Epoch: 11
	(train)	|	Loss: 0.0285	|	Acc: 84.6%
	(valid)	|	Loss: 0.0279	|	Acc: 86.7%
Ep

In [68]:
# Task: Save the model into the models folder
torch.save(model, "models/pytorch_multi_car_evaluation.pt")

### 8.   Assess Performance

In [69]:
# Task: Assess the model performance on the testing set and print its scores
test_loss, test_acc = test_classification(test_dataset, model=model, criterion=criterion, batch_size=BATCH_SIZE, device=device)
print(f'\tLoss: {test_loss:.4f}\t|\tAccuracy: {test_acc:.1f}')

	Loss: 0.0288	|	Accuracy: 0.8


In [64]:
# Task: Add Changes to GIT
! git add .
! git commit -m "pytorch binary classification"

[master 9e62c9d] pytorch binary classification
 4 files changed, 3010 insertions(+), 1409 deletions(-)
 create mode 100644 models/pytorch_bin_default_card.pt
 create mode 100644 notebooks/2_pytorch_binary_classification.ipynb


In [55]:
! git config --global user.email "nathan@fragar.id.au"
! git config --global user.name "Nathan Fragar"

In [59]:
! git push --set upstream origin pytorch_reg

error: src refspec origin does not match any
[31merror: failed to push some refs to 'upstream'
[m