# EE382V - Hardware Architecture for Machine Learning
## NVIDIA Tensor Cores for Accelerating Machine Learning Workload

## Notebook 3 - Model Training with FP16

In this notebook, we will try to modify our training process into FP16 so that it can use Tensor Cores. We need to train our model in order to take advantage of Tensor Cores. We will still use pre-trained ResNet-50 and the same dataset in this notebook.

### Import Library
We need to import some libraries which are needed to perform some functions in this notebook.

In [1]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
from tensorboardX import SummaryWriter
import datetime
import time
import copy
import math

### Global Variable
Here, we define global variables.

In [2]:
data_dir       = './'
raw_dir        = f'{data_dir}/raw'
raw_dogs_dir   = f'{raw_dir}/dogs'
raw_cats_dir   = f'{raw_dir}/cats'
train_dir      = f'{data_dir}/train'
train_dogs_dir = f'{train_dir}/dogs'
train_cats_dir = f'{train_dir}/cats'
val_dir        = f'{data_dir}/val'
val_dogs_dir   = f'{val_dir}/dogs'
val_cats_dir   = f'{val_dir}/cats'
log_dir        = f'{data_dir}/log'
chk_dir        = f'{data_dir}/checkpoint'
test_dir       = f'{data_dir}/test'

### GPU Initialization
We will use GPU to train our model. The TACC Maverick2 V100 Compute Node is equipped with two NVIDIA Tesla V100 GPUs. In this assignment, we will only use one of them. If there is no GPU available, we will revert back to use the CPU.

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Dataset Augmentation and Normalization
Before we train our model, we need to preprocess our dataset. We will do Image Data Augmentation which is a method to artificially enchance the size and quality of dataset. The augmentantion can create variations of the data that can improve the ability of the model to fit on the real world scenario. You can learn more about Image Data Augmentation in [9]. The variations of data can consist of rotation, flip, noise, brightness, and contrast. We also need to normalize our dataset. We will use the default value that is used in PyTorch example [10]. We define different transformation for training dataset and validation dataset. We do not do augmentation on validation dataset. We also convert the data into PyTorch tensor.

In [4]:
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomRotation(5),
        transforms.RandomHorizontalFlip(),
        transforms.RandomResizedCrop(224, scale=(0.96, 1.0), ratio=(0.95, 1.05)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize([224,224]),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

### Dataset Handler
In this section, we will define the method how we will handle the dataset and feed them into both training and validation process. We will use PyTorch dataloader to feed the dataset into the training and validation of the model. 

The batch size is the number of data (image) taken from dataset to train and update our model in each iteration. For example, if we have 128 images in our dataset and we have batch size of 16, then there will be 8 iterations to train our model in one epoch. In each iteration, we take 16 images, train our model, and update model parameter. You can adjust the batch size as a part of hyperparameter tuning. In this assignment, we will only adjust the batch size according to the available GPU memory. The NVIDIA Tesla V100 has 16GB of HBM2 memory and it should be able to store 1024 images in single batch (if we use FP32). You can change the batch size according to your GPU Memory Size. 

The workers are basically the number of background process that the data loader can use. Since TACC Maverick2 V100 Compute Node has 96 hardware threads (48 cores with HyperThreading), we put 96 as the number of workers. Feel free to change the number of workers according to the available hardware thread on your workstation.

In [5]:
###################### Change as needed ######################
batch_size  = 1024
num_workers = 96
##############################################################

# Define the data transformation
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}

# Define the data loader
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batch_size,
                                              shuffle=True, num_workers=num_workers)
              for x in ['train', 'val']}

# Print the statistics
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classes
print(class_names)
print(f'Train image size: {dataset_sizes["train"]}')
print(f'Validation image size: {dataset_sizes["val"]}')

['cats', 'dogs']
Train image size: 20000
Validation image size: 5000


### Download Pre-Trained Model
We download the pre-trained model of ResNet-50. By using pre-Trained model, we can save time and computational resource to train our model. Then, this pre-trained model will be trained using our dataset. We also define a checkpoint location which allows us to save our model after we have trained it.

In [6]:
# Download the pre-trained model of ResNet-50
model_conv  = torchvision.models.resnet50(pretrained=True)

# Define the checkpoint location to save the trained model
check_point = f'{chk_dir}/model-checkpoint-fp16.tar'

### Define Training Components
Before we go to train our model, we need to define several components. We need to define the Criterion, Optimizer, and Learning Rate Scheduler. 

In [7]:
# Parameters of newly constructed modules have requires_grad=True by default
for param in model_conv.parameters():
    param.requires_grad = False

# We change the parameter of the final fully connected layer.
# We have to keep the number of input features to this layer.
# We change the output features from this layer into 2 features (i.e., we only have two classes).
num_ftrs = model_conv.fc.in_features
model_conv.fc = nn.Linear(num_ftrs, 2)

# Copy the model into GPU memory
model_conv = model_conv.to(torch.float16).to(device)

# Choose the Criterion as Cross Entropy Loss
criterion = nn.CrossEntropyLoss()

# Optimize only the parameters of the final fully connected layer since we have changed them.
optimizer_conv = optim.SGD(model_conv.fc.parameters(), lr=0.001, momentum=0.9)

# This is our learning rate scheduler. Decay learning rate by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)

### Define Training Function
We define our training function as follows.

In [15]:
def train_model(model, criterion, optimizer, scheduler, num_epochs, timestamp):
    since = time.time()
    writer = SummaryWriter('log/'+timestamp+'-fp16')
    
    # Initialization
    best_model_wts = copy.deepcopy(model.state_dict())
    best_loss = math.inf
    best_acc = 0.
    
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                scheduler.step()
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for i, (inputs, labels) in enumerate(dataloaders[phase]):
                inputs = inputs.to(torch.float16).to(device)
                labels = labels.to(device)
                
                
                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                
                
                if phase == 'train' :
                    writer.add_scalar('Train/Current_Running_Loss', loss.item(), epoch*len(dataloaders[phase])+i)
                    writer.add_scalar('Train/Current_Running_Corrects', torch.sum(preds == labels.data), epoch*len(dataloaders[phase])+i)
                    writer.add_scalar('Train/Accum_Running_Loss', running_loss, epoch*len(dataloaders[phase])+i)
                    writer.add_scalar('Train/Accum_Running_Corrects', running_corrects, epoch*len(dataloaders[phase])+i)
                else :
                    writer.add_scalar('Validation/Current_Running_Loss', loss.item(), epoch*len(dataloaders[phase])+i)
                    writer.add_scalar('Validation/Current_Running_Corrects', torch.sum(preds == labels.data), epoch*len(dataloaders[phase])+i)
                    writer.add_scalar('Validation/Running_Loss', epoch_loss, epoch*len(dataloaders[phase])+i)
                    writer.add_scalar('Validation/Running_Corrects', epoch_acc, epoch*len(dataloaders[phase])+i)

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))
            
            if phase == 'train' :
                writer.add_scalar('Train/Loss', epoch_loss, epoch)
                writer.add_scalar('Train/Accuracy', epoch_acc, epoch)
            else :
                writer.add_scalar('Validation/Loss', epoch_loss, epoch)
                writer.add_scalar('Validation/Accuracy', epoch_acc, epoch)
            
            # deep copy the model
            if phase == 'val' and epoch_loss < best_loss:
                print(f'New best model found!')
                print(f'New record loss: {epoch_loss}, previous record loss: {best_loss}')
                best_loss = epoch_loss
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        writer.flush()
        print()
        
    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:.4f} Best val loss: {:.4f}'.format(best_acc, best_loss))

    # load best model weights
    model.load_state_dict(best_model_wts)
    writer.close()
    return model, best_loss, best_acc

### Training The Model
Finally, we can train our model. You can adjust the number of epochs to train our model. An epoch is one training and one validation. We will use large learning rate at the first epoch. The learning rate will slowly be decreased as we run more epoch to find the global optimum. You are free to adjust the number of epoch as it will only take around 1 minute to run 1 epoch. You can monitor the training progress in TensorBoard. TensorBoard data will be updated at the end of each epoch.

You will also need to monitor the GPU Memory Usage. Open your first terminal in MobaXterm and execute command nvidia-smi. This command will give you the list of all GPUs installed in the compute node and their status.

In [16]:
###################### Change as needed ######################
num_epochs = 14
##############################################################

today = datetime.datetime.today() 
timestamp = today.strftime('%Y%m%d-%H%M%S')

# Start the training
model_conv, best_val_loss, best_val_acc = train_model(model_conv,
                                                      criterion,
                                                      optimizer_conv,
                                                      exp_lr_scheduler,
                                                      num_epochs,
                                                      timestamp)

# Save the trained model for future use.
torch.save({'model_state_dict': model_conv.state_dict(),
            'optimizer_state_dict': optimizer_conv.state_dict(),
            'best_val_loss': best_val_loss,
            'best_val_accuracy': best_val_acc,
            'scheduler_state_dict' : exp_lr_scheduler.state_dict(),
            }, check_point)

Epoch 0/13
----------
train Loss: 0.0832 Acc: 0.9783
val Loss: 0.0840 Acc: 0.9776
New best model found!
New record loss: 0.0839548828125, previous record loss: inf

Epoch 1/13
----------
train Loss: 0.0834 Acc: 0.9779
val Loss: 0.0840 Acc: 0.9772

Epoch 2/13
----------
train Loss: 0.0838 Acc: 0.9789
val Loss: 0.0839 Acc: 0.9776
New best model found!
New record loss: 0.0839197265625, previous record loss: 0.0839548828125

Epoch 3/13
----------
train Loss: 0.0821 Acc: 0.9793
val Loss: 0.0840 Acc: 0.9772

Epoch 4/13
----------
train Loss: 0.0834 Acc: 0.9782
val Loss: 0.0837 Acc: 0.9772
New best model found!
New record loss: 0.083683203125, previous record loss: 0.0839197265625

Epoch 5/13
----------
train Loss: 0.0838 Acc: 0.9780
val Loss: 0.0842 Acc: 0.9774

Epoch 6/13
----------
train Loss: 0.0828 Acc: 0.9795
val Loss: 0.0841 Acc: 0.9774

Epoch 7/13
----------
train Loss: 0.0834 Acc: 0.9783
val Loss: 0.0837 Acc: 0.9776

Epoch 8/13
----------
train Loss: 0.0830 Acc: 0.9783
val Loss: 0.08

### End
This is the end of Notebook 3. Please move forward to Notebook 4 where you will make prediction from our trained model in FP16.

!!IMPORTANT!! To close this Notebook, you have to use File -> Close and Halt. With this way, the Python process associated with this Notebook will also be killed.

Version 0.5  - January 9th, 2020 - ©2020 hanindhito@bagus.my.id