# Intel PyTorch GPU Training and Inference with AMP

The `PyTorch Training Optimizations with Advanced Matrix Extensions Bfloat16` sample will demonstrate how to train a ResNet50 model using the CIFAR10 dataset using the Intel® Extension for PyTorch*.

The Intel® Extension for PyTorch* extends PyTorch* with optimizations for extra performance boost on Intel® hardware. While most of the optimizations will be included in future PyTorch* releases, the extension delivers up-to-date features and optimizations for PyTorch on Intel® hardware. For example, newer optimizations include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).

| Area                  | Description
|:---                   |:---
| What you will learn   | Training performance improvements using Intel® Extension for PyTorch* with Intel® AMX BF16
| Time to complete      | 20 minutes
| Category              | Code Optimization

## Purpose

The Intel® Extension for PyTorch* gives users the ability to speed up training on Intel® Xeon Scalable processors with lower precision data formats and specialized computer instructions. The bfloat16 (BF16) data format uses half the bit width of floating-point-32 (FP32), lowering the amount of memory needed and execution time to process. You should notice performance optimization with the AMX instruction set when compared to AVX-512.

## Prerequisites

| Optimized for           | Description
|:---                     |:---
| OS                      | Ubuntu* 18.04 or newer
| Hardware                | 4th Gen Intel® Xeon® Scalable Processors or newer
| Software                | Intel® Extension for PyTorch*

## Key Implementation Details

This code sample will train a ResNet50 model using the CIFAR10 dataset while using Intel® Extension for PyTorch*. The model is trained using FP32 and BF16 precision, including the use of Intel® Advanced Matrix Extensions (AMX) on BF16. AMX is supported on BF16 and INT8 data types starting with the 4th Generation of Xeon Scalable Processors. The training time will be compared, showcasing the speedup of BF16 and AMX.

>**Note**: Training is not performed using INT8 since using a lower precision will train a model with fewer parameters, which is likely to underfit and not generalize well.

## Installation of required packages

Ensure the kernel is set to Pytorch-GPU before running the following code.

In [None]:
!pip install matplotlib requests tqdm

In [None]:
import os
from time import time
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import intel_extension_for_pytorch as ipex
from tqdm import tqdm

In [None]:
# Hyperparameters and constants
LR = 0.01
MOMENTUM = 0.9
DATA = 'datasets/cifar10/'
epochs=1
batch_size=128

### Check for env setup

In [None]:
torch.xpu.is_available()

In [None]:
try:
  device = "xpu" if torch.xpu.is_available() else "cpu" 
  
except:
  device = "cpu"  

if device == "xpu": # Intel dGPU is recognized as device type xpu
  print("IPEX_XPU is present and Intel GPU is available to use for PyTorch")
  device = "gpu"
else:
  print("using CPU device for PyTorch")


## Loading the dataset
The CIFAR10 dataset is used for this sample. Dataset is being downloaded from built-in datasets available in the torchvision.datasets module. Batch size will be set to 128.

In [None]:
#Dataloader operations
transform = torchvision.transforms.Compose([
torchvision.transforms.Resize((224, 224)),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = torchvision.datasets.CIFAR10(
        root=DATA,
        train = True,
        transform=transform,
        download=True,
)
train_loader = torch.utils.data.DataLoader(
        dataset=train_dataset,
        batch_size=batch_size
)

test_dataset = torchvision.datasets.CIFAR10(root=DATA, train = False,
                                       download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size )


## Training the Model
The function below will train the ResNet50 model based on whether it should use CPU or Intel dGPU, and whether to use FP32 or BF16 data type. To use Intel dGPU, we need to transfer model and data to xpu device using `to("xpu")`.To use BF16 in operations on CPU, use the `torch.cpu.amp.autocast()` function to perform forward and backward propagation.

For Intel dGPU, `torch.xpu.amp` provides convenience for auto data type conversion at runtime, allowing deep learning workloads to benefit from lower-precision floating point data types like `torch.float16` or `torch.bfloat16`, which offer lighter calculation workload and smaller memory usage. However, lower-precision data types sacrifice accuracy for performance. The Auto Mixed Precision (AMP) feature automates data type conversions for operators, allowing for a trade-off between accuracy and performance. `torch.xpu.amp.autocast` is a context manager that enables scopes of the script to run with mixed precision, where operations are performed in a data type chosen by the autocast class to improve performance while maintaining accuracy.

In [None]:
"""
Function to run a test case
"""
def trainModel(train_loader, modelName="myModel", device="cpu", dataType="fp32"):
    """
    Input parameters
        train_loader: a torch DataLoader object containing the training data with images and labels
        modelName: a string representing the name of the model
        device: the device to use - cpu or gpu
        dataType: the data type for model parameters, supported values - fp32, bf16
    Return value
        training_time: the time in seconds it takes to train the model
    """

    # Initialize the model and add a fully connected layer for finetuning the model on CIFAR dataset(with 10 classes). Originally, the ResNet50 is trained with ImageNet dataset(1000 classes)   
    model = torchvision.models.resnet50(pretrained=True)
    model.fc = torch.nn.Linear(2048,10)
    lin_layer = model.fc
    new_layer = torch.nn.Sequential(
        lin_layer,
        torch.nn.Softmax(dim=1)
    )
    model.fc = new_layer

    #Define loss function and optimization methodology
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=MOMENTUM)
    model.train()

    #export model and criterian to XPU device. GPU specific code
    if device == "gpu":
        model = model.to("xpu:0") ## if we have two Intel dGPU device, we can specify xpu:0 or xpu:1
        criterion = criterion.to("xpu:0") 

    #Optimize with BF16 or FP32(default) . BF16 specific code
    if "bf16" == dataType:
        model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
    else:
        model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)

    #Train the model
    num_batches = len(train_loader) * epochs
    

    for i in range(epochs):
        running_loss = 0.0

        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            # export data to XPU device. GPU specific code
            if device == "gpu":
                data = data.to("xpu:0")
                target = target.to("xpu:0")

            # Apply Auto-mixed precision(BF16)  
            if "bf16" == dataType:
                with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):

                    output = model(data)
                    loss = criterion(output, target)
                    loss.backward()
                    optimizer.step()
                    running_loss += loss.item()

            else:

                output = model(data)
                loss = criterion(output, target)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()


            # Showing Average loss after 50 batches
            if 0 == (batch_idx+1) % 50:
                print("Batch %d/%d complete" %(batch_idx+1, num_batches))
                print(f' average loss: {running_loss / 50:.3f}')
                running_loss = 0.0

    # Save a checkpoint of the trained model
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        }, 'checkpoint_%s.pth' %modelName)
    print(f'\n Training finished and model is saved as checkpoint_{modelName}.pth')
    return None


### Model Training with default FP32 precision(Recommended for inference comparison)

In [None]:
#Model Training
print("Training model with FP32 on GPU, will be saved as checkpoint_gpu_rn50.pth")
trainModel(train_loader, modelName="gpu_rn50", device="gpu", dataType="fp32")

### Model Training with default AMP BF16(Optional) 

In [None]:
#Model Training
print("Training model with BF16 on GPU, will be saved as checkpoint_gpu_rn50_bf16.pth")
trainModel(train_loader, modelName="gpu_rn50_bf16", device="gpu", dataType="bf16")

## FP32 & AMP BF16 Model Evaluation if trained with FP32 precision

### Load model from saved model file

In [None]:
#Load model structure from torchvision and weights from saved checkpoint file
def load_model(cp_file = 'checkpoint_rn50.pth'):
    model = torchvision.models.resnet50()
    model.fc = torch.nn.Linear(2048,10)
    lin_layer = model.fc
    new_layer = torch.nn.Sequential(
        lin_layer,
        torch.nn.Softmax(dim=1)
    )
    model.fc = new_layer

    checkpoint = torch.load(cp_file)
    model.load_state_dict(checkpoint['model_state_dict']) 
    return model


### Applying Intel® Extension for PyTorch (IPEX) optimizations and Converting model to TorchScript(Optional)
Intel® Extension for PyTorch (IPEX) provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch normally yields better performance from optimization techniques such as operation fusion, and Intel® Extension for PyTorch (IPEX) amplified them with more comprehensive graph optimizations. Therefore we recommended you to take advantage of Intel® Extension for PyTorch (IPEX) with TorchScript. 


In [None]:
def ipex_optimize(model, dataType = "fp32" , device="cpu"):
    model.eval()
    if device=="gpu":
        model = model.to("xpu:0")
    if dataType=="bf16":
        model = ipex.optimize(model, dtype=torch.bfloat16)
    else:
        model = ipex.optimize(model, dtype = torch.float32)
            
    return model


### Inference

In [None]:
def inferModel(model, test_loader, device="cpu" , dataType='fp32'):
    correct = 0
    total = 0
    if device == "gpu":
        model = model.to("xpu:0")
    infer_time = 0

    with torch.no_grad():
        num_batches = len(test_loader)
        batches=0
                   
        for i, data in tqdm(enumerate(test_loader)):
            
            # Record time for Inference
            torch.xpu.synchronize()
            start_time = time()
            images, labels = data
            if device =="gpu":
                images = images.to("xpu:0")
                 
            outputs = model(images)
            outputs = outputs.to("cpu") # Need model outputs back to CPU(Host) again to remove Device(GPU) to Host overhead as all the accuracy related computation is going to happen on CPU
            _, predicted = torch.max(outputs.data, 1)
            
            total += labels.size(0)
            correct += (predicted == labels).sum().item()        
            
            # Record time after finishing batch inference
            torch.xpu.synchronize()
            end_time = time()      

            if i>=3 and i<=num_batches-3: # Ignoring a few start and end batches for consistent and accurate latency measure 
                infer_time += (end_time-start_time)
                batches += 1
            #Skip last few batches     
            if i == num_batches - 3:
                break    

    accuracy = 100 * correct / total
    return accuracy, infer_time*1000/(batches*batch_size)


In [None]:
#Evaluation of different models
def Eval_model(cp_file = 'checkpoint_model.pth', dataType = "fp32" , device="gpu" ):
    model = load_model(cp_file)
    model = ipex_optimize(model, dataType , device)
    accuracy, latency = inferModel(model, test_loader, device, dataType )
    print(f' Model accuracy: {accuracy} and Average Inference latency: {latency} \n'  )
    return accuracy, latency

### Accuracy and Inference latency check

For FP32 model on GPU

In [None]:
#For FP32 model on GPU
print("Model evaluation with FP32 on GPU")
acc_fp32, fp32_avg_latency = Eval_model(cp_file = 'checkpoint_gpu_rn50.pth', dataType = "fp32" , device="gpu")

For BF16 model on GPU

In [None]:
#For AMP BF16 model on GPU
print("Model evaluation with AMP BF16 on GPU")
acc_bf16, bf16_avg_latency = Eval_model(cp_file = 'checkpoint_gpu_rn50.pth', dataType = "bf16" , device="gpu")

## Summary of Results for GPU
The following cells below will summarize the training times for all three cases and display graphs to show the performance speedup.

In [None]:
#Summary 
print("Summary")
print(f'Inference average latecy for FP32  on GPU is:  {fp32_avg_latency} ')
print(f'Inference average latency for AMP BF16 on GPU is:  {bf16_avg_latency} ')

speedup_from_amp_bf16 = fp32_avg_latency / bf16_avg_latency
print("Inference with BF16 is %.2fX faster than FP32 on GPU" %speedup_from_amp_bf16)

In [None]:
plt.figure()
plt.title("ResNet50 Inference Latency Comparison")
plt.xlabel("Test Case")
plt.ylabel("Inference Latency per sample(ms)")
plt.bar(["FP32 on GPU", "AMP BF16 on GPU"], [fp32_avg_latency, bf16_avg_latency])


In [None]:
plt.figure()
plt.title("Accuracy Comparison")
plt.xlabel("Test Case")
plt.ylabel("Accuracy(%)")
plt.bar(["FP32 on GPU", "AMP BF16 on GPU"], [acc_fp32, acc_bf16])
print(f'Accuracy drop with AMP BF16 is: {acc_fp32-acc_bf16}')

In [None]:
speedup_from_bf16_on_gpu = fp32_avg_latency/bf16_avg_latency
plt.figure()
plt.title("GPU AMP BF16 Speedup")
plt.xlabel("Test Case")
plt.ylabel("SpeedUp")
plt.bar(["FP32 on GPU", "Speed Up from AMP BF16 on GPU"], [1, speedup_from_bf16_on_gpu])

In [None]:
print('[CODE_SAMPLE_COMPLETED_SUCCESFULLY]')