# HPC as a solution for AI: PyTorch

<p style='text-align: justify;'>
In this section, it will be shown how to optimize PyTorch models, accelerating training and execution using GPUs.
</p>    

The principal gols are:
* **Use** the PyTorch library on GPU environments for the first time to accelerate the training of image classification models,
* **Familiarize** yourself with the CIFAR-10 and CIFAR-100 dataset by classifying their various classes,
* **Evaluate** and **Compare** the performance of your models on GPU and CPU environments to understand the benefits of GPU acceleration in AI tasks.

## The problem: Resource-intensive training and model scalability

<p style='text-align: justify;'>
As AI research progresses, deep neural networks have become critical for tasks like image generation and language translation. However, resource-intensive training challenges arise as networks become more complex and demanding in performance.
</p>

<p style='text-align: justify;'>
Research and development in artificial intelligence have made remarkable strides in recent decades, mainly driven by deep neural networks. These networks are computational structures loosely inspired by the functioning of the human brain. They are particularly well-suited for tasks that involve large volumes of data, such as pattern recognition in images, natural language processing, and more.
</p>

<p style='text-align: justify;'>
However, as the problems being addressed become more complex and performance demands increase, the need for computational resources also grows exponentially. Additionally, the scalability of these models becomes a concern as they grow in size and complexity. Maintaining and optimizing constantly expanding AI models becomes challenging for the research and development community.
</p>

## The solution: GPUs and PyTorch

<p style='text-align: justify;'>
Using libraries like PyTorch, a popular machine learning and AI framework, offers a flexible interface for designing, training, and evaluating neural networks using GPUs, especially when harnessed with computational prowess.
</p>
<p style='text-align: justify;'>
Furthermore, Intel® PyTorch is well equipped to fully utilize the optimizations and hardware support of Intel® processors and GPUs. This synergy results in an even more efficient and performance-oriented machine-learning experience. It enables practitioners to extract maximum computational throughput from their hardware infrastructure.
</p>

##  ☆ Challenge: Zoo breakout!☆

<p style='text-align: justify;'>
    Recently, an unexpected incident occurred at the local zoo, <b>Orange Grove Zoo</b>: all the animals escaped from their enclosures and are now roaming freely. To deal with this situation, we need your help locating and classifying the escaped animals, distinguishing each animal class, and identifying possible vehicles in the same environment.
</p>
<p style='text-align: justify;'>
You have been assigned as the person responsible for developing a computer vision system capable of identifying and classifying the escaped animals and identifying the presence of vehicles in the images. We will use the CIFAR-10 dataset and the TensorFlow library to train a deep-learning model for this challenge.
</p>
CIFAR-10 and CIFAR-100 datasets comprehensively collect $32$x$32$ pixel images grouped into $10$ and $100$ distinct classes, respectively.

- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html): CIFAR-10 consists of $60,000$ images, each belonging to one of the ten classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. This dataset offers a diverse set of images representing everyday objects.

- [CIFAR-100 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html): CIFAR-100 expands upon the CIFAR-10 concept. However, it introduces a more challenging task by categorizing images into $100$ classes. These classes include various subcategories such as fruits, animals, vehicles, and more.

a) **Create** deep neural network model utilizing the PyTorch library for the classification of animals and vehicles on a GPU environment using the CIFAR-10 dataset.

b) **Conduct** a comparative analysis between models trained on a CPU and GPU to highlight disparities in results.

c) Now, use the CIFAR-100 dataset for the classification of animals and vehicles on a GPU. Would it be a good decision to use a GPU or CPU environment for the training process?

### ☆ Solution for `CIFAR-10` using PyTorch on CPU☆

#### ⊗ Importing packages

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time

#### ⊗ Define processing device

In [2]:
device = torch.device("cpu:0")

#### ⊗ Transformations to the data

<p style='text-align: justify;'>
    As part of the data preparation process, we create a <b>transforms</b> object to apply specific transformations to the data. These transformations are commonly used in training datasets to enhance data diversity and ready images for utilization in a deep learning model, such as a convolutional neural network (CNN).
    </p>

In [3]:
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

#### ⊗ Downloading the dataset

<p style='text-align: justify;'>
Following that, download the CIFAR-10 dataset and load it into the code. Define the neural network as we have done in previous notebooks, and remember to move this network instance to the previously defined device.
</p>

In [4]:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=4)

Files already downloaded and verified


#### ⊗ Creating the model

Now it is necessary to create the model for our neural network using PyTorch.

In [5]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(128 * 8 * 8, 512)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = x.view(-1, 128 * 8 * 8)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

net = Net()
net.to(device)

Net(
  (conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=8192, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=10, bias=True)
)

#### ⊗ Training the network

Now we will train our neural network.

In [6]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

device = torch.device("cpu")
net.to(device)

cpu_start_time = time.time()

for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    print(f'Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}')

cpu_end_time = time.time()

total_cpu_time_cifar_10 = cpu_end_time - cpu_start_time

print(f"\nCPU Training time: {total_cpu_time_cifar_10:.2f} seconds)")

torch.save(net.state_dict(), 'cifar10_cpu_model.pth')

Epoch 1, Loss: 1.7964948424902718
Epoch 2, Loss: 1.417594108740082
Epoch 3, Loss: 1.2418243392654087
Epoch 4, Loss: 1.1093364748198662
Epoch 5, Loss: 1.0149216093980442
Epoch 6, Loss: 0.9433749182449888
Epoch 7, Loss: 0.8874704087786662
Epoch 8, Loss: 0.8377322353365476
Epoch 9, Loss: 0.8021479407539758
Epoch 10, Loss: 0.7553412200849684

CPU Training time: 462.89 seconds)


### ☆  Solution for `CIFAR-10` using PyTorch on GPU  ☆

#### ⊗ Transformations to the data

In [9]:
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

In [10]:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(128 * 8 * 8, 512)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = x.view(-1, 128 * 8 * 8)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

net = Net()
net.to(device)

Files already downloaded and verified


Net(
  (conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=8192, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=10, bias=True)
)

#### ⊗ Training the network

In [11]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

gpu_start_time = time.time()

for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}')

gpu_end_time = time.time()
total_gpu_cifar_10 = gpu_end_time - gpu_start_time

print(f'GPU Training time: {total_gpu_cifar_10:.2f} seconds')

torch.save(net.state_dict(), 'cifar10_gpu_model.pth')

Epoch 1, Loss: 1.7745253969641293
Epoch 2, Loss: 1.4122941140323648
Epoch 3, Loss: 1.239821930827997
Epoch 4, Loss: 1.10916560552919
Epoch 5, Loss: 1.0204177968337407
Epoch 6, Loss: 0.9522925040606037
Epoch 7, Loss: 0.8932674217711934
Epoch 8, Loss: 0.8396686213400663
Epoch 9, Loss: 0.7989676483451863
Epoch 10, Loss: 0.7566478979557066
GPU Training time: 54.03 seconds


In [12]:
print(f"\nSpeedup:{total_cpu_time_cifar_10 / total_gpu_cifar_10 : .2f}X")


Speedup: 8.57X


### ☆ Solution for `CIFAR-100` using PyTorch on CPU☆

<p style='text-align: justify;'>
Utilizing the CIFAR-100 dataset for the classification of animals and vehicles is a significantly more computationally demanding task compared to CIFAR-10 training. This heightened computational demand arises from the larger number of images and classes present in CIFAR-100. Additionally, training a deep neural network for this purpose typically necessitates a higher number of epochs to ensure effective model training.
</p>

<p style='text-align: justify;'>
Let's repeat the process on the journey of training a new model for the CIFAR-100 dataset using the CPU. It is important to keep in mind that this process can be time-consuming, especially if your CPU doesn't have high computational capabilities. Patience may be required as we proceed with this task.
</p>

#### ⊗ Define processing device

In [13]:
device = torch.device("cpu")

#### ⊗ Training the network

In [14]:
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=4)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(128 * 8 * 8, 512)
        self.fc2 = nn.Linear(512, 100) # Redefine for 100 outputs.

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = x.view(-1, 128 * 8 * 8)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

net = Net()
net.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

cpu_start_time = time.time()

for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    print(f'Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}')

cpu_end_time = time.time()

total_cpu_time_cifar_100 = cpu_end_time - cpu_start_time

print(f"\nCPU Training time for CIFAR-100: {total_cpu_time_cifar_100} seconds")

Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ./data/cifar-100-python.tar.gz


100.0%


Extracting ./data/cifar-100-python.tar.gz to ./data
Epoch 1, Loss: 4.134092291297815
Epoch 2, Loss: 3.595558230529356
Epoch 3, Loss: 3.3053552775126893
Epoch 4, Loss: 3.0862986937813135
Epoch 5, Loss: 2.9105644085827995
Epoch 6, Loss: 2.7574649564445477
Epoch 7, Loss: 2.622874382511734
Epoch 8, Loss: 2.4997932191395087
Epoch 9, Loss: 2.3941552340222136
Epoch 10, Loss: 2.3102799943645898

CPU Training time for CIFAR-100: 476.2612404823303 seconds


### ☆ Solution for `CIFAR-100` using PyTorch on GPU☆

As you have seen previously, training simple neural networks with CPUs is not a feasible practice. Therefore, utilizing a GPU is a wise choice if you have access to one, as it will significantly accelerate the training process and empower you to experiment with more intricate models and larger datasets in the future. So, let's proceed to repeat the process, this time loading the CIFAR-100 dataset.

#### ⊗ Define processing device

In [17]:
device = torch.device("cuda:0")

#### ⊗ Training the network 

In [19]:
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(128 * 8 * 8, 512)
        # Change output layers to 100 units
        self.fc2 = nn.Linear(512, 100)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = x.view(-1, 128 * 8 * 8)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

net = Net()
net.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

gpu_start_time = time.time()

for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}')

gpu_end_time = time.time()

total_gpu_cifar_100 = gpu_end_time - gpu_start_time

print(f'GPU Training time: {total_gpu_cifar_100:.2f}')

# Save the model
torch.save(net.state_dict(), 'cifar100_gpu_model.pth')

Files already downloaded and verified
Epoch 1, Loss: 4.17650096495743
Epoch 2, Loss: 3.5885800492123265
Epoch 3, Loss: 3.290251217839663
Epoch 4, Loss: 3.0836600952441127
Epoch 5, Loss: 2.921369514197035
Epoch 6, Loss: 2.773668039180434
Epoch 7, Loss: 2.6351819642059637
Epoch 8, Loss: 2.5177373129998326
Epoch 9, Loss: 2.4155612860799143
Epoch 10, Loss: 2.315835479580228
GPU Training time: 53.77


Now we will evaluate the speedup by comparing the GPU and CPU execution times.

In [20]:
print(f"\nSpeedup:{total_cpu_time_cifar_100 / total_gpu_cifar_100 : .2f}X")


Speedup: 8.86X


### Comments about the results

<p style='text-align: justify;'>
We explored training neural networks with PyTorch, comparing CPU and GPU performance on the CIFAR-10 and CIFAR-100 dataset using 10 epochs. When training with CIFAR-10 and utilizing the CPU, and GPU environments, the process can be executed in approximately:
</p>

|  Pytorch |      CIFAR-10    |  CIFAR-100 |
|----------|:-------------:   |-----------:|
| CPU      |  462.89          |   472.26   |
| GPU      |  54.03           |   53.77    |
| Speedup  |  8.56X           |    8.78X   |

<p style='text-align: justify;'>
This outcome illustrates that the GPU has achieved nearly a <b>Speedup of 9X</b> compared to the CPU when running with 10 epochs in the algorithm with the highest computational cost (CIFAR-100). Using the Pytorch the GPU has substantially enhanced the training speed, which is particularly advantageous.
</p>    

## Summary
In this notebook we have shown:

- Install and use PyTorch using GPU environments,
- Comparative performance tests between CPU and GPU on model training.

## Clear the memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)