# High-performance computing applied in AI solutions

<p style='text-align: justify;'>
High-performance computing, known as HPC, is a field of modern computing whose goal is to solve computational problems of high complexity and large volumes of data by dividing complex problems into smaller parts that are processed simultaneously by several processors, accelerating the resolution time. HPC enables scientists, engineers, and researchers to perform highly detailed simulations, massive data analysis, and precise modeling that would be impractical or unachievable using conventional systems.
</p>

<p style='text-align: justify;'>
HPC systems are designed to handle large volumes of data and perform intensive calculations in a fraction of the time it would take on conventional computers. An HPC comprises one or several supercomputers of interconnected high-performance processors, large amounts of memory, and fast storage to handle intensive workloads.
</p>

<div style="text-align:center">
<img src="./images/figure01_ponte_vecchio.jpg" style="width: 500px;">
</div>



## ⊗ **Why use parallel computing?**

<p style='text-align: justify;'>   
Parallel computing is widely used in many areas such as scientific simulations, graphics rendering, big data analysis, machine learning, artificial intelligence, image processing, and many more. There are several approaches to implementing parallel computing, of which we can include data parallelism, task parallelism, instruction parallelism, bit-level parallelism, thread-level parallelism, among others.
</p>

<div style="text-align:center">
<img src="./images/figure02_parallel_computing.png" style="width: 1000px;">
</div>

<p style='text-align: justify;'>   
One of the main gains parallel computing provides is remarkable performance acceleration. You can get more work done in less time by running multiple tasks simultaneously. This aspect is particularly advantageous for solving complex problems that often involve intensive calculations or the analysis of vast data sets.
Parallel computing is essential for dealing with the growing data generated in our digital world. In machine learning and artificial intelligence, training complex models in parallel is critical for creating effective AI systems in areas such as pattern recognition, natural language processing, and computer vision.
</p>

<p style='text-align: justify;'> 
HPC and parallel computing can be used in several scenarios. Let's meet some of them below.
</p>

## ⊗ **HPC applied in AI**

<p style='text-align: justify;'> 
HPC plays a key role when it comes to applications using srtificial intelligence, since a large computational power is needed to be able to train increasingly complex AI models and perform analysis on massive data sets.
</p>

<div style="text-align:center">
<img src="./images/figure03_aurora_supercomputing.jpg" style="width: 500px;">
</div>
    
<p style='text-align: justify;'> 
A notable example is Intel's Aurora supercomputer, which plays a key role in research areas as diverse as neuroscience, aerospace simulation, universe exploration, and artificial intelligence. These surveys require an extremely high processing capacity,
making the application of structures such as HPC essential. Conducting research in this direction requires the use of computational algorithms capable of dealing with large volumes of data and that also have resources for implementing artificial intelligence solutions. For example, in neuroscience research,
Aurora can help simulate complex neural networks and analyze brain data at scale, leading to advances in understanding neurological disorders and developing more effective treatments.
</p>

<p style='text-align: justify;'>
In aerospace simulations and exploration of the universe, Aurora allows the modeling of complex phenomena, such as the behavior of planetary systems and aircraft flight dynamics, contributing to space exploration and the development of more advanced technologies. In summary, the intersection between HPC and life science research, space, and AI drives scientific discoveries and technological advances that have the potential to transform our lives and our understanding of the world around us.
</p>  

## ⊗ **HPC uses cases**

<p style='text-align: justify;'> 
HPC is often used in fields where processing requirements are extraordinarily high and exceed the capabilities of conventional computer systems. Here are some examples of HPC use cases:
</p>
    
<div style="text-align:center">
<img src="./images/figure04_hpc_applications.png" style="width: 500px;">
</div>

* **Machine learning and artificial intelligence:** training complex machine learning models requires a lot of computing power. HPC allows you to train models faster and handle larger datasets, resulting in advances in AI, pattern recognition, and data analysis;

* **Biomedical research:** HPC accelerates the virtual screening of molecules, assessing how they interact with target proteins. This streamlines the drug discovery process, saving time and resources;

* **Aerodynamics and flight simulation:** the aerospace industry uses HPC to simulate the behavior of aircraft, improve wing design, optimize fuel efficiency, and study aerodynamics;

* **Exploration of natural resources and petroleum:** the simulation of oil and gas reservoirs, as well as the exploration of mineral resources, requires complex models and intensive calculations. HPC helps make informed decisions about locating and exploiting these resources;

* **Particle physics:** particle physics research requires HPC to analyze data generated by particle accelerators such as the LHC (Large Hadron Collider);

* **Scientific research and simulations:** HPC allows the modeling of natural phenomena and processes that would be almost impossible to observe experimentally. For example, simulating particle interactions in a particle accelerator or simulating long-term weather processes.

<p style='text-align: justify;'> 
These are just a few examples of the many use cases for HPC. In general, it plays a crucial role in areas that need advanced computational capacity to solve complex problems, often driving innovation and scientific and technological progress.
</p>

## **HPC as solutions for AI**

<p style='text-align: justify;'> 
When we want to deal with a large volume of information in artificial intelligence applications, aiming to substantially reduce the time required to solve the problem, it is essential that specific software tools are implemented for this purpose. Let's now explore two of the most prominent libraries: <a href='https://www.tensorflow.org/api_docs' target='_blank'><em>TensorFlow</em></a> and <a href='https://pytorch.org/tutorials/recipes/recipes/intel_extension_for_pytorch.html' target='_blank'><em>PyTorch</em></a>. These tools play a central role in creating and training AI models in processing-intensive environments. Let's get to know each of them.
</p>

### ⊗ **TensorFlow**

<p style='text-align: justify;'>
<em>TensorFlow</em> is an open-source library focused on high-performance numerical computing, especially suitable for training and deploying machine learning and deep learning models. It is used in various applications, from computer vision to natural language processing.
</p>
<p style='text-align: justify;'>
In our algorithms, we will be using a package called <a href='https://www.tensorflow.org/guide/keras?hl=pt-br' target='_blank'><em>Keras</em></a>, which is nothing more than a high-level API for building and training neural networks. Its main feature is to simplify and streamline the development of deep learning models.
</p>  
<p style='text-align: justify;'>
Let's see how we can access and utilize our CPU to train a simple neural network using TensorFlow.
</p>

#### **Checking the environmental availability**

In [None]:
import tensorflow as tf

# Check for available GPUs
gpus = tf.config.experimental.list_physical_devices('GPU')

if gpus:
    # Configure GPU memory allocation dynamically
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)

    # Display information about available GPU
    for i, gpu in enumerate(gpus):
        print(f"GPU {i + 1}: {gpu.name}")
else:
    print("No GPU available. Using CPU.")


####  **Creating a neural network with TensorFlow**

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

# Set the training data
X_train = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_train = np.array([0, 1, 1, 0])

# create the model
model = keras.Sequential([
    keras.layers.Dense(units=1, input_dim=2, activation='sigmoid')
])

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model on the GPU
with tf.device('/GPU:0'):
    model.fit(X_train, y_train, epochs=20)

# rate the model
accuracy = model.evaluate(X_train, y_train)[1]
print(f'Model accuracy: {accuracy}')


### ⊗ **PyTorch**

<em>PyTorch</em> is an open source machine learning library known for its flexibility and ease of use, making it a popular choice among deep learning researchers and developers. Let's see the same code we did, but now with PyTorch.
</p>

#### **Checking the environmental availability**

In [None]:
import torch

# Check if a GPU is available
if torch.cuda.is_available():
    # Get the number of available GPUs
    num_gpus = torch.cuda.device_count()

    # Display information about available GPUs
    for i in range(num_gpus):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("No GPU available. Using CPU.")


####  **Creating a neural network with PyTorch**

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Define training data as tensors
X_train = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
y_train = torch.tensor([0, 1, 1, 0], dtype=torch.float32).view(-1, 1)

# create the model
model = nn.Sequential(
    nn.Linear(2, 1),
    nn.Sigmoid()
)

# Moving model and data to GPU if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
X_train, y_train = X_train.to(device), y_train.to(device)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# train the model
for epoch in range(1000):
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

# rate the model
with torch.no_grad():
    predicted = model(X_train)
    predicted = (predicted > 0.5).float()
    accuracy = (predicted == y_train).sum().item() / len(y_train)
    print(f'Model accuracy: {accuracy}')


##  ☆ Challenge: Zoo breakout!☆ 

<p style='text-align: justify;'> 
    Recently, an unexpected incident occurred at the local zoo, <b>Orange Grove Zoo</b>: all the animals escaped from their enclosures and are now roaming freely. To deal with this situation, we need your help locating and classifying the escaped animals, distinguishing each animal class, and identifying possible vehicles in the same environment.
</p>
<p style='text-align: justify;'> 
You have been assigned as the person responsible for developing a computer vision system capable of identifying and classifying the escaped animals and identifying the presence of vehicles in the images. We will use the CIFAR-10 dataset and the TensorFlow library to train a deep-learning model for this challenge.
</p>
CIFAR-10  datasets comprehensively collect $32$x$32$ pixel images grouped into $10$ distinct classes.

- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html): CIFAR-10 consists of $60,000$ images, each belonging to one of the ten classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. This dataset offers a diverse set of images representing everyday objects.

a) **Create** a deep neural network model utilizing the TensorFlow and PyTorch libraries for the classification of animals and vehicles on a CPU using the CIFAR-10 dataset,

b) **Measure** the execution time for the algorithm using CIFAR-10 on a CPU environment,

c) **Justify** why it is more interesting to use the tools (TensorFlow, and PyTorch) in conjunction on a GPU or CPU environment?

### ☆ Solution for `CIFAR-10` using TensorFlow on a CPU☆

#### ⊗ Importing packages

In [None]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import time

#### ⊗ Verify the devices

In [None]:
# Checking if CPU is available
cpus = tf.config.experimental.list_physical_devices('CPU')
if not cpus:
    raise RuntimeError("No CPU available.")
else:
    print("CPU available")

#### ⊗ Downloading the dataset

Now we need to download the CIFAR-10 dataset to be able to make predictions. CIFAR-10 is a dataset of labeled images, meaning that each image already has a known label.

In [None]:
# Loading the CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

#### ⊗ Normalizing the dataset

After downloading the entire set of images, we need to normalize them so that we can use them in our example.

In [None]:
# Normalizing pixel values to the [0, 1] range
train_images, test_images = train_images / 255.0, test_images / 255.0

#### ⊗ Creating the model

Now it is necessary to create the model for our neural network. Notice that this step becomes extremely simple using the power of TensorFlow.

In [None]:
# Creating the CNN model
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10)
])

#### ⊗ Compiling

After creating the model, it needs to be compiled, just as we did in the first example with the CPU.

In [None]:
# Compiling the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

#### ⊗ Training function

Now it is important to define the training function of the model, which it will use to train using the CIFAR-10 dataset.

In [None]:
# Function to train the model and measure time with progress
def train_model(device, train_images, train_labels):
    with tf.device(device):
        start_time = time.time()
        history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels), verbose=1)
        end_time = time.time()
    
    return history, end_time - start_time

#### ⊗ Training the model

<p style='text-align: justify;'> The next step is to perform the model training. Note that in the step below, we will use the CPU to train the model. </p>

In [None]:
# Train the model on the CPU 
cpu_history, cpu_time = train_model('/CPU:0', train_images, train_labels)
print(f"\nCPU Training time: {cpu_time:.2f} seconds or ({cpu_time / 60:.2f} minutes)")

### Comments about the results with TensorFlow

<p style='text-align: justify;'> 
We observed that the training process of our neural network took about <b> 253 seconds (4.22 minutes)</b> for a dataset being trained for 10 epochs. This represents a significantly reduced number of training epochs in computational terms, especially when compared to larger data sets. Therefore, carrying out more extensive training on the CPU becomes impractical, making it necessary to use more robust computing resources, such as a graphics processing unit (GPU).
</p>

#### Hardware and software setup used in the experiments

The operating system used in all experiments is Red Hat Enterprise Linux, and all experiments performed in this work were run on an environment offering a GPU node, one Intel(R) i7-1165G7 CPU @4.70 GHz, 64 GB RAM, and 1 Intel Tiger Lake Gen12. This architecture provides 16 streaming multiprocessors with 8 GB HBM2 memory. 

### ☆ Solution for `CIFAR-10` using PyTorch on a CPU☆

#### ⊗ Importing Packages

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time

#### ⊗ Verify the devices

It is very important, before trying to execute anything on any device, to verify if it is available and if PyTorch can use it.

In [None]:
device = torch.device("cpu:0")

#### ⊗ Transformations to the data

<p style='text-align: justify;'> 
    As part of the data preparation process, we create a <b>transforms</b> object to apply specific transformations to the data. These transformations are commonly used in training datasets to enhance data diversity and ready images for utilization in a deep learning model, such as a convolutional neural network (CNN).
    </p>

In [None]:
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

#### ⊗ Downloading the dataset

<p style='text-align: justify;'> 
Following that, download the CIFAR-10 dataset and load it into the code. Define the neural network as we have done in previous notebooks, and remember to move this network instance to the previously defined device.
</p>

In [None]:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=4)

#### ⊗ Creating the model

Now it is necessary to create the model for our neural network using PyTorch.

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(128 * 8 * 8, 512)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = x.view(-1, 128 * 8 * 8)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

net = Net()
net.to(device)

#### ⊗ Training the network

Now we will train our neural network.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

device = torch.device("cpu")
net.to(device)

cpu_start_time = time.time()

for epoch in range(10):  
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
    print(f'Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}')

cpu_end_time = time.time()

cpu_time = cpu_end_time - cpu_start_time

print(f"\nCPU Training time: {cpu_time:.2f} seconds or ({cpu_time / 60:.2f} minutes)")

torch.save(net.state_dict(), 'cifar10_cpu_model.pth')

### Comments about the results using PyTorch

<p style='text-align: justify;'> 
You may have noticed that we completed our training with only ten epochs, and it took around <b>811 seconds (13.52 minutes)</b>. That means it is a reasonably long time for a small number of epochs. Imagine increasing it to 10 epochs or using a more extensive dataset like <b>CIFAR-100</b>! It would become impractical to perform this kind of task on conventional computing resources, such as a laptop with a CPU. Therefore, it is necessary to rely on much greater computational power provided by environments with GPUs.
</p>

## Hardware and software setup used in the experiments

The operating system used in all experiments is Red Hat Enterprise Linux, and all experiments performed in this work were run on an environment offering a GPU node, one Intel(R) i7-1165G7 CPU @4.70 GHz, 64 GB RAM, and 1 Intel Tiger Lake Gen12. This architecture provides 16 streaming multiprocessors with 8 GB HBM2 memory. 

##  The performance metric used in the experiments

<p style='text-align: justify;'>
We used the concept of <b>Speedup</b> to evaluate the performance of the algorithms. In computer architecture, Speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in the speed of execution of a task executed on two similar architectures with different resources.
</p>

## Summary

<p style='text-align: justify;'>
In this notebook we have shown: 

- The definitions about HPC applied in AI applications,
- Some HPC uses cases, and solutions for AI using TensorFlow, and PyTorch,
- An example of training a neural network on CPU systems using the CIFAR-10 algorithm.
</p>    

## Clear the memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

In this section you learned the meaning of HPC applied in AI applications, and how we can use the processing speed of a GPU to improve the performance of our AI algorithms. In the next notebook we will study HPC as solutions for AI using TensorFlow in [_02-hpc-simulations-tensorflow.ipynb_](02-hpc-simulations-tensorflow.ipynb).