# HPC as solutions for AI: TensorFlow

<p style='text-align: justify;'>
In this section, it will be shown how to optimize TensorFlow models, accelerating training and execution using GPUs.
</p>    

The principal goals are:
* **Understand** what is TensorFlow,
* **Learn** the basic concepts of TensorFlow for GPUs,
* **Familiarize** yourself with the CIFAR-10 and CIFAR-100 datasets by classifying their various classes,
* **Create** a model using TensorFlow.

## What applications uses TensorFlow in AI?

<p style='text-align: justify;'>
TensorFlow is an open-source machine learning framework developed by Google that is widely used in artificial intelligence (AI) applications. It provides a comprehensive set of tools, libraries, and community support for building and deploying various machine learning models, such as deep learning, computer vision, and neural networks. Overall, TensorFlow is a versatile framework that covers a wide range of machine learning and AI applications. However, when creating a model, the training process becomes a bottleneck as it takes a lot of time, but as we will see throughout the module, TensorFlow allows us to speed up this processing.
</p>    

## The solution: GPUs and TensorFlow

<p style='text-align: justify;'> 
In addition to being a powerful library for machine learning, TensorFlow allows you to train the created model using GPUs, enhancing and accelerating the training process. As we will see later, the performance gain when training a TensorFlow model on a GPU is enormous because GPUs are designed with thousands of processing cores, which allow the execution of many simultaneous operations. This is especially beneficial for matrix calculations, which are fundamental in machine learning algorithms such as neural networks.
</p>

##  ☆ Challenge: Zoo breakout!☆ 

<p style='text-align: justify;'> 
    Recently, an unexpected incident occurred at the local zoo, <b>Orange Grove Zoo</b>: all the animals escaped from their enclosures and are now roaming freely. To deal with this situation, we need your help locating and classifying the escaped animals, distinguishing each animal class, and identifying possible vehicles in the same environment.
</p>
<p style='text-align: justify;'> 
You have been assigned as the person responsible for developing a computer vision system capable of identifying and classifying the escaped animals and identifying the presence of vehicles in the images. We will use the CIFAR-10 dataset and the TensorFlow library to train a deep-learning model for this challenge.
</p>
CIFAR-10 and CIFAR-100 datasets comprehensively collect $32$x$32$ pixel images grouped into $10$ distinct classes.

- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html): CIFAR-10 consists of $60,000$ images, each belonging to one of the ten classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. This dataset offers a diverse set of images representing everyday objects.

- [CIFAR-100 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html): CIFAR-100 expands upon the CIFAR-10 concept, containing 60,000 images as well. However, it introduces a more challenging task by categorizing images into 100 classes. These classes include various subcategories such as fruits, animals, vehicles, and more.

a) **Create** deep neural network model utilizing the TensorFlow library for the classification of animals and vehicles on a GPU environment using the CIFAR-10 dataset.

b) **Conduct** a comparative analysis between models trained on a CPU and GPU to highlight disparities in results.

c) Now, use the CIFAR-100 dataset for the classification of animals and vehicles on a GPU. Would it be a good decision to use a GPU or CPU environment for the training process?

### ☆ Solution for `CIFAR-10` using TensorFlow on CPU and GPU ☆

#### ⊗ Importing packages

In [1]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np
import time

2023-11-13 20:35:19.729161: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### ⊗ Verify the devices

It is very important, before trying to execute anything on any device, to verify if it is available and if TensorFlow can use it.

#####  Checking the environmental availability

In [2]:
# Checking if GPU is available
print(f"CPU device: ", tf.config.list_physical_devices('CPU'))
print(f"GPU device: ", tf.config.list_physical_devices('GPU'))

CPU device:  [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
GPU device:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


2023-11-13 20:35:21.355158: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 20:35:21.381995: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 20:35:21.382197: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

#### ⊗ Downloading the dataset

Now we need to download the CIFAR-10 dataset to be able to make predictions. This dataset is a set of labeled images, meaning that each image already has a known label.

In [3]:
# Loading the CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

#### ⊗ Normalizing the dataset

After downloading the entire set of images, we need to normalize them so that we can use them in our example.

In [4]:
# Normalizing pixel values to the [0, 1] range
train_images, test_images = train_images / 255.0, test_images / 255.0

#### ⊗ Training the model

<p style='text-align: justify;'>
 Below we have the training function with model creation and model compilation. Notice that we need to do the creation and compilation together with the training because we need to set in which device everything will be done. We need to do this because if we don't set the device to create and compile the model, the TensorFlow will choose the faster device, in this case, the GPU, so if we try to use the CPU to train the model, it will fail because the model will be created on the GPU. 
</p>

In [5]:
def train_model(device, train_images, train_labels):
    with tf.device(device):
        
        # Creating the CNN model
        model = models.Sequential([
            layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
            layers.MaxPooling2D((2, 2)),
            layers.Conv2D(64, (3, 3), activation='relu'),
            layers.MaxPooling2D((2, 2)),
            layers.Conv2D(64, (3, 3), activation='relu'),
            layers.Flatten(),
            layers.Dense(64, activation='relu'),
            layers.Dense(10)
        ])

        # Compiling the model
        model.compile(optimizer='adam',
                      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                      metrics=['accuracy'])

        start_time = time.time()
        history = model.fit(train_images, train_labels, epochs=10, 
                            validation_data=(test_images, test_labels), verbose=1)
        end_time = time.time()
    
    return history, end_time - start_time

<p style='text-align: justify;'> The next step is to perform the model training. Note that in the step below, we will use the CPU to train the model and then the GPU to train and compare their execution times. (Depending on the GPU and CPU of your machine, this step may take some time). </p>

In [6]:
history, cpu_time = train_model('/CPU:0', train_images, train_labels)
print(f"\nCPU Training time: {cpu_time:.2f} seconds or ({cpu_time / 60:.2f} minutes)")

2023-11-13 20:35:22.017197: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 20:35:22.017452: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 20:35:22.017646: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Epoch 1/10


2023-11-13 20:35:22.566696: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 614400000 exceeds 10% of free system memory.


  28/1563 [..............................] - ETA: 8s - loss: 2.2819 - accuracy: 0.1183

2023-11-13 20:35:23.446930: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f27b000a430 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-11-13 20:35:23.447049: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-11-13 20:35:23.453681: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-13 20:35:23.475873: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

CPU Training time: 100.12 seconds or (1.67 minutes)


<p style='text-align: justify;'> Now let's do the same process, but now with the GPU </p>

In [7]:
gpu_history, gpu_time = train_model('/GPU:0', train_images, train_labels)
print(f"\nGPU Training time: {gpu_time:.2f} seconds or ({gpu_time / 60:.2f} minutes)")

2023-11-13 20:37:02.871275: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 614400000 exceeds 10% of free system memory.
2023-11-13 20:37:03.461766: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 614400000 exceeds 10% of free system memory.


Epoch 1/10


2023-11-13 20:37:04.112846: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
2023-11-13 20:37:04.177134: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-11-13 20:37:04.266971: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-11-13 20:37:04.289473: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f238ab55dd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-13 20:37:04.289491: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2023-11-13 20:37:04.334874: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

GPU Training time: 33.99 seconds or (0.57 minutes)


Now we will evaluate the speedup by comparing the GPU and CPU execution times.

In [8]:
print(f"\nSpeedup:{cpu_time / gpu_time: .2f}X") 


Speedup: 2.95X


### ☆ Solution `CIFAR-100` using TensorFlow on CPU and GPU ☆

#### ⊗ Importing Packages

In [9]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np
import time

#### ⊗ Verify the devices

It is very important, before trying to execute anything on any device, to verify if it is available and if TensorFlow can use it.

#####  Checking the environmental availability

In [10]:
# Checking if GPU is available
print(f"CPU device: ", tf.config.list_physical_devices('CPU'))
print(f"GPU device: ", tf.config.list_physical_devices('GPU'))

CPU device:  [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
GPU device:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


#### ⊗ Downloading the dataset

Now we need to download the CIFAR-100 dataset to be able to make predictions. This dataset is a labeled images, meaning that each image to be loaded already has a known label.

In [11]:
# Loading the CIFAR-100 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar100.load_data()

#### ⊗ Normalizing the dataset

After downloading the entire set of images, we need to normalize them so that we can use them in our example.

In [12]:
# Normalizing pixel values to the [0, 1] range
train_images, test_images = train_images / 255.0, test_images / 255.0

#### ⊗ Training the model

<p style='text-align: justify;'>
 Below we have the training function with model creation and model compilation. Notice that we need to get the creation and compilation together with the training because we need to set in which device everything will be done. We need to do this because if we don't set the device to create and compile the model, the TensorFlow will choose the faster device, in this case, the GPU, so if we try to use the CPU to train the model, it will fail because the model will be created on the GPU. 
</p>

In [13]:
def train_model(device, train_images, train_labels):
    with tf.device(device):
        # Creating the CNN model
        model = models.Sequential([
            layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
            layers.MaxPooling2D((2, 2)),
            layers.Conv2D(64, (3, 3), activation='relu'),
            layers.MaxPooling2D((2, 2)),
            layers.Conv2D(64, (3, 3), activation='relu'),
            layers.Flatten(),
            layers.Dense(64, activation='relu'),
            layers.Dense(100)
        ])

        # Compiling the model
        model.compile(optimizer='adam',
                      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                      metrics=['accuracy'])

        start_time = time.time()
        history = model.fit(train_images, train_labels, epochs=10, 
                            validation_data=(test_images, test_labels), verbose=1)
        end_time = time.time()
    
    return history, end_time - start_time

<p style='text-align: justify;'> 
    The next step is to train the model. Note that in the step below we will use the CPU to train the model. 
</p>

In [14]:
cpu_history, cpu_time = train_model('/CPU:0', train_images, train_labels)
print(f"\nCPU Training time: {cpu_time:.2f} seconds or ({cpu_time / 60:.2f} minutes)")

Epoch 1/10


2023-11-13 20:37:37.636772: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 614400000 exceeds 10% of free system memory.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

CPU Training time: 103.07 seconds or (1.72 minutes)


Now we will perform the same training only using the GPU.

In [15]:
gpu_history, gpu_time = train_model('/GPU:0', train_images, train_labels)
print(f"\nGPU Training time: {gpu_time:.2f} seconds or ({gpu_time / 60:.2f} minutes)")

2023-11-13 20:39:20.910500: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 614400000 exceeds 10% of free system memory.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

GPU Training time: 33.96 seconds or (0.57 minutes)


Now we will evaluate the speedup by comparing the GPU and CPU execution times.

In [16]:
print(f"\nSpeedup:{cpu_time / gpu_time: .2f}X") 


Speedup: 3.04X


### Comments about the results

<p style='text-align: justify;'>
We explored training neural networks with TensorFlow, comparing CPU and GPU performance on the CIFAR-10 and CIFAR-100 dataset using 10 epochs. When training with CIFAR-10 and utilizing the CPU, and GPU environments, the process can be executed in approximately (in seconds):
</p>

|  TensorFlow |      CIFAR-10    |  CIFAR-100 |
|----------|:-------------:      |-----------:|
| CPU         |  100.11          |   103.07|
| GPU         |  33.99           |   33.96 |
| Speedup     |  2.95X           |   3.04 |

<p style='text-align: justify;'>
This outcome illustrates that the GPU has achieved nearly a <b>Speedup of 3X</b> compared to the CPU when running with 10 epochs in the algorithm with the highest computational cost (CIFAR-100). Thanks to its parallel computing capabilities, the GPU has substantially enhanced the training speed, which is particularly advantageous for handling extensive data and intricate models in deep learning.
</p>   

## Summary
In this notebook we have shown: 

- Install and use TensorFlow using GPU environments,
- Comparative performance tests between CPU and GPU on model training.

## Clear the memory
Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [17]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

: 

## Next

In this section, you learned how to use TensorFlow in a simple example using a GPU environment. In the next section, you will learn about other applications in which those devices can be pretty useful, in the notebbok [_03-hpc-simulations-pytorch.ipynb_](03-hpc-simulations-pytorch.ipynb).