# CPU vs GPU Performance Comparison


#### Intel CPU vs. Intel HD Graphics

Author: P. Valchev


#



## Introduction

Training of neural networks can be accelerated by using the parallelism of calculations on GPUs. The Compute Unified Device Architecture (CUDA) is a parallel computing  platform and programming model used to produce custom code for the utilisation GPU which was developed by Nvidia and it is available on some Nvidia products. 

It should be pointed out that on a GPU increasing batch size will always try to fill up the entire GPU memory, which is not the case on the CPU. On the CPU an increase in batch size will increase the time pr. batch. Therefore a large batch size can be beneficial to be used on GPU. 

TensorFlow can, together with CUDA, make the use of the whole GPU architecture to further optimise computation time during model training.

In [2] it is found that the differences in the performance of TensorFlow depends significantly on the processing unit and the more complex neural networks benefit from the GPUs parallelizing capabilities, which makes using GPU with TensorFlow well worth it in most cases. The authors agree also that the benefits of the GPU becomes insignificant when a simplistic neural network is trained with small instances of training data. At the end it is hard to draw any conclusion on the memory management of the GPU and CPU as the results indicate that the average memory allocation was affected mostly by the training data. All the comparison work was done with CUDA enabled GPU.

In the literature there are examples of Tensorflow performance on CPU and CUDA supported GPU, but officially Tensorflow and Keras do not support any other GUP like Radeon or Intel Graphics. 

In this study, it will be looked into the performance of TensorFlow with respect to time. The question one can ask is: 

• Does the performance of TensorFlow / Keras generally benefit from using non-CUDA enabled GPU over the CPU when using TensorFlow/Keras, with regard to time efficiency?



# 1. Install PlaidML


PlaidML is an open source tensor compiler. Combined with Intel’s nGraph graph compiler, it gives popular deep learning frameworks performance portability across a wide range of CPU, GPU and other accelerator processor architectures. [1]

![title](img/plaidML.png)

## 1.1 Set-up virtual environment



### Step 1. Preparation
    Stop Jupyther Notebook if running.

### Step 2. Create virtual environment
    Create virtual environment in terminal with following command:

        $ conda create -n plaid python=3.7

![title](img/plaidML_create_env.png)
    

### Step 3. Activate virtual environment

    To activate this environment, use:

        $ conda activate plaid

![title](img/plaidML_avitvate_env.png)

    To deactivate an active environment, use
        $ conda deactivate

### Step 4. Validate virtual environment
    verifiy if the environment is up and running:
        $ echo $CONDA_DEFAULT_ENV
    
![title](img/plaidML_validate_env.png)


## 1.2. Install PlaidML

### Step 5. Install PlaidML with Keras

    $ pip install -U plaidml-keras opencv-python



### Step 6. Run installation script 

        $ plaidml-setup
        
            After executing the setup it is going to ask you if you want to enable experimental device, so confirm with "yes". Otherwise it will choose the one with metal in name, which is not going to work! PlaidML Keras uses OpenCL. 
            Select default devices by entering the coresponding number and confirm.
            Last step is to save setting to this environment, so again confirm with "yes".
    

![title](img/plaidML_install.png)





### Step 7. Run Notebook

    Start Jupyter Notebook in the newly created environment. Install numpy, pandas and ect in the environment if needed. 
    
![title](img/Anaconda_Environment.png)

# 2. CPU - GPU Benchmark 

# 2.1. Datasets

## Fashion-MNIST Dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. [3]

Although the dataset is relatively simple, it can be used as the basis for learning and practicing how to develop, evaluate, and use deep convolutional neural networks for image classification from scratch. This includes how to develop a robust test harness for estimating the performance of the model, how to explore improvements to the model, and how to save the model and later load it to make predictions on new data.

## CIFAR-10

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. [4]


# 2.2. Models



## 2.2.1. Simple training model with small dataset

In the Appendix 1 is a complete example that can be tried on system with installed PlaidML. It trains a very simple neural network with one hidden layer that sums input vectors.

The code will be execuded on different computing devices (CPU and GPU). It will be surprising to find that training this model on CPU is faster, as the dataset is very small and the model very simple. 

Below is the comparison table:




## 2.2.2. Training model with Fashion-MNIST Dataset

The training model is available in Apendix 2. The model is more complex than the previous model. As well the dataset hat 10 different classes. 

The comparison table shows the execution time on different devices:



## 2.2.3. Simple training model with Fashion-MNIST Dataset

The code that need to be run is taken from [6]. This code is used for benchmarking VGG-19 in prediction.




## 3. Results

Tested configurations

![title](img/tested_hardware.png)

### 3.1. Configuration 1
Each model is executed three times - on the CPU (not PlaidML), GPU-OpenCL and GPU-Metal. During model execution CPU and GPU load were monitored, as well as all messages coming from PlaidML.

### Tested hardware

CPU: 1.6 GHz Dual-Core Intel Core i5 (I5-5250U)

Graphics: Intel HD Graphics 6000


### CUP load

During model run on CPU:

![title](img/CPU_Load_GPU_null.png)

### GPU load

During model run on GPU (openCL or metal):

![title](img/GPU_during_executing.png)


### PlaidML messages

Import PlaidML Keras

![title](img/Using_plaidml_keras.png)


After model compilation the right GPU device is displayes:

openCL
![title](img/Using_OpenCL.png)

Metal-GPU
![title](img/Using_Metal.png)



### 3.1.2 Result summary
Results summary is in the table below:

![title](img/Result_table.png)


It is confirmed that on small datasets and simple models the CPU outperforms the GPU. 
Since Intel HD Graphic 6000 is not officially supported (tested by PlaidML) there are also problems during model run and convergence. For example, model training with Fashion-MNIST dataset, it is seen that GPU performs faster, but detailed look at the results shows much worse accuracy than CPU.

Prediction with VGG19 is simply shows that openCL is not well optimized for this GPU. With another GPU could be different. 

### 3.1.3 Conclusion

- On simple dataset and simple neural network model, the CPU has better performance.
- Since GPU is not supported, the models can still run on Intel HD Graphics 6000 but the acieved accuracy is not satisfactory.
- It can be confirmed that Intel HD Graphics 6000 is not optimezed for machine learning and even with PlaidML the results were not optimistic.
- Genereally low end graphic adapters are not suitable for training machine learing models, which is expected.
- Intel acquired the company which created PlaidML in 2018 but no big effort was done since then. In the near feature there are no signs that Intel will push PlaidML toward support of its own graphic adapters. 
- Without stronger Intel support is is very likely that PlaidML (which is open sourced) will be discontinued.


## 3.2. Configuration 2
Each model is executed three times - on the CPU (not PlaidML) and GPU-OpenCL. During model execution CPU and GPU load were monitored, as well as all messages coming from PlaidML.

### Tested hardware

CPU: 1.9 GHz Intel i7 (I7-8665U)

Graphics: AMD Radeon 550X (2Gb)


### CUP load

During model run on CPU:

![title](img/2_CPU_load.png)

### GPU load

During model run on GPU (openCL or metal):

![title](img/2_GPU_load.png)


### PlaidML messages

After model compilation the right GPU device is displayes:

![title](img/2_opencl_info.png)



### 3.1.2 Result summary
Results summary is in the table below:

![title](img/2_Results.png)


It is confirmed that on small datasets and simple models the CPU outperforms the GPU. 
Fashion-MNIST dataset, it is seen that GPU performs faster. Model accuracy (trained on GPU and CPU) is similar.

VGG19 is faster on GPU.

### 3.1.3 Conclusion

- On simple dataset and simple neural network model, the CPU has better performance.
- The AMD Radeon 550X can be more than two times faster than the I7-8665U for larger datasets.

# 4. Conclusion

Although there are some problems on running PlaidML on Intel HD Graphics 6000 it is still interesting to see this solution for deep learning. This experiment shows that it is in principle possilbe to train models in openCL but at the same time there is a lot to be done for further improve the system.

On AMD Radeon 550X there is advantages to use GPU for model training, since it will reduce the training time. 




Bibliography
1. https://www.intel.com/content/www/us/en/artificial-intelligence/plaidml.html
2. ERIC LIND, ÄVELIN PATNIGOSO, A performance comparison between CPU and GPU in TensorFlow, 2019.
3. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
4. Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.
5. https://github.com/kennylimyx/plaidml/blob/main/plaidml_mnist_gpu_test.ipynb
6. https://reposhub.com/cpp/machine-learning/plaidml-plaidml.html
7. https://towardsdatascience.com/machine-learning-on-macos-with-an-amd-gpu-and-plaidml-55a46fe94bc0



## Appendix 1.
## Simple training model with small dataset [7]



In [None]:
import numpy as np
from os import environ
environ["KERAS_BACKEND"] = "plaidml.keras.backend"
import keras


from keras.layers import Dense
from matplotlib import pyplot as plt
# Params
num_samples = 100000 
vect_len = 20
max_int = 10
min_int = 1

# Generate dataset
X = np.random.randint(min_int, max_int, (num_samples, vect_len))
Y = np.sum(X, axis=1)

# Get 80% of data for training
split_idx = int(0.8 * len(Y))
train_X = X[:split_idx, :]; test_X = X[split_idx:, :]
train_Y = Y[:split_idx]; test_Y = Y[split_idx:]

# Make model
model = keras.models.Sequential()
model.add(keras.layers.Dense(32, activation='relu', input_shape=(vect_len,)))
model.add(keras.layers.Dense(1))
model.compile('adam', 'mse', metrics=['accuracy'])

history = model.fit(train_X, train_Y, validation_data=(test_X, test_Y), \
                    epochs=10, batch_size=128)

## Appendix 2.
## Training model with Fashion-MNIST Dataset [5]

In [None]:

import plaidml.keras
plaidml.keras.install_backend()
import os
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"


import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

# Download fashion dataset from Keras
fashion_mnist = keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Reshape and normalize the data
x_train = x_train.astype('float32').reshape(60000,28,28,1) / 255
x_test = x_test.astype('float32').reshape(10000,28,28,1) / 255

# Build a CNN model
# run this each time before you fit the model

# if using plaidml, you should see "INFO:plaidml:Opening device xxx" after you run this chunk


model = keras.Sequential()
model.add(keras.layers.Conv2D(filters=64, kernel_size=2, padding='same', activation='relu', input_shape=(28,28,1))) 
model.add(keras.layers.MaxPooling2D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=2))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss=keras.losses.sparse_categorical_crossentropy,
              metrics=['accuracy'])

model.summary()

%%time
# Fit the model on training set
model.fit(x_train, y_train,
          batch_size=1028,
          epochs=1)

# Evaluate the model on test set
score = model.evaluate(x_test, y_test, verbose=0)
# Print test accuracy
print('\n', 'Test accuracy:', score[1])