# Dask Bursting GPU vs. CPU Speed Testing

In this notebook, we compare the speed at which the CPU and the GPU complete a matrix multiplication of the same random arrays.

In [None]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config

import dask
import io
import re
import logging
import s3fs

from astropy.io import fits
from dask.distributed import Client
from os import listdir
from os.path import isfile, join
from re import search

In [None]:
from dask_gateway import Gateway, GatewayCluster
gateway = Gateway()
options = gateway.cluster_options()

# We're setting some defaults here just for grins... 
# I like the pangeo/base-notebook image for the workers since it has almost every library you'd need on a worker
# In our environment, without setting these, the widget will default to the same image that the notebook itself is running, 
# as well as 2 cores and 4GB memory per worker

options.image = 'public.ecr.aws/q3h7b4o8/heliocloud/helio-daskhub-mltf:2025.01.29'
options.worker_cores = 4
options.worker_memory = 7
options.profile='gpu-xlarge'

# This calls the widget
options  

In [None]:
cluster = gateway.new_cluster(options)
client = cluster.get_client()
n_workers = 3
cluster.scale(n_workers)
#cluster.adapt(minimum=1, maximum=n_workers)
cluster

In [None]:
def get_nvidia_driver_version():
    import pynvml
    import tensorflow as tf
    
    return pynvml.nvmlSystemGetDriverVersion(), len(tf.config.list_physical_devices('GPU'))

In [None]:
client.run(get_nvidia_driver_version)

In [None]:
import numpy as np
import tensorflow as tf

array_a = np.random.rand(4000,6000).astype(np.float32)
array_b = np.random.rand(6000,4000).astype(np.float32)

In [None]:
def do_tensor_math(array_a, array_b):
    import os
    # Set log level to 3 to supress INFO and WARNING messages
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
    import tensorflow as tf
    
    num_gpus = len(tf.config.list_physical_devices('GPU'))

    tf.debugging.set_log_device_placement(True)
    
    a = tf.constant(array_a)
    b = tf.constant(array_b)
    #c = tf.matmul(a, b)

    # Run the matrix multiplication 100 times on the CPU
    for i in range(1000):
        c = tf.matmul(a, b)
    
    return c

In [None]:
%%time
a_scatter = client.scatter(array_a)
b_scatter = client.scatter(array_b)
c = client.submit(do_tensor_math, a_scatter, b_scatter)
#print(c.result())

In [None]:
c.result()

In [None]:
cluster.shutdown()

In [None]:
cluster

### Cell #1

First, do the imports. We're setting the TensorFlow "log level" to 3 so that it supresses warnings, but still outputs whether the TensorFlow operations are taking place on the CPU, or the GPU

In [None]:
import os
# Set log level to 3 to supress INFO and WARNING messages
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow as tf
import numpy as np
import time

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

tf.debugging.set_log_device_placement(True)

### Cell #2: Create some tensors

Create the matrices that we'll be working with. TensorFlow requires that the values be in float32 format for doing matrix multiplication on the GPU.

In [None]:
array_a = np.random.rand(4000,6000).astype(np.float32)
array_b = np.random.rand(6000,4000).astype(np.float32)

### Cell #3: Matrix multiplication on the CPU

We multiply the matrices on the CPU once. In Cell #1, we enabled TensorFlow to log device placement. As a result, we this cell should output the line "Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0" to show that the operation is taking place on the CPU.

In [None]:
%%time
with tf.device('/CPU:0'):
  # Place tensors on the CPU
  a = tf.constant(array_a)
  b = tf.constant(array_b)
  c = tf.matmul(a, b)

print(c)

### Cell #4: Increase the Processing Demand on the CPU

Now let's do the same matrix multiplication 100 times. We expect that this cell should take around 30 seconds to run. 

In [None]:
#%%capture
%%time
start_time = time.time()

with tf.device('/CPU:0'):
  # Place tensors on the CPU
  a = tf.constant(array_a)
  b = tf.constant(array_b)
    
  # Run the matrix multiplication 100 times on the CPU
  for i in range(100):
    c = tf.matmul(a, b)

end_time = time.time()
cpu_execution_time = end_time - start_time

### Cell #5: Display CPU Processing Time

Run the cell below to display how many seconds it look for Cell #4 to run. 

In [None]:
print(f"Execution time on the CPU: {cpu_execution_time} seconds")

### Cell #6: Matrix multiplication on the GPU

Now we do the same calculation on the GPU. The device placement log should show that we are now operating on the GPU.

In [None]:
%%time
with tf.device('/GPU:0'):
  # Place tensors on the GPU
  a = tf.constant(array_a)
  b = tf.constant(array_b)
  c = tf.matmul(a, b)

print(c)

### Cell #7: Increase the Processing Demand on the GPU

Run the next cell to repeat the same matrix multiplication 100 times. This should take far less time than when we ran it on the CPU.

In [None]:
%%time
#start_time = time.time()

with tf.device('/GPU:0'):
  # Place tensors on the GPU
  a = tf.constant(array_a)
  b = tf.constant(array_b)
    
  # Run the matrix multiplication 100 times on the GPU
  for i in range(1000):
    c = tf.matmul(a, b)

#print(c)

#end_time = time.time()
#gpu_execution_time = end_time - start_time

In [None]:
print(c)

### Cell #8: Display GPU Processing Time

Run the cell below to display how many seconds it look for Cell #7 to run. 

In [None]:
print(f"Execution time on the GPU: {gpu_execution_time} seconds")

### Final time comparison

Following this tutorial, you should have found that the processing speed of the GPU is *significantly* faster than the CPU. Perhaps you can see yourself speeding up your own data analysis by moving from CPU to GPU computing. Continue working your way through the other tutorial Notebooks in this directory to to learn the ins and outs of doing data analysis with GPU accelerated computing!