<a href="https://colab.research.google.com/github/mmegankl/ece386-lab4/blob/main/book/b3-devboard/ice-gpu-acceleration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ICE 4: GPU Acceleration

Let's check out some different PyTorch arithmetic on the CPU vs. GPU!

We'll present the same code for both TensorFlow and Pytorch.

In [1]:
%pip install -q torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
%pip install -q tensorflow

First, import and check the version.

We will also make sure we can access the CUDA GPU.
This should always be your first step!

We'll also set a manual_seed for the random operations. While this isn't strictly necessary for this experiment, it's good practice as it aids with reproducibility.

In [3]:
import torch

print("PyTorch Version:", torch.__version__)

# Help with reproducibility of test
torch.manual_seed(2016)

if not torch.cuda.is_available():
    raise OSError("ERROR: No GPU found.")

PyTorch Version: 2.6.0+cu124


In [4]:
import tensorflow as tf

print("TensorFlow Version:", tf.__version__)

# Help with reproducibility of test
tf.random.set_seed(2016)

# Make sure we can access the GPU
print("Physical Devices Available:\n", tf.config.list_physical_devices())
if not tf.config.list_physical_devices("GPU"):
    raise OSError("ERROR: No GPU found.")

TensorFlow Version: 2.18.0
Physical Devices Available:
 [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Dot Product

Dot products are **extremely** common tensor operations. They are used deep neural networks and linear algebra applications.

A dot product is essentially just a bunch of multiplications and additions.

- PyTorch provides the [`torch.tensordot()`](https://pytorch.org/docs/stable/generated/torch.tensordot.html) method.
- TensorFlow provides the [`tf.tensordot()`](https://www.tensorflow.org/api_docs/python/tf/tensordot) method.

First, let's define two methods to compute the dot product. One will take place on the CPU and the other on the GPU.

### CPU Timing

The CPU method is trivial!

In [5]:
# Compute the tensor dot product on CPU
def torch_cpu_dot_product(a, b):
    return torch.tensordot(a, b)

In [6]:
# Compute the tensor dot product on CPU
def tf_cpu_dot_product(a, b):
    with tf.device("/CPU:0"):
        product = tf.tensordot(a, b, axes=2)
    return product

### GPU Timing

For **PyTorch** the GPU method has a bit more two it. We must:

1. Send the tensors to the GPU for computation. We call torch.to() on the tensor to send it to a particular device
2. Wait for the GPU to synchronize. According to the docs, GPU ops take place asynchronously so you need to use synchronize for precise timing.

For **TensorFlow** the `tf.device` makes it a bit simpler.

In [7]:
# Send the tensor to GPU then compute dot product
# synchronize() required for timing accuracy, see:
# https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution
def torch_gpu_dot_product(a, b):
    a_gpu = a.to("cuda")
    b_gpu = b.to("cuda")
    product = torch.tensordot(a_gpu, b_gpu)
    torch.cuda.synchronize()
    return product

In [8]:
def tf_gpu_dot_product(a, b):
    with tf.device("/GPU:0"):
        product = tf.tensordot(a, b, axes=2)
    return product

### Running the benchmark

This section declares the start and stop tensor sizes for our test.
You can change `SIZE_LIMIT` and then run again; just know that at some point you will run out of memory!

Next, it does tests at several sizes within this range, doubling each time.

We use [`timeit.timeit()`](https://docs.python.org/3/library/timeit.html#timeit.timeit) for the tests. It will call the function multiple times and then average those times. Timeit is also more accurate than manually calling Python's time function and doing subtraction.

Finally, results are saved into a list that's then exported to a pandas DataFrame for easy viewing.

In [11]:
import pandas as pd
from timeit import timeit

SIZE_LIMIT: int = 5000  # where to stop at

In [12]:
# This cell is PyTorch
tensor_size = 10  # start at size 10
torch_results = []

print("Running PyTorch with 2D tensors from", tensor_size, "to", SIZE_LIMIT, "square")

# Run the test
while tensor_size < SIZE_LIMIT:
    # Random array
    a = torch.rand(tensor_size, tensor_size, device="cpu")
    b = torch.rand(tensor_size, tensor_size, device="cpu")

    # Time the CPU operation
    cpu_time = timeit("torch_cpu_dot_product(a, b)", globals=globals(), number=50)

    # Time the GPU operation
    # First, we send the data to the GPU, called the warm up
    # It really depends on the application of this time is important or negligible
    # We are doing it here becasue timeit() averages the results of multiple runs
    torch_gpu_dot_product(a, b)
    # Now we time the actual operation
    gpu_time = timeit("torch_gpu_dot_product(a, b)", globals=globals(), number=50)

    # Record the results
    torch_results.append(
        {
            "tensor_size": tensor_size * tensor_size,
            "cpu_time": cpu_time,
            "gpu_time": gpu_time,
            "gpu_speedup": cpu_time / gpu_time,  # Greater than 1 means faster on GPU
        }
    )

    # Increase tensor_size by 100. For larger SIZE_LIMITS, change to double tensor_size
    # tensor_size = tensor_size * 2
    tensor_size = tensor_size + 100

# Done! Cast the results to a DataFrame and print
torch_results_df = pd.DataFrame(torch_results)
print("PyTorch Results:")
print(torch_results_df)

Running PyTorch with 2D tensors from 10 to 5000 square
PyTorch Results:
    tensor_size  cpu_time  gpu_time  gpu_speedup
0           100  0.001779  0.005502     0.323283
1         12100  0.002094  0.007183     0.291463
2         44100  0.005668  0.011524     0.491821
3         96100  0.011940  0.018499     0.645414
4        168100  0.020133  0.026561     0.757982
5        260100  0.030350  0.043117     0.703912
6        372100  0.045183  0.043348     1.042317
7        504100  0.057984  0.057040     1.016549
8        656100  0.079135  0.065701     1.204465
9        828100  0.097341  0.085292     1.141266
10      1020100  0.130673  0.097749     1.336820
11      1232100  0.135504  0.116508     1.163041
12      1464100  0.163814  0.129445     1.265516
13      1716100  0.193714  0.151668     1.277226
14      1988100  0.221213  0.175932     1.257379
15      2280100  0.273726  0.200691     1.363917
16      2592100  0.291131  0.228494     1.274129
17      2924100  0.344517  0.252172     1.3662

In [13]:
# This cell is TensorFlow
tensor_size = 10  # start at size 10
tf_results = []

print(
    "Running TensorFlow with 2D tensors from", tensor_size, "to", SIZE_LIMIT, "square"
)

# Run the test
while tensor_size <= SIZE_LIMIT:
    # Random tensor_size x tensor_size array
    with tf.device("/CPU:0"):
        a = tf.random.uniform((tensor_size, tensor_size))
        b = tf.random.uniform((tensor_size, tensor_size))

    # Time the CPU operation
    cpu_time = timeit("tf_cpu_dot_product(a, b)", globals=globals(), number=10)

    # Time the GPU operation
    # First, we send the data to the GPU, called the warm up
    # It really depends on the application of this time is important or negligible
    # We are doing it here because timeit() runs the function multiple times anyway
    tf_gpu_dot_product(a, b)
    # Now we time the actual operation
    gpu_time = timeit("tf_gpu_dot_product(a, b)", globals=globals(), number=10)

    # Record the results
    tf_results.append(
        {
            "tensor_size": tensor_size * tensor_size,
            "cpu_time": cpu_time,
            "gpu_time": gpu_time,
            "gpu_speedup": cpu_time / gpu_time,  # Greater than 1 means faster on GPU
        }
    )

    # Increase tensor_size by 100. For larger SIZE_LIMITS, change to double tensor_size
    # tensor_size = tensor_size * 2
    tensor_size = tensor_size + 100

# Done! Cast the results to a DataFrame and print
tf_results_df = pd.DataFrame(tf_results)
print("TensorFlow Results:")
print(tf_results_df)

Running TensorFlow with 2D tensors from 10 to 5000 square
TensorFlow Results:
    tensor_size  cpu_time  gpu_time  gpu_speedup
0           100  0.021958  0.010560     2.079401
1         12100  0.010939  0.012063     0.906840
2         44100  0.008504  0.009926     0.856779
3         96100  0.009171  0.009697     0.945702
4        168100  0.010602  0.009754     1.086999
5        260100  0.012395  0.012611     0.982846
6        372100  0.012976  0.009942     1.305107
7        504100  0.014673  0.010148     1.445844
8        656100  0.016733  0.009927     1.685570
9        828100  0.020311  0.009953     2.040627
10      1020100  0.022921  0.009927     2.308924
11      1232100  0.030088  0.010424     2.886416
12      1464100  0.027466  0.011424     2.404177
13      1716100  0.030935  0.010406     2.972841
14      1988100  0.034565  0.009819     3.520101
15      2280100  0.038964  0.013049     2.986040
16      2592100  0.042360  0.010540     4.019032
17      2924100  0.049108  0.011033     

### Dot Product Results

If you left the default sizes, you should see 10 rows of results.
You'll notice that with small tensors the CPU is *faster* than the GPU!
This is also indidcated by the **gpu_speedup** being less than 1.

But as the tensor sizes grow, the GPU overtakes the CPU for speed! 🏎️

## Another Tensor Operation

Your task is to repeat this benchmark below, but finding the minimum element in a **1D tensor**.
You only need to do it with **one** framework.

Use either

- [`torch.min()`](https://pytorch.org/docs/stable/generated/torch.min.html) *or*
- [`tf.math.reduce_min()](https://www.tensorflow.org/api_docs/python/tf/math/reduce_min)

In [22]:
# Define your methods here - using PyTorch
def torch_cpu_min(a):
    return torch.min(a)

In [30]:
# Conduct your benchmark here

tensor_size = 10  # start at size 10
torch_results = []
a = torch.randn(tensor_size, tensor_size, device="cpu")

print("Running PyTorch with 1D tensors from", tensor_size, "to", SIZE_LIMIT, "square")

# Run the test
while tensor_size < SIZE_LIMIT:
    # # Random array
    # a = torch.rand(tensor_size, tensor_size, device="cpu")
    # b = torch.rand(tensor_size, tensor_size, device="cpu")

    # Time the CPU operation
    cpu_time = timeit("torch_cpu_min(a)", globals=globals(), number=50)

    # Time the GPU operation
    # First, we send the data to the GPU, called the warm up
    # It really depends on the application of this time is important or negligible
    # We are doing it here becasue timeit() averages the results of multiple runs
    result = torch_cpu_min(a)
    # Now we time the actual operation
    gpu_time = timeit("torch_cpu_min(a)", globals=globals(), number=50)

    # Record the results
    torch_results.append(
        {
            "tensor_size": tensor_size * tensor_size,
            "cpu_time": cpu_time,
            "gpu_time": gpu_time,
            "gpu_speedup": cpu_time / gpu_time,  # Greater than 1 means faster on GPU
        }
    )

    # Increase tensor_size by 100. For larger SIZE_LIMITS, change to double tensor_size
    # tensor_size = tensor_size * 2
    tensor_size = tensor_size + 100

# Done! Cast the results to a DataFrame and print
torch_results_df = pd.DataFrame(torch_results)
print("PyTorch Results:")
print(torch_results_df)

Running PyTorch with 1D tensors from 10 to 5000 square
PyTorch Results:
    tensor_size  cpu_time  gpu_time  gpu_speedup
0           100  0.000670  0.000194     3.462049
1         12100  0.000200  0.000186     1.072602
2         44100  0.000188  0.000567     0.331661
3         96100  0.000185  0.000175     1.058377
4        168100  0.000115  0.000281     0.410569
5        260100  0.000130  0.000108     1.198625
6        372100  0.000120  0.000137     0.877037
7        504100  0.000143  0.000109     1.313350
8        656100  0.000108  0.000116     0.932640
9        828100  0.000110  0.000136     0.810443
10      1020100  0.000108  0.000109     0.994683
11      1232100  0.000115  0.000108     1.058902
12      1464100  0.000133  0.000110     1.210195
13      1716100  0.000108  0.000107     1.006097
14      1988100  0.000109  0.000108     1.009855
15      2280100  0.000115  0.000108     1.063290
16      2592100  0.000108  0.000114     0.943473
17      2924100  0.000110  0.000120     0.9204

## Deliverable

> **After** answering these questions, download this completed notebook and **upload to Gradescope.**

### Reflection 📈

### *Why* does the CPU outperform the GPU dot product with smaller vectors?
CPUs utilize the cache quickly and are made to do quick operations.

### *How* did the CPU vs. GPU perform for `min()`?
Initially, with smaller tensor sizes, the GPU outperformed the CPU by executing quicker. Hoever, with larger tensor sizes, the CPU was starting to compare in performance (and in some cases performed better).

#### *Why* did it perform that way?
I think this is because, as was mentioned earlier, the CPU utilizes the cache and, so it can pull up recent memory quickly. As such, the GPU was probably executing each computation individually, whereas the CPU had means for more efficient computations.