<a href="https://colab.research.google.com/github/josesan77/MLprojects/blob/master/CUDA/ConfigAndRunCuda_on_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Prerequisites

First of all check/define GPU environment for the Notebook.
1. In this Notebook's *Runtime* menu / *Change runtime type* submenu check GPU availability. For free accounts it should be T4 GPU (in Jan 2025), and be aware that GPU usage is limited, but enough for this calculation. Paying subscription can reach stronger hardware without time limitation, but the T4 GPU is enough now.
2. Select T4 GPU
3. Connect GPU at the top-right corner, below Settings icon) before starting session (processing codes below).

[More on GPUs](https://cloud.google.com/compute/docs/gpus):

V2-8 TPU is not needed!
[V2 Pod slice with 8 TensorCores](https://cloud.google.com/tpu/docs/v2)

## Remove - if required - existing CUDA installation and NVIDIA drivers

First of all, verify what NVCC version is installed on your Colab(oratory) environment.

In [None]:
!nvcc --version

This is what you should get if `nvcc` is installed:

```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88
```

Check release and version at the bottom! If it indicates release 9.2, V9.2.88 or higher (more recent) version then skipp section Remove/reinstall!






##  Remove/reinstall
Perform the next steps only in case running the first cell code failed!
### 1. Remove CUDA and NVIDIA modules

In [None]:
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg --purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

### 2. Install specific CUDA

In [None]:
!wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda-9.2

### 3. Installation check

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


## Install a jupyter extension
Skipping or performing the previous section, first run a test whether module is already installed. In general it is not, in that case install plugin using the code in the next cell.
When running
```
%load_ext nvcc_plugin
```
the code should return:
```
created output directory at /content/src
Out bin /content/result.out
```



In [2]:
#instead of %load_ext nvcc_plugin #an earlier version
%load_ext nvcc4jupyter # worked in Jan 2025, see later successful load

ModuleNotFoundError: No module named 'nvcc_plugin'

If above code fails, let's instal the required module. In general Colaboratory can reach appropriate pip modules to instal, so try this first:

In [30]:
!pip install nvcc4jupyter

Collecting nvcc4jupyter
  Downloading nvcc4jupyter-1.2.1-py3-none-any.whl.metadata (5.1 kB)
Downloading nvcc4jupyter-1.2.1-py3-none-any.whl (10 kB)
Installing collected packages: nvcc4jupyter
Successfully installed nvcc4jupyter-1.2.1


Successful installation returns this:
```
Collecting nvcc4jupyter
  Downloading nvcc4jupyter-1.2.1-py3-none-any.whl.metadata (5.1 kB)
Downloading nvcc4jupyter-1.2.1-py3-none-any.whl (10 kB)
Installing collected packages: nvcc4jupyter
Successfully installed nvcc4jupyter-1.2.1
```
If he above defined installation method fails, try this:

In [16]:
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git

NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968

In case both above pip instals fail, then clone the appropriate git library, enter the folder and install from there

In [None]:
!git clone https://github.com/andreinechaev/nvcc4jupyter.git

Cloning into 'nvcc4jupyter'...
remote: Enumerating objects: 529, done.[K
remote: Counting objects: 100% (408/408), done.[K
remote: Compressing objects: 100% (207/207), done.[K
remote: Total 529 (delta 209), reused 315 (delta 179), pack-reused 121 (from 1)[K
Receiving objects: 100% (529/529), 120.01 KiB | 6.32 MiB/s, done.
Resolving deltas: 100% (247/247), done.


In [None]:
cd nvcc4jupyter/
#ls #list files in folder, if needed

/content/nvcc4jupyter


Install the module. Find the solution!

In [None]:
# verify state of the cloned git, it is fine if "nothing to commit, working tree clean"
!git status

On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean


Load the plugin

In [31]:
#instead of %load_ext nvcc_plugin
%load_ext nvcc4jupyter

Detected platform "Colab". Running its setup...
Source files will be saved in "/tmp/tmpkvda_pct".


Successful load results in (something similar):
```
Detected platform "Colab". Running its setup...
Source files will be saved in "/tmp/tmp__erspob".
```

# Run CUDA

### Run a print example
Print the text "This is from CUDA" with a new line at the end.

In [5]:
# nvcc_plugin worked with magic(lin): %%cu  use
%%cuda
#include <iostream>

int main() {
    std::cout << "This is from CUDA\n";
    return 0;
}

This is from CUDA



UsageError: Cell magic `%%cu` not found.

### Involved example
C++ code is designed to find the maximum value in an array using CUDA for parallel computation.

In [8]:
%%cuda
#include <cstdio>
#include <iostream>

using namespace std;

__global__ void maxi(int* a, int* b, int n)
{
	int block = 256 * blockIdx.x;
	int max = 0;

	for (int i = block; i < min(256 + block, n); i++) {

		if (max < a[i]) {
			max = a[i];
		}
	}
	b[blockIdx.x] = max;
}

int main()
{

	int n;
	n = 3 >> 2;
	int a[n];

	for (int i = 0; i < n; i++) {
		a[i] = rand() % n;
		cout << a[i] << "\t";
	}

	cudaEvent_t start, end;
	int *ad, *bd;
	int size = n * sizeof(int);
	cudaMalloc(&ad, size);
	cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
	int grids = ceil(n * 1.0f / 256.0f);
	cudaMalloc(&bd, grids * sizeof(int));

	dim3 grid(grids, 1);
	dim3 block(1, 1);

	cudaEventCreate(&start);
	cudaEventCreate(&end);
	cudaEventRecord(start);

	while (n > 1) {
		maxi<<<grids, block>>>(ad, bd, n);
		n = ceil(n * 1.0f / 256.0f);
		cudaMemcpy(ad, bd, n * sizeof(int), cudaMemcpyDeviceToDevice);
	}

	cudaEventRecord(end);
	cudaEventSynchronize(end);

	float time = 0;
	cudaEventElapsedTime(&time, start, end);

	int ans[2];
	cudaMemcpy(ans, ad, 4, cudaMemcpyDeviceToHost);

	cout << "The maximum element is : " << ans[0] << endl;

	cout << "The time required : ";
	cout << time << endl;
}

The maximum element is : 0
The time required : 0.002976



### Try with Python
Since Python doesn't have built-in support for CUDA like C++, the Python equivalent will use libraries like NumPy for array operations and CuPy for CUDA-related tasks.

In [9]:
import numpy as np
import time # checking runtime

try simple python without CUDA

In [None]:
def maxi_kernel_python(a, b, n):
    """
    Python implementation of the CUDA kernel for finding the maximum in each block.
    """
    threads_per_block = 256
    grid_size = len(b)
    for block_idx in range(grid_size):
        block_start = threads_per_block * block_idx
        block_end = min(block_start + threads_per_block, n)
        max_val = 0
        for i in range(block_start, block_end):
            max_val = max(max_val, a[i])
        b[block_idx] = max_val

def main_python():
    # Initialize data
    start = time.time()
    n = 3 >> 2
    a = np.random.randint(0, n, n, dtype=np.int32)
    print("Array:", a)

    # Allocate device memory
    ad = cp.array(a)
    grids = int(cp.ceil(n / 256))
    bd = cp.zeros(grids, dtype=cp.int32)

    while n > 1:
        # Create host copies for kernel emulation
        a_host = ad.get()
        b_host = np.zeros_like(bd.get())

        # Call the kernel function (Python version)
        maxi_kernel_python(a_host, b_host, n)

        # Transfer results back to GPU
        bd = cp.array(b_host)
        ad = bd
        n = len(bd)

    # Get results
    ans = ad[0].item() # ad.get()[0]
    end = time.time()
    elapsed_time = (end- start)

    print("The maximum element is:", ans)
    print("The time required:", elapsed_time, "ms")

if __name__ == "__main__":
    main()


Array: []
The maximum element is: []
The time required: 0.0025017261505126953 ms


In [11]:
import cupy as cp
# CUDA kernel to find the maximum in each block
maxi_kernel = cp.RawKernel(r'''
extern "C" __global__
void maxi(const int* a, int* b, int n) {
    int block = 256 * blockIdx.x;
    int max = 0;

    for (int i = block; i < min(256 + block, n); i++) {
        if (max < a[i]) {
            max = a[i];
        }
    }
    b[blockIdx.x] = max;
}
''', 'maxi')

def main():
    # Initialize data
    n = 3 >> 2
    a = np.random.randint(0, n, n, dtype=np.int32)
    print("Array:", a)

    # Allocate device memory
    ad = cp.array(a)
    grids = int(cp.ceil(n / 256))
    bd = cp.zeros(grids, dtype=cp.int32)

    # Measure execution time
    start = cp.cuda.Event()
    end = cp.cuda.Event()
    start.record()

    while n > 1:
        maxi_kernel((grids,), (1,), (ad, bd, n))
        n = int(cp.ceil(n / 256))
        ad = cp.array(bd[:n])  # Copy results back to ad for the next iteration

    end.record()
    end.synchronize()

    # Get results
    ans = ad
    time = cp.cuda.get_elapsed_time(start, end)

    print("The maximum element is:", ans)
    print("The time required:", time, "ms")

if __name__ == "__main__":
    main()

Array: []
The maximum element is: []
The time required: 0.003776000114157796 ms


### Example 1

In [13]:
%%cuda

#include <stdio.h>

// This is a special function that runs on the GPU (device) instead of the CPU (host)
__global__ void kernel() {
  printf("Hello world!\n");
}

int main() {
  // Invoke the kernel function on the GPU with one block of one thread
  kernel<<<1,1>>>();

  // Check for error codes (remember to do this for _every_ CUDA function)
  if(cudaDeviceSynchronize() != cudaSuccess) {
    fprintf(stderr, "CUDA Error: %s\n", cudaGetErrorString(cudaPeekAtLastError()));
  }
  return 0;
}

Hello world!



"Hello world!" text returned.

### Example 2

In [14]:
%%cuda

#include <stdio.h>

// This kernel runs on the GPU and prints the thread's identifiers
__global__ void kernel() {
  printf("Hello from block %d thread %d\n", blockIdx.x, threadIdx.x);
}

int main() {
  // Launch the kernel on the GPU with four blocks of six threads each
  kernel<<<4,6>>>();

  // Check for CUDA errors
  if(cudaDeviceSynchronize() != cudaSuccess) {
    fprintf(stderr, "CUDA Error: %s\n", cudaGetErrorString(cudaPeekAtLastError()));
  }
  return 0;
}

Hello from block 1 thread 0
Hello from block 1 thread 1
Hello from block 1 thread 2
Hello from block 1 thread 3
Hello from block 1 thread 4
Hello from block 1 thread 5
Hello from block 0 thread 0
Hello from block 0 thread 1
Hello from block 0 thread 2
Hello from block 0 thread 3
Hello from block 0 thread 4
Hello from block 0 thread 5
Hello from block 2 thread 0
Hello from block 2 thread 1
Hello from block 2 thread 2
Hello from block 2 thread 3
Hello from block 2 thread 4
Hello from block 2 thread 5
Hello from block 3 thread 0
Hello from block 3 thread 1
Hello from block 3 thread 2
Hello from block 3 thread 3
Hello from block 3 thread 4
Hello from block 3 thread 5



A block of
```
Hello from block 1 thread 0
Hello from block 1 thread 1
Hello from block 1 thread 2
Hello from block 1 thread 3
Hello from block 1 thread 4
Hello from block 1 thread 5
Hello from block 0 thread 0
Hello from block 0 thread 1
Hello from block 0 thread 2
...
```
is returned.

### Example 3
The following (CUDA) code implements a simplified version of the SAXPY operation (y = a * x + y) on a GPU. Here is a detailed breakdown:

In [15]:
%%cuda

#include <stdint.h>
#include <stdio.h>

#define N 32
#define THREADS_PER_BLOCK 32

// This kernel runs on the GPU and performs the SAXPY operation (kernel)
__global__ void saxpy(float a, float* x, float* y) {
  // Which index of the array should this thread use?
  size_t index = 20; //hardcoded thread index for simplicity

  // Compute a times x plus y for a specific index
  y[index] = a * x[index] + y[index];
}

int main() {
  // Allocate arrays for X and Y on the CPU. This memory is only usable on the CPU
  float* cpu_x = (float*)malloc(sizeof(float) * N);
  float* cpu_y = (float*)malloc(sizeof(float) * N);

  // Initialize X and Y
  int i;
  for(i=0; i<N; i++) {
    cpu_x[i] = (float)i;
    cpu_y[i] = 0.0;
  }

  // The gpu_x and gpu_y pointers will only be usable on the GPU (which uses separate memory)
  float* gpu_x;
  float* gpu_y;

  // Allocate space for the x array on the GPU
  if(cudaMalloc(&gpu_x, sizeof(float) * N) != cudaSuccess) {
    fprintf(stderr, "Failed to allocate X array on GPU\n");
    exit(2);
  }

  // Allocate space for the y array on the GPU
  if(cudaMalloc(&gpu_y, sizeof(float) * N) != cudaSuccess) {
    fprintf(stderr, "Failed to allocate Y array on GPU\n");
    exit(2);
  }

  // Copy the cpu's x array to the gpu with cudaMemcpy, Data transfer from CPU to GPU
  if(cudaMemcpy(gpu_x, cpu_x, sizeof(float) * N, cudaMemcpyHostToDevice) != cudaSuccess) {
    fprintf(stderr, "Failed to copy X to the GPU\n");
  }

  // Copy the cpu's y array to the gpu with cudaMemcpy, Data transfer from CPU to GPU
  if(cudaMemcpy(gpu_y, cpu_y, sizeof(float) * N, cudaMemcpyHostToDevice) != cudaSuccess) {
    fprintf(stderr, "Failed to copy Y to the GPU\n");
  }

  // Calculate the number of blocks to run, rounding up to include all threads
  size_t blocks = (N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;

  // Run the saxpy kernel
  saxpy<<<blocks, THREADS_PER_BLOCK>>>(0.5, gpu_x, gpu_y);

  // Wait for the kernel to finish
  if(cudaDeviceSynchronize() != cudaSuccess) {
    fprintf(stderr, "CUDA Error: %s\n", cudaGetErrorString(cudaPeekAtLastError()));
  }

  // Copy the y array back from the gpu to the cpu
  if(cudaMemcpy(cpu_y, gpu_y, sizeof(float) * N, cudaMemcpyDeviceToHost) != cudaSuccess) {
    fprintf(stderr, "Failed to copy Y from the GPU\n");
  }

  // Print the updated y array (output)
  for(i=0; i<N; i++) {
    printf("%d: %f\n", i, cpu_y[i]);
  }

  //Cleanup
  cudaFree(gpu_x);
  cudaFree(gpu_y);
  free(cpu_x);
  free(cpu_y);

  return 0;
}


0: 0.000000
1: 0.000000
2: 0.000000
3: 0.000000
4: 0.000000
5: 0.000000
6: 0.000000
7: 0.000000
8: 0.000000
9: 0.000000
10: 0.000000
11: 0.000000
12: 0.000000
13: 0.000000
14: 0.000000
15: 0.000000
16: 0.000000
17: 0.000000
18: 0.000000
19: 0.000000
20: 10.000000
21: 0.000000
22: 0.000000
23: 0.000000
24: 0.000000
25: 0.000000
26: 0.000000
27: 0.000000
28: 0.000000
29: 0.000000
30: 0.000000
31: 0.000000



#### Key Observations (Example 3)
1. Index Calculation Issue:

The kernel only updates y[20] because the index is hardcoded, ignoring the thread ID (threadIdx.x) and block ID (blockIdx.x).
This defeats the purpose of parallel computation.

Reach parallel computation: Replace the hardcoded ```size_t index = 20``` with a calculation using thread and block indices:

```
size_t index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < N) {
    y[index] = a * x[index] + y[index];
}
```

2. SAXPY Limitation:

SAXPY is typically applied to all elements of the arrays in parallel. Here, the computation is limited to a single index.
3. CUDA Memory Management:

Proper allocation and transfer of memory between CPU and GPU are demonstrated.

# Compare Python and CUDA
Let's look the runtime difference of a Python and a C++ code optimized for CUDA (GPU run).

## Example 4: n-th number in the Fibonacci series
Properly declared the below methodss would return only the n-th element of the Fibonacci series, just added print() for each step to demonstrate the functionality. (Yes, who knows, ... knows that printing takes "lot of milliseconds" :) )
Let's start with CUDA / GPU.

In [28]:
%%cuda
#include <stdio.h>

__global__ void fibonacci_kernel( int* fib, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx == 0) {
        fib[0] = 0;  // F(0)
        fib[1] = 1;  // F(1)
    }

    __syncthreads();  // Ensure all threads initialize the first few Fibonacci values

    if (idx == 0 && n > 2) {
        for (int i = 2; i < n+1; i++) {
            // Shift the registers and calculate the new Fibonacci value
            int temp = fib[0] + fib[1];  // Calculate new Fibonacci number
            fib[0] = fib[1];  // Shift fib[0]
            fib[1] = temp;
            printf("The %dth Fibonacci number is: %d\n", i, fib[1]);
        }
    }

    __syncthreads();  // Synchronize threads before final output

    printf("The %dth Fibonacci number is: %d\n", n, fib[1]);

}

int main() {
    const int n = 50;  // Fibonacci sequence position (adjustable up to 50!)

    int* host_fib = new int[3];  // We only need to store 3 values at any time
    int* device_fib;

    cudaMalloc((void**)&device_fib, 3 * sizeof(int));

    // Initialize the first three Fibonacci numbers
    cudaMemcpy(device_fib, host_fib, 3 * sizeof(int), cudaMemcpyHostToDevice);

    int threads_per_block = 1;
    int blocks = 1;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);

    fibonacci_kernel<<<blocks, threads_per_block>>>(device_fib, n);

    cudaEventRecord(stop);
    cudaDeviceSynchronize();  // Ensure all threads are finished before continuing

    float elapsed_time = 0.0;
    cudaEventElapsedTime(&elapsed_time, start, stop);

    // Print the elapsed time
    printf("\n Elapsed time: %.4f ms\n", elapsed_time);

    cudaFree(device_fib);
    delete[] host_fib;

    return 0;
}

UsageError: Cell magic `%%cuda` not found.


### Note: Writing to file & compiling
If you want to write the code to a file for compiling, then use `%%writefile [filename]` and the C++ code

In [32]:
%%writefile fibonacci.cu
#include <stdio.h>

__global__ void fibonacci_kernel(int* fib, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx == 0) fib[idx] = 0;  // First Fibonacci number
    if (idx == 1) fib[idx] = 1;  // Second Fibonacci number

    __syncthreads();

    if (idx >= 2 && idx < n) {
        fib[idx] = fib[idx - 1] + fib[idx - 2];
        printf("Thread %d: %d\n", idx, fib[idx]);
    }
}

int main() {
    const int n = 10;
    int* host_fib = new int[n];
    int* device_fib;

    host_fib[0] = 0;
    host_fib[1] = 1;

    cudaMalloc((void**)&device_fib, n * sizeof(int));
    cudaMemcpy(device_fib, host_fib, n * sizeof(int), cudaMemcpyHostToDevice);

    int threads_per_block = 32;
    int blocks = (n + threads_per_block - 1) / threads_per_block;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    fibonacci_kernel<<<blocks, threads_per_block>>>(device_fib, n);
    cudaEventRecord(stop);

    cudaDeviceSynchronize();

    float elapsed_time = 0.0;
    cudaEventElapsedTime(&elapsed_time, start, stop);

    cudaMemcpy(host_fib, device_fib, 2 * sizeof(int), cudaMemcpyDeviceToHost);

    printf("The 100th Fibonacci number is: %d\n", host_fib[n - 1]);
    printf("Elapsed time: %.4f ms\n", elapsed_time);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(device_fib);
    delete[] host_fib;

    return 0;
}

Overwriting fibonacci.cu


In [21]:
#check if the code is written to file fibonacci.cu
%%sh
ls

fibonacci.cu
sample_data


Check [this](https://colab.research.google.com/drive/1GJOfTp56OeQRdE4u2_S7pUNRcJb4ik9X?usp=sharing#scrollTo=nW4NEuImTzYu) Colab tutorial for compiling.

In [22]:
%%cuda
nvcc -o fibonacci fibonacci.cu

UsageError: Cell magic `%%cuda` not found.


In [33]:
!nvcc -I fibonacci -lcublas -lcusolver -Wno-deprecated-gpu-targets fibonacci.cu

In [26]:
!ls

a.out  fibonacci.cu  sample_data


In [34]:
!./a.out

The 100th Fibonacci number is: 0
Elapsed time: 0.0000 ms


Now, calculate using a Python (recoursive, thus time optimized) calculation code.

In [15]:
#Fibonacci series nth element calculator (python code)
import time

def nthFibonacci( n : int) -> int:
    #recoursive calculation
    mod_base = 10**9 +7 #for easier printing at high numbers
    if n < 3:
        return 1
    arr1, arr2 = 0, 1
    for _ in range(2, n+1, 1):
        new_item = (arr1 + arr2) % mod_base
        arr1, arr2 = arr2, new_item
        print('Serie element #' + str(_) + ' : ' + str(new_item))
    return arr2

start_time = time.time()
n = 50
print('\nSerie element #' + str(n) + ' : ' + str(nthFibonacci(n)))
end_time = time.time()
print("Elapsed time: ", (end_time - start_time)*1000) # 0.056 ms / cycle

Serie element #2 : 1
Serie element #3 : 2
Serie element #4 : 3
Serie element #5 : 5
Serie element #6 : 8
Serie element #7 : 13
Serie element #8 : 21
Serie element #9 : 34
Serie element #10 : 55
Serie element #11 : 89
Serie element #12 : 144
Serie element #13 : 233
Serie element #14 : 377
Serie element #15 : 610
Serie element #16 : 987
Serie element #17 : 1597
Serie element #18 : 2584
Serie element #19 : 4181
Serie element #20 : 6765
Serie element #21 : 10946
Serie element #22 : 17711
Serie element #23 : 28657
Serie element #24 : 46368
Serie element #25 : 75025
Serie element #26 : 121393
Serie element #27 : 196418
Serie element #28 : 317811
Serie element #29 : 514229
Serie element #30 : 832040
Serie element #31 : 1346269
Serie element #32 : 2178309
Serie element #33 : 3524578
Serie element #34 : 5702887
Serie element #35 : 9227465
Serie element #36 : 14930352
Serie element #37 : 24157817
Serie element #38 : 39088169
Serie element #39 : 63245986
Serie element #40 : 102334155
Serie elemen

In [14]:
print("Elapsed time: ", str(round(0.0156*n*1000,4)), " ms") # 0.056 ms / cycle

Elapsed time:  780.0  ms


Without printing each step's result it may be faster, but not close to the CUDA run time.

## Exampe 5: Fractals with visualization
https://colab.research.google.com/github/noahgift/cloud-data-analysis-at-scale/blob/master/GPU_Programming.ipynb

# Cleaning
If installed and not needed remove the git clone:

In [None]:
rmdir /s /q "nvcc4jupyter"