<a href="https://colab.research.google.com/github/knoel99/learn_cuda/blob/master/01_easier_intro_to_cuda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An even easier introduction to CUDA

Source: https://developer.nvidia.com/blog/even-easier-introduction-cuda/

Many noob notes for C++ are added.

# Requirements
- Learn how to run C++ code in colab
- Select a colab runtime with a GPU

In [17]:
# Test C++ code
%%writefile hello.cpp
#include <iostream>
using namespace std;

int main() {
  cout << "Hello from Colab!" << endl;
  return 0;
}

Overwriting hello.cpp


In [18]:
# Compile with g++
!g++ hello.cpp -o hello
!./hello

Hello from Colab!


# Noob notes:
- `<iostream>` is the library needed to print results in the terminal
- writing `using namespace std;` allows to directly write function `cout` instead of `std::cout`
- `cout`means "console output" or "character output"

# Addition of two arrays with standard C++ code, on CPU

In this tutorial the studied function is just the addition of two arrays with 1 million elements each.

In [19]:
%%writefile addition.cpp
#include <iostream>
#include <math.h>

// Add two arrays
void add(int n, float *x, float *y) {
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void) {
  int N = 1<<20; // 1 M elements

  float *x = new float[N];
  float *y = new float[N];

  // Init the two arrays with a for loop.
  // tutorial says : init arrays on the host => host is the CPU
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all elements should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N ; i++)
    maxError = fmax(maxError, fabs(y[i] - 3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete[] x;
  delete[] y;

  return 0;

}

Overwriting addition.cpp


In [20]:
# Compile and run
!g++ addition.cpp -o addition
!./addition

Max error: 0


No error in the addition as expected

# Noob note

Meaning of `int N = 1<<20; `

- `1<<20` means 2^20, where the double chevron means shifting bits to the left. The two arrays has 1 048 676 elements.
- Each element of the array is a float, defined on 4 bytes.
- Each array is then about 4*2^20 bytes=~ 4 MB in memory


Some examples:
- 1 << 10 = 1024 ~ 1 kB
- 1 << 20 = 1 048 576 ~1 MB
- 1 << 30 = 1 073 741 824  ~1 GB

Why put the pointers in the function arguments instead of the arrays themselves, just like in python ?


In python we have:
```python
def add(a, b):
    for i in range(len(a)):
        b[i] = a[i] + b[i]

x = [1.0] * 2**20
y = [2.0] * 2**20
add(x, y)
```

In Cpp:
```cpp
void add(int n, float *x, float *y) {
    for (int i = 0; i < n; i++)
        y[i] = x[i] + y[i];
}
int N = 1<<20;
float *x = new float[N];
float *y = new float[N];

for (int i = 0; i < N; i++) {
  x[i] = 1.0f;
  y[i] = 2.0f;
}

add(N, x, y);
```

In theory those two lines are equivalent, but the convention is to declare the pointer of the variable instead of the variable itself.



# Running the addition on the GPU

Now I want to run the addtion function onto the GPU, using its cores.

We have to turn this C++ function into a kernel, ie a function that can run on the GPU.

To do this, we just need to add the keyword `__global__` to the function. The CUDA C++ compiler can then run the function on the GPU.

Defitions:
- CUDA kernel: a function that can be run on the GPU
- Device code: the code that runs on the GPU
- Host code: the code that runs on the CPU

Example of the tutorial:

```cpp
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *sum, float *x, float *y) {
  for (int i = 0; i < n; i++)
  sum[i] = x[i] + y[i];
}
```

# Memory allocation in CUDA

In standard C++, to allocate the memory for two arrays we do:

```cpp
// Init
int N = 10;
float *x = new float[N]:
float *y = new float[N];

// Do some stuff

// Free memory
delete[] x
delete[] y
```

In CUDA, thanks to the Unified Memory concept, the equivalent can be written as:

```cpp
// Allocate Unified Memory --- accessible from CPU or GPU
float *x, *y;
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));

// Do some stuff

// Free memory
cudaFree(x);
cudaFree(y)
```

So now we have the kernel defined with the keyword `__global__` like this:

```cpp
__global__
void add(int n, float *x, float *y){
  ...
}
```

And to call the kernel from the host in the `main` function, we do this:

```cpp
int main() {
  // Stuff before

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, sum, x, y);

  // Wait for the GPU to finish before accessing on host
  cudaDeviceSynschronize();

  // Stuff after
}
```

Where `cudaDeviceSynchronize()` is needed make the CPU wait untill the computation on the GPU is finished.

The complete code with the kernel is then:

In [27]:
%%writefile add_cuda.cu
#include <iostream>
#include <math.h>

// Kernel function to add two arrays:
__global__
void add(int n, float *x, float *y){
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void) {
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory - accessible from CPU or GPU
  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  // Init the two arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for the GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  int index = 0;
  for (int i = 0; i < N; i++) {
    std::cout << "y[" << i << "] = " << i << std::endl;

    maxError = fmax(maxError, fabs(y[i] - 3.0f));
    if (maxError > 0) {
      index = i;
      std::cout << "Max error at index: " << index << std::endl;
      break;
    }
  }
  std::cout << "Max error: " << maxError << std:: endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);

  return 0;
}


Overwriting add_cuda.cu


In [28]:
!nvcc add_cuda.cu -o add_cuda
!./add_cuda

y[0] = 0
Max error at index: 0
Max error: 1


# Noob Note

Let's check out ourselves that Unified Memory is accessile from the CPU or the GPU