# Objective

- To learn the basic API functions in CUDA host code
  - Device Memory Allocation
  - Host-Device Data Transfer

# Data Parallelism - Vector Addition Example

![alt tag](img/3.png)
<hr style="height:2px">

# Vector Addition – Traditional C Code

```cpp

// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
...
vecAdd(h_A, h_B, h_C, N);
}

``` 
<hr style="height:2px">

# Heterogeneous Computing vecAdd CUDA Host Code

![alt tag](img/5.png)

```cpp

#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
    int size = n* sizeof(float);
    float *d_A, *d_B, *d_C;
    
    // Part 1
    // Allocate device memory for A, B, and C
    // copy A and B to device memory
    
    // Part 2
    // Kernel launch code – the device performs the actual vector addition
    
    // Part 3
    // copy C from the device memory
    // Free device vectors
}

```
<hr style="height:2px">

# Partial Overview of CUDA Memories

- Device code can:
  - R/W per-thread registers
  - R/W all-shared global memory
  
- Host code can
  - Transfer data to/from per grid global memory

![alt tag](img/6.png)

**We will cover more memory types and more sophisticated memory models later.**


<hr style="height:2px">

# CUDA Device Memory Management API functions


![alt tag](img/7.png)
<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>