# 2 Heterogeneous data parallel computing

- *Data parallelism* - refers to the phenomenon in which different parts of the dataset can be computed independently

## 2.1 Data parallelism

- We'll work alongside an example to elaborate on this topic. Lets consider image manipulation, where we handle millions to trillions of pixels
    - eg. to convert a colored image to grayscale we need to compute the luminosity to the three channel matrices: $L=0.21r+0.72g+0.07b$, for all $N-1$ pixels: $O[0]=L(I[0](r,g,b)),\ldots, O[N-1]=L(I[N-1](r,g,b))$
- <div class="alert alert-info"><b>Task Parallelism vs Data Parallelism</b> - in general, the former is the main source of scalability but not the only type of paralellism. The latter also gives room for sqeezing parllelizable performance and a nice detail is the larger the application, the larger the independent tasks present </div>

## 2.2 CUDA C program structure

- CUDA C is NVIDIA's programming language that unlocks access to heterogeneous computing systems composed by CPU cores and massively parallel GPUs
    - CUDA C extends ANSI C with minimal new syntax and libraries (plus some C++ features) to target heterogenous computing
    - CUDA C's code structure reflects the structure of a *host* (CPU) and *devices* (GPUs) in a computer.
- Fig.2.3 shows the simplified (CPU threads don't overlap w/ GPU threads) scheme of the execution of *grids* 
    - In the color-to-grayscale example each thread will be used to compute one output pixel, so we can expect $N-1$ threads to be generated and scheduled. These take very few clock cycles in contrast to CPU threads which take thousands of clock cycles to generate and schedule
      
<img src="images/ch022-cuda-program.png" width="60%">
      
- <div class="alert alert-info"><b>Threads</b> - are a simplified view of how a processor executes a sequential program in a coputer. Contains the code of the program, the point in the code that is being executed and the values of its variables and data structures. Threads are sequential, even in CUDA programs, where a program initiates parallel execution by calling kernel functions which launches grids of execution (through its underlying runtime mechanisms)</div>

## 2.3 A vector addition kernel

- Lets walk through the "Hello World" equivalent example for sequential programming ie. vector addition, check Fig.2.4. *Notation.-* host variables will always have `_h` whereas variables used by the device will have `_d`
      
<img src="images/ch023-trad-vector-sum.png" width="50%">
      
- <div class="alert alert-info"><b>Pointers in C lang</b> - regular (pointer) variables are declared as <code>float V</code> (<code>float *P</code>). We can make P access the value of V w/ <code>P=&V</code>. So the args for <code>vecAdd</code> are pointers that access the i-th element of <code>A_h, B-h, C_h</code></div>