# <span style="color:green"> Objective </span>

- To learn how to apply parallel programming techniques to an application
    - A fast gather kernel
    - Thread coarsening for more work efficiency
    - Data structure padding for reduced divergence
    - Memory access locality and pre-computation techniques

<hr style="height:2px">

# <span style="color:green"> A Slower Sequential C Version </span>

```cpp

void cenergy(float *energygrid, dim3 grid, float gridspacing, float z, const float *atoms, int
numatoms) {
    int atomarrdim = numatoms * 4;
    int k = z / gridspacing;
    // OUTPUT ORIENTED
    for (int j=0; j<grid.y; j++) {
        float y = gridspacing * (float) j;
        for (int i=0; i<grid.x; i++) {
            float x = gridspacing * (float) i;
            float energy = 0.0f;
            for (int n=0; n<atomarrdim; n+=4) {
                // calculate potential contribution of each atom
                float dx = x - atoms[n];
                float dy = y - atoms[n+1];
                float dz = z - atoms[n+2];
                energy += atoms[n+3] / sqrtf(dx*dx + dy*dy + dz*dz);
            }
            energygrid[grid.x*grid.y*k + grid.x*j + i] += energy;
        }
    }
}

```

<hr style="height:2px">

# <span style="color:green"> A Slower Sequential C Version </span>

```cpp

void cenergy(float *energygrid, dim3 grid, float gridspacing, float z, const
float *atoms, int numatoms) {
    int atomarrdim = numatoms * 4;
    int k = z / gridspacing;
    for (int j=0; j<grid.y; j++) {
        float y = gridspacing * (float) j;
        for (int i=0; i<grid.x; i++) {
            float x = gridspacing * (float) i;
            float energy = 0.0f
            for (int n=0; n<atomarrdim; n+=4) {
                // calculate potential contribution of each atom - REDUNDANT WORK
                float dx = x - atoms[n];
                float dy = y - atoms[n+1];
                float dz = z - atoms[n+2];
                energy += atoms[n+3] / sqrtf(dx*dx + dy*dy + dz*dz);
            }
            energygrid[grid.x*grid.y*k + grid.x*j + i] += energy;
        }
    }
}

```

<hr style="height:2px">

# <span style="color:green"> Pros and Cons of the Slower Sequential Code </span>

- Pros
    - Fewer access to the energygrid array
    - Simpler code structure
- Cons
    - Many more calculations on the coordinates
    - More access to the atom array
    - Overall, much slower sequential execution due to the sheer number of calculations performed

<hr style="height:2px">

# <span style="color:green"> Simple DCS CUDA Block/Grid Decomposition </span>

![alt tag](img/6.png)
<hr style="height:2px">

# <span style="color:green"> Gather Parallelization </span>

![alt tag](img/7.png)
<hr style="height:2px">

# <span style="color:green"> A Fast DCS CUDA Gather Kernel </span>

```cpp

void __global__ cenergy(float *energygrid, dim3 grid, float gridspacing, float z, float *atoms,
int numatoms) {
    // --- One thread per grid ---
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    int atomarrdim = numatoms * 4;
    int k = z / gridspacing;
    float y = gridspacing * (float) j;
    float x = gridspacing * (float) i;
    // --- One thread per grid ---
    
    // --- All threads access all atoms. Consolidated writes to grid points ---
    float energy = 0.0f;
    for (int n=0; n<atomarrdim; n+=4) {
        // calculate potential contribution of each atom
        float dx = x - atoms[n];
        float dy = y - atoms[n+1];
        float dz = z - atoms[n+2];
        energy += atoms[n+3] / sqrtf(dx*dx + dy*dy + dz*dz);
    }
    energygrid[grid.x*grid.y*k + grid.x*j + i] += energy;
    // --- All threads access all atoms. Consolidated writes to grid points ---
}

```

<hr style="height:2px">

# <span style="color:green"> Additional Comments </span>

- Gather kernel is much faster than a scatter kernel
    - No serialization due to atomic operations
- Compute efficient sequential algorithm does not translate into the fast parallel algorithm
    - Gather vs. scatter is a big factor
    - But we will come back to this point later!

<hr style="height:2px">

# <span style="color:green"> Even More Comments </span>

- In modern CPUs, cache effectiveness is often more important than compute efficiency
- The input oriented (scatter) sequential code actually has bad cache performance
    - energygrid[] is a very large array, typically 20X or more larger than atom[]
    - The input oriented sequential code sweeps through the large data structure for each atom, trashing cache.

<hr style="height:2px">

# <span style="color:green"> Outline of A Fast Sequential Code </span>

```
for all z {
    for all atoms {pre-compute dz2 }
        for all y {
            for all atoms {pre-compute dy2 (+ dz2) }
                for all x {
                    for all atoms {
                        compute contribution to current x,y,z point
                        using pre-computed dy2 + dz2
                    }
                }
            }
        }
    }
}
```

<hr style="height:2px">

# <span style="color:green"> More Thoughts on Fast Sequential Code </span>

- Need temporary arrays for pre-calculated dz2 and dy2 + dz2 values
- So, why does this code has better cache behaior on CPUs?

<hr style="height:2px">

# <span style="color:green"> Reuse Distance Calculation for More Computation Efficiency </span>

![alt tag](img/14.png)
<hr style="height:2px">

# <span style="color:green"> Thread Coarsening </span>

![alt tag](img/15.png)
<hr style="height:2px">

# <span style="color:green"> A Compute Efficient Gather Kernel </span>

![alt tag](img/16.png)
<hr style="height:2px">

# <span style="color:green"> Thread Coarsening for More Computation Efficiency </span>

![alt tag](img/17.png)
<hr style="height:2px">

# <span style="color:green"> Performance Comparison </span>

![alt tag](img/18.png)
<hr style="height:2px">

# <span style="color:green"> More Work is Needed to Feed a GPU </span>

![alt tag](img/19.png)
<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>