# Final Exercise

Let's test your NVSHMEM skills! In this notebook We provide a fully-implemented CUDA application that works on a single GPU, and your job is to convert it to work correctly in NVSHMEM for an arbitrary number of PEs.

## Objectives

By the time you complete this notebook you will:

- Demonstrate your ability to write NVSHMEM code for an arbitrary number of PEs.

## 1D Wave Equation

As with the Jacobi program, you don't need to really understand this algorithm to work with the code, but, we will take time here to introduce it.

The problem we'll solve is the 1D wave equation:

$$
\frac{\partial^2}{\partial t^2} u(x,t) = c^2 \frac{\partial^2}{\partial x^2} u(x,t)
$$

Here $u = u(x,t)$ is a scalar function of space ($x$) and time ($t$), and $c$ is a (constant) characteristic wave speed (for example, the speed of sound if the wave in question is propagating in air).

Implementing the centered-difference discretization of the spatial and time derivatives, one way to write this is:

$$
\frac{1}{\Delta t^2} \left(u_{i}^{n+1} - 2 u_{i}^{n} + u_{i}^{n-1}\right) = \frac{c^2}{\Delta x^2} \left(u_{i+1}^{n} - 2 u_{i}^{n} + u_{i-1}^{n}\right)
$$

Where subscripts denote spatial indices and superscripts denote timesteps. Rearranging this for the unknown $u_{i}^{n+1}$ in terms of known quantities at timesteps $n$ and $n-1$, we have:

$$
u_{i}^{n+1} = 2 u_{i}^{n} - u_{i}^{n-1} + \left(\frac{c\, \Delta t}{\Delta x}\right)^2 \left(u_{i+1}^{n} - 2 u_{i}^{n} + u_{i-1}^{n}\right)
$$

### Solving

To solve using this method, we simply need to retain the value of the solution at two previous timesteps, and then replace the old data after each update.

We're going to specify a 1D domain from $x = 0.0$ to $x = 1.0$, and discretize into $N$ points with a grid spacing of $\Delta x = 1.0\, /\, (N - 1)$. $\Delta t$ [must be chosen](https://en.wikipedia.org/wiki/Courant%E2%80%93Friedrichs%E2%80%93Lewy_condition) so that it is less than or equal to $c \Delta x$. To simplify, we'll choose $c = 1.0$ so that we don't have to worry about that term floating around.

We're going to specify that $u(0, t) = u(1, t) = 0$. We can think of this like we're solving a wave propagating in a string, where the two ends of the string are held taut. What we specify is the initial condition $u(x, 0)$, a simple sine wave, as well as an initial condition (the velocity at $t = 0$ is zero, which is effectively implemented by starting with $u^{n} == u^{n-1}$).

The period of this wave is 1.0, so after simulating up to $t = 1$, the wave should return exactly to where it started. We will use that to verify our solution -- the check at the end of the code will print an "error" which is the $L^2$ norm of the current solution with respect to the initial solution.

## A Single GPU Implementation

This example is implemented in standard CUDA for a single GPU. Take some time to review the algorithm and its parallel implementation, and examine the output below.

In [None]:
%%writefile wave-1GPU.cu
#include <iostream>
#include <limits>
#include <cstdio>
#include <cmath>

// Number of points in the overall spatial domain
#define NUM_POINTS 1048576

__global__ void wave_update (float* u, const float* u_old, const float* u_older, float dtdxsq, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx > 0 && idx < N - 1) {

        float u_old_right = u_old[idx+1];
        float u_old_left = u_old[idx-1];

        u[idx] = 2.0f * u_old[idx] - u_older[idx] +
                 dtdxsq * (u_old_right - 2.0f * u_old[idx] + u_old_left);
    }
}

__global__ void initialize (float* u, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx < N) {
        u[idx] = std::sin(2.0f * M_PI * idx / static_cast<float>(NUM_POINTS - 1));
    }
}


int main(int argc, char **argv) 
{
    const int N = NUM_POINTS;

    // Allocate space for the grid data, and the temporary buffer
    // for the "old" and "older" data.
    float* u_older;
    float* u_old;
    float* u;

    cudaMalloc(&u_older, N * sizeof(float));
    cudaMalloc(&u_old, N * sizeof(float));
    cudaMalloc(&u, N * sizeof(float));

    // Initialize the data
    int threads_per_block = 256;
    int blocks = (N + threads_per_block - 1) / threads_per_block;

    initialize<<<blocks, threads_per_block>>>(u_older, N);
    initialize<<<blocks, threads_per_block>>>(u_old, N);
    initialize<<<blocks, threads_per_block>>>(u, N);
    cudaDeviceSynchronize();

    // Now iterate until we've completed a full period
    const float period = 1.0f;
    const float start_time = 0.0f;
    const float stop_time = period;

    // Maximum stable timestep is <= dx
    float stability_factor = 0.5f;
    float dx = 1.0f / (NUM_POINTS - 1);
    float dt = stability_factor * dx;

    float t = start_time;
    const float safety_factor = (1.0f - 1.0e-5f);

    int num_steps = 0;

    while (t < safety_factor * stop_time) {
        // Make sure the last step does not go over the target time
        if (t + dt >= stop_time) {
            dt = stop_time - t;
        }

        float dtdxsq = (dt / dx) * (dt / dx);

        // Launch kernel to do the calculation
        wave_update<<<blocks, threads_per_block>>>(u, u_old, u_older, dtdxsq, N);
        cudaDeviceSynchronize();

        // Swap u_old and u_older
        std::swap(u_old, u_older);

        // Swap u and u_old
        std::swap(u, u_old);

        // Print out diagnostics periodically
        if (num_steps % 100000 == 0) {
            std::cout << "Current integration time = " << t << "\n";
        }

        // Update t
        t += dt;
        ++num_steps;
    }

    // Clean up
    cudaFree(u_older);
    cudaFree(u_old);
    cudaFree(u);

    return 0;
}

### Compile and Run the Code for 1GPU

In [None]:
!nvcc -x cu -arch=sm_70 -o wave-1GPU wave-1GPU.cu

In [None]:
%time
!./wave-1GPU

## Assessment: Implement with NVSHMEM

Now implement this with NVSHMEM, dividing the domain equally into $M$   subdomains (for $M$   PEs), with PE 0 owning $x = [0, 1 / M]$  , PE 0 owning $x = [1/M, 2/M]$  , etc. Keep the points at $x = 0$   and $x = 1$   completely fixed.

Some things to keep in mind as you're implementing your solution:
- Currently the initialization routine assumes the full domain is available. You'll have to modify that so that each PE sets the appropriate initial conditions for its location in the spatial domain.
- Similarly, for the solution check, make sure you do this properly across all PEs by summing up the L2 norm locally and then reducing all over PEs.
- You'll have to do some point-to-point communication inside the actual `wave_update()` routine to get the "halo" data from neighboring PEs (but be careful not to update the boundary points, one of which lives on PE 0, the other on PE $M$-1).
- It's OK if your solution is not any faster than the single-GPU case. We are mostly focusing on writing correct NVSHMEM code in this lab. Given the default value of `NUM_POINTS`, the amount of work per kernel is small enough that we're not really efficiently utilizing the GPU. If you want to do a more realistic performance comparison, set `NUM_POINTS` to a much larger number (but then cap the integration to, say, 10000 steps so that it completes in a reasonable amount of time). You can use the Jupyter notebook `%time` magic function to easily compare application runtime (e.g. `%time !nvshmrun -np 4 ...`).

### Fix the code wave.cpp

If you are really struggling, you can execute the following cell to copy a near-solution implementation into `wave.cpp` that will still require you to address some `FIXMEs` to complete the assessment.

In [None]:
%%writefile wave.cpp
#include <iostream>
#include <limits>
#include <cstdio>
#include <cmath>

#include <nvshmem.h>
#include <nvshmemx.h>

// Number of points in the overall spatial domain
#define NUM_POINTS 1048576

__global__ void wave_update (float* u, const float* u_old, const float* u_older, float dtdxsq, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    int my_pe = // FIXME
    int n_pes = // FIXME

    bool on_boundary = false;
    if (my_pe == 0 && idx == 0) {
        on_boundary = true;
    }
    else if (my_pe == n_pes - 1 && idx == N-1) {
        on_boundary = true;
    }

    if (idx < N && !on_boundary) {
        float u_old_left;

        if (idx == 0) {
            // Note we do not get here if we're PE == 0
            
            // FIXME: Get `u_old_left` from the last element of the previous PE.
            // u_old_left = nvshmem_float_g(TODO);
               u_old_left = // FIXME
        }
        else {
            u_old_left = u_old[idx-1];
        }

        float u_old_right;
        if (idx == N-1) {
            // Note we do not get here if we're PE == n_pes - 1
            
            // FIXME: Get `u_old_right` from the first element of the next PE.
            // u_old_right = nvshmem_float_g(TODO);
            u_old_right = // FIXME
        }
        else {
            u_old_right = u_old[idx+1];
        }

        u[idx] = 2.0f * u_old[idx] - u_older[idx] +
                 dtdxsq * (u_old_right - 2.0f * u_old[idx] + u_old_left);
    }
}

__global__ void initialize (float* u, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    int offset = nvshmem_my_pe() * (NUM_POINTS / nvshmem_n_pes());

    if (idx < N) {
        u[idx] = std::sin(2.0f * M_PI * (idx + offset) / static_cast<float>(NUM_POINTS - 1));
    }
}


int main() {
    // FIXME: Initialize NVSHMEM
    
    
    // FIXME: Obtain our NVSHMEM processing element ID and number of PEs
    int my_pe = // FIXME
    int n_pes = // FIXME 

    // Each PE (arbitrarily) chooses the GPU corresponding to its ID
    int device = my_pe;  
    cudaSetDevice(device);

    // Set `N` to total number of points over number of PEs
    const int N = NUM_POINTS / n_pes; // FIXME 

    // Allocate symmetric data for the grid data, and the temporary buffer
    // for the "old" and "older" data.
     float* u_older = (float*) nvshmem_malloc( N * sizeof(float)); 
     float* u_old =   (float*) nvshmem_malloc( N * sizeof(float)); 
     float* u =       (float*) nvshmem_malloc( N * sizeof(float)); 

    // Initialize the data
    int threads_per_block = 256;
    int blocks = (N + threads_per_block - 1) / threads_per_block;

    initialize<<<blocks, threads_per_block>>>(u_older, N);
    initialize<<<blocks, threads_per_block>>>(u_old, N);
    initialize<<<blocks, threads_per_block>>>(u, N);
    cudaDeviceSynchronize();

    // Now iterate until we've completed a full period
    const float period = 1.0f;
    const float start_time = 0.0f;
    const float stop_time = period;

    // Maximum stable timestep is <= dx
    float stability_factor = 0.5f;
    float dx = 1.0f / (NUM_POINTS - 1);
    float dt = stability_factor * dx;

    float t = start_time;
    const float safety_factor = (1.0f - 1.0e-5f);

    int num_steps = 0;

    while (t < safety_factor * stop_time) {
        // Make sure the last step does not go over the target time
        if (t + dt >= stop_time) {
            dt = stop_time - t;
        }

        float dtdxsq = (dt / dx) * (dt / dx);

        // Launch kernel to do the calculation
        wave_update<<<blocks, threads_per_block>>>(u, u_old, u_older, dtdxsq, N);
        cudaDeviceSynchronize();

        // FIXME: Synchronize all PEs before peforming the swaps.
        nvshmem_barrier_all();
        // Swap u_old and u_older
        std::swap(u_old, u_older);

        // Swap u and u_old
        std::swap(u, u_old);

        // Print out diagnostics periodically
        // FIXME: Only do the periodic print if this is PE 0.
        if (num_steps % 100000 == 0 && my_pe == 0) {
            std::cout << "Current integration time = " << t << "\n";
        }

        // Update t
        t += dt;
        ++num_steps;
    }

    // Clean up
    // FIXME: Use NVSHMEM to free `u_older`, `u_old`, and `u`.
    
    
    // FIXME: Finalize NVSHMEM
    
    return 0;
}

### Run the Code for NVSHMEM

In [None]:
!nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o wave wave.cpp

In [None]:
%time
!nvshmrun -np $NUM_DEVICES ./wave

## Summary

Congratulations! You've now mastered the fundamentals of NCCL, CUDA-aware MPI, and NVSHMEM and are ready to begin applying it in your own problems.