# Notebook #2 - Programming Basics with SAXPY

First we set environment variables to point to the user's notebook code and to the Lucata tools. 

We'll also set SIMFLAGS to simulate 16 MiB of memory on 1 node for each simulation. 

In [None]:
import os
from IPython.display import Code

#Set the path to the latest toolset 
LUCATA_BASE="/tools/emu/pathfinder-sw/22.09-beta" 

#Get the path to where all code samples are
os.environ["USER_NOTEBOOK_CODE"]=os.path.dirname(os.getcwd())
os.environ["PATH"]=os.pathsep.join([os.path.join(LUCATA_BASE,"bin"),os.environ["PATH"]])
os.environ["CCFLAGS"]="-I"+LUCATA_BASE+"/include/"+" -L"+LUCATA_BASE+"/lib -lmemoryweb"
os.environ["SIMFLAGS"]="-m 24 --total_nodes 1 --output_instruction_count --capture_timing_queues"

This notebook goes along with the [Lucata programming basics slides](https://github.com/gt-crnch-rg/pearc-tutorial-2021/blob/main/slides/lucata_tutorial/01_Lucata_Pathfinder_Tutorial_Basics.pdf), so please follow along with the slides for a supplemental resource. 

## SAXPY with Cilk Spawn

Our first example shows an example of Single-precision AX Plus Y (SAXPY), a basic linear algebra kernel that combines scalar multiplication and vector addition. As shown in the saxpy kernel, the output, `y`, is equal to the sum of the constant `a` multiplied by the elements of a vector `x`. 

This first example shows how to implement SAXPY using the Cilk functions, `cilk_spawn` and `cilk_sync`. Note that the Lucata architecture operates on a particular "grain size", which is specified by the number of threads (the first argument to the program). 

*For examples of SAXPY in other parallel languages please check out this [NVIDIA developer blog on SAXPY](https://developer.nvidia.com/blog/six-ways-saxpy/).

In [None]:
Code('saxpy.c')

We'll test compiling and running this example with a 4 and 8 threads, an array with 32 or 128 elements, and a constant value `a` of 5.0. Then we will check the .cdc files for the simulated runtime of the profiled region. 

This notebook will use 8 threads and an array of size 128 for all examples after this one.

The line that says "Emu system run time" is the simulated time to complete the SAXPY operation.

In [None]:
%%bash
emu-cc -o saxpy.mwx $CCFLAGS saxpy.c
emusim.x $SIMFLAGS -- saxpy.mwx 4 32 5.0 2>/dev/null
less saxpy.cdc | grep "Emu system run time"

In [None]:
%%bash
emusim.x $SIMFLAGS -- saxpy.mwx 8 128 5.0 2>/dev/null
less saxpy.cdc | grep "Emu system run time"

## SAXPY with Cilk For

`cilk_for` can be used to launch one thread per loop iteration in a fashion similar to traditional OpenMP pragma-based `omp parallel for` loops. Note here that the programmer must explicitly specify a grainsize to partition up the input array.

In [None]:
Code('saxpy-for.c')

In [None]:
%%bash
emu-cc $CCFLAGS -o saxpy-for.mwx  saxpy-for.c
emusim.x $SIMFLAGS -- saxpy-for.mwx 8 128 5.0 2>/dev/null
less saxpy-for.cdc | grep "Emu system run time"

## SAXPY with Distributed Allocation (1D)

In this example, `memoryweb.h` is included for Lucata-specific distributed allocation strategies while `mw_malloc1dlong` is used to distribute data across different nodes within the system.

In [None]:
Code('saxpy-1d.c')

In [None]:
%%bash
emu-cc $CCFLAGS -o saxpy-1d.mwx  saxpy-1d.c
emusim.x $SIMFLAGS -- saxpy-1d.mwx 8 128 5.0 2>/dev/null
less saxpy-1d.cdc | grep "Emu system run time"

## SAXPY Distributed Spawn with migrate_hint

The `migrate_hint` allows the programmer to pass a pointer that is then used by the next `cilk_spawn` operation to efficiently jump to a specific part of a distributed array. Here the migration hint is specifying a "directed spawn" to the location where `y[j]` is located. Note that `cilk_spawn_at` provides a similar purpose by combining a spawn and migration hint operation into one call. 

In [None]:
Code('saxpy-1d-hint.c')

In [None]:
%%bash
emu-cc $CCFLAGS -o saxpy-1d-hint.mwx  saxpy-1d-hint.c
emusim.x $SIMFLAGS -- saxpy-1d-hint.mwx 8 128 5.0 2>/dev/null
less saxpy-1d-hint.cdc | grep "Emu system run time"

## SAXPY with 2D Distributed Allocation

This example shows the usage of `cilk_spawn_at` and 2D block allocation of data across the Lucata nodes. In this case, the number of threads matches the number of blocks and the work done by each thread is the block size. 

In [None]:
Code('saxpy-2d-spawn-at.c')

In [None]:
%%bash
emu-cc $CCFLAGS -o saxpy-2d-spawn-at.mwx  saxpy-2d-spawn-at.c
emusim.x $SIMFLAGS -- saxpy-2d-spawn-at.mwx 8 128 5.0 2>/dev/null
less saxpy-2d-spawn-at.cdc | grep "Emu system run time"

## SAXPY with Local Allocation

This example shows a variation of the previous 2D code with a local allocation for the output. You will notice that the local allocation for the output (as opposed to 2D allocation) results in more migrations overall. 

In [None]:
Code('saxpy-local-spawn-at.c')

In [None]:
%%bash
emu-cc $CCFLAGS -o saxpy-local-spawn-at.mwx  saxpy-local-spawn-at.c
emusim.x $SIMFLAGS -- saxpy-local-spawn-at.mwx 8 128 5.0 2>/dev/null
less saxpy-local-spawn-at.cdc | grep "Emu system run time"

## SAXPY with Replicated Data Structures

Finally, we look at using replication to create copies of the constant variable, `a` across all the nodes. This prevents migrations to access this common variable if it were located only on a single node. Note that replication can be a powerful tool for optimized allocation but it should be used primarily with small data structures and variables that are read-only (for coherency reasons).

In [None]:
Code('saxpy-1d-replicated.c')

In [None]:
%%bash
emu-cc $CCFLAGS -o saxpy-1d-replicated.mwx  saxpy-1d-replicated.c
emusim.x $SIMFLAGS -- saxpy-1d-replicated.mwx 8 128 5.0 2>/dev/null
less saxpy-1d-replicated.cdc | grep "Emu system run time"

#%%bash
#FLAGS="-I/tools/lucata/pathfinder-sw/22.02/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.02/lib -lmemoryweb"
#emu-cc -o saxpy-1d-replicated.mwx $FLAGS saxpy-1d-replicated.c
#emusim.x -m 24 --total_nodes 1 -- saxpy-1d-replicated.mwx 8 128 5 2>/dev/null

## Visualization

As we learned in Notebook 1.1, we can use `emusim_profile` to generate a number of charts to help us understand the performance of the profiled regions. Below is an example of how to do that for the last example, `saxpy-1d-replicated`. 

In [None]:
%%bash
emusim_profile saxpy-profile $SIMFLAGS -- saxpy-1d-replicated.mwx 8 128 5.0 2>/dev/null

The output for the above simulation is summarized in the file [saxpy-profile/saxpy-1d-replicated.html](saxpy-profile/saxpy-1d-replicated-report.html). Note that some plots will not be generated when running with fewer than 2 nodes (set in `$SIMFLAGS`).

### Postcript
Here we have investigated several different strategies for spawning threads and allocatin data with the Pathfinder's distributed layout. 

Once we've finished our testing, we can clean up some of the logfiles that we used for this example with `make clean`. Uncomment the following line to clean this directory.

In [None]:
#!make clean