# Notebook #2 - Programming Basics with SAXPY

First we set environment variables to point to the user's notebook code and to the Lucata tools.

In [5]:
import os
from IPython.display import Code
os.environ["USER_NOTEBOOK_CODE"]=os.path.dirname(os.getcwd())
os.environ["PATH"]=os.pathsep.join(["/tools/emu/pathfinder-sw/22.02/bin",os.environ["PATH"]])

This notebook goes along with the [Lucata programming basics slides](https://github.com/gt-crnch-rg/pearc-tutorial-2021/blob/main/slides/lucata_tutorial/01_Lucata_Pathfinder_Tutorial_Basics.pdf), so please follow along with the slides for a supplemental resource. 

## SAXPY with Cilk Spawn

Our first example shows an example of Single-precision AX Plus Y (SAXPY), a basic linear algebra kernel that combines scalar multiplication and vector addition. As shown in the saxpy kernel, the output, `y`, is equal to the sum of the constant `a` multiplied by the elements of a vector `x`. 

This first example shows how to implement SAXPY using the Cilk functions, `cilk_spawn` and `cilk_sync`. Note that the Lucata architecture operates on a particular "grain size", which is specified by the number of threads (the first argument to the program). 

*For examples of SAXPY in other parallel languages please check out this [NVIDIA developer blog on SAXPY](https://developer.nvidia.com/blog/six-ways-saxpy/).

In [6]:
Code('saxpy.c')

We'll test compiling and running this example with a 4 and 8 threads, an array with 32 or 128 elements, and a constant value `a` of 5.0. Then we will check the .cdc and .vsf files to get some high-level statistics on the execution. 

This notebook will use 8 threads and an array of size 128 for all examples after this one.

In [7]:
%%bash
emu-cc -o saxpy.mwx saxpy.c
emusim.x -m 21 --total_nodes 1 -- saxpy.mwx 4 32 5.0 &>/dev/null

In [8]:
!more saxpy.cdc
!more saxpy.vsf

************************************************
Program Name/Arguments: 
saxpy.mwx 
4 
32 
5.0 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 4, 5, 0

NodeID: num_reads, num_writes, num_rmws
0: 86, 44, 82



In [9]:
%%bash
emusim.x -m 21 --total_nodes 1 -- saxpy.mwx 8 128 5.0 &>/dev/null

In [10]:
!more saxpy.cdc
!more saxpy.vsf

************************************************
Program Name/Arguments: 
saxpy.mwx 
8 
128 
5.0 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 8, 9, 0

NodeID: num_reads, num_writes, num_rmws
0: 123, 48, 274



## SAXPY with Cilk For

`cilk_for` can be used to launch one thread per loop iteration in a fashion similar to traditional OpenMP pragma-based `omp parallel for` loops. Note here that the programmer must explicitly specify a grainsize to partition up the input array.

In [11]:
Code('saxpy-for.c')

In [12]:
%%bash
emu-cc -o saxpy-for.mwx saxpy-for.c
emusim.x -m 21 --total_nodes 1 -- saxpy-for.mwx 8 128 5.0 &>/dev/null

In [13]:
!more saxpy-for.cdc
!more saxpy-for.vsf 

************************************************
Program Name/Arguments: 
saxpy-for.mwx 
8 
128 
5.0 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 0, 1, 0

NodeID: num_reads, num_writes, num_rmws
0: 25, 9, 25



## SAXPY with Distributed Allocation (1D)

In this example, `memoryweb.h` is included for Lucata-specific distributed allocation strategies while `mw_malloc1dlong` is used to distribute data across different nodes within the system.

In [14]:
Code('saxpy-1d.c')

In [15]:
%%bash
FLAGS="-I/tools/lucata/pathfinder-sw/22.02/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.02/lib -lmemoryweb"
emu-cc -o saxpy-1d.mwx $FLAGS saxpy-1d.c
emusim.x -m 21 --total_nodes 1 -- saxpy-1d.mwx 8 128 5.0 &>/dev/null

In [16]:
!more saxpy-1d.cdc
!more saxpy-1d.vsf 

************************************************
Program Name/Arguments: 
saxpy-1d.mwx 
8 
128 
5.0 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 8, 9, 0

NodeID: num_reads, num_writes, num_rmws
0: 129, 63, 261



## SAXPY Distributed Spawn with migrate_hint

The `migrate_hint` allows the programmer to pass a pointer that is then used by the next `cilk_spawn` operation to efficiently jump to a specific part of a distributed array. Here the migration hint is specifying a "directed spawn" to the location where `y[j]` is located. Note that `cilk_spawn_at` provides a similar purpose by combining a spawn and migration hint operation into one call. 

In [17]:
Code('saxpy-1d-hint.c')

In [18]:
%%bash
FLAGS="-I/tools/lucata/pathfinder-sw/22.02/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.02/lib -lmemoryweb"
emu-cc -o saxpy-1d-hint.mwx $FLAGS saxpy-1d-hint.c
emusim.x -m 21 --total_nodes 1 -- saxpy-1d-hint.mwx 8 128 5.0 &>/dev/null

In [19]:
!more saxpy-1d-hint.cdc
!more saxpy-1d-hint.vsf 

************************************************
Program Name/Arguments: 
saxpy-1d-hint.mwx 
8 
128 
5.0 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 8, 9, 0

NodeID: num_reads, num_writes, num_rmws
0: 113, 63, 277



## SAXPY with 2D Distributed Allocation

This example shows the usage of `cilk_spawn_at` and 2D block allocation of data across the Lucata nodes. In this case, the number of threads matches the number of blocks and the work done by each thread is the block size. 

In [20]:
Code('saxpy-2d-spawn-at.c')

In [21]:
%%bash
FLAGS="-I/tools/lucata/pathfinder-sw/22.02/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.02/lib -lmemoryweb"
emu-cc -o saxpy-2d-spawn-at.mwx $FLAGS saxpy-2d-spawn-at.c
emusim.x -m 21 --total_nodes 1 -- saxpy-2d-spawn-at.mwx 8 128 5.0 &>/dev/null

In [22]:
!more saxpy-2d-spawn-at.cdc
!more saxpy-2d-spawn-at.vsf 

************************************************
Program Name/Arguments: 
saxpy-2d-spawn-at.mwx 
8 
128 
5.0 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 8, 9, 0

NodeID: num_reads, num_writes, num_rmws
0: 400, 444, 46



## SAXPY with Local Allocation

This example shows a variation of the previous 2D code with a local allocation for the output. You will notice that the local allocation for the output (as opposed to 2D allocation) results in more migrations overall. 

In [23]:
Code('saxpy-local-spawn-at.c')

In [24]:
%%bash
FLAGS="-I/tools/lucata/pathfinder-sw/22.02/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.02/lib -lmemoryweb"
emu-cc -o saxpy-local-spawn-at.mwx $FLAGS saxpy-local-spawn-at.c
emusim.x -m 21 --total_nodes 1 -- saxpy-local-spawn-at.mwx 8 128 5.0 &>/dev/null

In [25]:
!more saxpy-local-spawn-at.cdc
!more saxpy-local-spawn-at.vsf 

************************************************
Program Name/Arguments: 
saxpy-local-spawn-at.mwx 
8 
128 
5.0 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 8, 9, 0

NodeID: num_reads, num_writes, num_rmws
0: 427, 495, 1055



## SAXPY with Replicated Data Structures

Finally, we look at using replication to create copies of the constant variable, `a` across all the nodes. This prevents migrations to access this common variable if it were located only on a single node. Note that replication can be a powerful tool for optimized allocation but it should be used primarily with small data structures and variables that are read-only (for coherency reasons).

In [26]:
Code('saxpy-1d-replicated.c')

In [27]:
%%bash
FLAGS="-I/tools/lucata/pathfinder-sw/22.02/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.02/lib -lmemoryweb"
emu-cc -o saxpy-1d-replicated.mwx $FLAGS saxpy-1d-replicated.c
emusim.x -m 21 --total_nodes 1 -- saxpy-1d-replicated.mwx 8 128 5 &>/dev/null

In [28]:
!more saxpy-1d-replicated.cdc
!more saxpy-1d-replicated.vsf 

************************************************
Program Name/Arguments: 
saxpy-1d-replicated.mwx 
8 
128 
5 
************************************************
Simulator Version: 22.2.22
************************************************
Configuration Details:
Ring Model = Stratix: 3 GC Clusters, 8 MSPs
Number of Nodes=1
Total Memory (in MiB)=2
Logical MSPs per Node=1
Log2 Memory Size per MSP=21
GC Clusters per Node=3
GCs per Cluster=8
************************************************
************************************************
Simulator wall clock time (seconds): 0
Node ID: Outbound Migrations, Threads Created, Threads Died, Spawn Fails
0: 0, 8, 9, 0

NodeID: num_reads, num_writes, num_rmws
0: 121, 62, 261



### Postcript
Here we have investigated several different strategies for spawning threads and allocatin data with the Pathfinder's distributed layout. Note that the simulations we ran did not have fully accurate timing, but the .cdc and .vsf files do give some indication as to which strategies are more efficient for a particular input size and number of threads. 

Once we've finished our testing, we can clean up some of the logfiles that we used for this example with `make clean`. Uncomment the following line to clean this directory.

In [1]:
#!make clean