# Notebook #3 - Pathfinder Workflow with SAXPY

Initial steps: We must import specific environment variables to point to the user's notebook code director and to the Lucata tools.

In [1]:
%load_ext slurm_magic
import os
from IPython.display import Code

#Get the path to where all code samples are
os.environ["USER_NOTEBOOK_CODE"]=os.path.dirname(os.getcwd())
os.environ["X86FLAGS"] = "-I/tools/lucata/pathfinder-sw/22.09-beta/include/cilk/ -I/tools/emu/pathfinder-sw/22.09-beta/x86/include/emu_c_utils /tools/emu/pathfinder-sw/22.09-beta/x86/lib/libemu_c_utils.a"

os.environ["PATH"]=os.pathsep.join(["/tools/emu/pathfinder-sw/22.09-beta/bin",os.environ["PATH"]])
os.environ["FLAGS"]="-I/tools/lucata/pathfinder-sw/22.09-beta/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.09-beta/lib -lmemoryweb"
os.environ["LUCATA_BASE"]="/tools/emu/pathfinder-sw/22.09-beta/"

!echo $USER_NOTEBOOK_CODE
#Print out which Emu compiler, emu-cc, we are using
!which emu-cc
#Print out the compiler flags we need to use the Lucata memoryweb headers and library
!echo "Lucata compilation flags are $FLAGS"

/nethome/plavin3/lucata-pathfinder-tutorial/code
/tools/emu/pathfinder-sw/22.09-beta/bin/emu-cc
Lucata compilation flags are -I/tools/lucata/pathfinder-sw/22.09-beta/include/memoryweb/ -L/tools/lucata/pathfinder-sw/22.09-beta/lib -lmemoryweb


This notebook goes along with the [Lucata workflow slides](https://github.com/gt-crnch-rg/pearc-tutorial-2021/blob/main/slides/lucata_tutorial/02_Lucata_Pathfinder_Tutorial_Workflow.pdf), so please follow along with the slides for a supplemental resource. 

## Lucata Pathfinder Workflow

This figure shows the workflow for using the Pathfinder ecosystem and hardware. Since the Pathfinder is programmed using a variant of the Cilk programming language, code written for this platform can be run on x86 systems using the Lucata toolchain, some GCC versions (5-7), or an appropriate Clang branch like [MIT's Tapir](https://www.csail.mit.edu/research/tapir).

![Lucata Workflow](../resources/figs/lucata_pathfinder_workflow.png)

This notebook takes one of our previous Saxpy examples and uses it as part of a workflow that shows how to run code on the x86 system, simulator, and finally on the hardware. Just to revisit, we are using the basic SAXPY "1D allocation" kernel from Notebook 2.

### X86 Execution
As a first step, we need to update the code slightly to allow it to be compiled for x86 platforms using "memoryweb_x86.h". This compatibility header tells the compiler to compile for an x86 variant of Cilk rather than the Lucata version. The differences are that some Lucata-specific commands like `cilk_spawn_at` don't exist in most standard x86 Cilk APIs.

Also, note the inclusion of the "emu_c_utils" header, which provides additional helper functions.

In [4]:
Code('saxpy-1d-workflow.c')

We can then compile this code for execution on an x86 system as follows:

In [18]:
%%bash
set -x
gcc ${X86FLAGS} -fcilkplus -DX86 saxpy-1d-workflow.c -o saxpy-1d-workflow-x86

+ gcc -I/tools/lucata/pathfinder-sw/22.09-beta/include/cilk/ -I/tools/emu/pathfinder-sw/22.09-beta/x86/include/emu_c_utils /tools/emu/pathfinder-sw/22.09-beta/x86/lib/libemu_c_utils.a -fcilkplus -DX86 saxpy-1d-workflow.c -o saxpy-1d-workflow-x86


In [19]:
%%bash
CILK_NWORKERS=4 ./saxpy-1d-workflow-x86 8 128 5

SAXPY complete!


### Simulator Execution

Once we have tested our program with x86 Cilk execution we can proceed to test with the Lucata simulator, `emusim.x`. This simulator is a single-threaded simulator that operates on a detailed SystemC model of the Pathfinder system. As such, it is somewhat slow and should normally be used in "untimed" mode to verify functionality. Since the Pathfinder hardware does not currently include a debugger or runtime profiler, the simulator should also be used to debug issues and check basic performance profiling.

![Lucata Workflow](../resources/figs/lucata_pathfinder_workflow_2_emusim.png)

The best way to limit the runtime of the simulator is to run in "untimed" mode, meaning that no relative clocks are used to estimate performance of the application. You can do this by either commenting out any `starttiming()` calls in your code or using the `--ignore_starttiming` flag when simulating your code.

In [22]:
%%bash
emu-cc -o saxpy-1d-workflow.mwx $FLAGS saxpy-1d-workflow.c -lemu_c_utils

In [23]:
%%bash
emusim.x -- saxpy-1d-workflow.mwx 8 128 5.0

Start untimed simulation with local date and time= Mon Sep 19 16:13:29 2022

SAXPY complete!
End untimed simulation with local date and time= Mon Sep 19 16:13:29 2022




        SystemC 2.3.3-Accellera --- Sep  7 2022 09:15:59
        Copyright (c) 1996-2018 by all Contributors,
        ALL RIGHTS RESERVED


In [25]:
%%bash
emusim.x --total_nodes 4 -- saxpy-1d-workflow.mwx 8 128 5.0

Start untimed simulation with local date and time= Mon Sep 19 16:14:07 2022

SAXPY complete!
End untimed simulation with local date and time= Mon Sep 19 16:14:07 2022




        SystemC 2.3.3-Accellera --- Sep  7 2022 09:15:59
        Copyright (c) 1996-2018 by all Contributors,
        ALL RIGHTS RESERVED


### Hardware Execution

Once our code works with the simulator, and we feel we have optimized it enough it is time to start running on the hardware. You should first try to run your code in single-node fashion on the Pathfinder and then scale up to multi-node execution. Note that the `emu_handler_and_loader` command is meant for single-node execution while `emu_multinode_exec` is meant for multinode execution. All multi-node jobs must be run from node 0 in the Pathfinder system.

![Lucata Workflow](../resources/figs/lucata_pathfinder_workflow_3_emuhw.png)

In [28]:
%%bash
#ssh pathfinder1.crnch.gatech.edu
#ssh n0
cd ${USER_NOTEBOOK_CODE}/03-saxpy-workflow

#emu_handler_and_loader saxpy-1d-workflow.mwx 8 128 5.0
sbatch -M pathfinder -q single-node --wrap "emu_handler_and_loader 0 0 -- saxpy-1d-workflow.mwx 8 128 5.0"


#emu_multinode_exec saxpy-1d-workflow.mwx 8 128 5.0

sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-db.cc.gatech.edu:6819: Connection timed out
sbatch: error: Sending PersistInit msg: Connection timed out
sbatch: error: Sending PersistInit msg: Connection timed out
sbatch: error: DBD_GET_CLUSTERS failure: Connection timed out
sbatch: error: Problem talking to database
sbatch: error: There is a problem talking to the database: Connection timed out.  Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.


CalledProcessError: Command 'b'#ssh pathfinder1.crnch.gatech.edu\n#ssh n0\ncd ${USER_NOTEBOOK_CODE}/03-saxpy-workflow\n\n#emu_handler_and_loader saxpy-1d-workflow.mwx 8 128 5.0\nsbatch -M pathfinder -q single-node --wrap "emu_handler_and_loader 0 0 -- saxpy-1d-workflow.mwx 8 128 5.0"\n\n\n#emu_multinode_exec saxpy-1d-workflow.mwx 8 128 5.0\n'' returned non-zero exit status 1.

In [6]:
%sinfo

Unnamed: 0,PARTITION,AVAIL,TIMELIMIT,NODES,STATE,NODELIST
0,rg-arm-debug*,up,4:00:00,4,idle,octavius[1-4]
1,rg-arm-long,up,12:00:00,16,idle,octavius[1-16]
2,rg-gpu,up,12:00:00,7,idle,"frozone[1-4],instinct,quorra[1-2]"
3,rg-hpc,up,12:00:00,3,idle,flubber[8-10]
4,rg-intel-fpga-hw,up,12:00:00,2,idle,flubber[2-3]
5,rg-xilinx-fpga-hw,up,12:00:00,2,idle,flubber[4-5]
6,rg-smart-nic,up,12:00:00,2,idle,flubber[6-7]
7,notebook,up,12:00:00,1,idle,hawksbill


## Debugging Strategy for the Pathfinder

If you've made it this far, you likely have some idea of the key steps for compiling and running programs with the Pathfinder. 
As an added important note, Eric Hein shares this debugging strategy that builds on the usage of the x86 and simulator models to 
ensure correct program execution and scaling.

1) Compile on x86 (using memoryweb_x86.h)  
2) Run on x86 with a single thread (CILK_NWORKERS=1)  
3) Run on x86 multi-threaded  
4) Compile for Emu  
5) Run on emusim in untimed mode (--ignore_starttiming)  
6) Run on emusim with randomly initialized memory (--initialize_memory)  
7) Run on emusim in timed mode. (starttiming() in code)  
8) Run on single-node HW  
9) Run on multi-node HW  
10) Increase input size gradually. Always use the smallest input set that will finish/recreate the problem in a reasonable amount of time!

### Postcript

Once we've finished our testing, we can clean up some of the logfiles that we used for this example with `make clean`. Uncomment the following line to clean this directory.

In [2]:
#!make clean