# Step-by-step Creating a cluster to run Nasa NPB MPI Benchmarks

## Creating the role NPB

The NPB role is intended to be used to run an NPB MPI benchmark. We will install all required packages, download and compile NPB's IS benchmark (class C), adjust SSH keys between hosts, run the benchmark and get the results. To do this, we will create a role called `npb`. This role will have 3 actions:
* `setup`: which will install all required packages, download NPB (via wget) and unzip, compile the application (with Make) and setup SSH keys
* `run`: which will generate the hostfile and run the MPI application
* `result`: which will get the output files

### NPB Role file

The NPB role file will be placed in `~/.clap/roles/actions.d/npb.yaml`. Three actions is defined in this role: `setup`, `run` and `result`. It contents is shown below.

In [1]:
!cat ~/.clap/roles/actions.d/npb.yaml

actions:
  setup:
    playbook: roles/npb/setup.yml
    description: Install all necessary packages, download and compile NPB MPI benchmarks
    vars:
    - name: pubkey
      description: Path to the SSH public key to distribute to all nodes
    - name: privkey
      description: Path to the SSH private key to distribute to all nodes

  run:
    playbook: roles/npb/run.yml
    description: Run MPI NAS benchmark at all nodes

  result:
    playbook: roles/npb/result.yml
    description: Fetch results from execution to an local directory
    vars:
    - name: output
      description: Path where resuts will be placed



### Setup action

The setup action was placed at `~/.clap/roles/roles/npb/setup.yml`, This playbook will be executed when `setup` action is invoked or when adding a node to a role. 

The setup action requires 2 variables to be defined:
* `pubkey`: which is the path to the file with the public key
* `privkey`: which is the path to the file with the private key
These keys will be copied to all to nodes, allowing to perform SSH login between nodes without specifying password.

The contents of setup playbook is shown below. The `setup.yml` playbook will:
1. Change the hostname from nodes (optional)
2. Install several packages to compile and run MPI applications
3. Get and unzip NPB benchmark
4. Compile IS benchmark (class C) from a Makefile, using `make`
5. Copy ssh private and public keys to all hosts

In [2]:
! cat ~/.clap/roles/roles/npb/setup.yml

---
- hosts: all
  # Gather facts can be used to extract all information of remote system (network, disk, cpus, etc)
  # It will be stored in host_vars
  gather_facts: yes
  tasks:

  # Using Ansible's set_fact module to register variables
  # Variables set with set_fact module are visible to CLAP
  # https://docs.ansible.com/ansible/latest/collections/ansible/builtin/set_fact_module.html
  - name: Set some variables
    set_fact:
      home_dir: "{{ ansible_env.HOME }}"        # User's home directory"

  # Let's set the name of the host as the same as CLAP's node id to easy debug
  - name: Changing hostname
    become: yes
    hostname:
      name: "{{ inventory_hostname }}"
     
  # Using Ansible's apt module to update repository cache and install packages
  # https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html
  - name: Perform necessary package installation for NPB
    become: yes
    apt:
      update_cache: yes
      state: present
      pkg: 
   

### Run action

The `run` action will execute the MPI application at all nodes and wait for its termination. So, the tasks perfomed are:
* Generate the hostfile with all nodes belonging to the role
* Execute the `mpirun` command at one node

The example of hostfile generated for 2 nodes with 2 cpus each are:


`3.215.142.109 slots=2 # node: dd8c2d3f08714837988ee76b9c49556d`

`18.204.42.19 slots=2 # node: aeff1778eb9240679acbc9a12d5be184`

The playbook for the run action was placed at `~/.clap/roles/roles/npb/run.yml` and is shown below.

In [3]:
!cat ~/.clap/roles/roles/npb/run.yml

---
- hosts: all
  # Gather facts can be used to extract all information of remote system (network, disk, cpus, etc)
  # It will be stored in host_vars
  gather_facts: yes
  tasks:
  - name: Remove hostfile
    file:
      path: "{{ ansible_env.HOME }}/hostfile"
      state: absent
  
  - name: Generating hostfile
    lineinfile:
      path: "{{ ansible_env.HOME }}/hostfile"
      line: "{{ hostvars[item].ansible_host }} slots={{ hostvars[item].ansible_processor_cores }} # node: {{ item }}"
      state: present
      create: yes
    with_items: "{{ ansible_play_hosts }}"
    when: ansible_hostname == ansible_play_hosts[0]     # only host 0


  - name: Running MPI
    shell:
      cmd: "mpirun --hostfile ~/hostfile --output-filename execution --tag-output --report-bindings --rank-by slot ~/is.C.x > {{ inventory_hostname }}.output 2>&1"
      chdir: "{{ ansible_env.HOME }}"
    when: ansible_hostname == ansible_play_hosts[0]     # only host 0



### Results action

The `result` action will get the MPI result from nodes. It will simple copy the output files from application the localhost directory, informed through variable `output`.

The playbook was placed at `~/.clap/roles/roles/npb/result.yml` and is shown below.

In [4]:
!cat ~/.clap/roles/roles/npb/result.yml

---
- hosts: all
  gather_facts: yes
  tasks:
  - name: Find files to copy
    find:
      paths: "{{ ansible_env.HOME }}"
      recurse: no
      patterns: 
      - "execution.*"
      - "hostfile"
      - "*.output"
    register: files_to_fetch

  - name: "Copy output files to {{ output }}"
    fetch:
      src: "{{ item.path }}"
      dest: "{{ output }}"
      flat: yes
    with_items: "{{ files_to_fetch.files }}"


We can consult the if npb role is recognized by CLAP using `clapp role list`

In [5]:
!clapp role list

* name: commands-common
  Has 7 actions and 0 hosts defined
    actions: copy, fetch, install-packages, reboot, run-command, run-script, update-packages
    hosts: 

* name: npb
  Has 3 actions and 0 hosts defined
    actions: result, run, setup
    hosts: 

Listed 2 roles


## Defining our instance configurations

Our instances file used (`~/.clap/configs/instances.yaml`) defines two instance configurations called: `type-a` and `type-b`. The former is an AWS `t2.micro` VM and the last an AWS `t2.medium` VM. The instance file used is shown below.

In [6]:
!cat ~/.clap/configs/instances.yaml

type-a:
    provider: aws-config-us-east-1
    login: login-ubuntu
    flavor: t2.micro
    image_id: ami-07d0cf3af28718ef8
    security_group: otavio-sg

type-b:
    provider: aws-config-us-east-1
    login: login-ubuntu
    flavor: t2.medium
    image_id: ami-07d0cf3af28718ef8
    boot_disk_size: 16
    security_group: otavio-sg


Valid instance configurations can be listed using the `clapp node list-templates` command, as below.

In [7]:
!clapp node list-templates

* name: type-a
    provider config id:` aws-config-us-east-1`
    login config id: `login-ubuntu`

* name: type-b
    provider config id:` aws-config-us-east-1`
    login config id: `login-ubuntu`

Listed 2 instance configs


## Defining our cluster configuration

Once we defined our instance configurations and our roles, we will create a cluster template that will start nodes, add them to the `npb` role and execute `npb` role's `run` action (that start the MPI application in nodes).

Our cluster template will be placed at `~/.clap/configs/clusters/nas-cluster.yml` and will be called `npb-cluster`. This cluster will define one node type configuration, called `npb-type-b`. This configuration tells to cluster start command to start 2 `type-b` nodes.

After 2 `npb-type-b` nodes were sucessfully created and are reachable, setup phase shall begin. On the two nodes, the setup called `npb-install` is executed. This setup only adds the nodes to the npb role (and perform the setup action from this role). 

After the setup were executed at all nodes, the setup at `after_all` phase shall begin. Setups at this phase are executed at all nodes in the cluster, after all specific node setups finished. So now, all nodes will perform the `launch-mpi-npb` setup, which will execute the `run` action from `npb` role. With this, the application must be running at all nodes of the cluster.

The cluster configuration used is shown below.

In [8]:
!cat ~/.clap/configs/clusters/nas-cluster.yml

setups:
  npb-install:
    roles:
    - name: npb
      extra:
        pubkey: ~/.ssh/id_rsa.pub
        privkey: ~/.ssh/id_rsa

  launch-mpi-npb:
    actions:
    - role: npb
      action: run

clusters:
  npb-cluster:
    nodes:
      npb-type-b:
        type: type-b
        count: 2
        setups:
        - npb-install

    after_all:
    - launch-mpi-npb


We can also use `clapp cluster list-templates` to list all cluster templates recognized and available at CLAP.

In [9]:
! clapp cluster list-templates

cluster name: npb-cluster
    node types: npb-type-b

Listed 1 templates


We can start the cluster and also setup it to run the application with the `clapp cluster start` command

In [10]:
! clapp cluster start npb-cluster

Starting cluster: npb-cluster (perform setup: True)
[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Starting 2 type-b instances (timeout 600 seconds)] ***********************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Tagging instances] *******************************************************
[0;33mchanged: [localhost] => (item={'id': 'i-0488e6e959cdeacd7', 'name': 'ElizabethCilley-d5b6c1bf'})[0m
[0;33mchanged: [localhost] => (item={'id': 'i-0d1e49d3048a20620', 'name': 'ShirleyLindon-d3e7ad58'})[0m

PLAY RECAP ******************

And list the available clusters with `clapp cluster list` command 

In [11]:
!clapp cluster list

* Cluster: cluster-04cc1ec4f0544d2eb63bbf3706a77228, nickname: MuteMirror, configuration: npb-cluster, creation time: 04-06-21 21:24:30
   Has 2 nodes:
    - 2 npb-type-b nodes: d3e7ad582faa4d33b2421eae96c3a2f6, d5b6c1bfaa6c4db2b8b5055d69ab5197

Listed 1 clusters


Or check nodes that belongs to cluster `cluster-04cc1ec4f0544d2eb63bbf3706a77228` with `clapp cluster nodes` command. The `-q` parameter only show node names.

In [12]:
!clapp cluster nodes cluster-04cc1ec4f0544d2eb63bbf3706a77228 -q

d3e7ad582faa4d33b2421eae96c3a2f6
d5b6c1bfaa6c4db2b8b5055d69ab5197


Finnaly we can use `result` action from `npb` role to fetch results from nodes to a directory called `~/experiment-results` at local machine. The action will be executed from nodes of cluster `cluster-04cc1ec4f0544d2eb63bbf3706a77228`.

In [13]:
!clapp role action npb $(clapp cluster nodes cluster-04cc1ec4f0544d2eb63bbf3706a77228 -q) -a result -e output=~/experiment-results/


PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
[0;32mok: [d3e7ad582faa4d33b2421eae96c3a2f6][0m
[0;32mok: [d5b6c1bfaa6c4db2b8b5055d69ab5197][0m

TASK [Find files to copy] ******************************************************
[0;32mok: [d5b6c1bfaa6c4db2b8b5055d69ab5197][0m
[0;32mok: [d3e7ad582faa4d33b2421eae96c3a2f6][0m

TASK [Copy output files to /home/lopani/experiment-results/] *******************
[0;33mchanged: [d3e7ad582faa4d33b2421eae96c3a2f6] => (item={'path': '/home/ubuntu/hostfile', 'mode': '0664', 'isdir': False, 'ischr': False, 'isblk': False, 'isreg': True, 'isfifo': False, 'islnk': False, 'issock': False, 'uid': 1000, 'gid': 1000, 'size': 124, 'inode': 263062, 'dev': 51713, 'nlink': 1, 'atime': 1622852880.298839, 'mtime': 1622852876.6548262, 'ctime': 1622852876.6548262, 'gr_name': 'ubuntu', 'pw_name': 'ubuntu', 'wusr': True, 'rusr': True, 'xusr': Fal

Let's list results files at `~/experiment-results/` directory 

In [15]:
!ls ~/experiment-results/ -lha

total 32K
drwxrwxr-x  2 lopani lopani 4,0K jun  4 21:31 .
drwxr-xr-x 43 lopani lopani 4,0K jun  4 21:04 ..
-rw-rw-r--  1 lopani lopani 1,1K jun  4 21:30 d3e7ad582faa4d33b2421eae96c3a2f6.output
-rw-r--r--  1 lopani lopani 1,9K jun  4 21:30 execution.1.0
-rw-r--r--  1 lopani lopani  118 jun  4 21:30 execution.1.1
-rw-r--r--  1 lopani lopani  118 jun  4 21:30 execution.1.2
-rw-r--r--  1 lopani lopani  118 jun  4 21:30 execution.1.3
-rw-rw-r--  1 lopani lopani  124 jun  4 21:30 hostfile


And checking the result files...

In [16]:
!cat ~/experiment-results/hostfile

3.235.68.246 slots=2 # node: d3e7ad582faa4d33b2421eae96c3a2f6
35.170.52.38 slots=2 # node: d5b6c1bfaa6c4db2b8b5055d69ab5197


In [17]:
!cat ~/experiment-results/execution.1.0

[1,0]<stderr>:[d3e7ad582faa4d33b2421eae96c3a2f6:08001] MCW rank 0 is not bound (or bound to all available processors)
[1,0]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>: NAS Parallel Benchmarks 3.4 -- IS Benchmark
[1,0]<stdout>:
[1,0]<stdout>: Size:  134217728  (class C)
[1,0]<stdout>: Iterations:   10
[1,0]<stdout>: Total number of processes:  4
[1,0]<stdout>:
[1,0]<stdout>:   iteration
[1,0]<stdout>:        1
[1,0]<stdout>:        2
[1,0]<stdout>:        3
[1,0]<stdout>:        4
[1,0]<stdout>:        5
[1,0]<stdout>:        6
[1,0]<stdout>:        7
[1,0]<stdout>:        8
[1,0]<stdout>:        9
[1,0]<stdout>:        10
[1,0]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>: IS Benchmark Completed
[1,0]<stdout>: Class           =                        C
[1,0]<stdout>: Size            =                134217728
[1,0]<stdout>: Iterations      =                       10
[1,0]<stdout>: Time in seconds =                    17.36
[1,0]<stdout>: Total processes =                        4
[1,0]<stdout>: Ac

## Resizing cluster and running the application again

CLAP supports growing the cluster using `cluster grow` command. It will start and setup a new cluster node. Setups at `after_all` phase (that runs the application) will be executed at all nodes, so the application will run again with new nodes.

Lets add two more cluster nodes of type `npb-type-b` to cluster and run the application again with 4 nodes.

In [21]:
!clapp cluster grow cluster-04cc1ec4f0544d2eb63bbf3706a77228 --node npb-type-b:2

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Starting 2 type-b instances (timeout 600 seconds)] ***********************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Tagging instances] *******************************************************
[0;33mchanged: [localhost] => (item={'id': 'i-04581239a8b77e52a', 'name': 'ElijahPfaff-1c0357d0'})[0m
[0;33mchanged: [localhost] => (item={'id': 'i-0e6e5f02ddfcddde3', 'name': 'DouglasLewis-de24983c'})[0m

PLAY RECAP *********************************************************************
[0;3

And fetching results...

In [23]:
!clapp role action npb $(clapp cluster nodes cluster-04cc1ec4f0544d2eb63bbf3706a77228 -q) -a result -e output=~/experiment-results-2/


PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
[0;32mok: [de24983c31a144e78ad79206e0b9e72d][0m
[0;32mok: [d5b6c1bfaa6c4db2b8b5055d69ab5197][0m
[0;32mok: [1c0357d06a76453994e0621fb33727ab][0m
[0;32mok: [d3e7ad582faa4d33b2421eae96c3a2f6][0m

TASK [Find files to copy] ******************************************************
[0;32mok: [de24983c31a144e78ad79206e0b9e72d][0m
[0;32mok: [1c0357d06a76453994e0621fb33727ab][0m
[0;32mok: [d3e7ad582faa4d33b2421eae96c3a2f6][0m
[0;32mok: [d5b6c1bfaa6c4db2b8b5055d69ab5197][0m

TASK [Copy output files to /home/lopani/experiment-results-2/] *****************
[0;33mchanged: [de24983c31a144e78ad79206e0b9e72d] => (item={'path': '/home/ubuntu/execution.1.6', 'mode': '0644', 'isdir': False, 'ischr': False, 'isblk': False, 'isreg': True, 'isfifo': False, 'islnk': False, 'issock': False, 'uid': 1000, 'gid': 1000, 'size': 118, 'inod

In [25]:
!ls ~/experiment-results-2/ -lha

total 48K
drwxrwxr-x  2 lopani lopani 4,0K jun  4 21:42 .
drwxr-xr-x 44 lopani lopani 4,0K jun  4 21:41 ..
-rw-rw-r--  1 lopani lopani 1,1K jun  4 21:42 d3e7ad582faa4d33b2421eae96c3a2f6.output
-rw-r--r--  1 lopani lopani 1,9K jun  4 21:42 execution.1.0
-rw-r--r--  1 lopani lopani  118 jun  4 21:42 execution.1.1
-rw-r--r--  1 lopani lopani  118 jun  4 21:42 execution.1.2
-rw-r--r--  1 lopani lopani  118 jun  4 21:41 execution.1.3
-rw-r--r--  1 lopani lopani  118 jun  4 21:41 execution.1.4
-rw-r--r--  1 lopani lopani  118 jun  4 21:42 execution.1.5
-rw-r--r--  1 lopani lopani  118 jun  4 21:41 execution.1.6
-rw-r--r--  1 lopani lopani  118 jun  4 21:42 execution.1.7
-rw-rw-r--  1 lopani lopani  250 jun  4 21:41 hostfile


In [26]:
!cat ~/experiment-results-2/hostfile

3.235.68.246 slots=2 # node: d3e7ad582faa4d33b2421eae96c3a2f6
35.170.52.38 slots=2 # node: d5b6c1bfaa6c4db2b8b5055d69ab5197
3.235.248.132 slots=2 # node: 1c0357d06a76453994e0621fb33727ab
35.170.67.203 slots=2 # node: de24983c31a144e78ad79206e0b9e72d


In [27]:
!cat ~/experiment-results-2/execution.1.0

[1,0]<stderr>:[d3e7ad582faa4d33b2421eae96c3a2f6:08615] MCW rank 0 is not bound (or bound to all available processors)
[1,0]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>: NAS Parallel Benchmarks 3.4 -- IS Benchmark
[1,0]<stdout>:
[1,0]<stdout>: Size:  134217728  (class C)
[1,0]<stdout>: Iterations:   10
[1,0]<stdout>: Total number of processes:  8
[1,0]<stdout>:
[1,0]<stdout>:   iteration
[1,0]<stdout>:        1
[1,0]<stdout>:        2
[1,0]<stdout>:        3
[1,0]<stdout>:        4
[1,0]<stdout>:        5
[1,0]<stdout>:        6
[1,0]<stdout>:        7
[1,0]<stdout>:        8
[1,0]<stdout>:        9
[1,0]<stdout>:        10
[1,0]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>: IS Benchmark Completed
[1,0]<stdout>: Class           =                        C
[1,0]<stdout>: Size            =                134217728
[1,0]<stdout>: Iterations      =                       10
[1,0]<stdout>: Time in seconds =                    12.83
[1,0]<stdout>: Total processes =                        8
[1,0]<stdout>: Ac

## Terminating Cluster

In [28]:
!clapp cluster stop cluster-04cc1ec4f0544d2eb63bbf3706a77228

Stopping cluster `cluster-04cc1ec4f0544d2eb63bbf3706a77228`...
[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Stopping nodes DouglasLewis, ElijahPfaff, ElizabethCilley, ShirleyLindon] ***
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Cluster `cluster-04cc1ec4f0544d2eb63bbf3706a77228` stopped!


In [29]:
!clapp cluster list

Listed 0 clusters
