# Description of the artifacts TODO

TODO

## Setup the Chameleon

A container with a NVIDIA GPU is needed

- One node of type "gpu_p100", "gpu_rtx_6000", "gpu_v100" or "gpu_a100" ([see all types](https://chameleoncloud.readthedocs.io/en/latest/technical/reservations.html#chameleon-node-types))
- One public IP

### Configuration

Enter your project ID in the code block below, if you are not a member of `CHI-241321`.

In [1]:
import chi

chi.use_site("CHI@UC")

# Change to your project (CHI-XXXXXX)
chi.set("project_name", "CHI-241321")

print(f'Using Project {chi.get("project_name")}')

Now using CHI@UC:
URL: https://chi.uc.chameleoncloud.org
Location: Argonne National Laboratory, Lemont, Illinois, USA
Support contact: help@chameleoncloud.org
Using Project CHI-241321


### Create reservation

Chameleon resources need to be reserved before they can be used. 
We will reserve one bare metal node and one public IP address.

If you get an error such as "no host availiable", it may be the case that all of our nodes are reserved. Check the availiablility calendar to see if this is true:
https://chi.uc.chameleoncloud.org/project/leases/calendar/host/

It may take around a minute or so for your lease to become active.

In [3]:
import os
import keystoneauth1, blazarclient
from chi import lease


hours = 0.3
# lease_node_type = "gpu_p100" # gpu_rtx_6000
lease_node_type = "compute_skylake"

reservations = []
try:
    print("Creating lease...")
    lease.add_fip_reservation(reservations, count=1)
    lease.add_node_reservation(reservations, node_type=lease_node_type, count=1)

    start_date, end_date = lease.lease_duration(days=0,hours=hours)
    print(start_date, end_date)
    l = lease.create_lease(
        f"{os.getenv('USER')}-kmc", 
        reservations, 
        start_date=start_date, 
        end_date=end_date
    )
    lease_id = l["id"]

    print("Waiting for lease to start ...")
    lease.wait_for_active(lease_id)
    print("Lease is now active!")
except keystoneauth1.exceptions.http.Unauthorized as e:
    print("Unauthorized.\nDid set your project name and and run the code in the first cell?")
except blazarclient.exception.BlazarClientException as e:
    print(f"There is an issue making the reservation. Check the calendar to make sure a {lease_node_type} node is available.")
    print("https://chi.uc.chameleoncloud.org/project/leases/calendar/host/")
    print(e)
except Exception as e:
    print("An unexpected error happened.")
    print(e)

Creating lease...
2024-06-20 17:26 2024-06-20 17:43
Waiting for lease to start ...
Lease is now active!


### Provision bare metal node

Next, we will launch the reserved node with an image. 
It will take approximately 10 minutes for the bare metal node to be successfully provisioned. 

We use a image of Ubuntu 22.04 with CUDA installed provided by Chameleon.
https://www.chameleoncloud.org/appliances/109/

In [4]:
from chi import server

image = "CC-Ubuntu22.04-CUDA"

s = server.create_server(
    f"{os.getenv('USER')}-kmc", 
    image_name=image,
    reservation_id=lease.get_node_reservation(lease_id)
)

print("Waiting for server to start ...")
server.wait_for_active(s.id)
print("Done")

Waiting for server to start ...
Done


By default our node is only connected to a private network and thus not reachable over the internet or via Jupyter here. We need to associate a "Floating IP" to the node, which gives it the public address we reserved.

In [7]:
floating_ip = lease.get_reserved_floating_ips(lease_id)[0]
server.associate_floating_ip(s.id, floating_ip_address=floating_ip)

print(f"Waiting for SSH connectivity on {floating_ip} ...")
timeout = 60*2
import socket
import time
# Repeatedly try to connect via SSH.
start_time = time.perf_counter()
while True:
    try:
        with socket.create_connection((floating_ip, 22), timeout=timeout):
            print("Connection successful")
            break
    except OSError as ex:
        time.sleep(10)
        if time.perf_counter() - start_time >= timeout:
            print(f"After {timeout} seconds, could not connect via SSH. Please try again.")


Waiting for SSH connectivity on 192.5.87.221 ...
Connection successful


### Configure the instance

We provisioned the instance with a fresh base image, which comes with a lot of useful tools but not everything we need for this experiment. See the `setup.sh` for details; we will execute this setup script remotely on our server. Alternatively, we could create a snapshot of an instance. After one is made, it can be reused on new instances. The snapshot makes a copy of everything on the instance.

Note: a warning may appear when running this cell, that is fine.

In [11]:
from chi import ssh

with ssh.Remote(floating_ip) as conn:
    # Upload the script
    conn.put("setup.sh")
    # Run the script
    conn.run("bash setup.sh")

DeviceKMC
README
my_mounting_point
openrc
setup.sh
test



## Run and visualize the experiment

This copies the experiment runner script to the server, then runs the experiment, and saves the data.

Afterwards, the measurements are downloaded and plotted.

### 3) Matrix Split for CG

<img src="T_matrix.png" alt="drawing" width="600"/>

Figure 7) (a) Schematic representation of the connections between neighboring (blue) and non-neighboring (green) sites which lead to fixed and varying subsets of the sparsity pattern of $[G]$. (b) Scaling of the CG solver for \textbf{Eqn. 5} on Piz Daint, for $[G]$ stored as a single CSR matrix (full) or split into two sparse matrices (split). The same number of CG steps were timed (until the norm of the residual matched a convergence criteria of 1$\times$10$^{-15}\times$N$_{atom}$. Error bars correspond to the 95\% confidence interval over 35 measurements. The speedups are plotted with reference to the single-node runtime of the 'full' version. (c) Schematic of the extra communication overlap possible within the SpMV step when $[G]$ is stored in split format, where the SpMV with G$_1$ can occur in parallel with communicating the indices required for the SpMV of G$_2$.

This experiment replicates Figure 7 b) on a single node due to constraint resources on Chameleon.

In [None]:
with ssh.Remote(floating_ip) as conn:
    # Upload the script
    conn.put("run_experiment_3.sh")
    conn.run("bash run_experiment_3.sh")

After the experiment terminates, we can download the results and extract them to our local output directory for analysis in Jupyter.

In [None]:
import tarfile

with ssh.Remote(floating_ip) as conn:
    # Download the output
    conn.get("out/latest.tar.gz")
with tarfile.open("latest.tar.gz") as tar:
    # Extract the results to our notebook
    tar.extractall()
print("done")