# 🧪 Reproducible Slurm Cluster Deployment with Infrastructure Manager

This notebook demonstrates how to use the [Infrastructure Manager (IM)](https://imdocs.readthedocs.io/en/devel/) to deploy a Slurm cluster on [Chameleon Cloud](https://www.chameleoncloud.org/) via the OpenStack-based site **KVM@TACC**.

**Goals:**
- Install the IM client
- Deploy a Slurm cluster using a TOSCA template
- Submit a test job to Slurm and retrieve the output
- Destroy the deployed infrastructure

## 🛠️ Step 1: Install the Infrastructure Manager CLI

In [None]:
%pip install im-client

## 🔐 Step 2: Authentication File Setup

The IM client uses an authentication file. The file, `imcham-auth.dat`, has two lines:

- **Infrastructure Manager credentials** - Obtain an access token from [EGI Check-in Token Portal](https://aai.egi.eu/token) and save it in the `token` variable in the authentication file. Access tokens expire every hour.

![title](images/egi-portal.png)

- **Chameleon Cloud credentials** — To get the Chameleon credentials, head to the [Chameleon Cloud page](https://chameleoncloud.org/) and follow these steps:
> 1. Create a user account in Chameleon.
> 2. Since we are interested in OpenStack, select KVM@TACC (under Experiments), an OpenStack site provided by the Texas Advanced Computing Center (TACC) based on the KVM (Kernel-based Virtual Machine) hypervisor.
>
> ![title](images/chameleon-kvmtacc.png)
>
> 3. In the Identity section, one can create Application Credentials. Select **Create application credential**.
>
> ![title](images/chameleon-app-cred.png)
>> Application Credentials allow user applications to authenticate to keystone. With application credentials, applications authenticate with the application credential ID and a secret string which is not the user’s password. This way, the user’s password is not embedded in the application’s configuration.
>
> 4. In the name field, specify a name and leave the rest of the options as default. 
>
> ![title](images/chameleon-create-cred.png)
>
> Once obtained the application credentials, you can download the openrc file or the clouds.yaml file. The Project ID is also shown in the list of credentials after closing the pop-up.

Now, substitute the second line of the authentication file with the credentials obtained in Chameleon.


## 📄 Step 3: Define the Infrastructure Template

Below is a sample definition for a Slurm cluster infrastructure using a TOSCA template.

We save it to a file called `slurm-cluster.yaml`.

In [7]:
slurm_cluster_recipe = """
tosca_definitions_version: tosca_simple_yaml_1_0

description: Deploy a SLURM Virtual Cluster.
imports:
- grycap_custom_types: https://raw.githubusercontent.com/grycap/tosca/main/custom_types.yaml
metadata:
  childs: []
  display_name: SLURM virtual cluster
  filename: slurm_cluster.yml
  icon: images/slurm.png
  infra_name: slurm_test
  order: 4
  tabs:
    FE Node Features: fe_.*
    SLURM Features: slurm_.*
    WNs Features: wn_.*
  template_name: SLURM
  template_version: 1.1.2
topology_template:
  inputs:
    fe_cpus:
      default: 1
      description: Number of CPUs for the front-end node
      required: true
      type: integer
    fe_disk_size:
      constraints:
      - valid_values:
        - 0 GiB
        - 10 GiB
        - 20 GiB
        - 50 GiB
        - 100 GiB
        - 200 GiB
        - 500 GiB
        - 1 TiB
        - 2 TiB
      default: 0 GiB
      description: Size of the disk to be attached to the FE instance (Set 0 if disk
        is not needed)
      type: scalar-unit.size
    fe_mem:
      default: 1 GiB
      description: Amount of Memory for the front-end node
      required: true
      type: scalar-unit.size
    fe_mount_path:
      default: /home/data
      description: Path to mount the FE attached disk
      type: string
    fe_ports:
      default:
        port_22:
          protocol: tcp
          source: 22
      description: 'List of ports to be Opened in FE node (eg. 22,80,443,2000:2100).

        You can also include the remote CIDR (eg. 8.8.0.0/24).

        '
      entry_schema:
        type: PortSpec
      type: map
    fe_volume_id:
      default: ''
      description: 'Or URL of the disk to be attached to the FE instance (format:
        ost://api.cloud.ifca.es/'
      type: string
    slurm_version:
      constraints:
      - valid_values:
        - 23.11.8
        - 20.11.9
        - 21.08.5
        - 21.08.8
        - 22.05.10
      default: 23.11.8
      description: Version of SLURM to be installed
      type: string
    wn_cpus:
      default: 1
      description: Number of CPUs for the WNs
      required: true
      type: integer
    wn_disk_size:
      constraints:
      - valid_values:
        - 0 GiB
        - 10 GiB
        - 20 GiB
        - 50 GiB
        - 100 GiB
        - 200 GiB
        - 500 GiB
        - 1 TiB
        - 2 TiB
      default: 0 GiB
      description: Size of the disk to be attached to the WN instances (Set 0 if disk
        is not needed)
      type: scalar-unit.size
    wn_mem:
      default: 1024
      description: Amount of Memory for the WNs in MiB
      required: true
      type: integer
    wn_mount_path:
      default: /mnt/data
      description: Path to mount the WN attached disk
      type: string
    wn_num:
      default: 1
      description: Number of WNs in the cluster
      required: true
      type: integer
  node_templates:
    fe_block_storage:
      properties:
        size:
          get_input: fe_disk_size
        volume_id:
          get_input: fe_volume_id
      type: tosca.nodes.BlockStorage
    lrms_front_end:
      properties:
        version:
          get_input: slurm_version
        wn_cpus:
          get_input: wn_cpus
        wn_ips:
          get_attribute:
          - lrms_wn
          - private_address
        wn_mem:
          get_input: wn_mem
      requirements:
      - host: lrms_server
      type: tosca.nodes.indigo.LRMS.FrontEnd.Slurm
    lrms_server:
      capabilities:
        endpoint:
          properties:
            dns_name: slurmserver
            network_name: PUBLIC
            ports:
              get_input: fe_ports
        host:
          properties:
            mem_size:
              get_input: fe_mem
            num_cpus:
              get_input: fe_cpus
        os:
          properties:
            distribution: ubuntu
            image: ost://kvm.tacc.chameleoncloud.org/96d9c658-6540-4796-ae64-54d8ac6c45f8
            type: linux
      properties:
        instance_name: slurm_test_lrms_server
      requirements:
      - local_storage:
          node: fe_block_storage
          relationship:
            properties:
              location:
                get_input: fe_mount_path
            type: AttachesTo
      type: tosca.nodes.indigo.Compute
    lrms_wn:
      capabilities:
        host:
          properties:
            mem_size:
              concat:
              - get_input: wn_mem
              - ' MiB'
            num_cpus:
              get_input: wn_cpus
        os:
          properties:
            distribution: ubuntu
            image: ost://kvm.tacc.chameleoncloud.org/96d9c658-6540-4796-ae64-54d8ac6c45f8
            type: linux
        scalable:
          properties:
            count:
              get_input: wn_num
      properties:
        instance_name: slurm_test_lrms_wn
      requirements:
      - local_storage:
          node: wn_block_storage
          relationship:
            properties:
              location:
                get_input: wn_mount_path
            type: AttachesTo
      type: tosca.nodes.indigo.Compute
    wn_block_storage:
      properties:
        size:
          get_input: wn_disk_size
      type: tosca.nodes.BlockStorage
    wn_node:
      properties:
        front_end_ip:
          get_attribute:
          - lrms_server
          - private_address
          - 0
        public_front_end_ip:
          get_attribute:
          - lrms_server
          - public_address
          - 0
        version:
          get_input: slurm_version
      requirements:
      - host: lrms_wn
      type: tosca.nodes.indigo.LRMS.WorkerNode.Slurm
  outputs:
    cluster_creds:
      value:
        get_attribute:
        - lrms_server
        - endpoint
        - credential
        - 0
    cluster_ip:
      value:
        get_attribute:
        - lrms_server
        - public_address
        - 0
"""

# Save the template to a file
with open("slurm-cluster.yaml", "w") as f:
    f.write(slurm_cluster_recipe)

## 🚀 Step 4: Deploy the Infrastructure

Use the `im_client.py` CLI to deploy your infrastructure. The command will return an `infrastructure ID` used for future interactions.

In [None]:
!im_client.py -r https://im.egi.eu/im/ -a ./imcham-auth.dat create slurm-cluster.yaml

# The command is using the publicly available endpoint of the IM at https://im.egi.eu

Once deployed, paste the infrastructure ID below to continue working with it.

In [12]:
infra_id = "infra-id"

## 🔍 Step 5: Check Deployment Status

Before we SSH into the cluster, we need to check if the VM is in state configured:


In [None]:
!im_client.py -r https://im.egi.eu/im -a ./imcham-auth.dat getstate {infra_id}

## 🧪 Step 6: Submit a Slurm Job

SSH into the frontend VM and run your Slurm job from there. To do that, run the following command and paste the output to create the SSH command:

In [None]:
!im_client.py -r https://im.egi.eu/im -a ./imcham-auth.dat ssh {infra_id} 1 -q

In [549]:
ssh_command = "ssh-command"

In [None]:
# Create a SLURM job script
script = """#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=hostname.out

hostname
"""

In [None]:
import subprocess

# Connect to the cluster and save the job in /home/slurm/test.sh
subprocess.run(f"{ssh_command} 'echo {script!r} | sudo -u slurm tee /home/slurm/test.sh > /dev/null'", shell=True)

In [None]:
# Submit the script using 'sbatch' 
!$ssh_command "sudo su - slurm -c 'sbatch /home/slurm/test.sh'"

## 📤 Step 7: Check Output

You can check the job status with `squeue` and inspect outputs:

In [None]:
# Check the status of the job with the 'squeue' command
!$ssh_command "sudo -u slurm squeue"

In [None]:
# Once it finishes, check the output of the SLURM job
!$ssh_command "sudo -u slurm cat /home/slurm/hostname.out"

## 🧹 Step 8: Destroy the Infrastructure

After completing the experiment, destroy the infrastructure to free up resources.

In [None]:
!im_client.py -r https://im.egi.eu/im -a ./imcham-auth.dat destroy {infra_id}