# 🧪 Experiment: Deploying and Testing a SLURM Cluster for Distributed Computing
*This tutorial will guide you through deploying a small SLURM cluster using APRICOT and the Infrastructure Manager (IM).*

📘 **Context & Objective**

High-Performance Computing (HPC) clusters are essential for computationally intensive tasks such as simulations, modeling, and large-scale data processing. The **SLURM (Simple Linux Utility for Resource Management)** workload manager is widely used in academic and research environments to manage and schedule computing jobs on clusters.

This experiment demonstrates how to use the **APRICOT** extension to:

- Deploy a SLURM cluster on a cloud provider using a predefined recipe

- Submit and monitor a job directly from the notebook

- Retrieve the job output

- Tear down the infrastructure once the work is complete

All steps are automated using `%apricot` magic commands for simplicity and reproducibility.



### 🛠️ **Step 1: Load the APRICOT Extension**

In [None]:
%reload_ext apricot_magics

### 🔑 **Step 2: Add Your EGI Refresh Token**

In [None]:
refresh_token = "<token>"

In [None]:
%apricot_token {refresh_token}

### 📜 **Step 3: Define the SLURM Cluster Recipe**

You can either:

- Use a **predefined SLURM recipe** via the APRICOT GUI menu, or

- Use the **custom recipe** below in **TOSCA** format:
> 🔍 Change both _image_ values with the valid cloud provider image you want to use.

> ✍️ You will need to fill your authfile in `resources/authfile` with your IM and cloud credentials if you use the *magic commands* to deploy the cluster.

In [None]:
slurm_cluster_recipe = """
tosca_definitions_version: tosca_simple_yaml_1_0

description: Minimal SLURM Virtual Cluster

imports:
  - grycap_custom_types: https://raw.githubusercontent.com/grycap/tosca/main/custom_types.yaml

topology_template:
  inputs:
    fe_cpus:
      type: integer
      default: 1
    fe_mem:
      type: scalar-unit.size
      default: 1 GiB
    wn_cpus:
      type: integer
      default: 1
    wn_mem:
      type: scalar-unit.size
      default: 1 GiB
    wn_num:
      type: integer
      default: 1
    slurm_version:
      type: string
      default: 23.11.8
    fe_ports:
      type: map
      default:
        port_22:
          protocol: tcp
          source: 22

  node_templates:
    lrms_server:
      type: tosca.nodes.indigo.Compute
      properties:
        instance_name: slurm_frontend
      capabilities:
        host:
          properties:
            num_cpus: { get_input: fe_cpus }
            mem_size: { get_input: fe_mem }
        os:
          properties:
            type: linux
            distribution: ubuntu
            image: one://osenserver/image-id
        endpoint:
          properties:
            network_name: PUBLIC
            ports: { get_input: fe_ports }
            dns_name: slurmserver

    lrms_front_end:
      type: tosca.nodes.indigo.LRMS.FrontEnd.Slurm
      properties:
        version: { get_input: slurm_version }
        wn_ips: { get_attribute: [lrms_wn, private_address] }
      requirements:
        - host: lrms_server

    lrms_wn:
      type: tosca.nodes.indigo.Compute
      properties:
        instance_name: slurm_worker
      capabilities:
        host:
          properties:
            num_cpus: { get_input: wn_cpus }
            mem_size: { get_input: wn_mem }
        os:
          properties:
            type: linux
            distribution: ubuntu
            image: one://osenserver/image-id
        scalable:
          properties:
            count: { get_input: wn_num }

    wn_node:
      type: tosca.nodes.indigo.LRMS.WorkerNode.Slurm
      properties:
        version: { get_input: slurm_version }
        front_end_ip: { get_attribute: [lrms_server, private_address, 0] }
        public_front_end_ip: { get_attribute: [lrms_server, public_address, 0] }
      requirements:
        - host: lrms_wn

  outputs:
    cluster_ip:
      value: { get_attribute: [lrms_server, public_address, 0] }
    cluster_creds:
      value: { get_attribute: [lrms_server, endpoint, credential, 0] }
"""

### 🚀 **Step 4: Deploy the SLURM Cluster**

In [None]:
%apricot_create {slurm_cluster_recipe}

📝 After running this command, copy the `infrastructure_id` from the output.
Let’s assign it to a variable:

In [None]:
infrastructure_id = "infra-id"

### 📋 **Step 5: View cluster state**

You can check the logs of the deployment:

In [None]:
%apricot_log {infrastructure_id}

### 🧪 **Step 6: Submit a SLURM Job**

Let’s submit a simple SLURM job to verify that everything is working correctly.

#### 📄 **6.1 Create a SLURM job script on the VM**
We'll create a basic SLURM job script that prints a message.

In [None]:
script = """#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=output.out

echo "Hello from SLURM!"
"""

# Write the script to /home/slurm/job.sh as the slurm user
%apricot exec {infrastructure_id} echo {script!r} | sudo -u slurm tee /home/slurm/job.sh

#### 📤 **6.2 Submit the job**


In [None]:
# Submit the script using sbatch as the slurm user
%apricot exec {infrastructure_id} sudo su - slurm -c 'sbatch /home/slurm/job.sh'

#### 📋 **6.3 Check the job queue**

In [None]:
# View the current SLURM job queue
%apricot exec {infrastructure_id} squeue

> Wait until your job finishes. It should be quick for this simple example.

Check that the output file has been created

In [None]:
%apricot exec {infrastructure_id} sudo -u slurm ls /home/slurm/

### 📂 **Step 7: Retrieve the Output**

After the job completes, the output will be written to a file called `output.out` in the SLURM user’s home directory. Move it to /tmp so it’s accessible for download:

In [None]:
%apricot exec {infrastructure_id} sudo -u slurm mv /home/slurm/output.out /tmp

### 📤 **Step 8: Download Output Logs**

In [None]:
%apricot_download {infrastructure_id} /tmp/output.out .

### 🧹 **Step 9: Clean Up**

In [None]:
%apricot_destroy {infrastructure_id}

✅ **Summary**

In this notebook, you:

- Deployed a SLURM cluster from Jupyter

- Created and submitted a SLURM job

- Retrieved the output and displayed it

💡 **Notes**

- The SLURM controller and compute nodes are automatically configured via the recipe.

- SLURM jobs must be submitted as the slurm user.

- Output files written in the SLURM user’s home directory aren't accessible by default—use /tmp to enable downloads.


📌 **Conclusion**

This experiment showcases the power of cloud-based virtualization for enabling accessible HPC workflows. Using **APRICOT**, researchers and students can deploy scalable, reproducible environments directly from notebooks.