Adapting HPC Setup for a different cluster: Indiana University Example

Adapting HPC Setup - Gary L. Pavlis

What is this document and how is it structured

I wrote this wiki page as an example of how to set up a local cluster environment using the templates and documentation here on github. The example is for the HPC systems at Indiana University (IU). IU has multiple research clusters. I'm going to use the one available at the time of this writing (early 2023) they call carbonate. It is a general purpose cluster for HPC computing for IU faculty and students and has an architecture that will work with our standard docker container. A special feature it will allow me to exploit is the system they call the "REsearch Desktop" (RED). RED provides a graphical interface to a set of head notes that will be extremely useful using jupyter as our front end. I will need to do some special work then to allow me to run the notebook through the head node. I want to try to have the notebook server running by itself on the head node and have all the computational work done on compute notes.

The rest of this document is more or less a log of what I needed to do to get mspass running in this environment. I plan to faithfully record my successes and failures. That may make it harder to read but it should give anyone looking at this a better perspective on what they might face.

Blog of Pavlis Efforts

Step 1: Build container for singularity

An existing entry on the wiki for MsPASS (follow this link describes how to set up singularity. IU uses singularity and the same software management system as TACC. Here is a log of the commands I issued to the head node on RED in a "terminal" window:

module load singularity   # minor different from tacc version that adds a tacc tag
singularity build mspass.simg docker://wangyinz/mspass

The wiki page referenced above has a bunch of extra stuff that I think is just extraneous for my use. The singularity build line seems to work fine and I'll see if I can just run the file it creates called mspass.sing

I just tried to see what happens running this with this line:

singularity run mspass.sing   # Note wiki had a type an had this mspass.simg which is confusing

As I kind of expected that failed with this message:

[WARN  tini (21508)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
mkdir: cannot create directory '/logs': Read-only file system
/usr/sbin/start-mspass.sh: line 279: /logs/dask-scheduler_log_9hBOqd3YAb8k: No such file or directory
/usr/sbin/start-mspass.sh: line 280: /logs/dask-worker_log_9hBOqd3YAb8k: No such file or directory
mkdir: cannot create directory '/db': Read-only file system
2023-01-06T09:23:44.863-0500 F  CONTROL  [main] Failed global initialization: FileNotOpen: Failed to open "/logs/mongo_log"
[I 09:23:46.530 NotebookApp] Writing notebook server cookie secret to /N/home/u070/pavlis/Carbonate/.local/share/jupyter/runtime/notebook_cookie_secret
[I 09:23:47.301 NotebookApp] Serving notebooks from local directory: /N/home/u070/pavlis/Carbonate
[I 09:23:47.301 NotebookApp] Jupyter Notebook 6.2.0 is running at:
 ** Juypter actually launched here but would be useless give above ***

The messages show a clear problem with paths. I retrospect I expected that would be a problem but I was taking the hacker approach to just go for it to see what happened.

What I need to do is create a startup job script using the templates on github. That is a very different thing so I'm going to start a new section on the topic.

Step 2: Modifying job scripts

The first thing I had to do was revisit the mspass documentation page that is the primary reference really for this problem. I wrote it so I can read it easily. A warning to future readers is that page will likely be changed after I finish going through the process described in this blog. The reference here is the page in the MsPASS documentation with the (current - it could change) title MsPASS In-Depth Overview.

What that made me realize is that the wiki page on singularity led me down the wrong path. Having written that document I should have remembered that, but this is an example of a mistake worth preserving. Don't do what I just did other than building the container with singularity build.

That said, I downloaded a startup shell script template from github to use as a template. I wanted to get mspass working with multiple nodes on the IU system so I chose not the example listed in the documentation but the one called scripts/tacc_examples/distributed_node.sh.

Worked on that a bit and realized I should get the single node version running first. There are complexities in the distributed node job with ssh communication between nodes that are going to be more challenging. The single_node.sh script from tacc_examples is much simpler.

I found there were two types of changes I had to make to the single_node.sh script to make it work:

File path differences.
System configuration differences.

The first is relatively easy. The tacc template had a feature that helped that I could clone: two environment variables set at tacc at the system level had to be redefined for the IU system. They are: SCRATCH and WORK2. Here, in fact, is a diff of the two files that shows what I did for IU:

[pavlis@i26 usarray]$ diff single_node_tacc.sh single_node.sh
10a11,14
> # For IU environment we need to set these two environment variables
> # that are set system wide at TACC - makes converting this script easier
> SCRATCH=/N/scratch/pavlis
>  WORK2=/N/slate/pavlis
14c18
< MSPASS_CONTAINER=$WORK2/mspass/mspass_latest.sif
---
> MSPASS_CONTAINER=$WORK2/mspass/mspass.sing

Note the container name choice is just the name one would choose.

For system configuration I just had to change the module load lines. Here in fact, is the section of the diff output of the command above that is relevant:

< # load modules on tacc machines
< module unload xalt
< module load tacc-singularity
---
> # IU module commands needed
> module load singularity
38,42c41,45

The last thing for my application was that I didn't need the ssh tunnel stuff in the tacc script. Because I was using the RED front end I found I could just remove those lines and it worked with running running a browser in the RED (ThinLinc client) window.

I verified that who thing worked by copying our getting_started jupyter notebook to the working directory defined by this line in the shell script:

WORK_DIR=$SCRATCH/mspass/single_workdir

The way this can be launched on RED is a bit different from tacc. I used the dummy method and used the RED gui to run a little app they have there for launching an interactive job with a simple click; reasonable starting point for testing. Like the examples from tacc I get the output displayed on the interactive node terminal with this content:

(base) [pavlis@c71 usarray]$ ./single_node.sh 
singularity version 3.6.4 loaded.
Currently Loaded Modulefiles:
  1) quota/1.8             6) gsl/gnu/2.6          11) intel/19.0.5
  2) git/2.13.0            7) cmake/gnu/3.18.4     12) totalview/2020.0.25
  3) xalt/2.10.30          8) boost/gnu/1.72.0     13) singularity/3.6.4
  4) core                  9) gcc/9.1.0
  5) hpss/8.3_u4          10) openblas/0.3.3
/N/slate/pavlis/usarray
Sat Jan  7 07:06:21 EST 2023
(standard_in) 1: syntax error
got login node port 
[WARN  tini (32591)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
2023-01-07T07:06:26.855-0500 I  CONTROL  [main] log file "/N/scratch/pavlis/mspass/single_workdir/logs/mongo_log" exists; moved to "/N/scratch/pavlis/mspass/single_workdir/logs/mongo_log.2023-01-07T12-06-26".
[I 07:06:28.167 NotebookApp] Serving notebooks from local directory: /N/scratch/pavlis/mspass/single_workdir
[I 07:06:28.167 NotebookApp] Jupyter Notebook 6.2.0 is running at:
[I 07:06:28.167 NotebookApp] http://c71:8888/?token=8bda8850528e4932bd025ee457728d18d6329b85ae9d6ffd
[I 07:06:28.167 NotebookApp]  or http://127.0.0.1:8888/?token=8bda8850528e4932bd025ee457728d18d6329b85ae9d6ffd
[I 07:06:28.168 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 07:06:28.193 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///N/scratch/pavlis/mspass/single_workdir/.local/share/jupyter/runtime/nbserver-359-open.html
    Or copy and paste one of these URLs:
        http://c71:8888/?token=8bda8850528e4932bd025ee457728d18d6329b85ae9d6ffd
     or http://127.0.0.1:8888/?token=8bda8850528e4932bd025ee457728d18d6329b85ae9d6ffd

I could then connect with a Firefox browser window to the url with the c71 hostname. I ran the tutorial to the point where I needed the waveform files and stopped as I didn't have a local copy. Point is the startup.sh script now works.

Step 3: Work out changes for multiple nodes

The issues for multiple nodes are more complicated. The first things were easy and because I had actually started playing with the distributed_node.sh file first were mostly solved. That is, I had to make the same changes to path names and system software setup though module. There is one difference for the latter though. I needed an additional module for mpirun. The tacc script had this line to launch workers:

> mpiexec.hydra -n $((SLURM_NNODES-1)) -ppn 1 -hosts $WORKER_LIST $SING_COM &

The mpiexec.hydra program is apparantly a variant run at tacc that is loaded for all jobs by default. For the IU system I had to explicitly load a version of mpiexec with this command:

module load openmpi

and change the run line to this:

mpiexec -n $((SLURM_NNODES-1)) -ppn 1 -hosts $WORKER_LIST $SING_COM &

That left one issue: internode communication. The tacc example should explain why the ssh tunnel setup is necessary. I have to assume it is a necessary evil to get dask/spark to run on the cluster. To test this I first tried to launched an interactive job with 4 nodes with RED to allow me to do some hacking on the script without waiting for slurm to schedule multiple batch runs. Unfortunately, it seems the cluster limit for an interactive job is 2 nodes. Might have been ok, but I needed the education in slurm batch submissions anyway so I did this by repeated sbatch submission looking at the output from each run. The things I had to deal with were:

sbatch rejected the submission until I edited set of #SLURM lines at the top of the script. Two were the ,line for -p skx-dev defines some queue name at tacc and -A MsPASS that is the account name for the allocation account at tacc when that script was created. I just deleted both of those lines.
I got stuck with something that may just be temporary. When I submitted the jobs through sbatch it generated an error that the module program was not defined. A typical point where I had to ask for help from the sysadmins. This ends until I get that resolved.

Jan 12, 2023

A common experience at HPC centers is delays in finding the answer to a question like the one I had. This was no exception. The issue turned out to be a obscure one. The problem was created by the fact that I had a legacy setup of having tcsh as my default shell. Something was incompatible when a bash script, which is what the distributed_node strip is running, was run in my setup with tcsh as the parent shell. Never really figured out why, but the solution was to drop the legacy tcsh default shell. Then there were no more issues with module.

I then tried submitting the edited distributed_node.sh script to SLURM on the machine carbonate.uits.iu.edu. I hit a set of path problems I traced to inconsistencies in distributed_node.sh with the shell script called start-mspass.sh embedded inside the container. Since someone less knowledgeable may be reading this wiki that requires a minor digression. The way we currently have mspass launch in HPC systems is a two-stage shell script-based system. That is, the overall "job" in my example is being driven by my new version of distributed_node.sh, which from now on I will call distributed_node_IU.sh. Processing is initiated by submitting distributed_node_IU.sh to the local cluster (carbonate in my case) with SLURM's sbatch command via a unix shell on a head node. In my case the head node is actually a GUI system where I plan to run the web browser that will attach to the jupyter notebook server. The distributed_node_IU.sh script has to launch containers via singularity in every node it defines as a virtual cluster as described in the mspass user's manual in this section. Buried in that user's manual page is the fact EVERY TIME the container is launched each instance boots the container and then before returning control to distributed_node_IU.sh runs the master shell script start-mspass.sh.

With that education this next issue that arose is this: some of the containers are throwing errors when they execute start-mspass.sh. For the record, it is not obvious from the output the "job" creates that the errors come from the containers. Here is the output I see at this stage in my hacking to get this to work:

[pavlis@i23 usarray]$ cat mspass.o3259215
singularity version 3.6.4 loaded.
openmpi version 4.0.1 loaded.
Currently Loaded Modulefiles:
  1) quota/1.8                      8) boost/gnu/1.72.0
  2) git/2.13.0                     9) gcc/9.1.0
  3) xalt/2.10.30                  10) openblas/0.3.3
  4) core                          11) intel/19.0.5
  5) hpss/8.3_u4                   12) totalview/2020.0.25
  6) gsl/gnu/2.6                   13) singularity/3.6.4
  7) cmake/gnu/3.18.4              14) openmpi/intel/4.0.1(default)
/N/slate/pavlis/usarray
Wed Jan 11 16:58:56 EST 2023
primary node c3
(standard_in) 1: syntax error
got login node port 
c29,c30
Using Single node MongoDB
mpiexec: Error: unknown option "-o"
Type 'mpiexec --help' for usage.
mkdir: cannot create directory '/N/scratch': Read-only file system
mkdir: cannot create directory '/N/scratch': Read-only file system
/usr/sbin/start-mspass.sh: line 282: /N/scratch/pavlis/mspass/workdir/logs/dask-scheduler_log_DHxCJYf2b4Xw: No such file or directory
mkdir: cannot create directory '/N/scratch': Read-only file system
mkdir: cannot create directory '/N/scratch': Read-only file system
[C 16:59:08.099 NotebookApp] Bad config encountered during initialization:
[C 16:59:08.099 NotebookApp] No such notebook dir: ''/N/scratch/pavlis/mspass/workdir''
The Jupyter HTML Notebook.

   ... Additional output from Jupyter just echoing usage created by launch failure (omitted) ...

There are two types of errors I had to deal with next here:

The error from mpiexec. It makes no sense because the only line running mpiexec in distributed_node_IU.sh did not use -o and the start-mspass.sh script doesn't run mpiexec. The only place it could be coming from is this line: mpiexec -n $((SLURM_NNODES-1)) -hosts $WORKER_LIST $SING_COM &.
The string of errors about "Read-only file system". That is appears to be coming from launching the containers as nowhere in distributed_node_IU.sh has a reference to that symbol.

I had to do different things to solve each of these. For item 1, I suspected the run line for mpiexec. I added the following line immediately after the mpiexec line:

echo "mpiexec -n $((SLURM_NNODES-1)) -hosts $WORKER_LIST $SING_COM"

which is an old shell debug trick. I then resubmitted the job and got this output from that line:

mpiexec -n 2 -hosts x3,x4 singularity run /N/slate/pavlis/mspass/mspass_latest.sif

showing it wasn't there.

Item 2 is actually equally mysterious but for different reasons. I cannot find any place in either of these scripts where the symbol /N/scratch should ever be defined. It is ".." from $WORK_DIR in the draft script, but I find no reference to "..". The only thing I can currently think of is something in the running container is referencing "/" that is then somehow replaced with "/N/scratch".

Jan 14,2023

Worked on this for a while as you can see from the date without much progress until today. My fundamental error that was creating the readonly file system errors was misunderstanding the fundamental fact that singularity, like docker, runs from a mount point that defines the top of the file system for the container ("/"). Without some intervention the container can only see files below that mount point in the unix file system tree. Getting the container to work with other file systems mounted is a harder problem. First, I'm struggling enough with trying to get it to work with all data saved in a single file system.

Something I found incredibly useful was this incantation to run a bash shell in the container. With that I could use standard unix command line tools like ls, df, etc. to figure out what the default container was seeing. Here is the command line I used:

singularity exec /N/slate/pavlis/mspass/mspass_latest.sif bash

That helped me realize that with exec, at least, no matter where I launched this from the shell saw only my home directory which on the IU system is defined this way as shown by this output from the bash instance running above:

Singularity> pwd
/N/home/u070/pavlis/Carbonate

I can also see files in another file system on the IU system called /N/dcwan, which is a large lustre file system on which I happen to have a writable directory.

That led me to do what I should have done much earlier: RTFM for singularity. I then ran the following to test use of the --bind option to mount the file system I wanted for this processing:

singularity exec --bind /N/slate/pavlis,/home /N/slate/pavlis/mspass/mspass_latest.sif bash
Singularity> pwd
/N/slate/pavlis/usarray

which showed this did as expected. That is, it mounted the file system /N/slate and set the working directory I wanted as home. Clearly some variant of this is what I need to make this work.

The first issue that comes up best dealt with now as I see this use in the singularity documentation. The issue is that I don't want to contaminate the run line for the mspass container for every instance with --bind complication. I could, but the documentation shows a much cleaner solution for this case. This output shows the point better than words:

[pavlis@i16 usarray]$ export SINGULARITY_BIND="/N/slate/pavlis,/home"
[pavlis@i16 usarray]$ singularity exec /N/slate/pavlis/mspass/mspass_latest.sif bash
Singularity> pwd
/N/slate/pavlis/usarray

The point is the use of the environment variable, SINGULARITY_BIND to cause all containers from then on to be run with that --bind option. The next revision of the distributed_node_IU.sh script I created used that feature to bind the directory above with that same incantation.

That change solved the "read only" file system errors. I am left, however, with the mysterious error about the -o option for mpiexec.

That in the end was RIDICULOUS. Thank you google, but it turns out our master script had a typo. In the line of the script using mpiexec to launch worker containers we had "-hosts" BUT as the manual page says it should be "-host" or "-H". It is a bug in mpiexec, in my opinion, that it writes such a misleading error for an arg that does not match, but that solves the problem anyway. What I don't actually know is if the workers were actually running. That error may have been a red herring, to use a cliche.

So, I now have a working base script for running mspass with multiple nodes. Here it is with a few extra print statements I was using for debugging the shell script:

#!/bin/bash

#SECTION 1:  slurm commands (see below for more details)
#SBATCH -J mspass           # Job name
#SBATCH -p general
#SBATCH -o mspass.o%j       # Name of stdout output file
#SBATCH -N 3                # Total # of nodes
#SBATCH -n 3                # Total # of mpi tasks (normally the same as -N)
#SBATCH -t 00:10:00         # Run time (hh:mm:ss)
#      #SBATCH --ntasks-per-node=24   # appropriate for carbonate at IU
# For production need to specify number of tasks per node here


# SECTION 2:  Define the software environment
module load singularity
module load openmpi
module list
pwd
date

# IU and TACC have differences in default behavior of singularity
# for IU it seems only the file system defined the working directory 
# defined here is mounted
export SINGULARITY_BIND="/N/slate/pavlis,/home"

cd /N/slate/pavlis
SCRATCH=/N/slate/pavlis/scratch    
# Similar IU substitution for the WORK2 symbol
WORK2=/N/slate/pavlis

#SECTION 3:  Define some basic control variables for this shell
# this sets the working directory
# SCRATCH is an environment variable defined for all jobs on stempede2
WORK_DIR=${WORK2}/usarray
# This defines the path to the docker container file.
# like SCRATCH WORK2 is an environment variable defining a file system
# on stampede2
# reference singularity container created with singularity build
MSPASS_CONTAINER=$WORK2/mspass/mspass_latest.sif
# specify the location where user wants to store the data
# should be in either tmp or scratch
#DB_PATH='scratch'
DB_PATH=$SCRATCH/mspass/workdir
# the base for all hostname addresses
HOSTNAME_BASE='uits.iu.edu'
# Sets whether to use sharding or not (here sharding is turned on)
DB_SHARDING=false
# define database that enable sharding
SHARD_DATABASE="usarraytest"
# define (collection:shard_key) pairs
SHARD_COLLECTIONS=(
    "arrival:_id"
)
# This variable is used to simplify launching each container
# Arguments are added to this string to launch each instance of a
# container.  stampede2 uses a package called singularity to launch
# each container instances
SING_COM="singularity run $MSPASS_CONTAINER"


# Section 4:  Set up some necessary communication channels
# obtain the hostname of the node, and generate a random port number
NODE_HOSTNAME=`hostname -s`
echo "primary node $NODE_HOSTNAME"
#LOGIN_PORT=`echo $NODE_HOSTNAME | perl -ne 'print (($2+1).$3.$1) if /c\d(\d\d)-(\d)(\d\d)/;'`
#STATUS_PORT=`echo "$LOGIN_PORT + 1" | bc -l`
#echo "got login node port $LOGIN_PORT"

# create reverse tunnel port to login nodes.  Make one tunnel for each login so the user can just
# connect to stampede2.tacc.utexas.edu
# disabled for IU - RED allows us to get around this
#for i in `seq 4`; do
#    ssh -q -f -g -N -R $LOGIN_PORT:$NODE_HOSTNAME:8888 login$i
#    ssh -q -f -g -N -R $STATUS_PORT:$NODE_HOSTNAME:8787 login$i
#done
#echo "Created reverse ports on Stampede2 logins"


# Section 5:  Launch all the containers
# In this job we create a working directory on stampede2's scratch area
# Most workflows may omit the mkdir and just use cd to a working
# directory created and populated earlier
mkdir -p $WORK_DIR
cd $WORK_DIR
pwd

# start a distributed scheduler container in the primary node
echo Lauching scheduler on primary node
SINGULARITYENV_MSPASS_WORK_DIR=$WORK_DIR \
SINGULARITYENV_MSPASS_ROLE=scheduler $SING_COM &

# get the all the hostnames of worker nodes
WORKER_LIST=`scontrol show hostname ${SLURM_NODELIST} | \
             awk -vORS=, -v hostvar="$NODE_HOSTNAME" '{ if ($0!=hostvar) print $0 }' | \
             sed 's/,$/\n/'`
echo $WORKER_LIST

# start worker container in each worker node
echo Attempting to launch workers
SINGULARITYENV_MSPASS_WORK_DIR=$WORK_DIR \
SINGULARITYENV_MSPASS_SCHEDULER_ADDRESS=$NODE_HOSTNAME \
SINGULARITYENV_MSPASS_ROLE=worker \
mpiexec -n $((SLURM_NNODES-1)) -host $WORKER_LIST $SING_COM &
echo "mpiexec -n $((SLURM_NNODES-1)) -host $WORKER_LIST $SING_COM "

echo "Trying to launch other containers"
echo "DB launch line"
echo  "SINGULARITYENV_MSPASS_DB_PATH=$DB_PATH SINGULARITYENV_MSPASS_WORK_DIR=$WORK_DIR SINGULARITYENV_MSPASS_ROLE=db $SING_COM "
echo "Using Single node MongoDB"
# start a db container in the primary node
SINGULARITYENV_MSPASS_DB_PATH=$DB_PATH \
SINGULARITYENV_MSPASS_WORK_DIR=$WORK_DIR \
SINGULARITYENV_MSPASS_ROLE=db $SING_COM &
# ensure enough time for db instance to finish
sleep 10

# Launch the jupyter notebook frontend in the primary node.
# Run in batch mode if the script was
# submitted with a "-b notebook.ipynb"
if [ $# -eq 0 ]; then
    SINGULARITYENV_MSPASS_WORK_DIR=$WORK_DIR \
    SINGULARITYENV_MSPASS_SCHEDULER_ADDRESS=$NODE_HOSTNAME \
    SINGULARITYENV_MSPASS_DB_ADDRESS=$NODE_HOSTNAME \
    SINGULARITYENV_MSPASS_SLEEP_TIME=$SLEEP_TIME \
    SINGULARITYENV_MSPASS_ROLE=frontend $SING_COM
else
    while getopts "b:" flag
    do
        case "${flag}" in
            b) notebook_file=${OPTARG};
        esac
    done
    SINGULARITYENV_MSPASS_WORK_DIR=$WORK_DIR \
    SINGULARITYENV_MSPASS_SCHEDULER_ADDRESS=$NODE_HOSTNAME \
    SINGULARITYENV_MSPASS_DB_ADDRESS=$NODE_HOSTNAME \
    SINGULARITYENV_MSPASS_SLEEP_TIME=$SLEEP_TIME \
    SINGULARITYENV_MSPASS_ROLE=frontend $SING_COM --batch $notebook_file
fi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly