layout | title | author |
---|---|---|
default |
Scaling up Work |
Bob Freeman |
Previous: Better Visibility with Progress of Jobs, Job Information, and Job Control
Objectives
- Understand the customs and appropriate behavior when scaling up work
- Simple one task, one-job approaches
- Using job arrays to submit a set of jobs/tasks as one job with multiple parts
So you've made it. You can now run batch jobs, your code flexibly uses any input file and output file, you capture log files with unique names, and you get notifications. Well, you're not done just yet. It's time to really stretch, and to run many copies of your code at once, leveraging the full power and ability to scale across the many cores our cluster has to offer. We'll show several workflow or design patterns, but there are some points of etiquette and consideration we must point out.
Since you are using a shared resource, just as attending a concert, dinner out, or special occasion party, there are certain customs and behaviors that are appropriate and expected when working around others and on shared equipment. We'll touch on those briefly
- Ensure you are requesting the appropriate resources.
On local desktops / laptops, you own and control all the resources. On shared equipment, the scheduler packs your job together with other jobs, from you or other people. Like packing a car on the Martha's Vineyard ferry, the scheduler needs to know the size and shape of your job: # of core, amount of RAM, how long, etc. The more you take, the less for everyone else. So remember the mantra 'Take what you use; Use what you take.' See our Choosing Resources web page for more info.
- Bundle work, and don't overwhelm the scheduler.
It is easy to submit a job. And just as easy to submit 100 or 1000. But there are about 25 - 50 people also doing the same thing. If you submit and run jobs that all fail, or all finish in seconds, that is very expensive for the scheduler -- accepting the submission, calculating shape, size, etc; determine best place and time to run; sending over job info; setting up the job; running and monitoring the job (collecting runtime metrics); breaking down the job; collecting and logging the job data; sending info back to you; and writing everything to disk.
For this reason, it is always good to have jobs run from a minimum of at least 5 to 10 minutes to several hours or days. The longer the job runs, the higher the chance of failure, so ensure you have checkpointing or a resume-after-fail framework written into your code (ask RCS about this).
This also means that your jobs run more efficiently and that the cluster is more fully utilized.
- Do scaling tests, and "Test small, run big."
This safety net helps you understand what is the appropriate # of cores or RAM that is needed by seeing what happens when you double the core count or file size. Since more is not better, as that means less for others, you should understand what the optimal run configuration is based on the efficiency of your code and the urgency of your need for the results.
Also when scaling, it is easy to make mistakes in coding or when submitting jobs. Or one might not anticipate unexpected side effects on storage or due to bottlenecks in your code when running hundreds of jobs. So test runs with one job, then a few to tens of jobs, then larger. This gradual scaling will help identify problem points that you may need to resolve, especially if these relate to shared resources that other jobs also utilize, like storage.
This approach is wickedly simple, as we'll submit one job to the cluster for each task we wish to accomplish. But this also can be exceedingly frustrating for other users, as they see that hundred or thousands of jobs are ahead of them waiting to run. The etiquette rules above apply, so be thoughtful.
We'd like to process our files in the directory. And luckily we know that all the names use a sequence of numbers, which lends itself well to a for...do loop. Let's set that up now.
- Set up the shell of the for...do loop:
for index in `seq 1 4`; do
echo $index
done
When you run this in your terminal, you should see a series of numbers from 1 to 4. Good.
- Now let's add our command, but in debug mode -- putting
echo
in front of the actual command:
# use the module load if not continuing from previous sections
module load python/3.8.5
for index in `seq 1 4`; do
echo $index
echo bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data_%J.out -e logs/process_data_%J.err \
python code/process_data.py --input data/file1.bad --output data/out1.bad
done
When you run this, you should see the command echoed to the screen that you'd normally use to submit, but nothing has been submitted. But we're still only doing the same submission over and over. We need to adjust the input and output files.
- Adjust the input and output parameters to utilize the index, and the file inputs and outputs to the well-formed files:
# use the module load if not continuing from previous sections
module load python/3.8.5
for index in `seq 1 4`; do
echo $index
echo bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data_%J.out -e logs/process_data_%J.err \
python code/process_data.py --input data/file$index.txt --output data/out$index.txt
done
Now when the loop runs, it should show the four bsub
commands, each specify the appropriate
input and output file.
But before we unleash the Kraken (letting all of our jobs run; finally!) we should test/scale carefully
- Run the loop only once, and this time actually submit the job:
# use the module load if not continuing from previous sections
module load python/3.8.5
for index in `seq 1 1`; do
echo $index
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data_%J.out -e logs/process_data_%J.err \
python code/process_data.py --input data/file$index.out --output data/out$index.err
done
We've changed the sequence of numbers to 1..1, instead of 1..4, and also removed the echo
command. When run, the job should submit, and one should see that the job is submitted. You
can check this using the bjobs
command:
bjobs
xxx
And let's cancel that job to run all the jobs...
bkill xxx
- With one slight modification, a pause between submissions, let's submit all our jobs:
for index in `seq 1 4`; do
echo $index
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data_%J.out -e logs/process_data_%J.err \
python code/process_data.py --input data/file$index.out --output data/out$index.err
sleep 0.25
done
Again, with the bjobs
command, you should be able to see all your jobs running, and bjobs -l
will give you run details on all of them:
bjobs
bjobs -l
Congratulations on your first success at scaling your work!
As we mentioned earlier, running multiple instances of your code simultaneously across the cluster utilizing the cores that are available is one method for scaling your work. A not-so-pleasant side effedt of hte One Task, One Job design pattern is that you have tens or hundreds of jobs to manage, as does the scheduler. And everyone else might be bothered that the sheer number of jobs ahead of them when they submit their own jobs.
Another, more kinder and gentler job submission pattern is usiung job arrays. Thining about this last submission patter, all the jobs had the same size and shape on resource requests: all one core, all 1 GB RAM, all approximately 10 minutes, and all in the short queue. But we had tens or hundreds of jobs for a given set of work. Instead, when submitting work via job arrays, the work is submitted as one job, but the job has tens, hundereds, or thousands of tasks, like little slots in old post office mailboxes. Or indexes in an array.
There is a slight shift to how one utilizes these, which we'll explore now.
Instead of using a for..do loop for the sequence, and the variable $index to hold our particular sequence items, LSF creates the variable LSB_JOBINDEX that our code can utilize. This variable indicates which slot or task that a job is in the array.
The quick a dirty method for simple commands or jobs runs can be like the following:
- Submit our work as a job array, using the LSB_JOBINDEX variable for indicating the job slot:
# ensure python has already been loaded
module load python/3.8.5
# now submit our array, for 1 - 4 slots
bsub -J "process_data[1-4]" \
-q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data_%J_%I.out -e logs/process_data_%J_%I.err \
python code/process_data.py \
--input data/file\$LSB_JOBINDEX.txt --output data/out\$LSB_JOBINDEX.txt
A few changes to note:
- We submit the job array with the -J option, giving it an array name and the upper and lower indexes.
- For our job logs, we also include the %I option, which embeds the slot number into the log filenames
- Finally, we include
LSB_JOBINDEX
in the input file names to indicate which ones he wants. Notice that we prefix the$LSB_JOBINDEX
with '' so that the variable is ignored when submitted, but is parsed when the python command is exectuted on the compute node.
So our work is now one job (== one jobID) but has 4 tasks or slots. Since each task is identical the scheduler knows they are similar in resource needs, and thus it is easier for the scheduler to prioritize this with the other tasks.
Another way to submit this work to the cluster is via a job submission script. Although a little extra work, the writing is more expressive/less complicated, is easier to read, and can also serve as a record of your work -- automagically documenting what you have done.
- Let's create the submission script 'code/process_data_job.sh' in your favorite text editor. It should contain the following code:
#!/usr/bin/bash
#
# script process_data_job.sh
#
#
# bsub directives can go here
# and then any environment stuff
module load python/3.8.5
# assuming we are starting at the root of the project folder
# and now globals
slot=$LSB_JOBINDEX
infile = file${LSB_JOBINDEX}.txt
outfile = out${LSB_JOBINDEX}.txt
python code/process_data.py \
--input data/$infile --output data/$outfile
- Save our file, mark in executable, and launch our job array with the following command:
chmod a+x code/process_data_job.sh
# now submit our array, for 1 - 4 slots
bsub -J "process_data[1-4]" \
-q short \
-M 1000 -hl \
-We 15 \
-B -N \
-o logs/process_data_%J_%I.out -e logs/process_data_%J_%I.err \
code/process_data_job.sh
And within a few moments, our job should start running.
Although slightly simpler, the job submission line is still rather long. We can simplify that
by embedding all the job options in our job submission script file. And if we wish to override
any option in the file, we can include that override on the command line. You can execute
bkill -q short 0
to stop all these jobs.
- Modify our job submissions script, placing all job options in the script file:
#!/usr/bin/bash
#
#BSUB -J "process_data[1-4]"
#BSUB -q short
#BSUB -M 1000 -hl
#BSUB -We 15
#BSUB -B -N
#BSUB -o logs/process_data_%J_%I.out
#BSUB -e logs/process_data_%J_%I.err
#
# script process_data_job.sh
#
# and then any environment stuff
module load python/3.8.5
# assuming we are starting at the root of the project folder
# and now globals
slot=$LSB_JOBINDEX
infile = file${LSB_JOBINDEX}.txt
outfile = out${LSB_JOBINDEX}.txt
python code/process_data.py \
--input data/$infile --output data/$outfile
- Save our updated job submission script, and submit the job, overriding the queue that it is going to run in:
# now submit our array, for 1 - 4 slots
bsub -q long code/process_data_job.sh
If all goes well, we should see our jobs starting in a few moments, but running in the
long
queue, instead of short.
Congratulations! You are now working steadily towards fully utilizing a complex computational environment for your research, one that can speed your discovery and accelerate research outcomes beyond what you had imagined. The concepts here are applicable on most high-performance and high-throughput systems (HPC and HTC, respectively).
There are slight nuances to these approaches, one being working with non-sequentially-named files. There are solutions for this (use job index to pick the file in a bash array of filenames, use job index to look up line in file with file name or job parameters, etc) that require very minor code changes; and there are also different paradigms one can choose (e.g. HDF files, using a SQL database for data, etc). What is important is to think through the questions:
- what type of data am I working with?
- what is the format of the files and data?
- will I need to iterate repeatedly over the data?
- what analysis pattern am I using, and is it efficient given the setup at hand?
We hope these lessons and information help you on your research journey, and we urge you to contact us if you should have any questions.
Best, Bob & the RCS Staff