layout | title | author |
---|---|---|
default |
Take Baby Steps |
Bob Freeman |
Previous: README
Objectives
- Learn how to run your code from the terminal, both interactively & in batch
- Learn the scheduler job info command 'bjobs' & its options
In this section, we're going to assume It is important for you to be able to learn some basics of using the terminal, as one can run batch jobs most effectively & flexibly from the terminal.
When starting out using a compute cluster, or often called high-performance computing (HPC) or high-throughput computing (HTC) systems, running applications interactively most effectively emulates how we work on our local laptop and desktop systems. Through GUI interfaces, we can
LSF, our scheduler, follows most resource manager conventions
-
Open your remote desktop program (NoMachine) and connect to the HSBGrid compute cluster.
-
After the session starts, open the Gnome terminal program (Applications > System Tools > Terminal).
-
By default, most applications are not 'loaded' into our session (in the terminal). Since Spyder is bundled with Python, we'll 'load' in Spyder through the appropriate
module load
:
module load python/3.8.5
- With Python, Spyder, and all the data science environment tools in our $PATH, we can now run Spyder. To run interactive sessions from a terminal, we can follow the posted Interactive Job instructions, (Working with Custom Submission Scripts > Interactive) using the MATLAB example, to run Spyder:
cd $HOME/git/rcs__transitioning_to_batch
bsub -q short_int -Is -W 60 -R "rusage[mem=4000]" -M 4000 -hl spyder
If successful, after a few moments Spyder should open. Much has happened here:
This is all well and good, but we've only told Spyder to open, and not necessarily to open the code file. To do that, include the path to your code file by dragging the file icon into the terminal window after you've typed in the spyder job submission command but not yet pressed Enter. Your final result should look like:
bsub -q short_int -Is \
-W 60 \
-R "rusage[mem=4000]" -M 4000 -hl \
spyder code/process_data.py
- Although we've loaded the code file interactively, our goal is to run it. So let's make those adjustments. Instead of using Spyder to view and run our code, let's simply run it using the Python interpreter by substituting in the python command:
bsub -q short_int -Is \
-W 60 \
-R "rusage[mem=4000]" -M 4000 -hl \
python code/process_data.py
If all goes well, you should see in your terminal window the code running, albeit slowly. That's fine. Let's press Ctrl-C to abort the code.
This is actually going to be much easier than it sounds, but there are a number of 'friction points' that will appear. Those will become apparent.
- In your terminal session, switch to running your code in batch by switching to a batch
queue (
long
orshort
) and removing the 'interactive' flag (Is
). Theshort
queue is most appropriate, as this code will likely take minutes to run (~8.5 min on my 2020 iMac).
bsub -q short \
-We 15 \
-R "rusage[mem=4000]" -M 4000 -hl \
python code/process_data.py
Since we're running now in batch, our scheduler will respond with our job ID, and we cross our fingers that it is off an running. We have every reason to expect it to run OK, as it ran OK when doing it interactively.
Well, as you may have noticed, you're not quite sure what is going on. But we do have a few options:
- Look at the directories for any file changes
- Ask the scheduler if it's still running
- Look at the job run details
Since we know we're writing out the running sum every 10 lines, our output file should be present and growning, which we can see:
$ ls -alt data/ # sort this reverse chronogically
total 55776
-rw-rw----. 1 rfreeman dor 231930 Aug 11 11:36 outfile1.txt
drwxrwx---. 3 rfreeman dor 4096 Aug 11 11:28 .
drwxrwx---. 6 rfreeman dor 4096 Aug 11 11:20 ..
-r--r-----. 1 rfreeman dor 11321676 Aug 11 11:20 file4.txt
-r--r-----. 1 rfreeman dor 11321569 Aug 11 11:20 file3.txt
-r--r-----. 1 rfreeman dor 11320369 Aug 11 11:20 file2.txt
-r--r-----. 1 rfreeman dor 11322055 Aug 11 11:20 file1.txt
-r--r-----. 1 rfreeman dor 11320368 Aug 11 11:20 file1.bad
dr-xr-x---. 2 rfreeman dor 4096 Aug 11 11:20 bak
Yup, it's there. But we can't see what's going on. Let's park that for a moment.
Our next option is asking the scheduler for our job status and info, which is easy to get:
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1607942 rfreema RUN short rhrcscli1 rhrcsnod23 *s_data.py Aug 11 11:35
We can get a detailed report on our job by using the -l
(long) option:
$ bjobs -l 1607942
Job <1607942>, User <rfreeman>, Project <default>, Status <RUN>, Queue <short>,
Command <python code/process_data.py>, Share group charged
</rfreeman>, Esub <hl mem>
Wed Aug 11 11:35:10: Submitted from host <rhrcscli1>, CWD <$HOME/git/rmf__trans
itioning_to_batch>, Requested Resources <rusage[mem=4000]>
, memory/swap limit enforced per-job/per-host;
Wed Aug 11 11:35:11: Started 1 Task(s) on Host(s) <rhrcsnod23>, Allocated 1 Slo
t(s) on Host(s) <rhrcsnod23>, Execution Home </export/home
/dor/rfreeman>, Execution CWD </export/home/dor/rfreeman/g
it/rmf__transitioning_to_batch>;
Wed Aug 11 11:35:46: Resource usage collected.
IDLE_FACTOR(cputime/runtime): 0.00
MEM: 5 Mbytes; SWAP: 0 Mbytes; NTHREAD: 5
PGID: 182729; PIDs: 182729 182730 182732 183120
RUNLIMIT
4320.0 min
MEMLIMIT
3.9 G
ESTIMATED_RUNTIME:
15.0 min based on user-provided runtime estimation.
MEMORY USAGE:
MAX MEM: 5 Mbytes; AVG MEM: 4 Mbytes
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg] rusage[mem=4000.00] span[hosts=
1] affinity[core(1)*1]
Effective: select[type == local] order[r15s:pg] rusage[mem=4000.00] span[hosts
=1] affinity[core(1)*1]
Indeed, we can see our job is running, and it's pretty active, as the IDLE_FACTOR is greater than 0.5, indicating 50% efficiency. Aside, we can see our MAX MEM is XXX; and so we can downgrade our RAM ask, in order to give RAM for others who might need it.
Next: Set Up Your Script's Inputs and Outputs