Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New egs-parallel set of scripts to replace run_user_code_batch #628

Merged
merged 21 commits into from Apr 12, 2021

Conversation

ftessier
Copy link
Member

@ftessier ftessier commented Aug 27, 2020

This pull request implements egs-parallel: a new set of bash scripts to submit EGSnrc parallel jobs, with the following improvements over the legacy command exb (aliased to the run_user_code_batch script):

  • Detailed account of the job submission is saved in a log file (and to the terminal with the --verbose option)
  • Helpful messages and usage is provided when an error occurs during the job submission (or with the -h option)
  • Script level synchronization significantly reduces errors and race conditions for parallel jobs and .lock files
  • The EGSnrc command passed to the script is identical to the interactive one (copy and paste!)
  • Options can be provided on the command-line in arbitrary order, in short (-) or long (--) form
  • Single-letter short options don't require a space before their argument.
  • The top-level egs-parallel script dispatches a specific sub-script for each method (e.g., egs-parallel-pbs)
  • The egs-parallel script can launch jobs locally on a multicore computer, with the --batch cpu option
  • Scripts for other schedulers are easily added as new sub-scripts, without having to modify egs-parallel
  • A decent egs-parallel-clean script is provided to help tidying up intermediate simulation files and logs
  • The implementation is orthogonal to legacy scripts, so it is backward compatible (you can continue to use exb)
    usage:

        egs-parallel [options] -c | --command 'command'

    options:

        -h | --help         show this help
        -b | --batch        batch system to use ("cpu" by default)
        -d | --delay        delay in seconds between individual jobs
        -q | --queue        scheduler queue ("long" by default)
        -n | --nthread      number of threads ("8" by default)
        -o | --option       option(s) to pass to job scheduler, in quotes
        -v | --verbose      echo detailed egs-parallel log messages to terminal
        -c | --command      command to run, given in quotes

Here is a sample invocation:

egs-parallel --batch pbsdsh -q short -n12 -v -c 'egs_chamber -i slab -p 521icru'

The scripts egs-parallel and egs-parallel-clean are placed inside a new $HEN_HOUSE/scripts/bin/ directory, which is added to the path in the EGSnrc shell additions. The sub-scripts are only meant to be called from the top-level egs-parallel, so they are not placed inside this new bin/ directory to prevent calling them directly (they remain in $HEN_HOUSE/scripts/ which is not added to the path).

Take this out for a spin if you will, but remain cautious (especially with the cleaning script!).

@crcrewso
Copy link
Contributor

This looks so much cleaner!!!
What would it take to add support for Slurm?

@ftessier
Copy link
Member Author

ftessier commented Aug 28, 2020

@crcrewso For slurm, we need someone to copy the egs-parallel-pbs script to egs-parallel-slurm, and adjust it accordingly; then the option --batch slurm should work right away! I would also update the egs-parallel-clean script to clean and combine slurm output files the same way as pbs .eo files.

@ftessier
Copy link
Member Author

ftessier commented Aug 28, 2020

I know the scripts are a little dense: I am trying here to resolve the lock file issue without changing the EGSnrc code, which is perhaps a bit contorted (see by comparison @mainegra's cleaner uniform run control object solution in #588). I found that separating the two roles of the lock file (indicating that the job is running or not, and serving the number of histories) resolved a lot of the issues. There is now a .egsjob file, accessed by the first job only, which serves to indicate when the job has started and when it is finished. So perhaps now I can go back an integrate this .egsjob file inside the EGSnrc code... On the other hand, when this file is created at the script level, before the job is launched on the node, it curbs issues related to scheduler delay launching the first job. At any rate, I have found this approach more stable. And combined with @crcrewso's improvement with the lock file delay jitter, I think we got rid of most of the lock ups!

There is also a lot of code repetition between the sub-scripts, which is not ideal for code maintenance: we have to remember to copy changes to all the sub-scripts. But this is intentional, as I wanted it to be easy to create new sub-scripts for other schedulers (slurm!) or to suit their particular needs, without imposing any logic beyond the arguments passed to the script. If code maintenance becomes an issue, I'll think of something else; I am not worried about this for now.

@crcrewso
Copy link
Contributor

I'm not entirely sure if I'll have a test system for slurm until these quarantimes are over but if I do I'll definitely submit something.

@mainegra
Copy link
Contributor

@ftessier are these scripts tied up to using the locking file mechanism? The reason I ask is that in HPC environments where the locking file does not work, one might want to resort to the URCO mechanism (uniform load on all jobs) and then these script will not work. But perhaps this not an issue since in those cases, one might have to use different scripts, such as the one I created for the GPSC. I will try these scripts on the GPSC and see what happens!

@ftessier
Copy link
Member Author

For the moment yes they are :-( (except the -cpu script, which does not try to synchronize with the .lock file). I have to add an option to skip this for a uniform run control object (urco)... In the end, the lock file checking etc. should be done in the code, because otherwise options are obscurely correlated between the input file (selecting the urco) and the script option to skip the lock file. I don't know yet what is the best way forward. Perhaps just leave the .egsjob file in the script, and let EGSnrc worry about the .lock file as before; maybe this is sufficient, I will try!

@ftessier
Copy link
Member Author

In the end, the lock file should only be checked and handled by either the script or EGSnrc, not both, for exactly the reason you point out: we don't want to have to change the submit scripts if we change the lock file handling in EGSnrc.

@ftessier
Copy link
Member Author

@mainegra Could the first job still create an (empty) .lock file when starting with the uniform run control object (urco), even if it is never accessed again? This would be informative for the user, to know which job is running, etc. Perhaps the lock file could contain information on the total number of histories requested, how many jobs, and how many histories per job? The .lock file is useful on the submit script side to confirm that the job has indeed started.

@mainegra
Copy link
Contributor

@ftessier in practice it is possible, but the name (extension) would be misleading.

@ftessier
Copy link
Member Author

@mainegra Thanks for the suggested improvement! I removed the dependencies on the lock file in the egs-paralllel scripts. The script will still prevent launching the simulation if there is a .lock file though (or a .egsjob file), however this can now be overriden with a -f | --force option in the top level egs-parallel script.

@mainegra
Copy link
Contributor

mainegra commented Aug 31, 2020

@mainegra Thanks for the suggested improvement! I removed the dependencies on the lock file in the egs-paralllel scripts. The script will still prevent launching the simulation if there is a .lock file though (or a .egsjob file), however this can now be overriden with a -f | --force option in the top level egs-parallel script.

@ftessier that's great news! That way we could potentially use your scripts on the GPSC as well!

@ftessier ftessier added the work in progress Work in progress, don't merge yet label Sep 9, 2020
@mainegra
Copy link
Contributor

mainegra commented Dec 3, 2020

@ftessier I am giving these scripts a try! Quick question: Why is there a delay at the beginning? I turned on the verbose option and got a headache! 🤯

@mainegra
Copy link
Contributor

mainegra commented Dec 3, 2020

@mainegra Thanks for the suggested improvement! I removed the dependencies on the lock file in the egs-paralllel scripts. The script will still prevent launching the simulation if there is a .lock file though (or a .egsjob file), however this can now be overriden with a -f | --force option in the top level egs-parallel script.

@ftessier I tried using the urco with this and it failed complaining about the lock file not being there ... are you sure the dependency is gone?

@ftessier
Copy link
Member Author

ftessier commented Dec 3, 2020

Indeed, the scripts still contain a while loop waiting for the lock file. Hmmm, strange. The only explanation I can come up is that I did not properly commit all changes back then... Let me try again!

@ftessier
Copy link
Member Author

ftessier commented Dec 3, 2020

Updated to truly remove the lock file dependencies in the egs-parallel scripts, which was supposed to have been done (probably forgot to add these deletions to the commit earlier).

@ftessier ftessier removed the work in progress Work in progress, don't merge yet label Mar 25, 2021
@ftessier ftessier marked this pull request as ready for review March 25, 2021 04:01
Copy link
Contributor

@mainegra mainegra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

ftessier and others added 20 commits March 26, 2021 08:49
Save the egs-parallel log inside an *.egsparallel file in the
application directory, and add a verbosity option (-v) to also echo the
log to screen. By default the scripts proceed silently, unless an error
condition arises, which is always echoed to the terminal.
Notably, save log message to a log file, add a verbosity option (-v),
and allow joined single-letter options and argument (without a space
between the option and its argument, as in "-n123").
Apart from format and other minor adjustments, update the standard pbs
script egs-parallel-pbs (whereby EGSnrc jobs are submitted individually)
so that only the second job waits for the .egsjob file and the .lock
file, since the jobs are submitted sequentially anyways.
This egs-parallel-cpu subscript provides the option "--batch cpu" to
egs-parallel, to launch a simulation on multiple cores on the local cpu,
without requiring a job scheduler. Intentionally, this script is simple:
it just launches the jobs sequentially, without waiting around for the
.egsjob or .lock files, as in the pbs scripts. However, the logging is
consistent with the other egs-parallel scripts

The number of threads is always constrained to the number of threads
available on the machine, because it is inefficient to go beyond that,
and launching a large number of threads on a cpu by mistake may well
stall the computer.
Improve the script robustness, in particular by forcing the user to
specify either the -n (--dry-run) option, or the -f (--force) option to
actually remove files, to prevent accidental erasing (to some extent).
This script removes files without warnings (when using -f), so use with
caution: run with the -n option first to see what will be deleted.

Add the concatenation and sorting of egs-parallel log messages into the
.egsparallel file for reference. This is useful, because these log
messages may be scattered in different files, for example the .eo files
from pbs. After cleaning, the .egsparallel contains a time-ordered
sequence of messages from egs-parallel and its subscripts.
Strictly speaking, there can be multiple threads per hardwarde core;
this is typical in modern workstations. Change "ncore" to "nthread"
throughout the egs-parallel scripts, to avoid confusion.
Add a bin directory in HEN_HOUSE/scripts and add it to the PATH in the
shell additions scripts. This allows some EGSnrc scripts to be directly
executable by a user, without using aliases (which are not inherited by
subshells). The immediate motivation is for the top-level egs-parallel
script, and the egs-parallel-clean script, to become visible on the
path, while the egs-parallel sub-scripts remain in scripts and are not
in the path (these should not be invoked directly).
Do not source the shell additions scripts from within the egs-parallel
sub-scripts, as this is not necessary and not secure. Sourcing was only
needed in the dshtask script to get the path to the EGSnrc executables,
because tasks are launched on the pbs nodes without inheriting the
environment. In this case, simply export the PATH variable via the
pbsdsh qsub script.
Use a more portable date command format for the timestamp string, and
tweak the usage message of egs-parallel scripts.
Add -x (--extra) option to clean up egs-parallel log files .egsparallel
and .egsparallel.eo. Although this script always echoes progress to the
terminal, add a -v (--verbose) option to echo the commands that are run
by the script, instead of the more concise messages usually reported.
Internally, add an "action" command to ensure that the log messages
remain up to date with the commands.
For convenience, add a -l (--list) option to the cleaning script to list
all the .egslog file base names in the current directory. This option is
checked first and overrides every other argument: the list is printed to
the terminal and the script terminates. Also, reformat the usage message
and use the extension .egsparallel-eo (with a hyphen) to avoid collision
with the pbs .eo extension. Use executable basename in quit function.
Change the initial value of the --batch option to "cpu" so that the
script invokes the multicore parallel sub-script (egs-parallel-cpu) when
no --batch option is specified on the command line. This allows users to
try egs-parallel out of the box (most computers are multicore nowadays)
without worrying about schedulers.
Don't quit the egs-parallel submit scripts if no lock file is found, and
add a -f (--force) option to override existing .egsjob or .lock files.

The lock file for parallel jobs is managed inside EGSnrc, so the script
should not manage it as well: this creates an obscure correlation
between the code and the script. Moreover, the uniform run control
method does no create a lock file. Previously, the submit script would
quit if there was no lock file. The top-level egs-parallel script now
prevents the run if there is an .egsjob file OR a .lock file, for the
same reason. This can be overridden with the added --force option.
Detect pbs jobs that fail to launch in egs-parallel, by looking at the
echoed job pid: quit immediately if it is not an integer. If the first
job fails, subsequent jobs are not launched. Report the failure in the
log. Also adjust the format of a few log messages.
Fix a crash that occurred when the 14 character truncation of the
filename for an egs-parallel pbsdsh job ended up starting with a '.'.
The first character is now trimmed away if that is the case, so the job
name is only 13 characters.
Ensure that the PBS job name starts with an alphanumeric character
[0-9A-Za-z], following the PBS scheduler requirement. To avoid failed
jobs solely on the account of a bad job name, strip all leading
non-alphanumeric characters from the job name. Note that the egsinp
basename is not affected, this is strictly for the job name passed to
qsub via the -N option.
@ftessier
Copy link
Member Author

  • Rebased on develop.
  • Removed | pipe symbol in egs-parallel usage sample command (but not in list of options, stand for "OR").
  • Adjusted job name to strip leading non-alphanumeric characters.
  • Corrected typos in commit messages.

@blakewalters
Copy link
Contributor

blakewalters commented Mar 29, 2021

@ftessier, these scripts aren't expected to work seamlessly on OSX, correct? For example, on a (my) Mac, the line:
cpu_nthread=$(grep -c processor /proc/cpuinfo)
in egs-parallel-cpu doesn't actually return the no. of cpu's (I think it's "sysctl -n hw.ncpu"), resulting in a "unary operator expected" warning in the comparison on the next line of the script. The job seems to be submitted okay, though!

@ftessier
Copy link
Member Author

Good point @blakewalters. I always forget that there is no /proc on BSD systems. I have added a small commit to trap the Darwin os name and issue the proper command that you provided. I don't know if these are going to work on Windows (my guess is no). Anyone willing to try, say, under Windows git bash or WSL?

@crcrewso
Copy link
Contributor

crcrewso commented Mar 31, 2021

Good point @blakewalters. I always forget that there is no /proc on BSD systems. I have added a small commit to trap the Darwin os name and issue the proper command that you provided. I don't know if these are going to work on Windows (my guess is no). Anyone willing to try, say, under Windows git bash or WSL?

it shouldn't work over Windows Git bash since there's no gcc or gfortran

I can test WSL 1 and WSL 2 over the weekend

Edit 1: WSL 1 (Pengwin) works without issues

crcrewso@localhost:~/source/egshome/BEAM_varian_6x_above_jaws$ egs-parallel -v -c 'BEAM_varian_6x_above_jaws -i test -p SASKicru'
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.232695500: BEGIN egs-parallel
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.318693900: EGSnrc environment:
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.397607900:     HEN_HOUSE  = /home/crcrewso/source/EGSnrc/HEN_HOUSE/
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.475162800:     EGS_HOME   = /home/crcrewso/source/egshome/
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.550944100:     EGS_CONFIG = /home/crcrewso/source/EGSnrc/HEN_HOUSE/specs/linux.conf
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.724554000: parallel options:
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.801410000:     batch      = cpu
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.877489600:     queue      = long
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:24.949586200:     nthread    = 8
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.023381900:     delay      = 0
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.099621000:     command    = BEAM_varian_6x_above_jaws -i test -p SASKicru
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.174429800:     basename   = test
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.249390300:     first job  = 1
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.324223300:     options    =
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.439199600: log file: /home/crcrewso/source/egshome/BEAM_varian_6x_above_jaws/test.egsparallel
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.517992300: cd /home/crcrewso/source/egshome/BEAM_varian_6x_above_jaws
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.593960600: EXEC egs-parallel-cpu long 8 0 1 test 'BEAM_varian_6x_above_jaws -i test -p SASKicru' '' verbose
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.773332200: BEGIN /home/crcrewso/source/EGSnrc/HEN_HOUSE/scripts/egs-parallel-cpu
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:25.916381100: reduce requested threads (8) to match available cpu threads (6)
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:26.240655100: BEGIN host=localhost
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:26.317535200: job 0001: RUN BEAM_varian_6x_above_jaws -i test -p SASKicru -b -P 6 -j 1 -f 1
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:26.443890200: job 0001: host=localhost pid=13970
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:26.629680700: job 0002: RUN BEAM_varian_6x_above_jaws -i test -p SASKicru -b -P 6 -j 2 -f 1
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:26.763127600: job 0002: host=localhost pid=13985
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:26.956486700: job 0003: RUN BEAM_varian_6x_above_jaws -i test -p SASKicru -b -P 6 -j 3 -f 1
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:27.121060800: job 0003: host=localhost  pid=13995
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:27.305229200: job 0004: RUN BEAM_varian_6x_above_jaws -i test -p SASKicru -b -P 6 -j 4 -f 1
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:27.524768000: job 0004: host=localhost pid=14005
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:27.791004600: job 0005: RUN BEAM_varian_6x_above_jaws -i test -p SASKicru -b -P 6 -j 5 -f 1
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:28.056909800: job 0005: host=localhost pid=14015
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:28.480998200: job 0006: RUN BEAM_varian_6x_above_jaws -i test -p SASKicru -b -P 6 -j 6 -f 1
EGSnrc egs-parallel 2021-03-31 (UTC) 19:37:28.819149100: job 0006: host=localhost pid=14025

egs-parallel-clean:

crcrewso@localhost:~/source/egshome/BEAM_varian_6x_above_jaws$ egs-parallel-clean -f test
egs-parallel-clean
current directory: /home/crcrewso/source/egshome/BEAM_varian_6x_above_jaws/

CLEANING test ...
    create test.egsparallel-log (merged egs-parallel log messages)
    create test.egsparallel-out (merged parallel jobs output streams)
    remove test.egsjob
    remove test.egsparallel
    remove test_w*

DONE.

@ftessier ftessier merged commit c778e2a into develop Apr 12, 2021
@ftessier ftessier deleted the feature-egs-parallel branch April 12, 2021 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants