Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: improve support for queue args #328

Merged
merged 7 commits into from
Apr 4, 2019
Merged

RF: improve support for queue args #328

merged 7 commits into from
Apr 4, 2019

Conversation

mgxd
Copy link
Member

@mgxd mgxd commented Apr 1, 2019

Closes #302

Changes

  • add queue-args to submission script
  • fix up batch submission command
    • remove queue-args from queue submission
    • isolate files / subjects in unique commands
  • submit batch before any grouping (massive speedup when dealing with many subjects/files)
  • imports changed from relative to absolute, to behave better across different environments.

@codecov-io
Copy link

codecov-io commented Apr 1, 2019

Codecov Report

Merging #328 into master will increase coverage by 0.23%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #328      +/-   ##
==========================================
+ Coverage   74.08%   74.31%   +0.23%     
==========================================
  Files          35       35              
  Lines        2643     2667      +24     
==========================================
+ Hits         1958     1982      +24     
  Misses        685      685
Impacted Files Coverage Δ
heudiconv/external/dcmstack.py 25% <0%> (ø) ⬆️
heudiconv/tests/test_queue.py 100% <100%> (ø) ⬆️
heudiconv/utils.py 92.12% <100%> (ø) ⬆️
heudiconv/dicoms.py 82.07% <100%> (ø) ⬆️
heudiconv/bids.py 82.58% <100%> (ø) ⬆️
heudiconv/cli/run.py 77.56% <80%> (-0.08%) ⬇️
heudiconv/queue.py 90% <87.5%> (+4.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3467341...4621423. Read the comment docs.

@mgxd mgxd changed the title RF: move to absolute import RF: improve support for queue args Apr 1, 2019
@mgxd
Copy link
Member Author

mgxd commented Apr 1, 2019

@fhopp re your question in #304 - could you see if this PR fixes your problem? An example call using the queue-args would look like:

heudiconv --files /path/to/files -s pilot -f convertall -q SLURM --queue-args "--partition=om_bigmem --cpus-per-task=4 --time=10"

@yarikoptic
Copy link
Member

hm... so what is the benefit of moving to absolute imports exactly, or what problem(s) that would solve?
I've been moving toward relative imports wherever possible and @satra advocates them over absolute to enable python packages to be "embeddable" into larger packages (although I am yet to see that in real use).

@yarikoptic
Copy link
Member

I left a comment on #302, but most likely I didn't fully comprehend the situation, since most likely it relates to slurm execution... in either case - it must not be a problem of relative-vs-absolute imports but most likely incorrect invocation. I really do not think we should replace relative imports with absolute to address whatever problem it is

@satra
Copy link
Member

satra commented Apr 2, 2019

@mgxd - when you say behave better across different environments, what do you mean? i do have a personal preference for relative imports in these relatively shallow projects. and in the past i have gotten bitten over an installed project vs a dev project by using absolute imports.

@mgxd
Copy link
Member Author

mgxd commented Apr 2, 2019

@satra if anything, its good to clearly show where files and functions lie within the project, especially to new contributors. It's also recommended by pep8, and would fit in well given how basic our structure is (I think our max depth currently is 3).

@fhopp
Copy link

fhopp commented Apr 2, 2019

@mgxd the latest #328 does the trick, a.k.a. it does not crash but executes with arguments being properly passed to SLURM. Now, I understand that each subject will be submitted as a separate "job" to the cluster. Yet, is there any information on how to optimize the execution? For example, does it make sense to specify multiple CPUs per job or is heudiconv a single task and hence operates on one cpu per node? Likewise, would it help to increase the memory per CPU? Apologies for all these questions. I do have a 65 subject dataset for which these are all very helpful questions. I am also keeping track of all the steps and would be more than happy to contribute a working example at the end (I am also not using docker, and I have noticed an increased demand for non-docker examples). Thanks!

@satra
Copy link
Member

satra commented Apr 2, 2019

@mgxd - here is a use case for relative imports (although a developer should be more careful):

python setup.py install
# change a file locally and test that it is still working
pytest heudiconv

In this scenario all the absolute imports would get picked from the installed location. Personally, i like the relative imports because of frame of reference. if there are many project cross-links i can easily see how relative imports could get as complicated (go up two and down three). i think they are both clear in shallow trees with only a few branches at the top of level.

here are the two selling points for me:

  • for users who are not used to python and do want to contribute i feel making the above problem (setup->change->test) easier is better for them.
  • it is also easier to import libraries into projects as git sub modules. let say you want to distribute your own copy of nipype in your package, as fmriprep (used to?) do. (with absolute imports this would be really painful)

@yarikoptic
Copy link
Member

Please let stick with relative imports, and concentrate on addressing the problem at hands which smells to be just an "incorrect" invocation specification (directly of a submodule instead of the heudiconv entry point script itself) within the SLURM job.

@yarikoptic
Copy link
Member

to that extent the culprit is the https://github.com/nipy/heudiconv/blob/master/heudiconv/cli/run.py#L293

pyscript = op.abspath(inspect.getfile(inspect.currentframe()))

from "good old" times of heudiconv being a single script, so that is how cli/run.py ends up being run. The whole "python packaging" is making it a bit more complicated but sys.argv[0] might just work?

@satra
Copy link
Member

satra commented Apr 2, 2019

can't pyscript simply be heudiconv (now that it is a package)? aren't we essentially calling heudiconv with one subject?

@yarikoptic
Copy link
Member

in principle it could be I guess. I just dislike hardcoding anything ;)

@mgxd
Copy link
Member Author

mgxd commented Apr 2, 2019

Ok, I've reverted imports due to popular demand 😆

@fhopp on our cluster, I run heudiconv across a hundred of subjects concurrently; heudiconv is a single core application (currently), and I don't think I have encountered a time where I needed more than 2GB of memory.

@fhopp
Copy link

fhopp commented Apr 3, 2019

@mgxd Just out of curiosity, how long does generating the heuristic file across all subjects take?
I have 64 subjects distributed across 5 nodes, each allocated 2GB and the process seems to be going rather slow (~12 hours now, not finished yet). I just would like to have some benchmark to compare to and rule out that this might be a problem of how my SLURM is configured. The command I execute is as follows:

heudiconv \
-d /srv/lab/fmri/mft/dicom4bids/data/{subject}*/* \
-f /srv/lab/fmri/mft/dicom4bids/data/convertall.py \
-s sub01 sub02 sub03 sub04 sub05 sub06 sub07 sub08 sub09 sub10 sub11 sub12 sub13 sub14 sub15 sub16 sub17 sub18 sub19 sub20 sub21 sub22 sub23 sub24 sub25 sub26 sub27 sub28 sub29 sub30 sub31 sub32 sub33 sub34 sub35 sub36 sub37 sub38 sub39 sub40 sub41 sub42 sub43 sub44 sub45 sub46 sub47 sub48 sub49 sub50 sub51 sub52 sub53 sub54 sub55 sub56 sub57 sub58 sub59 sub60 sub61 sub62 sub63 sub64 \
-c none -q SLURM --queue-args "--partition=heudiconv --mem=2000" \
-o /srv/lab/fmri/mft/dicom4bids/heudiconv_temp/output/

I was also thinking whether it would be better to split the subject into chunks of five rather than submitting the full list as done above?

Finally, it appears that the above command submits just one job to each node. While I understand that heudiconv runs on a single CPU, is it possible to have each node work on multiple jobs concurrently? Say one node has 8 cores, is it possible to submit 8 jobs to that node?

@mgxd
Copy link
Member Author

mgxd commented Apr 3, 2019

@fhopp actually I think its on our end, I see a bottleneck at

study_sessions = get_study_sessions(args.dicom_dir_template, args.files,
heuristic, outdir, args.session,
args.subjs, grouping=args.grouping)

I'm working on a fix now

@mgxd
Copy link
Member Author

mgxd commented Apr 3, 2019

@fhopp - hopefully you'll see a drastic speed-up after the last few commits.

@fhopp
Copy link

fhopp commented Apr 3, 2019

@mgxd Awesome, trying this now. From what I see so far, this increases the processing time before the batch job(s) are submitted. Will check and send an update on the speed up. I still think it would be best if these jobs submissions can run on single cores on each node in the cluster.
Example: 2 nodes with 4 cores each should allow a concurrent heudiconv of 8 subjects.

@fhopp
Copy link

fhopp commented Apr 3, 2019

@mgxd Confirmed! Job submission is almost instant.

@mgxd
Copy link
Member Author

mgxd commented Apr 4, 2019

Example: 2 nodes with 4 cores each should allow a concurrent heudiconv of 8 subjects.

If you specify --cpus-per-task=1, this should be the case. I'm going to merge this in, please let us know if you are still running into issues!

@fhopp
Copy link

fhopp commented Apr 4, 2019

@mgxd short answer: yes, please merge, I tried with your latest fix and job submission is instant. As for execution, it was in fact an issue of my SLURM config. It was set to a non-shared, exclusive mode in which CPUs and Memory in the cluster were not seen as "consumable resources": https://slurm.schedmd.com/cons_res.html
I changed the setting and now jobs are being properly distributed and executed concurrently.
Thank you for all these efforts! As a small contribution, may I offer to submit a slide set or notebook that showcases and summarized how SLURM may be set up and used for optimal performance with heudiconv?

@mgxd
Copy link
Member Author

mgxd commented Apr 4, 2019

Yes, that would be awesome. We have a small collection of User tutorials on our docs that would be a great place for it

@mgxd mgxd merged commit 6b2a460 into nipy:master Apr 4, 2019
@mgxd mgxd deleted the fix/slurm branch April 4, 2019 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues w/ slurm
5 participants