Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nu_correct: disk i/o issues (singularity) #1308

Closed
JasperVanDenBosch opened this issue Oct 4, 2018 · 9 comments
Closed

nu_correct: disk i/o issues (singularity) #1308

JasperVanDenBosch opened this issue Oct 4, 2018 · 9 comments

Comments

@JasperVanDenBosch
Copy link

We are trying to get fmriprep running on our HPC (Slurm-based) in a singularity container.

We run the container with -c -e flags and mount a working directory and data and output directory.

The process runs fine untill freesurfer's nu_correct says:

ncendef: ncid 5: No space left on device
Error outputting volume: possibly disk full?
nu_evaluate: crashed while running evaluate_field (termination status=139)
nu_correct: crashed while running nu_evaluate (termination status=65280)
ERROR: nu_correct

This doesn't seem to make sense as the drive mounted has terabytes available. Is it possible that freesurfer is trying to write to a directory that is not defined as the working/data/output directory?

Singularity call:

singularity run -B $DATADIR:/data -B $TMPDIR:/work -c -e fmriprep-1.1.4.simg \
    /data/BIDS/ \
    /data/BIDS/derivatives/ \
    participant \
    --participant-label $SUBLABEL \
    --work-dir /work \
    --fs-license-file /data/fs-license/license.txt \
    --nthreads $NTHREADS \
    --mem-mb $MAXMEMMB

attached:

@effigies
Copy link
Member

effigies commented Oct 4, 2018

Is there any chance you're hitting a quota? Unlike most of fMRIPrep, FreeSurfer will store all data in the output directory.

I'm seeing lines like the following:

tmpdir is ./tmp.mri_nu_correct.mni.1121
...
[vandejjf@bear-pg0210u16b.bear.cluster:/data/BIDS/derivatives/freesurfer/sub-1/mri/] [2018-10-04 14:59:29] running:
  /opt/freesurfer/mni/bin/nu_estimate_np_and_em -parzen -log -sharpen 0.15 0.01 -iterations 1000 -stop 0.001 -shrink 4 -auto_mask -nonotify -b_spline 1.0e-7 -distance 50 -quiet -execute -clobber -nokeeptmp -tmpdir ./tmp.mri_nu_correct.mni.1121/0/ ./tmp.mri_nu_correct.mni.1121/nu0.mnc ./tmp.mri_nu_correct.mni.1121/nu1.imp

I'm interpreting this as the temporary directory is /data/BIDS/derivatives/freesurfer/sub-1/mri/tmp.mri_nu_correct.mni.1121, but I suppose it's possible that there's some other temporary directory being used that isn't evidenced in the log.

@JasperVanDenBosch
Copy link
Author

There's no quotas on /data, and that directory does contain files.

@effigies
Copy link
Member

effigies commented Oct 4, 2018

Are you able to get support from your sysadmin? It seems like an odd error. You could also try the FreeSurfer list, to see if they've seen this show up for other reasons.

Note that this looks the same as freesurfer/freesurfer#462. We may be hitting an HPC edge case.

Finally, did you try re-running? FreeSurfer should try to pick back up where it left off. If it's a timing-related bug, it may be resolved by a second pass.

@effigies
Copy link
Member

effigies commented Oct 4, 2018

This may also be a Singularity issue, where for some reason you're using space inside the container: https://groups.google.com/a/lbl.gov/forum/#!topic/singularity/eq-tLo2SewM

I don't know off-hand how to test that hypothesis, though.

@JasperVanDenBosch
Copy link
Author

Are you able to get support from your sysadmin? It seems like an odd error.

Yeah we are requesting additional local /scratch space from them but I doubt they'll give us much more. (If network connectivity is even the issue).

Note that this looks the same as freesurfer/freesurfer#462. We may be hitting an HPC edge case.

Indeed. But isn't this a rather common scenario?

Finally, did you try re-running? FreeSurfer should try to pick back up where it left off. If it's a timing-related bug, it may be resolved by a second pass.

Is there perhaps a way to tell FreeSurfer to wait longer or retry? Rerunning using the temporary working directory seems like asking for hard-to-reproduce scenarios..

This may also be a Singularity issue, where for some reason you're using space inside the container: https://groups.google.com/a/lbl.gov/forum/#!topic/singularity/eq-tLo2SewM

Yes I saw that thread, unfortunately there is no way for us to allow arbitrary writing inside the container. So we'd have to figure out where freesurfer is writing files (if it's not within the working directory).

@effigies
Copy link
Member

effigies commented Oct 4, 2018

But isn't this a rather common scenario?

Yes, HPC is a common environment, but the heterogeneity of clusters, as well as users' individual environments, makes it pretty difficult for us to reproduce issues. Also, that everybody is creating their own Singularity images from our Docker images is probably an unnecessary source of variation.

Is there perhaps a way to tell FreeSurfer to wait longer or retry?

Not very easily under the current Nipype framework.

Rerunning using the temporary working directory seems like asking for hard-to-reproduce scenarios..

If you use a persistent working directory location, you'll have an easier time picking up where you left off.

@JasperVanDenBosch
Copy link
Author

That makes sense. I'll try some more things and report back :)

@JasperVanDenBosch
Copy link
Author

Got some excellent help figuring out where freesurfer is trying to write from Michael Krause:

https://mail.nmr.mgh.harvard.edu/pipermail//freesurfer/2018-October/058813.html

turns out it uses /tmp. So I now mount /tmp as well.

Should this be added to the documentation?

PS I still got some disk full errors earlier in the pipeline, but this was because all my fmriprep processes were accessing the same mail report file at the same time, so a random sleep at the start of the HPC job fixed this.

@oesteban
Copy link
Member

Seems like you finally figured out. Please reopen if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants