Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory use in 1.0.0-rc10 vs. 1.0.0-rc8? #836

Closed
fcmeyer opened this issue Nov 16, 2017 · 10 comments
Closed

Excessive memory use in 1.0.0-rc10 vs. 1.0.0-rc8? #836

fcmeyer opened this issue Nov 16, 2017 · 10 comments

Comments

@fcmeyer
Copy link

fcmeyer commented Nov 16, 2017

Hello,

I have been testing different versions of fMRIPREP on the same two subjects in the Vanderbilt HPC using Singularity. I told SLURM to allocate 24GB to the job.

Here's the SBATCH script we used for one subject:

#!/bin/tcsh
#SBATCH --nodes=1    # comments allowed
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=03:30:00
#SBATCH --mem=24G
#SBATCH --output=fpreprc10_226854.log

set OMP_NUM_THREADS = $SLURM_CPUS_PER_TASK
module load GCC Singularity
singularity run poldracklab_fmriprep_1.0.0-rc10-2017-11-10-49e327b1b660.img \
  /data/h_zald_lab/Fran/tts_test/BIDS_TTS_Test \
  /data/h_zald_lab/Fran/tts_test/out_fmriprep_RC10 \
  participant --participant_label 226854 \
  --n_cpus 8 --mem_mb 24000 --no-freesurfer \
  -w /data/h_zald_lab/Fran/tts_test/work_226854_RC10 \
  -t mid

The version for RC8 was identical, except we changed the image to RC8 and updated the output names to have RC8 instead of RC10 for later comparison. We did this for two subjects total.

While RC8 ran with no hiccups, RC10 had two issues. First, the repetitive warning mentioned in my Neurostars post that kept spamming throughout the log:

171110-02:11:12,400 interface WARNING:
Affines of input and reference images do not match, CopyXForm will probably make the input image useless.

But more concerning, SLURM killed the pipeline for both of our subjects due to excessive memory usage (I am including only the tails because the log files are super long, given the insane amount of warnings we got):

171116-17:29:39,944 niworkflows INFO:
	 Generating report for aCompCor. file "/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/func_preproc_task_mid_run_01_wf/bold_bold_trans_wf/merge/vol0000_xform-00000_merged.nii.gz", mask "/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acc_tfm/highres001_BrainExtractionBrain_prob_0_tpmsum_roi_trans_boldmsk.nii.gz"
171116-17:29:53,729 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acompcor/report.html)
171116-17:29:54,76 interface WARNING:
	 Affines of input and reference images do not match, CopyXForm will probably make the input image useless.
171116-17:59:02,201 niworkflows INFO:
	 Successful spatial normalization (retry #0).
171116-17:59:02,203 niworkflows INFO:
	 Report - setting fixed (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/anat_preproc_wf/t1_2_mni/fixed_masked.nii.gz) and moving (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/anat_preproc_wf/t1_2_mni/ants_t1_to_mni_Warped.nii.gz) images
171116-17:59:02,203 niworkflows INFO:
	 Generating visual report
171116-17:59:23,657 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/anat_preproc_wf/t1_2_mni/report.svg)
slurmstepd: error: Job 21317166 exceeded memory limit (26401148 > 25165824), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 21317166 ON vmp1312 CANCELLED AT 2017-11-16T12:05:08 ***
	 Generating report for aCompCor. file "/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/func_preproc_task_mid_run_01_wf/bold_bold_trans_wf/merge/vol0000_xform-00000_merged.nii.gz", mask "/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acc_tfm/highres001_BrainExtractionBrain_prob_0_tpmsum_roi_trans_boldmsk.nii.gz"
171116-17:54:16,39 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acompcor/report.html)
171116-17:54:17,378 interface WARNING:
	 Affines of input and reference images do not match, CopyXForm will probably make the input image useless.
171116-18:34:26,585 niworkflows INFO:
	 Successful spatial normalization (retry #0).
171116-18:34:26,590 niworkflows INFO:
	 Report - setting fixed (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/anat_preproc_wf/t1_2_mni/fixed_masked.nii.gz) and moving (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/anat_preproc_wf/t1_2_mni/ants_t1_to_mni_Warped.nii.gz) images
171116-18:34:26,590 niworkflows INFO:
	 Generating visual report
171116-18:35:00,936 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/anat_preproc_wf/t1_2_mni/report.svg)
slurmstepd: error: Job 21317317 exceeded memory limit (26395460 > 25165824), being killed
slurmstepd: error: *** JOB 21317317 ON vmp452 CANCELLED AT 2017-11-16T12:42:27 ***

I am not sure whether the minimum memory requirements have increased since rc8, or whether this is a bug causing excessive memory usage / ignoring the limits specified in the call. But, I figured I'd bring it up in case others are experiencing this problem.

Thank you so much for developing this, I really like this tool!

@effigies
Copy link
Member

Hi @fcmeyer. We're also running into memory issues lately, and it's not entirely clear why at this point. One hypothesis we have is that the main program is acquiring too much memory, so when it spawns new processes, all that memory is being copied before it can be released.

Thanks for alerting us to the fact that this issue has arisen since 1.0.0-rc8. That should be useful in narrowing down the causes.

@oesteban
Copy link
Member

oesteban commented Nov 17, 2017

Thanks a lot @fcmeyer, as Chris commented this is very likely a duplicate of #833. Let's keep discussion there.

We are working in several directions to ease these problems (eg. #839, nipy/nipype#2284, nipy/nipype#2289).

And thanks a lot for the compliments!

@oesteban
Copy link
Member

oesteban commented Nov 21, 2017

One question @fcmeyer, could you please paste here the output of:

grep "Free memory" fpreprc10_226854.log

@fcmeyer
Copy link
Author

fcmeyer commented Nov 28, 2017

Hi @oesteban, sorry about the delay - was out of town for Thanksgiving. I tried running the command you listed, but it gives no output:

[calvacfa@vmps11 tts_test]$ grep "Free memory" fpreprc10_226854.log [calvacfa@vmps11 tts_test]$

However, if I just do "memory", there's a SLURM error that shows the memory usage at the time the process was killed:

[calvacfa@vmps11 tts_test]$ grep "memory" fpreprc10_226854.log slurmstepd: error: Job 21317317 exceeded memory limit (26395460 > 25165824), being killed

I hope this helps!

@oesteban
Copy link
Member

@fcmeyer we have just released 1.0.0-rc12. Even though we can still see some problems with memory, I think we have tackled a great chunk of the issue. Could you test the new release out?

@fcmeyer
Copy link
Author

fcmeyer commented Dec 11, 2017

@oesteban Just tried out the gold 1.0.0 release, the memory problem is definitely resolved for me. Re-running a batch of ~380 subjects in the cluster right now, about 200 done so far with no major issues or memory shutdowns from the cluster (using same settings as I had used for 1.0.0-rc8). Congrats on 1.0.0!

@oesteban
Copy link
Member

Glad we prioritized on this. We'll keep trying to reduce the memory fingerprint. Thank you all for your precious feedback :)

@oesteban
Copy link
Member

hey @fcmeyer could I ask you to relate your story here -> https://neurostars.org/t/fmriprep-success-stories/1111?u=oesteban ?

@fcmeyer
Copy link
Author

fcmeyer commented Dec 11, 2017

@oesteban done!

@oesteban
Copy link
Member

oesteban commented Dec 11, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants