Excessive memory use in 1.0.0-rc10 vs. 1.0.0-rc8? #836

fcmeyer · 2017-11-16T19:24:58Z

Hello,

I have been testing different versions of fMRIPREP on the same two subjects in the Vanderbilt HPC using Singularity. I told SLURM to allocate 24GB to the job.

Here's the SBATCH script we used for one subject:

#!/bin/tcsh
#SBATCH --nodes=1    # comments allowed
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=03:30:00
#SBATCH --mem=24G
#SBATCH --output=fpreprc10_226854.log

set OMP_NUM_THREADS = $SLURM_CPUS_PER_TASK
module load GCC Singularity
singularity run poldracklab_fmriprep_1.0.0-rc10-2017-11-10-49e327b1b660.img \
  /data/h_zald_lab/Fran/tts_test/BIDS_TTS_Test \
  /data/h_zald_lab/Fran/tts_test/out_fmriprep_RC10 \
  participant --participant_label 226854 \
  --n_cpus 8 --mem_mb 24000 --no-freesurfer \
  -w /data/h_zald_lab/Fran/tts_test/work_226854_RC10 \
  -t mid

The version for RC8 was identical, except we changed the image to RC8 and updated the output names to have RC8 instead of RC10 for later comparison. We did this for two subjects total.

While RC8 ran with no hiccups, RC10 had two issues. First, the repetitive warning mentioned in my Neurostars post that kept spamming throughout the log:

171110-02:11:12,400 interface WARNING:
Affines of input and reference images do not match, CopyXForm will probably make the input image useless.

But more concerning, SLURM killed the pipeline for both of our subjects due to excessive memory usage (I am including only the tails because the log files are super long, given the insane amount of warnings we got):

171116-17:29:39,944 niworkflows INFO:
	 Generating report for aCompCor. file "/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/func_preproc_task_mid_run_01_wf/bold_bold_trans_wf/merge/vol0000_xform-00000_merged.nii.gz", mask "/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acc_tfm/highres001_BrainExtractionBrain_prob_0_tpmsum_roi_trans_boldmsk.nii.gz"
171116-17:29:53,729 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acompcor/report.html)
171116-17:29:54,76 interface WARNING:
	 Affines of input and reference images do not match, CopyXForm will probably make the input image useless.
171116-17:59:02,201 niworkflows INFO:
	 Successful spatial normalization (retry #0).
171116-17:59:02,203 niworkflows INFO:
	 Report - setting fixed (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/anat_preproc_wf/t1_2_mni/fixed_masked.nii.gz) and moving (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/anat_preproc_wf/t1_2_mni/ants_t1_to_mni_Warped.nii.gz) images
171116-17:59:02,203 niworkflows INFO:
	 Generating visual report
171116-17:59:23,657 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_108992_RC10/fmriprep_wf/single_subject_108992_wf/anat_preproc_wf/t1_2_mni/report.svg)
slurmstepd: error: Job 21317166 exceeded memory limit (26401148 > 25165824), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 21317166 ON vmp1312 CANCELLED AT 2017-11-16T12:05:08 ***

	 Generating report for aCompCor. file "/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/func_preproc_task_mid_run_01_wf/bold_bold_trans_wf/merge/vol0000_xform-00000_merged.nii.gz", mask "/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acc_tfm/highres001_BrainExtractionBrain_prob_0_tpmsum_roi_trans_boldmsk.nii.gz"
171116-17:54:16,39 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/func_preproc_task_mid_run_01_wf/bold_confounds_wf/acompcor/report.html)
171116-17:54:17,378 interface WARNING:
	 Affines of input and reference images do not match, CopyXForm will probably make the input image useless.
171116-18:34:26,585 niworkflows INFO:
	 Successful spatial normalization (retry #0).
171116-18:34:26,590 niworkflows INFO:
	 Report - setting fixed (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/anat_preproc_wf/t1_2_mni/fixed_masked.nii.gz) and moving (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/anat_preproc_wf/t1_2_mni/ants_t1_to_mni_Warped.nii.gz) images
171116-18:34:26,590 niworkflows INFO:
	 Generating visual report
171116-18:35:00,936 niworkflows INFO:
	 Successfully created report (/data/h_zald_lab/Fran/tts_test/work_226854_RC10/fmriprep_wf/single_subject_226854_wf/anat_preproc_wf/t1_2_mni/report.svg)
slurmstepd: error: Job 21317317 exceeded memory limit (26395460 > 25165824), being killed
slurmstepd: error: *** JOB 21317317 ON vmp452 CANCELLED AT 2017-11-16T12:42:27 ***

I am not sure whether the minimum memory requirements have increased since rc8, or whether this is a bug causing excessive memory usage / ignoring the limits specified in the call. But, I figured I'd bring it up in case others are experiencing this problem.

Thank you so much for developing this, I really like this tool!

The text was updated successfully, but these errors were encountered:

effigies · 2017-11-17T14:07:24Z

Hi @fcmeyer. We're also running into memory issues lately, and it's not entirely clear why at this point. One hypothesis we have is that the main program is acquiring too much memory, so when it spawns new processes, all that memory is being copied before it can be released.

Thanks for alerting us to the fact that this issue has arisen since 1.0.0-rc8. That should be useful in narrowing down the causes.

oesteban · 2017-11-17T17:05:31Z

Thanks a lot @fcmeyer, as Chris commented this is very likely a duplicate of #833. Let's keep discussion there.

We are working in several directions to ease these problems (eg. #839, nipy/nipype#2284, nipy/nipype#2289).

And thanks a lot for the compliments!

oesteban · 2017-11-21T07:52:43Z

One question @fcmeyer, could you please paste here the output of:

grep "Free memory" fpreprc10_226854.log

fcmeyer · 2017-11-28T20:06:57Z

Hi @oesteban, sorry about the delay - was out of town for Thanksgiving. I tried running the command you listed, but it gives no output:

[calvacfa@vmps11 tts_test]$ grep "Free memory" fpreprc10_226854.log [calvacfa@vmps11 tts_test]$

However, if I just do "memory", there's a SLURM error that shows the memory usage at the time the process was killed:

[calvacfa@vmps11 tts_test]$ grep "memory" fpreprc10_226854.log slurmstepd: error: Job 21317317 exceeded memory limit (26395460 > 25165824), being killed

I hope this helps!

oesteban · 2017-11-30T18:45:32Z

@fcmeyer we have just released 1.0.0-rc12. Even though we can still see some problems with memory, I think we have tackled a great chunk of the issue. Could you test the new release out?

fcmeyer · 2017-12-11T19:23:00Z

@oesteban Just tried out the gold 1.0.0 release, the memory problem is definitely resolved for me. Re-running a batch of ~380 subjects in the cluster right now, about 200 done so far with no major issues or memory shutdowns from the cluster (using same settings as I had used for 1.0.0-rc8). Congrats on 1.0.0!

oesteban · 2017-12-11T19:27:02Z

Glad we prioritized on this. We'll keep trying to reduce the memory fingerprint. Thank you all for your precious feedback :)

oesteban · 2017-12-11T19:43:30Z

hey @fcmeyer could I ask you to relate your story here -> https://neurostars.org/t/fmriprep-success-stories/1111?u=oesteban ?

fcmeyer · 2017-12-11T20:13:27Z

@oesteban done!

oesteban · 2017-12-11T20:14:12Z

Awesome! Thanks!

…

On Dec 11, 2017 3:13 PM, "fcmeyer" ***@***.***> wrote: @oesteban <https://github.com/oesteban> done! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#836 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAkhxvQn4-xWjJRBQSN3Jfi5w8Lcl25Qks5s_YzogaJpZM4QhAW4> .

oesteban closed this as completed Nov 17, 2017

effigies added duplicate memory labels Dec 11, 2017

shnizzedy mentioned this issue Dec 3, 2020

🐛 ANTs doesn't respect memory constraints FCP-INDI/C-PAC#1404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory use in 1.0.0-rc10 vs. 1.0.0-rc8? #836

Excessive memory use in 1.0.0-rc10 vs. 1.0.0-rc8? #836

fcmeyer commented Nov 16, 2017

effigies commented Nov 17, 2017

oesteban commented Nov 17, 2017 •

edited

Loading

oesteban commented Nov 21, 2017 •

edited

Loading

fcmeyer commented Nov 28, 2017

oesteban commented Nov 30, 2017

fcmeyer commented Dec 11, 2017

oesteban commented Dec 11, 2017

oesteban commented Dec 11, 2017

fcmeyer commented Dec 11, 2017

oesteban commented Dec 11, 2017 via email

Excessive memory use in 1.0.0-rc10 vs. 1.0.0-rc8? #836

Excessive memory use in 1.0.0-rc10 vs. 1.0.0-rc8? #836

Comments

fcmeyer commented Nov 16, 2017

effigies commented Nov 17, 2017

oesteban commented Nov 17, 2017 • edited Loading

oesteban commented Nov 21, 2017 • edited Loading

fcmeyer commented Nov 28, 2017

oesteban commented Nov 30, 2017

fcmeyer commented Dec 11, 2017

oesteban commented Dec 11, 2017

oesteban commented Dec 11, 2017

fcmeyer commented Dec 11, 2017

oesteban commented Dec 11, 2017 via email

oesteban commented Nov 17, 2017 •

edited

Loading

oesteban commented Nov 21, 2017 •

edited

Loading