-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea to minimize memory fingerprint #2776
Comments
Sounds reasonable. By "Patch" you mean replace?
…On Tue, Nov 13, 2018 at 10:42 AM Oscar Esteban ***@***.***> wrote:
Summary
Large workflows are prone to get killed by the OOM Killer in linux and
similar devices in other systems. The common denominator is that the RSS
memory fingerprint is sensible, not too big, but the VM skyrockets when the
workflow has many nodes.
Actual behavior
Except for pure python interfaces, all nipype interfaces fork new
processes via subprocess.Popen. That way, all new processes are started
using fork and allocate for twice as much virtual memory as it is
allocated before forking. This, in addition with python inefficiency to
garbage collect after finishing the process, leads to overcommitting memory
(and processes being killed on systems that would not allow that).
Expected behavior
Less memory consumption
How to replicate the behavior
Run fmriprep on a large dataset, disallowing overcommitting and with a low
memory limit (e.g. 8GB)
Idea:
Patch subprocess.Popen with multiprocessing.context.Popen. Doing that, in
theory, all these processes should be forked on the server process (which
should have a constant, small memory fingerprint).
WDYT @satra <https://github.com/satra> @effigies
<https://github.com/effigies> @chrisfilo <https://github.com/chrisfilo> ?
BTW, this solution might be only possible with Python>=3.4. As per
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
:
*Changed in version 3.4*: spawn added on all unix platforms, and
forkserver added for some unix platforms. Child processes no longer inherit
all of the parents inheritable handles on Windows.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2776>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOkp9S8tpPqYViC3P9VhnlIZXQw_j4lks5uuxJ6gaJpZM4YcRYq>
.
|
Seems worth trying. |
let's try it out. are we the only ones struggling with memory? how does dask handle this? |
Yes, although I'm checking and such replacement is not as easy as I first thought. The interface of the
I think there is one big difference w.r.t. dask: they don't ubiquitously use My impression is that this issue only happens with linux. |
related: https://bugs.python.org/issue20104#msg222570 EDIT: this post in the same thread may also apply https://bugs.python.org/issue20104#msg289758 |
This would be an alternative (python 3.8, not sure how easy to backport) https://docs.python.org/3.8/library/os.html#os.posix_spawn |
There is also the |
Yep, the problem is that I would not expect a great difference between |
This might be too big to try. Just in case I wrote this bpo https://bugs.python.org/issue35238?@ok_message=msg%20329868%20created%0Aissue%2035238%20created&@template=item#msg329868 |
I have a functional implementation of nipype running commands through Patching it into a custom fmriprep docker image**, I've gotten the following virtual memory profile: Using the same docker image without patching nipype (i.e., using fork_exec): ** Docker image was built with latest fmriprep's master, with the addition of Conclusion: I'm either measuring wrongly*** or changing to spawn does not have any effect. Another possibility is that ds000005/sub-01 is not a big enough dataset so as to trigger memory problems. *** It could be because my patch to memory_profiler is faulty or because memory_profiler is just not the right tool to measure this. Or that sampling at 0.01s rate is not fast enough to capture the memory duplication of fork and thus allocated memory gets freed before you can even measure. Extra thought: I'll dig deeper on a peak of 200GB that happens soon after starting fmriprep, as seen on the same profiles without zooming in: |
Apparently, switching to Python 3.7 (incl. #2778, thus with forkserver mode process pool) seems to reduce memory usage by 1~2GiB in this test case: |
So And |
Correct EDIT: Correct iff we are are not missing out memory allocations of fork_exec because they happen very fast (way below the sampling frequency of 0.01s) or because they don't add to the VMS. |
Ah, good point. Maybe we can test by allocating ~1/4 of available memory in the base process and setting the overcommit policy to disallow? |
I'll try that tomorrow. |
Okay, since I wasn't convinced by This is what happens with vanilla nipype (using fork): And this with spawn: By using the option My next step will be running fmriprep on TACC with Russ' parameters and valgrind, since we know that will crash fmriprep due to memory. WDYT? I can also play with when valgrind snapshots. There is one mode where a snapshot is triggered by memory allocations and frees. That could be a better way of measuring. |
Testing on TACC might lead to long turnaround time - last time my job spend
25h in the queue.
…On Tue, Nov 20, 2018, 8:10 AM Oscar Esteban ***@***.*** wrote:
Okay, since I wasn't convinced by memory_profiler measurements, I managed
to run fmriprep with valgrind-massif with the following command line: valgrind
--tool=massif --massif-out-file=massif.out.%p --pages-as-heap=yes
--suppressions=/usr/local/miniconda/share/valgrind-python.supp
/usr/local/miniconda/bin/fmriprep - more details here
<nipreps/fmriprep@master...oesteban:ehn/include-profiler>
.
This is what happens with vanilla nipype (using fork):
[image: fmriprep-valgrind-fork]
<https://user-images.githubusercontent.com/598470/48785854-6d706000-ec9a-11e8-8744-374b601cca77.png>
And this with spawn:
[image: fmriprep-valgrind-spawn]
<https://user-images.githubusercontent.com/598470/48785873-76f9c800-ec9a-11e8-99ba-b341c38779b9.png>
By using the option --pages-as-heap=yes I hoped to capture also virtual
memory allocations. However it is shocking that the graph is giving 10
times less memory usage than memory_profiler. As valgrind is pretty
mature, I'm guessing that there is something off with memory_profiler or
its plotting function. Other than that, my impression is that the only
difference is the density of sampling (faster for memory_profiler).
My next step will be running fmriprep on TACC with Russ' parameters and
valgrind, since we know that will crash fmriprep due to memory. WDYT?
I can also play with when valgrind snapshots. There is one mode where a
snapshot is triggered by memory allocations and frees. That could be a
better way of measuring.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2776 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOkp7LCcgvV5_nc-M-HBpvJYhqrve_1ks5uxClqgaJpZM4YcRYq>
.
|
Okay, I'll test on My Connectome locally then. Using The hypothesis is: it seems we are allocating a lot of memory just in the beginning and then keeping it throughout fMRIPrep. Let's see what is the memory slope for My Connectome, and then suppose that it enters the plateau when execution reaches the |
if you know a consistent crash pattern @mgxd could test on our slurm cluster with a memory limit on the process as well. |
@satra I'll try to think of something we can hand over to @mgxd to replicate, but I don't want to waste anyone else's time. One quick hack we can try is creating the MultiProc runner object at the very very beginning of the execution (so the workers have a minimal memory fingerprint) and pass it on to the workflow.run method. That should work, shouldn't it? |
that would work. we do create it pretty soon after expanding iterables. |
Okay, so I'm trying to simplify and I'm starting to see interesting things. First, I stopped running these tests within docker. I wanted to see how the forkserver is behaving. To minimize the size of workers, I initialize a MultiProcPlugin instance in the beginning of fmriprep, right after parsing the inputs (the only customization of nipype was #2786) and Python 3.7.1. I made two tests: MultiProc, 8 processors, 8 omp-nthreads: LegacyMultiProc, 8 processors, 8 omp-nthreads: Conclusions:
|
One thing I've been chatting with @effigies: Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 18:15:35)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] :: Anaconda custom (64-bit) on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import psutil
>>> p = psutil.Process()
>>> p.memory_info().vms / (1024**2)
71.33984375
>>> from multiprocessing import set_start_method
>>> set_start_method('forkserver')
>>> p.memory_info().vms / (1024**2)
75.765625
>>> from nipype.pipeline.plugins import legacymultiproc as lmpc
>>> plugin = lmpc.LegacyMultiProcPlugin(plugin_args={'nprocs': 8})
>>> p.memory_info().vms / (1024**2)
1066.44140625 So basically, importing nipype takes 850 MB and starting a LegacyMultiProcPlugin adds some 150MB in addition. From a different point of view: >>> import sys
>>> a = set(sys.modules)
>>> import nipype
>>> b = set(sys.modules)
>>> len(b - a)
1628 Nipype might be importing too many things. |
Posted this on the slack, but might as well do it here, too. Here's a more thorough profiling script and the results: https://gist.github.com/effigies/183898aa6ef047eb8ef86f7e63f5528a |
Is it worth focusing on constants when the memory issue seems to scale with
the size of the workflow?
…On Tue, Nov 27, 2018 at 6:04 PM Chris Markiewicz ***@***.***> wrote:
Posted this on the slack, but might as well do it here, too. Here's a more
thorough profiling script and the results:
https://gist.github.com/effigies/183898aa6ef047eb8ef86f7e63f5528a
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2776 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOkp6dXaahXDZ2LLV_9lap_2PE-ep8xks5uze8jgaJpZM4YcRYq>
.
|
If we're running with a forkserver, the base memory footprint needed to run each node is going to make a difference. We're currently looking at 850MB for a bare interpreter + nipype, which is what each worker needs to load in order to un-pickle the interface and run it, so 8-core machines will use 6.8GB at rest, apart from the workflow. Of course, this only makes sense with a forkserver; if we're forking from a process that contains the workflow, we're always going to get destroyed by a big node count. Though yes, we should try to reduce node size. Oscar's pushing from that end. |
Both are related. The offset of 850MB of the nipype import is pretty huge. Worse so for the main thread (where you start the forkserver), where the offset of nipype + pool is 1GB. ds000005 is not a particularly large dataset, it adds some 700MB of workflow and node objects on top. So, the real footprint is Chris' baseline of 6.8GB on workers plus 1.7GB of a normal dataset. For My Connectome I'm sure that the workflow overhead is much more than 700MB. On the imports side. Just doing this: master...oesteban:enh/minimize-node-memory-2 I've gotten to the following in the imports front: >>> import psutil
>>> p = psutil.Process()
>>> p.memory_info().vms / (1024**2)
71.34765625
>>> import nipype
>>> p.memory_info().vms / (1024**2)
455.28515625
>>> from nipype.interfaces import base
>>> p.memory_info().vms / (1024**2)
781.16015625 On the However, I'm running into a problem: the config object is copied entirely into all Nodes and Workflows. For Workflows that accounts for most of the object size (discounting Nodes) and you usually have maybe hundreds of Workflows at most. But for Nodes it is a problem because this copy precludes using Chris and I have discussed that nodes could have a proxy for the input traits, which are the heaviest part of Interface objects. I can see clear links to the flat file system representation of hashes. |
One more picture. I've run fmriprep with the following snippet inserted at points of initialization: print('Memory usage <where> - '
'RSS %.1f MB / VM %.f MB' % (mem.rss / (1024**2), mem.vms / (1024**2)))
print('Object usage:')
objgraph.show_most_common_types() where thisproc = psutil.Process()
mem = thisproc.memory_info() This is the output:
In summary, we start with We should probably try to get nipype not to import pandas (of course) and numpy (which is imported to check its version in utils/config.py). Ideally, I'd say the interfaces could be imported and the pipeline left for after the workers are started. All in all, I don't think we can reduce the initial VM footprint below ~500MB. But, having, e.g., 8 workers that means start forking without duplicating 300MB per worker. The legacy multiproc with maxtasksperchild is a MUST at this point. WDYT? |
Oh, I forgot to note a couple of details:
|
Looking at that RSS/VMS disparity with the workflow building (+100MB RSS / +550MB VMS), I think we should check the |
>>> import psutil
>>> p = psutil.Process()
>>> a = p.memory_info().vms / (1024**2)
>>> import nipype
>>> b = p.memory_info().vms / (1024**2)
>>> from fmriprep.workflows import anatomical
>>> c = p.memory_info().vms / (1024**2)
>>> from niworkflows.interfaces.segmentation import ReconAllRPT
>>> d = p.memory_info().vms / (1024**2)
>>> print('VM overhead of nipype is %.f MB' % (b - a))
VM overhead of nipype is 778 MB
>>> print('VM overhead of fmriprep.workflows.anatomical is %.f MB' % (c - b))
VM overhead of fmriprep.workflows.anatomical is 580 MB
>>> print('VM now includes niworkflows: %.f MB' % (d - c))
VM now includes niworkflows: 0 MB In a previous check, I found that |
Summary
Large workflows are prone to get killed by the OOM Killer in linux and similar devices in other systems. The common denominator is that the RSS memory fingerprint is sensible, not too big, but the VM skyrockets when the workflow has many nodes.
Actual behavior
Except for pure python interfaces, all nipype interfaces fork new processes via
subprocess.Popen
. That way, all new processes are started usingfork
and allocate for twice as much virtual memory as it is allocated before forking. This, in addition with python inefficiency to garbage collect after finishing the process, leads to overcommitting memory (and processes being killed on systems that would not allow that).Expected behavior
Less memory consumption
How to replicate the behavior
Run fmriprep on a large dataset, disallowing overcommitting and with a low memory limit (e.g. 8GB)
Idea:
Patch
subprocess.Popen
withmultiprocessing.context.Popen
. Doing that, in theory, all these processes should be forked on the server process (which should have a constant, small memory fingerprint).WDYT @satra @effigies @chrisfilo ?
BTW, this solution might be only possible with Python>=3.4. As per https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods:
The text was updated successfully, but these errors were encountered: