Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add instrumentation to monitor resources #984

Merged
merged 15 commits into from Apr 27, 2022
Merged

Conversation

oesteban
Copy link
Member

Adds an instrumentation module which could be eventually packaged with nipype or standalone to keep track of resource utilization of mriqc.

For now, I'm testing the pattern and will try to write up some code to generate nice plots with the data.

cc/ @effigies @mgxd

@oesteban
Copy link
Member Author

I'm still investigating why the recording pauses for about 100s just while the FSL fast process is kicked off. I suspect it has to do with threading, forking and locks.

example_output.csv

@oesteban
Copy link
Member Author

Two assumptions I should've been more explicit about:

  • NiPype's resource monitor does not work at the workflow level. It can be useful to track isoleted interfaces but the approach is expensive, inaccurate, inefficient and irreliable for workflows (each interface triggers its own thread, with great potential for problems with process forking).
  • This effort is to investigate Too much vmem? #824, where FSL fast seems to behave as a memory hog.

If the new approach to monitoring works out, then we will see how to make it available to other users.

@oesteban
Copy link
Member Author

This is the kind of plots we can generate with these new files (see code under new viz submodule).

Using spawn, plotting RSS (--nprocs 8):
mriqc-rss-spawn

Using spawn, plotting VM (--nprocs 8):
mriqc-vms-spawn

I've just plotted it with forkserver and the picture is exactly the same, so I am afraid I might have not managed to correctly configure spawn or the process pool just initiates all workers either way.

The huge light blue area amounts to python processes, which all get named "python3.8". The main process offset is also important, although nothing compared to the multiprocessing workers.

That said, it feels like we should find a solution for nipype 1 execution plugins:

  • Instead of multiprocessing, use multithreading. I don't know right this minute if this has ever been attempted.
  • Creating a plugin based in asyncio (which seems the best option in theory, as this is eminently an I/O bound problem).

wdyt @satra @effigies @mgxd ?

@oesteban oesteban force-pushed the enh/resource-monitor branch 2 times, most recently from 2ad11d9 to 0537657 Compare April 25, 2022 16:23
@oesteban oesteban marked this pull request as ready for review April 25, 2022 16:26
@effigies
Copy link
Member

That said, it feels like we should find a solution for nipype 1 execution plugins:

  • Instead of multiprocessing, use multithreading. I don't know right this minute if this has ever been attempted.

Assuming all of your pure Python interfaces are trivial, this might work. Otherwise it's going to be hard to avoid bottlenecks related to the GIL.

  • Creating a plugin based in asyncio (which seems the best option in theory, as this is eminently an I/O bound problem).

Same issue here. Asyncio doesn't replace multiprocessing, it just allows us to skip writing our own callback queue.

@oesteban
Copy link
Member Author

That said, it feels like we should find a solution for nipype 1 execution plugins:

  • Instead of multiprocessing, use multithreading. I don't know right this minute if this has ever been attempted.

Assuming all of your pure Python interfaces are trivial, this might work. Otherwise it's going to be hard to avoid bottlenecks related to the GIL.

  • Creating a plugin based in asyncio (which seems the best option in theory, as this is eminently an I/O bound problem).

Same issue here. Asyncio doesn't replace multiprocessing, it just allows us to skip writing our own callback queue.

Python interfaces could also spawn their own process to overcome the GIL. I believe that having the main thread control a pool of threads instead of processes should, at least, ease the multiplicative effect of multiprocessing in terms of allocation.

@oesteban oesteban force-pushed the enh/resource-monitor branch 2 times, most recently from a6f7eea to a9176ee Compare April 26, 2022 06:42
@oesteban oesteban force-pushed the enh/resource-monitor branch 2 times, most recently from 41d7f6b to 4259bab Compare April 26, 2022 08:05
@oesteban
Copy link
Member Author

Alright - it seems like a good practice with apparently little cost.

In e4bdd4e, I set up OMP_NUM_THREADS=1 early in the config file, and make sure that workers of the process pool set it to the proper value (passed by --omp-nthreads of the commandline or the total number of CPUs if unspecified). That keeps the mother process under 600MB VMS (i.e., less than 1/10th of the original size).

Then, we SHOULD NOT use the forkserver. It seems the forkserver doesn't kill workers (at least prior 3.11), and the default fork context works well, because of the new vms of the mother process.

Finally, as a note for @satra, @mgxd, @effigies and other nipypers, we probably want to avoid the ProcessPool, which maintains the --nprocs number of workers up, when nipype typically doesn't need that. Instead, creating ad-hoc processes when necessary, which, at points, might reach the maximum number of parallel processes, seems like a sure way of reducing further the memory fingerprint.

With the changes in this PR, and after the latest OMP=1 change, the RSS picture remains very similar (as expected):

mriqc-rss-omp-1

But the VMS picture has changed dramatically:

mriqc-vms-omp-1

(don't get deceived by the weird lines I removed from the plot da20809)

With this commit, I believe the VMem problems have been addressed.

Resolves: #824.
Related: #536.
@oesteban
Copy link
Member Author

This all comes about because of nipy/nipype#3456

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants