-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel harmonic centrality #3581
Parallel harmonic centrality #3581
Conversation
I see that the integration fails for two reasons: either the requirements as I have specified in the setup.py file are not installed or the notation I'm resolving these now. |
I see that some tests fail because the All the tests related to my method pass positively on travis, even though there still is failure to install the required packages on appveyor:
How can I proceed? |
To handle the absence of some packages follow the boilerplate code found in e.g. It looks like the I should warn you before you put too much more effort into this that we have not settled on a platform/library for parallel algorithms yet. Search through previous issues for "parallel" |
I'll look into that, but I don't use scipy and to my knowledge nor does any package I use, so it remains unclear to me as to why the error pops up.
Touch is a simple package that just wraps an open(path, "w") with a semantically significant name, nothing much. I chose to use it just for making more semantically clear the purpose of that function. The other packages required are listed in the first message: these are compress_json, auto_tqdm and dict_hash. The packages are not a firm requirement, but for instance on a graph I had to tackle in a paper I'm working the resulting JSON was about 3GB, so saving compressed partial results helped to avoid bloating the disk. dict_hash is only needed when using the cache, and as said before is not a firm requirement for the package to run, but if the execution is interrupted (by either a MemoryError or a KeyInterrupt by the user) the cache helps to avoid recomputing the nodes. Finally, auto_tqdm while not essential helps (at least me) having an idea when the section of the script will be done running and if one needs to report the expected completion time or pays for the computing time it can be a helpful information.
To my knowledge multiprocessing is the go-to library for parallel execution, as all my projects (and colleagues at my laboratory) dealing with python and parallel programming use it, but I should stress that I do not possess any statistics on parallel programming packages usage. I should look into this more. ...after going down a rabbit hole of parallel programming for the last 40 minutes... I have learned that multiprocessing is the default parallelization library in Python and it comes shipped with it and the most common non-default library (by download count from pip, metric that does not exist for the current multiprocessing version) is Dask, which "just" offers an high-level interface to multiprocessing. Therefore, I believe that multiprocessing can be considered the default library for parallel programming in Python. |
I guess you need to list the new required packages in the file |
Maybe you need to change the tests module using |
… No longer supporting SLURM clusters.
I have dropped all the external dependencies from this version, one can parallelize further over computing nodes by specifying the |
I have implemented also the Karger and Christofides, pull request coming soon. The first one may be more useful as an educational implementation than an actually efficient tool, as it gets significantly faster than Stoer-Wagner only when using a copious amount of parallel computing. Its strengths are the extreme easy with which it scales on clusters, since it just requires The second one is, on the other hand, pretty useful: it is a 1.5 approximation of TSP. |
Sorry to ask you to repackage again, but could you make this an example (by putting it in /examples/advanced ). There is already one parallel example: plot_parallel_betweenness.py |
Sure, without leaving it also in the algorithms' directory, right? What do you think about the other 3 implementations I've pushed? |
Parallel harmonic centrality
Implemented parallel version of harmonic centrality, using [Pool from multiprocessing] to(https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool).
Extra features
Additionally, the version includes both compressed caching using compress_json and parallelization across SLURM clusters computing nodes (such as Galileo) with virtual shared disk. The SLURM parallelization is achieved simply by touching a temporary file, so to avoid having to deal with inter-node communication systems such as OpenMPI
at this scale. The implementation also uses auto_tqdm to show an optional loading bar as the various tasks are completed.
Requirements
The implementation requires a number of packages to be added to the requirements:
Comparison with single-thread version
I have run a comparison on a 12 thread machine and the parallel implementation, as can be expected, vastly outperforms the single-thread one:
Test on SLURM cluster
Tests have been run on the Cineca's Galileo SLURM cluster to verify that the proposed method (touched temporary files) works properly and so far no collision have been identified, even though they should be possible even though unlikely since rarely SLURM jobs get to start all together. Better synchronization, if needed, can be surely achieved by wrapping the proposed implementation with OpenMPI.
I hope this work can be useful.
Have a nice day,
Luca