Parallel harmonic centrality #3581

LucaCappelletti94 · 2019-09-13T13:11:14Z

Parallel harmonic centrality

Implemented parallel version of harmonic centrality, using [Pool from multiprocessing] to(https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool).

Extra features

Additionally, the version includes both compressed caching using compress_json and parallelization across SLURM clusters computing nodes (such as Galileo) with virtual shared disk. The SLURM parallelization is achieved simply by touching a temporary file, so to avoid having to deal with inter-node communication systems such as OpenMPI
at this scale. The implementation also uses auto_tqdm to show an optional loading bar as the various tasks are completed.

Requirements

The implementation requires a number of packages to be added to the requirements:

touch, used for a touching an empty file.
compress_json, for reading and writing temporary compressed JSON files.
auto_tqdm, for showing an optional loading bar automatically adapted to its context (either jupyter or console)
dict_hash, for deterministically hashing dictionaries. Used for creating the cache path names.

Comparison with single-thread version

I have run a comparison on a 12 thread machine and the parallel implementation, as can be expected, vastly outperforms the single-thread one:

Test on SLURM cluster

Tests have been run on the Cineca's Galileo SLURM cluster to verify that the proposed method (touched temporary files) works properly and so far no collision have been identified, even though they should be possible even though unlikely since rarely SLURM jobs get to start all together. Better synchronization, if needed, can be surely achieved by wrapping the proposed implementation with OpenMPI.

I hope this work can be useful.

Have a nice day,
Luca

LucaCappelletti94 · 2019-09-13T13:17:23Z

I see that the integration fails for two reasons: either the requirements as I have specified in the setup.py file are not installed or the notation f"{variable}" is not supported in python 3.5, version in which the test is executed.

I'm resolving these now.

LucaCappelletti94 · 2019-09-13T14:00:13Z

I see that some tests fail because the scipy module is absent and some other tests (bethe_hessian_matrix), which don't seem to be related to the new proposed method fail, for instance AttributeError: module 'networkx' has no attribute 'algebraic_connectivity', which again has little to do with my method.

All the tests related to my method pass positively on travis, even though there still is failure to install the required packages on appveyor:

test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_bal_tree ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_big_graph_harmonic ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_clique_complete ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_cycle_C4 ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_cycle_C5 ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_empty ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_exampleGraph ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_p3_harmonic ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_p4_harmonic ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_singleton ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_weighted_harmonic ... ok

How can I proceed?
Luca

dschult · 2019-09-13T15:19:09Z

To handle the absence of some packages follow the boilerplate code found in e.g. eigenvector.py at the bottom (basically you move the import into the function and put a setup_module function at the bottom of your module so the tests can skip when scipy is not installed.

It looks like the touch package is not getting loaded on appveyor. Can you tell me anything about it? Are there any other libraries you are importing that will be increasing the required packages to run networkx? We may have to handle them in a similar way to how we handle scipy.

I should warn you before you put too much more effort into this that we have not settled on a platform/library for parallel algorithms yet. Search through previous issues for "parallel"
to see previous discussions about parallel libraries. It may be different now in 2019, but in the past it wasn't clear which libraries to use -- so much seems to depend on the hardware setup, etc. Do you have a perspective that python parallel libraries have become uniform, or that one library is becoming the de facto parallel interface?

LucaCappelletti94 · 2019-09-13T16:05:46Z

To handle the absence of some packages follow the boilerplate code found in e.g. eigenvector.py at the bottom (basically you move the import into the function and put a setup_module function at the bottom of your module so the tests can skip when scipy is not installed.

I'll look into that, but I don't use scipy and to my knowledge nor does any package I use, so it remains unclear to me as to why the error pops up.

It looks like the touch package is not getting loaded on appveyor. Can you tell me anything about it? Are there any other libraries you are importing that will be increasing the required packages to run networkx? We may have to handle them in a similar way to how we handle scipy.

Touch is a simple package that just wraps an open(path, "w") with a semantically significant name, nothing much. I chose to use it just for making more semantically clear the purpose of that function. The other packages required are listed in the first message: these are compress_json, auto_tqdm and dict_hash. The packages are not a firm requirement, but for instance on a graph I had to tackle in a paper I'm working the resulting JSON was about 3GB, so saving compressed partial results helped to avoid bloating the disk.

dict_hash is only needed when using the cache, and as said before is not a firm requirement for the package to run, but if the execution is interrupted (by either a MemoryError or a KeyInterrupt by the user) the cache helps to avoid recomputing the nodes.

Finally, auto_tqdm while not essential helps (at least me) having an idea when the section of the script will be done running and if one needs to report the expected completion time or pays for the computing time it can be a helpful information.

I should warn you before you put too much more effort into this that we have not settled on a platform/library for parallel algorithms yet. Search through previous issues for "parallel"
to see previous discussions about parallel libraries. It may be different now in 2019, but in the past it wasn't clear which libraries to use -- so much seems to depend on the hardware setup, etc. Do you have a perspective that python parallel libraries have become uniform, or that one library is becoming the de facto parallel interface?

To my knowledge multiprocessing is the go-to library for parallel execution, as all my projects (and colleagues at my laboratory) dealing with python and parallel programming use it, but I should stress that I do not possess any statistics on parallel programming packages usage. I should look into this more.

...after going down a rabbit hole of parallel programming for the last 40 minutes...

I have learned that multiprocessing is the default parallelization library in Python and it comes shipped with it and the most common non-default library (by download count from pip, metric that does not exist for the current multiprocessing version) is Dask, which "just" offers an high-level interface to multiprocessing.

Therefore, I believe that multiprocessing can be considered the default library for parallel programming in Python.

dschult · 2019-09-13T18:39:07Z

I guess you need to list the new required packages in the file requirements/extras.txt`` file along with version numbers. At first you can just put in the version numbers you used. No need to try finding oldest version that works -- too many rabbit holes already. Other rabbit holes include looking at travis.ymlandappveyor.yml``` to see how the requirements files are used when testing. The goal for now is to get the imports to work in the tests. :)

Take a look at #3440 as well as #3270 and #3439

dschult · 2019-09-13T21:19:01Z

Maybe you need to change the tests module using test_eigenvector_centrality.py as a model for optional dependencies... ???

… No longer supporting SLURM clusters.

LucaCappelletti94 · 2019-09-14T08:15:30Z

I have dropped all the external dependencies from this version, one can parallelize further over computing nodes by specifying the nbunch over to run each node.

LucaCappelletti94 · 2019-09-14T08:22:20Z

I have implemented also the Karger and Christofides, pull request coming soon.

The first one may be more useful as an educational implementation than an actually efficient tool, as it gets significantly faster than Stoer-Wagner only when using a copious amount of parallel computing.

Its strengths are the extreme easy with which it scales on clusters, since it just requires n random iterations, and the second one just doesn't as it is linear. On very big graph this might make a difference, but I believe that in those situations one might implement a solution ad hoc.

The second one is, on the other hand, pretty useful: it is a 1.5 approximation of TSP.

dschult · 2019-09-14T17:42:09Z

Sorry to ask you to repackage again, but could you make this an example (by putting it in /examples/advanced ). There is already one parallel example: plot_parallel_betweenness.py
in there and this could go nicely with it.

LucaCappelletti94 · 2019-09-15T06:30:11Z

Sure, without leaving it also in the algorithms' directory, right? What do you think about the other 3 implementations I've pushed?

LucaCappelletti94 added 2 commits September 13, 2019 11:13

Implemented parallel version of harmonic centrality.

62c132e

Updated requirements in setup.py accordingly

d501a2e

Resolved errors.

d6d5564

Hopefully resolved the issues related to importing packages

be7e76f

dschult added the type: Enhancements label Sep 13, 2019

LucaCappelletti94 added 4 commits September 13, 2019 21:28

Hopefully fixed dependencies errors

97251c1

Fixed typo

22e29f5

I think I've properly added dependencies this time. I think.

4c1849d

Moved imports in the respective functions. Is this a best practice?

35cee9c

LucaCappelletti94 added 2 commits September 14, 2019 09:15

Removed loading bar

5ae2332

Dropped all cache related dependencies to achieve test compatibility.…

ebf698a

… No longer supporting SLURM clusters.

hagberg marked this pull request as draft June 25, 2020 21:00

jarrodmillman mentioned this pull request Jul 10, 2020

Parallelization plan #4064

Open

Base automatically changed from master to main March 4, 2021 18:20

dPys mentioned this pull request Dec 28, 2022

feat: prototype of higher-level API and helper functions #6306

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel harmonic centrality #3581

Parallel harmonic centrality #3581

LucaCappelletti94 commented Sep 13, 2019

LucaCappelletti94 commented Sep 13, 2019 •

edited

Loading

LucaCappelletti94 commented Sep 13, 2019

dschult commented Sep 13, 2019

LucaCappelletti94 commented Sep 13, 2019

dschult commented Sep 13, 2019 •

edited

Loading

dschult commented Sep 13, 2019

LucaCappelletti94 commented Sep 14, 2019

LucaCappelletti94 commented Sep 14, 2019

dschult commented Sep 14, 2019

LucaCappelletti94 commented Sep 15, 2019

Parallel harmonic centrality #3581

Parallel harmonic centrality #3581

Conversation

LucaCappelletti94 commented Sep 13, 2019

Parallel harmonic centrality

Extra features

Requirements

Comparison with single-thread version

Test on SLURM cluster

LucaCappelletti94 commented Sep 13, 2019 • edited Loading

LucaCappelletti94 commented Sep 13, 2019

dschult commented Sep 13, 2019

LucaCappelletti94 commented Sep 13, 2019

dschult commented Sep 13, 2019 • edited Loading

dschult commented Sep 13, 2019

LucaCappelletti94 commented Sep 14, 2019

LucaCappelletti94 commented Sep 14, 2019

dschult commented Sep 14, 2019

LucaCappelletti94 commented Sep 15, 2019

LucaCappelletti94 commented Sep 13, 2019 •

edited

Loading

dschult commented Sep 13, 2019 •

edited

Loading