Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel harmonic centrality #3581

Closed
wants to merge 10 commits into from
Closed

Parallel harmonic centrality #3581

wants to merge 10 commits into from

Conversation

LucaCappelletti94
Copy link
Contributor

Parallel harmonic centrality

Implemented parallel version of harmonic centrality, using [Pool from multiprocessing] to(https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool).

Extra features

Additionally, the version includes both compressed caching using compress_json and parallelization across SLURM clusters computing nodes (such as Galileo) with virtual shared disk. The SLURM parallelization is achieved simply by touching a temporary file, so to avoid having to deal with inter-node communication systems such as OpenMPI
at this scale. The implementation also uses auto_tqdm to show an optional loading bar as the various tasks are completed.

Requirements

The implementation requires a number of packages to be added to the requirements:

  • touch, used for a touching an empty file.
  • compress_json, for reading and writing temporary compressed JSON files.
  • auto_tqdm, for showing an optional loading bar automatically adapted to its context (either jupyter or console)
  • dict_hash, for deterministically hashing dictionaries. Used for creating the cache path names.

Comparison with single-thread version

I have run a comparison on a 12 thread machine and the parallel implementation, as can be expected, vastly outperforms the single-thread one:

Comparison

Test on SLURM cluster

Tests have been run on the Cineca's Galileo SLURM cluster to verify that the proposed method (touched temporary files) works properly and so far no collision have been identified, even though they should be possible even though unlikely since rarely SLURM jobs get to start all together. Better synchronization, if needed, can be surely achieved by wrapping the proposed implementation with OpenMPI.

I hope this work can be useful.

Have a nice day,
Luca

@LucaCappelletti94
Copy link
Contributor Author

LucaCappelletti94 commented Sep 13, 2019

I see that the integration fails for two reasons: either the requirements as I have specified in the setup.py file are not installed or the notation f"{variable}" is not supported in python 3.5, version in which the test is executed.

I'm resolving these now.

@LucaCappelletti94
Copy link
Contributor Author

I see that some tests fail because the scipy module is absent and some other tests (bethe_hessian_matrix), which don't seem to be related to the new proposed method fail, for instance AttributeError: module 'networkx' has no attribute 'algebraic_connectivity', which again has little to do with my method.

All the tests related to my method pass positively on travis, even though there still is failure to install the required packages on appveyor:

test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_bal_tree ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_big_graph_harmonic ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_clique_complete ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_cycle_C4 ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_cycle_C5 ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_empty ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_exampleGraph ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_p3_harmonic ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_p4_harmonic ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_singleton ... ok
test_parallel_harmonic_centrality.TestParallelHarmonicCentrality.test_weighted_harmonic ... ok

How can I proceed?
Luca

@dschult
Copy link
Member

dschult commented Sep 13, 2019

To handle the absence of some packages follow the boilerplate code found in e.g. eigenvector.py at the bottom (basically you move the import into the function and put a setup_module function at the bottom of your module so the tests can skip when scipy is not installed.

It looks like the touch package is not getting loaded on appveyor. Can you tell me anything about it? Are there any other libraries you are importing that will be increasing the required packages to run networkx? We may have to handle them in a similar way to how we handle scipy.

I should warn you before you put too much more effort into this that we have not settled on a platform/library for parallel algorithms yet. Search through previous issues for "parallel"
to see previous discussions about parallel libraries. It may be different now in 2019, but in the past it wasn't clear which libraries to use -- so much seems to depend on the hardware setup, etc. Do you have a perspective that python parallel libraries have become uniform, or that one library is becoming the de facto parallel interface?

@LucaCappelletti94
Copy link
Contributor Author

To handle the absence of some packages follow the boilerplate code found in e.g. eigenvector.py at the bottom (basically you move the import into the function and put a setup_module function at the bottom of your module so the tests can skip when scipy is not installed.

I'll look into that, but I don't use scipy and to my knowledge nor does any package I use, so it remains unclear to me as to why the error pops up.

It looks like the touch package is not getting loaded on appveyor. Can you tell me anything about it? Are there any other libraries you are importing that will be increasing the required packages to run networkx? We may have to handle them in a similar way to how we handle scipy.

Touch is a simple package that just wraps an open(path, "w") with a semantically significant name, nothing much. I chose to use it just for making more semantically clear the purpose of that function. The other packages required are listed in the first message: these are compress_json, auto_tqdm and dict_hash. The packages are not a firm requirement, but for instance on a graph I had to tackle in a paper I'm working the resulting JSON was about 3GB, so saving compressed partial results helped to avoid bloating the disk.

dict_hash is only needed when using the cache, and as said before is not a firm requirement for the package to run, but if the execution is interrupted (by either a MemoryError or a KeyInterrupt by the user) the cache helps to avoid recomputing the nodes.

Finally, auto_tqdm while not essential helps (at least me) having an idea when the section of the script will be done running and if one needs to report the expected completion time or pays for the computing time it can be a helpful information.

I should warn you before you put too much more effort into this that we have not settled on a platform/library for parallel algorithms yet. Search through previous issues for "parallel"
to see previous discussions about parallel libraries. It may be different now in 2019, but in the past it wasn't clear which libraries to use -- so much seems to depend on the hardware setup, etc. Do you have a perspective that python parallel libraries have become uniform, or that one library is becoming the de facto parallel interface?

To my knowledge multiprocessing is the go-to library for parallel execution, as all my projects (and colleagues at my laboratory) dealing with python and parallel programming use it, but I should stress that I do not possess any statistics on parallel programming packages usage. I should look into this more.

...after going down a rabbit hole of parallel programming for the last 40 minutes...

I have learned that multiprocessing is the default parallelization library in Python and it comes shipped with it and the most common non-default library (by download count from pip, metric that does not exist for the current multiprocessing version) is Dask, which "just" offers an high-level interface to multiprocessing.

Therefore, I believe that multiprocessing can be considered the default library for parallel programming in Python.

@dschult
Copy link
Member

dschult commented Sep 13, 2019

I guess you need to list the new required packages in the file requirements/extras.txt`` file along with version numbers. At first you can just put in the version numbers you used. No need to try finding oldest version that works -- too many rabbit holes already. Other rabbit holes include looking at travis.ymlandappveyor.yml``` to see how the requirements files are used when testing. The goal for now is to get the imports to work in the tests. :)

Take a look at #3440 as well as #3270 and #3439

@dschult
Copy link
Member

dschult commented Sep 13, 2019

Maybe you need to change the tests module using test_eigenvector_centrality.py as a model for optional dependencies... ???

@LucaCappelletti94
Copy link
Contributor Author

I have dropped all the external dependencies from this version, one can parallelize further over computing nodes by specifying the nbunch over to run each node.

@LucaCappelletti94
Copy link
Contributor Author

I have implemented also the Karger and Christofides, pull request coming soon.

The first one may be more useful as an educational implementation than an actually efficient tool, as it gets significantly faster than Stoer-Wagner only when using a copious amount of parallel computing.

Its strengths are the extreme easy with which it scales on clusters, since it just requires n random iterations, and the second one just doesn't as it is linear. On very big graph this might make a difference, but I believe that in those situations one might implement a solution ad hoc.

The second one is, on the other hand, pretty useful: it is a 1.5 approximation of TSP.

@dschult
Copy link
Member

dschult commented Sep 14, 2019

Sorry to ask you to repackage again, but could you make this an example (by putting it in /examples/advanced ). There is already one parallel example: plot_parallel_betweenness.py
in there and this could go nicely with it.

@LucaCappelletti94
Copy link
Contributor Author

Sure, without leaving it also in the algorithms' directory, right? What do you think about the other 3 implementations I've pushed?

@hagberg hagberg marked this pull request as draft June 25, 2020 21:00
Base automatically changed from master to main March 4, 2021 18:20
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

2 participants