Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Parallel ErdosRenyi Generator #221

Merged
merged 19 commits into from
Dec 17, 2018

Conversation

manpen
Copy link
Contributor

@manpen manpen commented Aug 10, 2018

Apologies for issuing this PR without a description.

Remark: NetworKit calls G(n,p) graphs Erdos-Renyi-Graphs. It is my understanding that G(n,p) are infact Gilbert graphs, while ER introduced the G(n,m) model. For consistency, I will keep on refering to G(n,p) as ER.

The G(n, p) model was first introduced by Edgar Gilbert in a 1959 paper studying the connectivity threshold mentioned above.[3] The G(n, M) model was introduced by Erdős and Rényi in their 1959 paper. As with Gilbert, their first investigations were as to the connectivity of G(n, M), with the more detailed analysis following in 1960. https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model

The primary reason for this PR is the fact that the GraphBuilder benchmarks use ER graphs. Unfortunately it relied on a quadratic time sampling that required (for the parameters used) expected 1000 random floats for each edge generated. Hence, the runtime of the benchmarks was primarily dominated by the generation of test data and did not actually measure the GraphBuilder itself.

After migrating away from the quadratic time sampling to the parallel ER model, the runtime dropped from ~1900ms to ~30ms. This includes sampling the data and maintaining the GraphBuilder. I hence increased the problem size by a factor of 100 to get meaningful data. Also observe that previously the parallel benchmark did not measure a parallel structure as it passed parallel=false to the GraphBuilder. I also moved the benchmarks from generators/test to graph/test as the GraphBuilder is part of the latter.

As the benchmark cannot accept a Graph data structure, the ER generator itself is implemented as a StreamGenerator; it has a forEach callback and expects (e.g.) a lambda to be called for each edge. It basically uses the linear time algorithm previously implemented in ErdosRenyiGenerator with a few tweaks giving a sequential speed-up between 3x and 5x; it scales almost perfectly, as the hot-section runs pleasingly parallel.

I now demoted the original ErdosRenyiGenerator to a wrapper to the Enumerator. Here the speed-ups rather small, since the runtime is dominated by the GraphBuilder.

@manpen manpen changed the title Feature parallel erdosrenyi Feature Parallel EdosRenyi Generator Aug 10, 2018

/**
* Generates a stream of edges of a G(n,p) graph. The edges are not
* written two memory, but emitted only via usual @a forEdges semantics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"two" -> "to"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@manpen manpen force-pushed the feature_parallel_erdosrenyi branch from d5f4656 to da9ef8a Compare August 10, 2018 14:11
@manpen
Copy link
Contributor Author

manpen commented Aug 10, 2018

I forgot to ask my important question:

The old ER generator generates only edge (u, v) with u > v. While I agree with this for undirected graphs, a directed G(n,p) graph may also contain edges (v < u) and esp. self-loops. The old generator was unable of emitting them. The new enumerator implements this, as it is the mathematically correct way to do it.

There is however code that relies on directed graphs with (u < v), e.g. CentralityGTest.testGroupDegreeDirected which cannot deal with self-loops.

IMHO the other code should be fixed, and the generator should produce graphs according to definition even if this breaks user-code. What's your opinion?

@manpen
Copy link
Contributor Author

manpen commented Aug 12, 2018

What do think about the following solution:

  • Deprecate ErdosRenyiGenerator (as its broken for the directed case and has a misleading name)
  • Rename ErdosRenyiEnumerator (and its wrapper) to GilbertGNPEnumerator/GilbertGNPGenerator
  • (later) Introduce an ErdosRenyiGNMGenerator (e.g. the simple parallel algorithm in https://arxiv.org/abs/1710.07565)

@manpen manpen changed the title Feature Parallel EdosRenyi Generator Feature Parallel ErdosRenyi Generator Aug 22, 2018
std::bind (and its indirection via std::function) is slow.
Replace it by hard-coded chaining of PRNG and distribution gives
a speed-up of nearly a factor 2.
It's an (optionally parallel) variant of the O(m) time algorithm
by Batagelj and Brandes already implemented sequentially in
ErdosRenyiGenerator.cpp. The difference is that it is an header-only
solution invoking a callback for every edge generated. Thus it can
be used as a faster Generator and also as a data source to benchmark
the GraphBuilder.
According to my experiments generating a float or double random variate
is 2.1 times slower than generating an integer variate. Additionally
std::log2 is roughly 1.5 times faster than std::log. Hence we get a
speed-up of more than 3x with this commit.
The old benchmark relied on a O(n*n) time algorithm to compute an
Erdos-Renyi-Graph causing requiring the generation of 1'000 random
numbers per edge inserted for the used parameters. Hence, the benchmark
did not really measure the GraphBuilder but rather the generation
process.

After replacing the generator with the efficient ErdosRenyi enumerator
it runtime dropped from roughly 1600ms to 30ms. Hence, in the original
benchmark roughly 1% of time was spent on the actual experiment.

I now increased the dataset size by a factor of 8 to have a somewhat
meaningful runtimes.
The toGraphParallel section did pass parallel=false to the builder.
Hence the time difference to the sequential variant seems to be only
noise. Now that it's enabled, the parallel GraphBuilder is consistently
slower on my machine.

Also split several benchmarks into own Test cases to be able to select
them using GTest filter.
The generator is now basically a pImpl wrapper to the enumerator.
This should keep compilation times low despite the template magic
of the enumerator.
I do not see any point on having them in GeneratorsBenchmark.
This is misleading at best.
For p>0.5 it more sensible to draw the edges NOT included (rather
than the ones to include). The enumerator now uses this strategy
to reduce the number of computations required.

It also now supports p = 0.0 and p = 1.0. efficently.
These extremal values are efficiently supported as special cases
and do not require drawing of random numbers. They are actively
used by some tests.
@manpen manpen force-pushed the feature_parallel_erdosrenyi branch 2 times, most recently from c4255d8 to 55f66ab Compare August 25, 2018 14:45
@manpen manpen force-pushed the feature_parallel_erdosrenyi branch from 55f66ab to b7b36dd Compare August 26, 2018 13:42
@manpen
Copy link
Contributor Author

manpen commented Aug 29, 2018

Just to let you know: I included all changes discussed last week (optional support to generated directed graphs without self loops and support for the extreme cases of p=0.0 and p=1.0); in case of p>0.5 there's also a further optimisation that the implementation does not draw a random number of each edge but for each edge NOT included (which are expected to be less). This does not yield an asymptomatic improvement, but is faster in practice.

I consider all tests passing, as the one failing test case in Travis is not related to this PR and hence feel it is completed from a technical point of view. I did however not change the interface of the Python code to expose the flag for self-loop-free instances, as the naming issue is still open.

@avdgrinten
Copy link
Contributor

avdgrinten commented Oct 1, 2018

I did not have the time to look at this PR yet. It also has a few merge conflicts. Could you resolve the conflicts while I review the PR?

After the conflicts are resolved, the CI can be restarted (with the new configuration) by closing/reopening the PR.

@manpen
Copy link
Contributor Author

manpen commented Oct 1, 2018

Yes, sure.

@avdgrinten
Copy link
Contributor

We will probably not rename the generator to GilbertGenerator (or similar). Even if ER is technically the wrong name, it has been in use for too long to change it now. If we want to add a true ER generator, we will have to use a different name for that.

Copy link
Contributor

@avdgrinten avdgrinten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the PR looks good to me. I will merge it after some minor questions have been addressed.

networkit/cpp/generators/test/GeneratorsBenchmark.cpp Outdated Show resolved Hide resolved
networkit/cpp/generators/test/GeneratorsBenchmark.cpp Outdated Show resolved Hide resolved
networkit/cpp/graph/test/GraphBuilderBenchmark.h Outdated Show resolved Hide resolved
@manpen
Copy link
Contributor Author

manpen commented Oct 1, 2018

A small question regarding the name: is it okay if I rename the Enumerator part into GilbertEnumerator. This is new code, and the (correct) name does not break existing code; the only odd part is then that the ErdosRenyiGenerator uses the GilbertEnumerator ... I feel this is the right way to do it, but I completely understand if you prefer ErdosRenyiEnumerator.

@avdgrinten
Copy link
Contributor

I think I'd still prefer ErdosRenyiEnumerator for consistency. If ER generator did not exist yet it would be a different matter...

@manpen
Copy link
Contributor Author

manpen commented Dec 12, 2018

This failed CI test is again a sporadic Travis bug during system setup. I dealt with all remarks and consider this PR ready. Please let me know if there's anything else.

@avdgrinten
Copy link
Contributor

LGTM

@angriman angriman merged commit 26587a4 into networkit:Dev Dec 17, 2018
@manpen manpen deleted the feature_parallel_erdosrenyi branch June 5, 2019 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants