Feature/girgs #320

manpen · 2019-05-20T20:38:17Z

This PR will introduces two new (related) generators described in https://arxiv.org/abs/1905.06706.
A very large portion of the code is from @chistopher.

"girgs" (a generator for up to five-dimensional tori -> https://arxiv.org/abs/1511.00576). The girgs generator adds functionality to NetworKit and hence should be interesting in its own right.
"hypergirgs" (hyperbolic Graphs with positive temperature support)
In our experiments "hypergirgs" is consistently faster (~ 4times) than the band-based RHG generator currently integrated in NetworKit for threshold graphs.

HyperGIRGs also support temperatures T in [0; 1). Here, NetworKit currently uses a QuadTree-based approach with a complexity of (n^{3/2} + m)log n. HyperGirgs, in contrast, has an expected linear complexity, which yields significant speed-ups (we measured between two and three
orders of magnitude for n -> 1e6).

The implementation additionally has nice features, such as deterministic output for identical seeds, even in the context of parallel computation (which the current positive temperature generator has not).

The only caveat is that the current QuadTree generator support temperatures T>1; while those instances are not that interesting (this regime is quite quirky), it seems strange to reduce functionality.

Hence my suggestion would be to remove the band-based generator and replace it by HyperGIRGs, which would also take-over the T < 1 domain from the QuadTree. Then, the slow QuadTree is only used for T > 1. Is this okay?

Currently this PR contains only GIRGs ince input on some design decisions would be appreciate before I migrate hypergirgs.

1.) Since you are reluctant to add new dependencies, I copied the code from https://github.com/chistopher/girgs and put it into networkit/cpp/generators/girgs -- the files are isolated and the generator code itself lives within the NetworKit::girgs namespace.

The generator itself doesn't follow NetworKit coding style in order to ease porting changes from the upstream library. However, the NetworKit bindings (in networkit/cpp/generator/GeometricInhomogenousGenerator.(cpp|h)) to the generator obey the style guide. Is this acceptable?

2.) Similarly to Hyperbolic Graphs, GIRGs are defined via a geometric embedding. Hence the generator starts by distributing points in space. In the past, I was unable to execute certain experiments since the current RHG implementation hides these points.

Hence I added the following interface:

When calling the GIRG constructor the points are already sampled (while we usally have no significant computation before calling generate(), sampling points is very fast and usually takes only a few milliseconds). The points are then accessible via the getPositions()/getWeights() methods. When calling generate() this data is freed to have more memory for the graph builder. If the user wishes to generate the graph and retain the points, generateKeepingInput() is available (I opted for a new method rather than an overload of generate to avoid confusion on the virtual inheritance). OK?

3.) Another use case one might have is to provide a user-prepared set of points to the generator. In the C++ interface this is easily implemented via polymorphic constructors. Unfortunately, we do not have them available in Python. Since I prefer an easy readable constructor signature over a *args/**kwargs mess, I opted to declare a second Python class GeometricInhomogenousFromPointSetGenerator which expects the point set in its constructor.
Is this okay? Do you prefer a static factory function? Alternative suggestions?

Sorry for the long post and thanks!

avdgrinten · 2019-05-21T10:38:23Z

I didn't have time to review the code yet. Hence, I will only try to answer your design questions.

For 2) and 3): Why don't you simply add a (non-virtual) method setPointSet() that can be called after the default constructor has done its job? As far as I can see, this would also solve the problem of 2) (i.e., just sample the points outside of the generator). There could be a static member function as a helper to properly sample the points.

While I don't think that precomputation in a constructor is problematic per se, I believe that most Algorithm classes should be reusable - it is often useful to keep an instance around if you need to call an algorithm multiple times as this avoids expensive memory reallocation.

manpen · 2019-05-21T12:46:32Z

Thank you for your answer! A code review is not necessary at the moment -- I included the code only for completeness. Currently, I only have the questions posted above; so esp. 1) whether foreign coding styles are acceptable in clearly isolated file that are also encapsulated into an own namespace and how we deal with the band-based RHG generator.

Regarding your setPointSet() proposal (I forget to include one crucial point): Most of the time GIRGs/RHGs will probably be used on internally generated data sets. In order to produce these sets, we need parameters such as alpha or avgDeg which are not required when feeding external data.

So as I see it we have four options:
a) Use one class and make these parameter optional. This I do not like, since they are actually needed in most cases and we would throw a runtime error when calling generate for a thing the language could enforce before calling construct.
b) Force the user to provide dummy values. This also seems not nice, esp. when reading foreign code.
c) Use a different class name
d) Use a static factory method that hides b). This breaks to common pattern of first instantiating generator and then calling generate() but will presumably only be necessary infrequently.

I feel c) and d) are viable, but favoured c) since it does not break our paradigm.

avdgrinten · 2019-05-21T13:10:02Z

Regarding the foreign code: I dislike the idea of copying code into NetworKit (even though I know that it has been done in the past) -- I fear that the code will diverge over time and important updates will not be pulled. I don't really see how this is better than adding a submodule (provided that the foreign code is located in a well-maintained public repository).

I would prefer it even more if the submodule could be made optional: i.e., it is built by default but if (for whatever reason) the user decides to not clone it, the build still succeeds (without the GIRG generator, obviously). I think this is a route that we should take in the future as will i) make handling dependencies less painful, and hence ii) allow us to use them more freely.

Regarding the constructor: b) is not viable. a) is not pretty but there is precedent in NetworKit to fail at runtime if parameters are chosen badly.

As far as I can see, this is only a problem in Python right? In C++ we can just add multiple constructors (and use tag dispatch if they would be ambiguous otherwise). My suggestion would be to not compromise the quality of the C++ code because Python is missing features. We could use proper constructors in the C++ case and revert to static factory methods for Python. What do you think about that?

manpen · 2019-05-21T13:24:25Z

Regarding the constructor issue: Yes, this is a purely pythonic issue. The C++ code is all nice and clean using polymorphic constructors. Following your suggestion, I will remove the GeometricInhomogenousFromPointSetGenerator class and leave the GeometricInhomogenousGenerator as with the exception of adding a static factory for the infrequently used case of providing the data set from an externally used source.

Regarding the copied code: it is change somewhat. I migrated to NetworKit's PRNG infrastructure, NetworKit's parallel sorting dispatcher and added SignalHandlers. This cannot be easily achieved in the external library, so we either needed a branch or fork. Both do not solve the issue of possible divergence. Hence I would vote against it and copy the code if that's fine with you -- if we do not rename all internal variable (etc.) then porting possible changes from the original library remains straight forward.

avdgrinten · 2019-05-21T13:41:11Z

Alright, if the code needs to be modified, that changes the situation somewhat. Yes, in that case it is preferable to copy it into NetworKit. Coding style is not a major issue, if it is at least internally consistent.

Regarding the static method: what about just using tag dispatch? Something like

struct UseExternalPointsTag { };

static constexpr UseExternalPointsTag useExternalPoints{};

// Declaration
GeometricInhomogenousGenerator::GeometricInhomogenousGenerator(UseExternalPointsTag, std::vector<...> points)

// Usage
GeometricInhomogenousGenerator myGenerator{useExternalPoints, myPointSet};

I consider this to be more idiomatic C++ than the static factory method.

avdgrinten · 2019-05-21T13:51:59Z

I just noticed that you're not using a static factory method in the C++ code anyway. In this case, my suggestion above does not make sense.

manpen · 2019-05-21T13:53:57Z

Copying code: cool!

Sorry for my inexact reply. I would not touch the C++ code; its currently using three polymorphic overloads:

    /**
     * @param[in] n Number of nodes
     * @param[in] avgDegree The desired average degree with 0 < avgDegree < n-1
     * @param[in] powerlawExp Target exponent of power-law distribution with powerlawExp > 2.0
     * @param[in] alpha parameter (1/temperature) with alpha > 1
     * @param[in] dimension Dimension of the underlying geometry with 1 <= dimension <= 5
     */
    GeometricInhomogenousGenerator(count n, double avgDegree, double powerlawExp=3, double alpha=std::numeric_limits<double>::infinity(), unsigned dim=1);

    /**
     * Construct generator from weights that are then scale to match avgDegree and T.
     *
     * @param[in] points Coordinates of points
     * @param[in] weights Unscaled weights (assumed to be powerlaw distributed)
     * @param[in] avgDegree The desired average degree with 0 < avgDegree < n-1
     * @param[in] alpha parameter (1/temperature) with alpha > 1
     *
     * @warning points and weights are moved into the container. The we're not using
     * rvalue refs because Cython does not handle them.
     */
    GeometricInhomogenousGenerator(std::vector<coordinate_t>& points, std::vector<double>& weights, double avgDegree, double alpha);

    /**
     * Construct generator from *already scaled* weights.
     *
     * @param[in] points Coordinates of points
     * @param[in] weights *Scaled* weights
     * @param[in] alpha parameter (1/temperature) with alpha > 1
     *
     * @warning points and weights are moved into the container. The we're not using
     * rvalue refs because Cython does not handle them.
     */
    GeometricInhomogenousGenerator(std::vector<coordinate_t>& points, std::vector<double>& weights, double alpha);

The static factory (@staticmethod) was intended for the Python interface only, so you would get a graph rougly as follows:

gen = nk.generators.GeometricInhomogenousGenerators.fromDataset(coords, weights)
G = gen.generate()

avdgrinten · 2019-05-21T13:55:32Z

networkit/cpp/generators/GeometricInhomogenousGenerator.h

+     * @param[in] avgDegree The desired average degree with 0 < avgDegree < n-1
+     * @param[in] T Temperature with 0 <= T < 1
+     */
+    GeometricInhomogenousGenerator(std::vector<coordinate_t>&& points, std::vector<double>&& weights, double avgDegree, double T);


Why do the constructors take the vectors by rvalue reference? You're essentially forcing users who want to pass copies to do this explicitly. I would prefer passing the vectors by value. This will still be cheap (at least if the user passes a temporary or uses std::move) but will also gracefully fall back to a copy.

EDIT: Taking them by (non-const) reference is not a good solution either as this prevents passing temporaries. If the class needs an internal copy, take the vectors by value.

In order to keep the necessary memory footprint as small as possible, we do the following:

Move the user provided data set into the container.

Convert it into an internal data structure

OPTIONAL: Delete the the user provided data structure.

Build the graph <- this is when we need the most memory so deleting the input might make sense.

The reason it is moved into the class is to be able delete the data as soon as it's not needed anymore. If you consider this interface as too complicated, we can drop the capability of deleting the data.

Yes, I understand that you want to do this. My point is that passing by value enables this behavior without unexpected pitfalls. If you take the vector by value, you still get zero-copy behavior if the user passes a rvalue. However, it also works correctly (but less efficiently) if the user passes an lvalue and decides to reuse it later.

Thanks. Missed that and will fix it!

manpen · 2019-06-07T08:51:25Z

Just to let you know: I'm still hunting down a bug in that only occurs on NetworKit's CI on OSX (that's how I spotted #326) which might still take some time.

manpen · 2019-06-18T16:16:16Z

Contrary to the initial PR message, I will add HyperGIRGs in another PR, since this one has already become quite large. So a short recap of what's in this one:

It adds the GeometricInhomogenousRandomGenerator in C++ and Python. The generator itself resides in networkit/cpp/generator/girgs and does not adhere to NetworKit's guide lines.
The PR also bumps up TLX's version and hence adds a warning to CMake make sure that git submodule update was executed. The warning should be adopted whenever we include a newer version of TLX.

manpen · 2019-06-18T16:17:01Z

The failing CI build is only due to an Homebrew issue. Previous builds have been okay.

avdgrinten · 2019-07-01T07:26:06Z

What is the current status of this PR?

manpen · 2019-07-01T07:37:33Z

I'm currently not aware of any bugs and welcome a review.

avdgrinten · 2019-07-01T07:42:14Z

Do you think this PR could be rebased to clean up the history? I am not suggesting to squash it into a single commit but maybe into a smaller number of commits? There are quite a few commits that fix up older commits (and merges from Dev to the feature branch); this makes reviewing the whole thing quite a bit tedious.

manpen · 2019-07-01T07:53:50Z

I can give it a try this afternoon.

Co-authored-by: Christopher Weyand <christopher.weyand@student.hpi.de>

Remove OMP max-reduction as not supported by MSVC Avoid UB on post-decremented iterator GIRGs without MSVC warnings Increase slack in probabilistic test Adhere to NetworKit code style

manpen · 2019-07-02T09:41:34Z

While not perfectly linear, the history should now be much more clean. Some intermediate commits do not compile (as we a couple of file have been moved and commits are now sorted semantically rather than historically). Hope that's alright with you.

avdgrinten · 2019-07-02T09:48:53Z

Yes, that's much better than before. It's a bit unfortunate that intermediate commits don't compile, but if we merge with a merge commit, it is possible to deal with that.

avdgrinten · 2019-07-02T09:49:18Z

I will try to review / merge this PR today.

manpen · 2019-07-02T10:03:05Z

Thank you! It's not urgent.

manpen · 2019-10-11T09:57:20Z

After our last discussion of how to deal with the GIRG / HyperGirg generators, I believe that the necessary changes can be done in the GIRGs library itself in a generic way. If we can add the GIRGs library as another git submodule, I'd close this PR and would prepare another one including only the necessary glue code.

avdgrinten · 2019-10-11T10:32:23Z

That would be great. If it cannot be done, we should treat GIRGs the same way as ttmath and isolate its code in an own directory under extlibs/.

avdgrinten reviewed May 21, 2019

View reviewed changes

manpen force-pushed the feature/GIRGs branch from 59bba9b to 9d0c06e Compare May 25, 2019 21:25

manpen mentioned this pull request May 27, 2019

Simplify #include statements #322

Merged

manpen force-pushed the feature/GIRGs branch 2 times, most recently from 70efaf2 to e95d8e4 Compare June 18, 2019 08:28

avdgrinten closed this Jul 1, 2019

avdgrinten reopened this Jul 1, 2019

manpen and others added 9 commits July 1, 2019 19:48

Copied code from https://github.com/chistopher/girgs

e67cc87

Co-authored-by: Christopher Weyand <christopher.weyand@student.hpi.de>

Inline .inl files

8bf5a19

Migrate GIRGs code into NetworKit

9e6264a

Add GeometricInhomogenousGenerator to NetworKit core

c7bbd11

Add Python bindings and tests

baf70ef

Backport changes from GIRGs upstream

7c06b06

Co-authored-by: Christopher Weyand <christopher.weyand@student.hpi.de>

Litter the code with SignalHandlers

d1e44cd

Move GIRG headers into dedicated include directory

b8394d8

Fix incompatiblities with MSVC and increase stability

14dd3ca

Remove OMP max-reduction as not supported by MSVC Avoid UB on post-decremented iterator GIRGs without MSVC warnings Increase slack in probabilistic test Adhere to NetworKit code style

Add Check to CMake warning about out-of-date submodule

ad9b1d8

manpen force-pushed the feature/GIRGs branch 2 times, most recently from 70efaf2 to ad9b1d8 Compare July 2, 2019 06:22

manpen closed this Oct 11, 2019

chistopher mentioned this pull request Nov 2, 2023

Feature/girgs #1142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/girgs #320

Feature/girgs #320

manpen commented May 20, 2019 •

edited

avdgrinten commented May 21, 2019

manpen commented May 21, 2019 •

edited

avdgrinten commented May 21, 2019 •

edited

manpen commented May 21, 2019

avdgrinten commented May 21, 2019 •

edited

avdgrinten commented May 21, 2019

manpen commented May 21, 2019

avdgrinten May 21, 2019 •

edited

manpen May 21, 2019 •

edited

avdgrinten May 21, 2019 •

edited

manpen May 21, 2019

manpen commented Jun 7, 2019

manpen commented Jun 18, 2019

manpen commented Jun 18, 2019

avdgrinten commented Jul 1, 2019

manpen commented Jul 1, 2019

avdgrinten commented Jul 1, 2019

manpen commented Jul 1, 2019

manpen commented Jul 2, 2019

avdgrinten commented Jul 2, 2019

avdgrinten commented Jul 2, 2019

manpen commented Jul 2, 2019

manpen commented Oct 11, 2019

avdgrinten commented Oct 11, 2019

Feature/girgs #320

Feature/girgs #320

Conversation

manpen commented May 20, 2019 • edited

avdgrinten commented May 21, 2019

manpen commented May 21, 2019 • edited

avdgrinten commented May 21, 2019 • edited

manpen commented May 21, 2019

avdgrinten commented May 21, 2019 • edited

avdgrinten commented May 21, 2019

manpen commented May 21, 2019

avdgrinten May 21, 2019 • edited

Choose a reason for hiding this comment

manpen May 21, 2019 • edited

Choose a reason for hiding this comment

avdgrinten May 21, 2019 • edited

Choose a reason for hiding this comment

manpen May 21, 2019

Choose a reason for hiding this comment

manpen commented Jun 7, 2019

manpen commented Jun 18, 2019

manpen commented Jun 18, 2019

avdgrinten commented Jul 1, 2019

manpen commented Jul 1, 2019

avdgrinten commented Jul 1, 2019

manpen commented Jul 1, 2019

manpen commented Jul 2, 2019

avdgrinten commented Jul 2, 2019

avdgrinten commented Jul 2, 2019

manpen commented Jul 2, 2019

manpen commented Oct 11, 2019

avdgrinten commented Oct 11, 2019

manpen commented May 20, 2019 •

edited

manpen commented May 21, 2019 •

edited

avdgrinten commented May 21, 2019 •

edited

avdgrinten commented May 21, 2019 •

edited

avdgrinten May 21, 2019 •

edited

manpen May 21, 2019 •

edited

avdgrinten May 21, 2019 •

edited