Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/girgs #320

Closed
wants to merge 10 commits into from
Closed

Feature/girgs #320

wants to merge 10 commits into from

Conversation

manpen
Copy link
Contributor

@manpen manpen commented May 20, 2019

This PR will introduces two new (related) generators described in https://arxiv.org/abs/1905.06706.
A very large portion of the code is from @chistopher.

  • "girgs" (a generator for up to five-dimensional tori -> https://arxiv.org/abs/1511.00576). The girgs generator adds functionality to NetworKit and hence should be interesting in its own right.

  • "hypergirgs" (hyperbolic Graphs with positive temperature support)
    In our experiments "hypergirgs" is consistently faster (~ 4times) than the band-based RHG generator currently integrated in NetworKit for threshold graphs.

HyperGIRGs also support temperatures T in [0; 1). Here, NetworKit currently uses a QuadTree-based approach with a complexity of (n^{3/2} + m)log n. HyperGirgs, in contrast, has an expected linear complexity, which yields significant speed-ups (we measured between two and three
orders of magnitude for n -> 1e6).

The implementation additionally has nice features, such as deterministic output for identical seeds, even in the context of parallel computation (which the current positive temperature generator has not).

The only caveat is that the current QuadTree generator support temperatures T>1; while those instances are not that interesting (this regime is quite quirky), it seems strange to reduce functionality.

Hence my suggestion would be to remove the band-based generator and replace it by HyperGIRGs, which would also take-over the T < 1 domain from the QuadTree. Then, the slow QuadTree is only used for T > 1. Is this okay?

Currently this PR contains only GIRGs ince input on some design decisions would be appreciate before I migrate hypergirgs.

1.) Since you are reluctant to add new dependencies, I copied the code from https://github.com/chistopher/girgs and put it into networkit/cpp/generators/girgs -- the files are isolated and the generator code itself lives within the NetworKit::girgs namespace.

The generator itself doesn't follow NetworKit coding style in order to ease porting changes from the upstream library. However, the NetworKit bindings (in networkit/cpp/generator/GeometricInhomogenousGenerator.(cpp|h)) to the generator obey the style guide. Is this acceptable?

2.) Similarly to Hyperbolic Graphs, GIRGs are defined via a geometric embedding. Hence the generator starts by distributing points in space. In the past, I was unable to execute certain experiments since the current RHG implementation hides these points.

Hence I added the following interface:

When calling the GIRG constructor the points are already sampled (while we usally have no significant computation before calling generate(), sampling points is very fast and usually takes only a few milliseconds). The points are then accessible via the getPositions()/getWeights() methods. When calling generate() this data is freed to have more memory for the graph builder. If the user wishes to generate the graph and retain the points, generateKeepingInput() is available (I opted for a new method rather than an overload of generate to avoid confusion on the virtual inheritance). OK?

3.) Another use case one might have is to provide a user-prepared set of points to the generator. In the C++ interface this is easily implemented via polymorphic constructors. Unfortunately, we do not have them available in Python. Since I prefer an easy readable constructor signature over a *args/**kwargs mess, I opted to declare a second Python class GeometricInhomogenousFromPointSetGenerator which expects the point set in its constructor.
Is this okay? Do you prefer a static factory function? Alternative suggestions?

Sorry for the long post and thanks!

@avdgrinten
Copy link
Contributor

I didn't have time to review the code yet. Hence, I will only try to answer your design questions.

For 2) and 3): Why don't you simply add a (non-virtual) method setPointSet() that can be called after the default constructor has done its job? As far as I can see, this would also solve the problem of 2) (i.e., just sample the points outside of the generator). There could be a static member function as a helper to properly sample the points.

While I don't think that precomputation in a constructor is problematic per se, I believe that most Algorithm classes should be reusable - it is often useful to keep an instance around if you need to call an algorithm multiple times as this avoids expensive memory reallocation.

@manpen
Copy link
Contributor Author

manpen commented May 21, 2019

Thank you for your answer! A code review is not necessary at the moment -- I included the code only for completeness. Currently, I only have the questions posted above; so esp. 1) whether foreign coding styles are acceptable in clearly isolated file that are also encapsulated into an own namespace and how we deal with the band-based RHG generator.

Regarding your setPointSet() proposal (I forget to include one crucial point): Most of the time GIRGs/RHGs will probably be used on internally generated data sets. In order to produce these sets, we need parameters such as alpha or avgDeg which are not required when feeding external data.

So as I see it we have four options:
a) Use one class and make these parameter optional. This I do not like, since they are actually needed in most cases and we would throw a runtime error when calling generate for a thing the language could enforce before calling construct.
b) Force the user to provide dummy values. This also seems not nice, esp. when reading foreign code.
c) Use a different class name
d) Use a static factory method that hides b). This breaks to common pattern of first instantiating generator and then calling generate() but will presumably only be necessary infrequently.

I feel c) and d) are viable, but favoured c) since it does not break our paradigm.

@avdgrinten
Copy link
Contributor

avdgrinten commented May 21, 2019

Regarding the foreign code: I dislike the idea of copying code into NetworKit (even though I know that it has been done in the past) -- I fear that the code will diverge over time and important updates will not be pulled. I don't really see how this is better than adding a submodule (provided that the foreign code is located in a well-maintained public repository).

I would prefer it even more if the submodule could be made optional: i.e., it is built by default but if (for whatever reason) the user decides to not clone it, the build still succeeds (without the GIRG generator, obviously). I think this is a route that we should take in the future as will i) make handling dependencies less painful, and hence ii) allow us to use them more freely.

Regarding the constructor: b) is not viable. a) is not pretty but there is precedent in NetworKit to fail at runtime if parameters are chosen badly.

As far as I can see, this is only a problem in Python right? In C++ we can just add multiple constructors (and use tag dispatch if they would be ambiguous otherwise). My suggestion would be to not compromise the quality of the C++ code because Python is missing features. We could use proper constructors in the C++ case and revert to static factory methods for Python. What do you think about that?

@manpen
Copy link
Contributor Author

manpen commented May 21, 2019

Regarding the constructor issue: Yes, this is a purely pythonic issue. The C++ code is all nice and clean using polymorphic constructors. Following your suggestion, I will remove the GeometricInhomogenousFromPointSetGenerator class and leave the GeometricInhomogenousGenerator as with the exception of adding a static factory for the infrequently used case of providing the data set from an externally used source.

Regarding the copied code: it is change somewhat. I migrated to NetworKit's PRNG infrastructure, NetworKit's parallel sorting dispatcher and added SignalHandlers. This cannot be easily achieved in the external library, so we either needed a branch or fork. Both do not solve the issue of possible divergence. Hence I would vote against it and copy the code if that's fine with you -- if we do not rename all internal variable (etc.) then porting possible changes from the original library remains straight forward.

@avdgrinten
Copy link
Contributor

avdgrinten commented May 21, 2019

Alright, if the code needs to be modified, that changes the situation somewhat. Yes, in that case it is preferable to copy it into NetworKit. Coding style is not a major issue, if it is at least internally consistent.

Regarding the static method: what about just using tag dispatch? Something like

struct UseExternalPointsTag { };

static constexpr UseExternalPointsTag useExternalPoints{};

// Declaration
GeometricInhomogenousGenerator::GeometricInhomogenousGenerator(UseExternalPointsTag, std::vector<...> points)

// Usage
GeometricInhomogenousGenerator myGenerator{useExternalPoints, myPointSet};

I consider this to be more idiomatic C++ than the static factory method.

@avdgrinten
Copy link
Contributor

I just noticed that you're not using a static factory method in the C++ code anyway. In this case, my suggestion above does not make sense.

@manpen
Copy link
Contributor Author

manpen commented May 21, 2019

Copying code: cool!

Sorry for my inexact reply. I would not touch the C++ code; its currently using three polymorphic overloads:

    /**
     * @param[in] n Number of nodes
     * @param[in] avgDegree The desired average degree with 0 < avgDegree < n-1
     * @param[in] powerlawExp Target exponent of power-law distribution with powerlawExp > 2.0
     * @param[in] alpha parameter (1/temperature) with alpha > 1
     * @param[in] dimension Dimension of the underlying geometry with 1 <= dimension <= 5
     */
    GeometricInhomogenousGenerator(count n, double avgDegree, double powerlawExp=3, double alpha=std::numeric_limits<double>::infinity(), unsigned dim=1);

    /**
     * Construct generator from weights that are then scale to match avgDegree and T.
     *
     * @param[in] points Coordinates of points
     * @param[in] weights Unscaled weights (assumed to be powerlaw distributed)
     * @param[in] avgDegree The desired average degree with 0 < avgDegree < n-1
     * @param[in] alpha parameter (1/temperature) with alpha > 1
     *
     * @warning points and weights are moved into the container. The we're not using
     * rvalue refs because Cython does not handle them.
     */
    GeometricInhomogenousGenerator(std::vector<coordinate_t>& points, std::vector<double>& weights, double avgDegree, double alpha);

    /**
     * Construct generator from *already scaled* weights.
     *
     * @param[in] points Coordinates of points
     * @param[in] weights *Scaled* weights
     * @param[in] alpha parameter (1/temperature) with alpha > 1
     *
     * @warning points and weights are moved into the container. The we're not using
     * rvalue refs because Cython does not handle them.
     */
    GeometricInhomogenousGenerator(std::vector<coordinate_t>& points, std::vector<double>& weights, double alpha);

The static factory (@staticmethod) was intended for the Python interface only, so you would get a graph rougly as follows:

gen = nk.generators.GeometricInhomogenousGenerators.fromDataset(coords, weights)
G = gen.generate()

* @param[in] avgDegree The desired average degree with 0 < avgDegree < n-1
* @param[in] T Temperature with 0 <= T < 1
*/
GeometricInhomogenousGenerator(std::vector<coordinate_t>&& points, std::vector<double>&& weights, double avgDegree, double T);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the constructors take the vectors by rvalue reference? You're essentially forcing users who want to pass copies to do this explicitly. I would prefer passing the vectors by value. This will still be cheap (at least if the user passes a temporary or uses std::move) but will also gracefully fall back to a copy.

EDIT: Taking them by (non-const) reference is not a good solution either as this prevents passing temporaries. If the class needs an internal copy, take the vectors by value.

Copy link
Contributor Author

@manpen manpen May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to keep the necessary memory footprint as small as possible, we do the following:

  • Move the user provided data set into the container.
  • Convert it into an internal data structure
  • OPTIONAL: Delete the the user provided data structure.
  • Build the graph <- this is when we need the most memory so deleting the input might make sense.

The reason it is moved into the class is to be able delete the data as soon as it's not needed anymore. If you consider this interface as too complicated, we can drop the capability of deleting the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that you want to do this. My point is that passing by value enables this behavior without unexpected pitfalls. If you take the vector by value, you still get zero-copy behavior if the user passes a rvalue. However, it also works correctly (but less efficiently) if the user passes an lvalue and decides to reuse it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Missed that and will fix it!

@manpen
Copy link
Contributor Author

manpen commented Jun 7, 2019

Just to let you know: I'm still hunting down a bug in that only occurs on NetworKit's CI on OSX (that's how I spotted #326) which might still take some time.

@manpen manpen force-pushed the feature/GIRGs branch 2 times, most recently from 70efaf2 to e95d8e4 Compare June 18, 2019 08:28
@manpen
Copy link
Contributor Author

manpen commented Jun 18, 2019

Contrary to the initial PR message, I will add HyperGIRGs in another PR, since this one has already become quite large. So a short recap of what's in this one:

  • It adds the GeometricInhomogenousRandomGenerator in C++ and Python. The generator itself resides in networkit/cpp/generator/girgs and does not adhere to NetworKit's guide lines.
  • The PR also bumps up TLX's version and hence adds a warning to CMake make sure that git submodule update was executed. The warning should be adopted whenever we include a newer version of TLX.

@manpen
Copy link
Contributor Author

manpen commented Jun 18, 2019

The failing CI build is only due to an Homebrew issue. Previous builds have been okay.

@avdgrinten
Copy link
Contributor

What is the current status of this PR?

@manpen
Copy link
Contributor Author

manpen commented Jul 1, 2019

I'm currently not aware of any bugs and welcome a review.

@avdgrinten avdgrinten closed this Jul 1, 2019
@avdgrinten avdgrinten reopened this Jul 1, 2019
@avdgrinten
Copy link
Contributor

Do you think this PR could be rebased to clean up the history? I am not suggesting to squash it into a single commit but maybe into a smaller number of commits? There are quite a few commits that fix up older commits (and merges from Dev to the feature branch); this makes reviewing the whole thing quite a bit tedious.

@manpen
Copy link
Contributor Author

manpen commented Jul 1, 2019

I can give it a try this afternoon.

manpen and others added 9 commits July 1, 2019 19:48
Co-authored-by: Christopher Weyand <christopher.weyand@student.hpi.de>
Co-authored-by: Christopher Weyand <christopher.weyand@student.hpi.de>
Remove OMP max-reduction as not supported by MSVC
Avoid UB on post-decremented iterator
GIRGs without MSVC warnings
Increase slack in probabilistic test
Adhere to NetworKit code style
@manpen manpen force-pushed the feature/GIRGs branch 2 times, most recently from 70efaf2 to ad9b1d8 Compare July 2, 2019 06:22
@manpen
Copy link
Contributor Author

manpen commented Jul 2, 2019

While not perfectly linear, the history should now be much more clean. Some intermediate commits do not compile (as we a couple of file have been moved and commits are now sorted semantically rather than historically). Hope that's alright with you.

@avdgrinten
Copy link
Contributor

Yes, that's much better than before. It's a bit unfortunate that intermediate commits don't compile, but if we merge with a merge commit, it is possible to deal with that.

@avdgrinten
Copy link
Contributor

I will try to review / merge this PR today.

@manpen
Copy link
Contributor Author

manpen commented Jul 2, 2019

Thank you! It's not urgent.

@manpen
Copy link
Contributor Author

manpen commented Oct 11, 2019

After our last discussion of how to deal with the GIRG / HyperGirg generators, I believe that the necessary changes can be done in the GIRGs library itself in a generic way. If we can add the GIRGs library as another git submodule, I'd close this PR and would prepare another one including only the necessary glue code.

@avdgrinten
Copy link
Contributor

That would be great. If it cannot be done, we should treat GIRGs the same way as ttmath and isolate its code in an own directory under extlibs/.

@manpen manpen closed this Oct 11, 2019
@chistopher chistopher mentioned this pull request Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants