Make mlpack completely header-only #3250

rcurtin · 2022-07-31T17:55:47Z

This PR does the last step---removes libmlpack.so entirely. This will probably require some adaptation downstream in the examples and models repositories, but, that should be pretty easy (just don't link against libmlpack.so anymore).

Most of this PR is CMake reconfiguration and simplification: now that there is no libmlpack.so, there's a lot less that we have to do.

A shortlist of notable modifications here:

libmlpack.so is gone, and so linking for bindings and tests is now a little bit simpler.
The arma_config_check.hpp file, which made sure that the same compilation settings were used in libmlpack.so were used when mlpack was included, are no longer necessary, and so all related CMake infrastructure has been removed.
The pkgconfig generator is modified so it no longer includes -lmlpack in the linker command.
mlpack_export.hpp is no longer needed---so everything related to that is now gone.
Documentation is updated to reflect that mlpack is now header-only (maybe it could be updated in more places---I would be interested in people's comments on where to do that).
Finally, there is now only one source file that ever changes as a result of CMake: src/mlpack/util/gitversion.hpp. This is already generated directly into src/, and not into build/include/, so there is no compelling reason to make the first step of every build to copy every mlpack header into build/include/. Therefore, I removed the mlpack_headers target, and now there is no more step to copy all of the headers. This should accelerate builds, I hope, or at least remove some of the tedium... the only "downside" is that users used to including the build/include/ directory (if they build mlpack without installing, for instance, like I often do), will need to just include src/ instead---a minor change.

… properties of libmlpack.so.

conradsnicta · 2022-08-01T02:31:29Z

@rcurtin I thought more about the "include everything in one header" issue (follow-up to #3233). Rather than making the "one header" approach mandatory and converting all of the codebase, I suggest making it an option. With this, both the old and new ways of including mlpack functionality would work.

More specifically:

for folks that simply want the convenience (at the possible cost of increased compilation time) they can do #include <mlpack.hpp> and be done
for folks that want to be more selective in what is included (to avoid increased compilation time), they can include specific subsets of mlpack headers, as is done now

rcurtin · 2022-08-12T20:46:22Z

@rcurtin I thought more about the "include everything in one header" issue (follow-up to #3233). Rather than making the "one header" approach mandatory and converting all of the codebase, I suggest making it an option. With this, both the old and new ways of including mlpack functionality would work.

More specifically:

for folks that simply want the convenience (at the possible cost of increased compilation time) they can do #include <mlpack.hpp> and be done

for folks that want to be more selective in what is included (to avoid increased compilation time), they can include specific subsets of mlpack headers, as is done now

@conradsnicta I did some serious digging into this issue. Fundamentally simply the cost of including all of mlpack's headers is not too expensive although it is noticeable. If I make a header that includes everything---except the code that enables serialization for ANN layers (more on that later)---and I include this in the mnist_cnn example and the example code from this linear regression example I see these changes:

mnist_cnn
- g++, this branch with only necessary headers included (and no serialization.hpp): 11.4s, ~800MB RAM
- g++, this branch with all headers included (and no serialization.hpp): 18.8s, ~1.6GB RAM
- g++, this branch with only necessary headers included (and serialization): 65.0s, ~4.5GB RAM
- clang, this branch with only necessary headers included (and no serialization.hpp): 11.2s, ~300MB RAM
- clang, this branch with all headers included (and no serialization.hpp): 17.4s, ~900MB RAM
linear_regression
- g++, this branch with only necessary headers included: 6.0s, ~750MB RAM
- g++, this branch with all headers included (and no serialization.hpp): 13.6s, ~1.5GB RAM
- g++, this branch with all headers included (and serialization): 63.6s, ~4.5GB RAM
- clang, this branch with only necessary headers included: 5.1s, ~300MB RAM
- clang, this branch with all headers included (and no serialization.hpp): 11.9s, ~900MB RAM
- clang, this branch with all headers included (and serialization): 192.0s, ~3GB RAM

(I found that gcc's precompiled headers did not help much.)

So fundamentally it is not too painful to include all of the headers, and I agree that your approach of supplying an mlpack.hpp header that includes everything is reasonable, and then we can ensure that we have documentation to suggest that users can reduce compile times and memory usage by only including what they need. Simultaneously I will also need to go through the library and make sure that each directory has a "top-level" include file that you can include to get everything related to that module. So, e.g., #include <mlpack/methods/cf.hpp> should include all the bells and whistles of the CF module, instead of the bare minimum. Some directories don't have any top-level include at all right now.

Now, about the crazy serialization numbers: this is what surprised me, though in retrospect it makes perfect sense. Including this file causes compile times to take a minimum of one minute and use a minimum of 3GB of RAM. It also makes the resulting programs much larger! The reason, as it turns out, is that the compiler must instantiate every single layer that we are allowing to be serializable. This is because cereal must have a constructor definition for any polymorphic class that it might be serializing, since deserialization might encounter an arbitrary layer type. So basically what this boils down to is that any program that includes that file must contain compiled versions of every layer type---which places an insane demand on the compiler, especially if the user is not ever even using a neural network (or serializing one).

There is no reasonable way to avoid that compilation cost, or to automatically compile only the layers that a user has explicitly serialized---since a program could feasibly just be loading a model, a user may never actually manually instantiate a layer.

So, this presents a dilemma that I plan to solve like this:

By default, ANN layers will not be serializable. Users' code will still compile, but if they try to serialize a model, it will fail at runtime (cereal will issue some specific error). An FAQ section will be added to the README and the website for this specific error, suggesting the following solutions:
Users can manually #include <mlpack/methods/ann/layer/serialization.hpp> to allow serialization of all mlpack's layer types (where MatType = arma::mat), although this will come with a heavy compilation cost. It is convenient, though, and necessary if the program is to be able to load arbitrary networks.
Users who want to minimize compilation time can manually write CEREAL_REGISTER_TYPE(Layer) for each layer that they use. It is perhaps inconvenient and a little ugly to do that, but there is no realistic alternative.

(Also, a side note: when the serialization issue is handled correctly, these runtimes are way better than before #2777: compiling mnist_cnn with boost::visitor takes 50s and uses 3GB RAM.)

Anyway, happy to hear any comments on this approach. However I will implement in a separate PR, since it'll be a lot of moving includes around and other trudgery.

zoq

This is huge, awesome work.

conradsnicta · 2022-08-15T02:01:39Z

@rcurtin Thanks for digging into this. Even though there is a slowdown when including everything, I bet that many people would be willing to use this for the sake of convenience and simplicity.

For the ANN serialistion problem, there are a few approaches. First, I suggest making a simple define-based option to enable serialisation of ANNs, which is then detected by mlpack headers. This is instead of users directly using CEREAL_REGISTER_TYPE(Layer). For example:

#define MLPACK_ENABLE_ANN_SERIALISATION
#include <mlpack.hpp>

and then mlpack would internally call the appropriate set of CEREAL_REGISTER_TYPE(Layer) functions.

Second, serialisation is not execution-time critical, so we can direct the compiler not to spend too much time optimising the produced code. This can be accomplished by adding attributes to serialisation-related functions. As an example, say we want to have the attributed named mlpack_cold, which can be defined for GCC and clang as follows:

#define mlpack_cold

#if defined(__GNUG__) && (!defined(__clang__))
  #undef  mlpack_cold
  #define mlpack_cold __attribute__((__cold__))
#endif

#if defined(__clang__)
  #if !defined(__has_attribute)
    #define __has_attribute(x) 0
  #endif

#if __has_attribute(__cold__)
  #undef  mlpack_cold
  #define mlpack_cold __attribute__((__cold__))
#elif __has_attribute(__minsize__)
  #undef  mlpack_cold
  #define mlpack_cold __attribute__((__minsize__))
#endif

Then a serialisation function would be decorated with mlpack_cold along these lines:

inline
mlpack_cold
bool
serialise(output_object& out, const input_object& in) { ... }

I've used a similar approach to decorate Mat::save() and Mat::load() functions within Armadillo, as well as a subset of functions within the diskio class. This has led to minor but measurable decreases in compilation time. It's possible that the effect with Cereal would be more pronounced, depending on how complex the underlying code is.

Yet another option would be to rewrite the ANN serialisation code, so that all the ANN layers are first converted into a contiguous block of memory (akin to a raw dump). Then Cereal would be used only to serialise that block of memory, instead of hooking into the guts of ANN code.

rcurtin · 2022-08-15T12:54:24Z

For the ANN serialistion problem, there are a few approaches. First, I suggest making a simple define-based option to enable serialisation of ANNs, which is then detected by mlpack headers.

Yep, that's exactly what I'm thinking!

Second, serialisation is not execution-time critical, so we can direct the compiler not to spend too much time optimising the produced code. This can be accomplished by adding attributes to serialisation-related functions. As an example, say we want to have the attributed named mlpack_cold, which can be defined for GCC and clang as follows:

This is a nice idea but I think that it would not apply here. The issue is not so much that the code generated by CEREAL_REGISTER_TYPE(Linear) (or whatever layer) is being over-optimized, but instead that when I call CEREAL_REGISTER_TYPE(Linear), Linear is actually the templated class LinearType<arma::mat>, which must be fully instantiated---and the methods involved aren't just serialization but instead all the methods that each layer implements. So it might still be useful to add something like mlpack_cold in various places, but I don't think it will change the reality that registering serialization for a layer causes a significant number of template instantiations. Let me know if I overlooked something there. 👍

shrit · 2022-08-15T21:20:55Z

@rcurtin Great work on this. I wanted to review it this weekend, but I had no chance. I was facing some DNS issues.
This is really huge 🚀 💯

rcurtin · 2022-08-15T21:23:23Z

No worries, there are a few follow-up PRs in progress if you want to review those 😄 and, if you find any issues with this PR, I'll handle any comments that you post. 👍

rcurtin added 10 commits July 8, 2022 13:49

An intermediate attempt to actually make it all header-only.

d48e32a

Merge remote-tracking branch 'origin/master' into everything-header-only

32c5e2d

Remove debugging output.

d0df14b

Remove debugging output.

ffd105c

Remove more debugging output.

abad216

Oops, re-enable libbfd and libdl detection.

133a54a

Remove arma_config_check.hpp: it was only needed to check compilation…

b7cc593

… properties of libmlpack.so.

Fix pkgconfig handling.

89b44ef

Perform the entire build in src/; remove mlpack_headers target.

683a93d

Update documentation for header-only changes.

260aebb

rcurtin added c: build system c: core t: added feature labels Jul 31, 2022

rcurtin added 3 commits July 31, 2022 15:51

Attempt to fix some builds.

d614ad9

We have to link against Armadillo.

42e6d88

Actually, we need to link against all mlpack dependencies.

544d0e5

rcurtin added 8 commits August 1, 2022 04:50

Update mlpack header locations.

406a819

Fix minor syntax issues.

e7769a4

Update name of CMake variables for libraries to link against.

49218a5

Set the variable at the correct scope.

f9d40d1

Add some more debugging output to try to figure out what is going wrong.

46ed12b

Oops, we can't use PARENT_SCOPE.

9a18cd7

Try putting gfortran and pthread last.

11cbc77

Try supplying all of the MLPACK_LIBRARIES.

c684784

zoq approved these changes Aug 14, 2022

View reviewed changes

conradsnicta approved these changes Aug 15, 2022

View reviewed changes

rcurtin merged commit a472496 into mlpack:master Aug 15, 2022

rcurtin deleted the everything-header-only branch August 15, 2022 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make mlpack completely header-only #3250

Make mlpack completely header-only #3250

rcurtin commented Jul 31, 2022

conradsnicta commented Aug 1, 2022 •

edited

rcurtin commented Aug 12, 2022

zoq left a comment

conradsnicta commented Aug 15, 2022

rcurtin commented Aug 15, 2022

shrit commented Aug 15, 2022

rcurtin commented Aug 15, 2022 •

edited

Make mlpack completely header-only #3250

Make mlpack completely header-only #3250

Conversation

rcurtin commented Jul 31, 2022

conradsnicta commented Aug 1, 2022 • edited

rcurtin commented Aug 12, 2022

zoq left a comment

Choose a reason for hiding this comment

conradsnicta commented Aug 15, 2022

rcurtin commented Aug 15, 2022

shrit commented Aug 15, 2022

rcurtin commented Aug 15, 2022 • edited

conradsnicta commented Aug 1, 2022 •

edited

rcurtin commented Aug 15, 2022 •

edited