Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"The future of mlpack", round two! #2524

Closed
rcurtin opened this issue Jul 19, 2020 · 53 comments
Closed

"The future of mlpack", round two! #2524

rcurtin opened this issue Jul 19, 2020 · 53 comments

Comments

@rcurtin
Copy link
Member

rcurtin commented Jul 19, 2020

Hello everyone!

Nearly ten years ago, I wrote a document called "The Future of MLPACK": http://www.ratml.org/misc/mlpack_future.pdf

That document laid out four goals for the development of mlpack:

  • Create scalable, fast machine learning algorithms.
  • Design an intuitive, simple API for users who are not C++ gurus.
  • Implement as large a collection as possible of machine learning methods.
  • Provide cutting-edge machine learning algorithms that no other library does.

In the decade since I wrote that, I think that we have made some incredible efforts towards those goals. But now it's 2020, and maybe it's time to revisit them.

In the past ten years the world has changed in ways that I certainly couldn't have predicted; when I wrote those four goals above, Python was not even the dominant language for data science! I think even the term "data scientist" hadn't even really entered the popular lexicon.

Is anyone here interested in discussing the directions we should take in the next 3-5 years or so? If we could make a new design document with our goals and the things we want to see mlpack solve, this could be really helpful for new contributors---and for users---to know "what we're all about" and what we're aimed at.

I've certainly learned a lot in the past ten years about project planning and setting goals. So I'd love the chance to help moderate, guide, and contribute to a discussion like this. More importantly, I'd love to see what each of our development interests are, so that maybe we can all team up on the things we all believe in to make them a reality. :)

So... to kick it off:

  • What's a goal that you think should be in an mlpack development plan, and why?

Let's see where the discussion goes. :)

(Update: downthread, there is a new design document: #2524 (comment))

@abernauer
Copy link

abernauer commented Jul 19, 2020

So I have two ideas that come to mind immediately:

  1. Standardize interfaces so they align more closely with typical machine learning workflow.
  2. Make mlpack more accessible to new users or those lacking C++ expertise.

Number one has already been addressed to a degree with #2421. I think that's a good goal because it makes mlpack less native to people whose primary experience comes from working with machine learning in other languages. At the same time it aligns well with the original goals of the project. From my own experience using the command line bindings I found those pretty intuitive and easy to use. Goal one might be more relevant in terms of bindings to other languages, but I think it's a good idea to streamline those interfaces to be consistent with other software libraries. As I think it would appeal to practitioners of machine learning more, yet still maintain mlpack's advantages over other machine learning libraries.

Number two is sort of addressed with the simple API, but I feel like making mlpack approachable to new users is a great way to grow the community. When I first got interested in mlpack last spring--this was after hearing about the library through a mentor at my University-- my initial impression of the project was that it was a blazing fast machine learning library and had algorithms that others didn't offer. On the other hand after looking at the code base and all the dependencies the library looked a bit overwhelming to a newcomer to the project. Given my interest in software development and learning more skills, I shrugged that concern off and ended up finding the community very helpful and welcoming. However; I don't think that is the norm for most people who are just getting interested in machine learning and software development. Lastly, I think if we're able to lower the barrier to entry without sacrificing any of the advantages mlpack has, growing the popularity of the library will be a lot easier.

@shrit
Copy link
Member

shrit commented Jul 19, 2020

I do not know why, but I had an inner feeling this year that we are going to revisit
this document, maybe because I have read this document last year, or maybe because
it is 2020. So I had already prepared a list in my head and I think it is the
time to share it with the community.

Some of these ideas are born during my work on my GSoC project, the other part,
when I use mlpack in general in day to day life, and also what I
would like to see in mlpack in the next 10 years.

I going to resume each idea in a title and give the reason why these goals are
important to me.

  • Example: Make the example repository as good as the tests we have in mlpack.

I have started using mlpack at the end of 2018, I was looking for a specific
example of neural networks, I did not come across the example repository, but I
founded the testing code which provided a good example. I believe that mlpack has a great test set.
At the same time, there are a few examples of using methods in mlpack, I
would like to see at least one example for each method we have in mlpack in that
repository. Why? I do not have a lot of time to search for API functions, a
ready to use example is a great help, especially when you are working on a project
and you would like to test code rapidly without losing time looking for details
in the API. I would like to work on that at the end of the year when having more time.
To give an example of how much the test system is good in mlpack you can see
that adaboost_test.cpp has about 900 lines of code just to test the
algorithm.

  • Documentation: Improve documentation to be as good as Keras.

This one is related to the example repository, I like documentations in
Keras and in SciKit-learn. I found these documentations extremely educative and
helpful in testing code rapidly, I love to learn what a new machine learning
algorithm does in reading and executing an example rapidly, this is much for
efficient than reading the math in a research paper and imagining how they
got their results. I know there are good documentations in mlpack, but I am
still fascinated by those in SciKit-learn. This is a reason why I think
this point is important for the next years.

  • Bandicoot for GPU support.

Personally I am lucky to have several good CPUs to train on them, but this is not
the case for a lot of people. I will be happy to see mlpack run on a GPU using
bandicoot. Hope that I will see that very soon. Maybe a GSoC idea to accelerate the
development of this direction?

  • Advertisement.

I would like to see mlpack known by everyone in the machine learning domain, I
am not an expert on how to do this. I am aware that mlpack has gained a lot of
visibility in the last 10 years, and it will gain more in the next years, but a
booster idea will be great to help increasing the visibility.

  • Dependencies: remove all the feelings of dependencies.

I think part of this issue has been addressed in #2440, and the issue will still
be open until all of the boost dependencies are removed. The idea here is to make sure
that the user is capable of installing mlpack without the need for any external
dependencies, that will means shipping mlpack with all dependencies inside, or with an auto
installer for dependency.
The only required dependency will be g++ and standard C++ library, I can argue
by saying that I will be able to install mlpack on a buildroot system,
or just download mlpack and type make and go have some coffee, or watch a
film if you have an Intel i5 processor 😃. Even a beginner
user will feel very satisfied to install mlpack by just typing make

  • Embedded System: continue support on embedded platforms

This is born with my GSoC project, and I will be happy to continue working on
this and adding support for new platforms after the project. I like embedded systems, I do not
why. The point here is that mlpack has the potential for embedded systems,
much better than Tensorflow, PyTorch or any other machine learning library.
This is because mlpack is fast, and written entirely in C++.

  • Build System: Add support for several build system

I made this the last point because I do not think this is important. mlpack uses
CMake, which is a good build system, and widely used in most C++ projects. At the
same time, in the last ten years, we had a lot of new build systems such as
meson. Now I see more and more projects adding it with CMake, so maybe the idea
worth discussion, but it is not that important.

Finally, these are my thoughts based on observations, testing and using of
mlpack, they are born from day to day use of the library. Most of these are
general thoughts, I know that the development will continue and improve by
adding new methods, and refactoring the old code to stay as simple and as fast as
possible.

These are my development interest, hope to see them as a reality in the few coming
years :).

@bkmgit
Copy link
Contributor

bkmgit commented Jul 20, 2020

GPU support may be helpful. Bandicot may be helpful because it can be used by Armadillo and seems portable, though it looks like it is still under development. Not sure how it will compare to other GPU libraries such as oneDNN, cuBLAS, RadeonML and Arm Compute, CuDNN and Arm NN

Libraries similar to MLPack are FANN, Shogun and Dlib. These are relatively lightweight, primarily use C++. FANN also has many language bindings. Specific design and dependency choices are somewhat different though.

Android support would be helpful.

ONNX and/or TVM support may be good to have.

Finally, would like to have command line input for neural networks. Related issue is #1254 Command line bindings do not seem to be available for many ML libraries, and these are very helpful for production workflows.

@zoq
Copy link
Member

zoq commented Jul 20, 2020

I completely agree with @Shirt about the examples, even if our tests are often a good place to start I don’t like the feeling to point people to the tests if they search for how to use mlpack. For one the tests are really strict; they often focus on a single feature, and they don’t show how you can use the bindings or cli interface. That said I really like to see more real-world examples, ideally in a ready to try version. I think the majority nowadays just searches for a ready to use solution, they don’t care if the interface is elegant, simple to use or fast, at least not in the first place. So all they want is a ready to use solution where they can put their dataset in and are done. If that fails they move along to the next promising solution.

So basically I want to point a user to a package they can install or some service they can use to try mlpack and also point people to an example that shows how they can use mlpack to solve a specific problem.

Partially I'm already working on a solution, which uses Jupyter notebooks, to provide a platform that doesn't require any setup, working on more examples.

@bkmgit
Copy link
Contributor

bkmgit commented Jul 21, 2020

@zoq @shrit The tests are nice. Maybe some could be documented as examples? Would some Jupyter examples use C++ through Xeus?

@kartikdutt18
Copy link
Member

That said I really like to see more real-world examples, ideally in a ready to try version. I think the majority nowadays just searches for a ready to use solution, they don’t care if the interface is elegant, simple to use or fast, at least not in the first place. So all they want is a ready to use solution where they can put their dataset in and are done. If that fails they move along to the next promising solution.

Extending on excellent points from @shrit and @zoq, It would really nice to have the following functionalities as well :

  1. Ready to use models :
    Most of the times, I feel, people don't create a neural network architecture for the problem they are dealing with, They generally find which is the state of the art model for that domain and tweak it for their problem. Here, having pre-trained models for training on other datasets or using them inference will be useful. Since, mlpack is completely written in C++, mlpack can provide machine learning solutions for embedded systems. It would be really nice to see some projects in mlpack that deploy object-detection / segmentation in RPi or something along the similar lines.

  2. Converter from other frameworks to mlpack.
    We already have onnx converter repo to convert models from onnx to mlpack. It would be nice to support all layers and have an API so that we can convert any model to mlpack.

  3. Translating weights from other frameworks.
    This is something that I found while working on my GSoC project, Since we currently don't have support for training models on GPU and there are many models with pretrained weights in other frameworks, It would be easier if we can simply define the model and architecture and all training parameters from say, PyTorch are transferred to the mlpack model.

  4. CLI for models and Simplified usage.
    For production level, CLI for all models will be really useful. Often, There is a lot of similarity in code for training various models that users have to write. I feel that we should also provide support where mlpack acts as a black box i.e. Simply pass the dataset name / type into the dataloader, specify the model name and model will trained with specified parameters. So user would simply download mlpack, a write a command like,
    mlpack_object_classification dataset_name model_name training_params
    and using only CLI user could have a trained model.

@zoq
Copy link
Member

zoq commented Jul 21, 2020

@zoq @shrit The tests are nice. Maybe some could be documented as examples? Would some Jupyter examples use C++ through Xeus?

Yes, we already started to implement some notebooks for the supported languages (C++, Julia, Go, Python), here is an example: https://github.com/mlpack/examples/tree/master/forest_covertype_prediction_with_random_forests

Also you are right the C++ kernel is xeus cling.

@bkmgit
Copy link
Contributor

bkmgit commented Jul 22, 2020

@zoq Thanks. Would any of these be helpful to have in the benchmarking setup? Would FANN and DLIB be the most appropriate libraries to compare against?

@rcurtin
Copy link
Member Author

rcurtin commented Jul 22, 2020

Wow, this is an awesome outpouring of ideas. It's really cool to see all the different perspectives here, and I think that can help us make a coherent set of goals that we are all aimed at.

If I really pare down what I am seeing in these ideas to a few simple, short bullet points that encapsulate everything, here's what I see:

  • mlpack should focus on ease-of-use for data scientists who may not know much C++, including interoperability. (@abernauer 1/2, @shrit 1/2/4?/5?, @bkmgit 3/4/5, @zoq, @kartikdutt18 1/2/3/4)

  • mlpack should focus on reducing resource usage: that means fast implementations and low overhead. (@shrit 3/5/6/7?, @bkmgit 1/2, @kartikdutt18 1)

I know I kind of used a seemingly random numbering scheme there, but I tried to match each point each person made to one of those two goals. I know also that I reduced a lot of nuance in what people were saying so that I could put all the ideas in a couple "buckets"---hope that I did an okay job of it. :)

The thoughts everyone posted align fairly closely with the thoughts I've been having over the previous months. The question I always want to be able to answer as well as possible is "why should I use mlpack?"

And... I think that the efforts we are focusing on here answer this question well:

  • If we can replicate the ease-of-use of the typical Python data science workflow via @zoq's awesome xeus-cling notebooks and the examples and models we already have ready thanks to @kartikdutt18 and others, then there isn't that much difference anymore between doing data science in Python and doing data science in C++. Of course, the Python ecosystem is way more mature for data science---but, I think we are very close to being able to demonstrate something that's definitely usable, even if we're not at feature-parity.

  • For those users who don't want to leave their preferred ecosystem, we can provide bindings to their language anyway thanks to the work of folks like @Yashwants19 and others.

  • A "killer feature" mlpack has over the Python ecosystem is easy deployment: instead of needing to, e.g., set up a container with all the right Python dependencies and never upgrade anything (I used to do that a lot at Symantec!), you can literally just compile your model into something deployable. Thanks to @shrit's work of stripping dependencies and reducing the size of compiled programs, these are (hopefully) readily applicable to resource-constrained devices. Compare this with, e.g., TensorFlow's XLA system, for which a user has to understand a lot to deploy something. By working natively in C++, there's no need for such tools---we can lean on the tooling that the C++ community (and the C community and the Linux system community) has built over the past 30-50 years.

I have a book on my desk called "High Performance Python". Amusingly, the main takeaway of this 345-page book is, "if you want code to be fast, you'll have to write C or C++ and wrap it through Cython or similar---or use Python packages where someone already did that". So, why not just write your data science code in C++ and get that performance without having to fight for it?

I wonder, what do people think of this statement?

"mlpack is a library dedicated to demonstrating the advantages and simplicity of a native-C++ data science workflow."

Anyway, please, if you have more thoughts feel free to add them! :)

@conradsnicta
Copy link
Contributor

conradsnicta commented Jul 23, 2020

@rcurtin I would recommend considerable simplification of the user-facing API.
Specifically, removal of template based parameters from all user-facing APIs, even if that means potential minor sacrifices in speed.

Templates are "scary", especially for people coming from "simpler" languages like Python. Even for many people reasonably well versed with C++, use of templates beyond stuff like std::vector can be annoying (which is part of the reason for the abomination known as the auto keyword).

Internal use of templates is fine, and can be used to reduce internal code repetition. In general I don't think it's necessary to expose template parameters to the user. Instead, it's possible to provide a nice external interface which then internally uses templates for speed.

As an example, in the gmm_diag class in Armadillo, the function .learn() has a straightforward dist_mode argument, as part of its "normal" set of arguments. Internally, the function then calls templated versions of several internal functions, depending on the value of dist_mode. This is done for speed -- templates provide a way for specialisation (direct or indirect) and hence templated functions are amenable to extra optimisation by the compiler.

The approach followed in the gmm_diag class provides a "simple" external API, at the expense of more complex internal API. So the trade-off is towards user friendliness.

There is also possibly slightly longer compilation times, as 2 sets of templated functions are instantiated. However, the latter point can also be advantage: a class which explicitly instantiates everything internally can be pre-compiled into a run-time library, meaning that the instatiations only need to be done once during the compilation of the entire library. During normal use of the library (ie. a user writing code which uses the library) there would be no template instatiations, thereby speeding up compilation. So in that sense the user friendliness is increased, as users don't have to wait as long.

@bkmgit
Copy link
Contributor

bkmgit commented Jul 24, 2020

One other area that may be of interest is reproducibility. Libraries that have minimal dependencies, may not be as fast, but are portable and allow for easier verification of results.

@rcurtin
Copy link
Member Author

rcurtin commented Jul 24, 2020

@conradsnicta the powerful thing that we get from user-exposed templates is "policy classes". For instance, I can write a custom kernel type class called, e.g., MySpecialKernel and then create an object of type KernelPCA<MySpecialKernel> and the compiler can make optimizations it would otherwise not be able to. There are particular machine learning techniques for which that can be powerful.

That said---and our userbase's habits demonstrate this---people mostly want the vanilla implementation. Very few people seem to want to write a custom kernel or custom distance metric or anything like that.

So definitely you are right that there are places where we could simplify things to provide clearer interfaces and documentation (in fact I think our documentation could, in general, use an overhaul!). One idea might even be to use our automatic binding generator... to provide bindings back to C++, which would then match the simple documentation for other languages like this: https://www.mlpack.org/doc/stable/python_documentation.html

Does anyone have additional thoughts to add to the discussion? If I went and wrote a brief outline of a "design document" or "direction document" that collected our goals in roughly the manner I described above, would it be useful? Is there any perspective that's been missed?

@bkmgit
Copy link
Contributor

bkmgit commented Jul 25, 2020 via email

@robertohueso
Copy link
Member

About documentation and examples improvements: There's program, very similar to GSoC, called Google Season of Docs. Maybe we can apply for that in the coming years :)

@manish7294
Copy link
Member

Hey guys, just a thought. I think if mlpack can have some amount of model interpretability, then I guess it would be really cool and helpful too.

@conradsnicta
Copy link
Contributor

One idea might even be to use our automatic binding generator... to provide bindings back to C++, which would then match the simple documentation for other languages like this: https://www.mlpack.org/doc/stable/python_documentation.html

Sounds good :) I think in many uses cases a simplified API would be useful.

@zoq
Copy link
Member

zoq commented Jul 27, 2020

One idea might even be to use our automatic binding generator... to provide bindings back to C++, which would then match the simple documentation for other languages like this: https://www.mlpack.org/doc/stable/python_documentation.html

Sounds good :) I think in many uses cases a simplified API would be useful.

I’m not sure what that would mean, let’s take PCA as an example, which basically boils down to the following interface:

template<typename DecompositionPolicy = ExactSVDPolicy>
class PCA
{
	PCA(const bool scaleData = false, const DecompositionPolicy& decomposition = DecompositionPolicy());

           double Apply(arma::mat& data, const size_t newDimension);
}:

So I can see that the DecompositionPolicy template parameter can be scary for someone not familiar with templates. So let’s say I provide an alias

using PCA = PCAType<ExactSVDPolicy>;

That way a user could directly use PCA and not PCA<> not sure that is what is meant by simplified API. Besides, I think an important part that some of the methods are currently lacking is the support of different matrix types arma::Mat<float>, spmat, etc. Some classes do support different matrix types via a class template parameter and sure we could go with the same solution as used in ensmallen and use Any but I think that will introduce some complexity into the code as well.

@conradsnicta
Copy link
Contributor

using PCA = PCAType<ExactSVDPolicy>;

That way a user could directly use PCA and not PCA<> not sure that is what is meant by simplified API.

Yeah, this is definitely barking up the right tree. I was thinking that entirely new functions and classes can be created (which offer simplified API), but the approach of aliasing most common usage patterns (via using) can also work. It would certainly reduce the need for much refactoring.

@SuryodayBasak
Copy link

SuryodayBasak commented Jul 28, 2020 via email

@brightprogrammer
Copy link

I don't know if I'm allowed to be a part of this conversation or not... Adding gpu support for mlpack might do wonders! (I don't think it has that now). I thought about this and maybe we just need armadillo to use gpu!

@rcurtin
Copy link
Member Author

rcurtin commented Aug 4, 2020

@brightprogrammer yeah, definitely agreed; the Bandicoot project is in progress for this exact goal: https://gitlab.com/conradsnicta/bandicoot-code/

I think I will try and write up a draft summary of this discussion that we can use as a "guide" or roadmap for future development, and we can workshop it until we have something that we're reasonably happy with.

Many of the requests here revolve around the same two general ideas: (a) ease-of-use and (b) efficiency and I think we can see everything through that lens. I didn't hear any thoughts against that viewpoint so I'll go ahead and go with it. 😄

Next question: what is the most effective format for this? A long time ago I wrote "The Future of MLPACK" as a typeset LaTeX document primarily out of a fascination with LaTeX, but I don't know if this is really the best way to do it. I imagine, a directional roadmap or design document like this for mlpack is most helpful for new contributors (or existing contributors) who might be asking "what should I do next that has the most impact?"

I can think of...

  • another typeset LaTeX document that we link to from somewhere (I like writing these 😄), and maybe publish to arXiv or something to try and get an audience with
  • some text for, e.g., a page like https://www.mlpack.org/about.html
  • a "sticky issue" in the repository?
  • something else?

Anyway, let me know what you think! This document will be aimed relatively high-level, with high-level goals. We could also include some lower-level code recommendations, such as API simplification, dependency reduction, and so forth. This can, of course, be a living document that we revisit and update periodically as our goals and needs change (and the world around us changes too). 👍

@brightprogrammer
Copy link

Old is gold!

@brightprogrammer
Copy link

brightprogrammer commented Aug 5, 2020

@rcurtin take a look at this :
https://icl.utk.edu/magma/software/index.html (alternative to LAPACK)

http://icl.cs.utk.edu/plasma/software/ (alternative to BLAS)

https://icl.utk.edu/slate/ (alternative to ScaLAPACK)

The above 3 libraries are highly optimized for linear algebra operations on cpu+gpu! They are supported my AMD, Intel & NVIDIA

One-sided matrix factorizations :
IMG_20200805_042126.jpg

@conradsnicta
Copy link
Contributor

conradsnicta commented Aug 5, 2020

@brightprogrammer A major problem with GPUs is that most "consumer" grade GPUs (read: designed for games) are only suitable for computing with 32 bit floats (and perhaps 16 bit floats). mlpack in general currently defaults to 64 bit floats (@rcurtin @zoq - please correct me if I'm wrong). For 64 bit floats most consumer GPUs are either worse or on par with CPUs.

The major exception is of course so-called "data center" GPUs, which cost major $$$. In other words, not easily accessible by the vast majority of people, which in turn means that it's a very narrow niche area. Both Nvidia and AMD are not going to give up milking this particular cow by reducing prices or boosting their gamer GPUs. A possible wildcard is Intel getting into the GPU space.

The GPU computing space is essentially divided into two camps: (i) specific case of neural networks (deep learning et al), and (ii) general linear algebra. For case (i), "gamer" GPUs are a good fit. For case (ii), you need major $$$ to play in this area. I suspect that case (i) probably makes up 90%+ of current GPU computing.

It's of course entirely possible to use 32 bit floats for general linear algebra, but the associated massive reduction in the precision and range of the floating point values works against that. That's why we have 64 bit floats.

@bkmgit
Copy link
Contributor

bkmgit commented Aug 5, 2020

There are deep learning applications that can use precision less than 64 bit floats, including 32 bit floats and modified 16 bit floats. These may also be important in the embedded and mobile computing space. Some care is needed in choosing these.

@brightprogrammer
Copy link

@conradsnicta that information is new to me... Please take a look at this if it helps... MAGMA provides following precisions :
IMG_20200805_094415.jpg

I think float is a 32 bit and double is 64 bit precision here ( please correct me if I am wrong 😅 as I haven't used lapack and blas before... I just want to help)

@bkmgit
Copy link
Contributor

bkmgit commented Aug 5, 2020

@brightprogrammer
Useful input. Possibly of interest:
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
the reduced precision format seems vendor specific. Unclear how much want to expose this to the typical user though. Some care is needed in design to minimize dependencies, having reasonable speed and good developer productivity.

@conradsnicta
Copy link
Contributor

@brightprogrammer For clarification:

More info: https://en.wikipedia.org/wiki/IEEE_754

Armadillo matrix type mat (and dmat) uses by 64 bit floating point elements. Matrix type fmat uses 32 bit floating point elements. More info: http://arma.sourceforge.net/docs.html#Mat

For 16 bit float, there are actually two versions: "FP16" and "BF16", which are not compatible:

There is no corresponding 16 bit floating point matrix in Armadillo, as C++ does not have a corresponding native element type. Standard versions of BLAS and LAPACK also do not support 16 bit floating point.

FP16 is generally not advised for linear algebra. BF16 is an adapted version of FP32, targeted towards use in neural network applications.

@brightprogrammer
Copy link

brightprogrammer commented Aug 5, 2020

@conradsnicta

(Sorry for disturbing if this isn't helpful😅🙏)

http://icl.cs.utk.edu/projectsfiles/magma/doxygen/routines.html

Please take a look a this page. It mentions some functions that provide 64 bit precision operations, eg : magma_dgemm for double precision general matrix multiplication... Is precision compatibility a problem even now? 😅

@shawnbrar
Copy link
Contributor

(I am a very new user of mlpack so I might be inaccurate about something)

Wow, this is an awesome outpouring of ideas. It's really cool to see all the different perspectives here, and I think that can help us make a coherent set of goals that we are all aimed at.

If I really pare down what I am seeing in these ideas to a few simple, short bullet points that encapsulate everything, here's what I see:

  • mlpack should focus on ease-of-use for data scientists who may not know much C++, including interoperability. (@abernauer 1/2, @shrit 1/2/4?/5?, @bkmgit 3/4/5, @zoq, @kartikdutt18 1/2/3/4)
  • mlpack should focus on reducing resource usage: that means fast implementations and low overhead. (@shrit 3/5/6/7?, @bkmgit 1/2, @kartikdutt18 1)

I know I kind of used a seemingly random numbering scheme there, but I tried to match each point each person made to one of those two goals. I know also that I reduced a lot of nuance in what people were saying so that I could put all the ideas in a couple "buckets"---hope that I did an okay job of it. :)

The thoughts everyone posted align fairly closely with the thoughts I've been having over the previous months. The question I always want to be able to answer as well as possible is "why should I use mlpack?"

And... I think that the efforts we are focusing on here answer this question well:

  • If we can replicate the ease-of-use of the typical Python data science workflow via @zoq's awesome xeus-cling notebooks and the examples and models we already have ready thanks to @kartikdutt18 and others, then there isn't that much difference anymore between doing data science in Python and doing data science in C++. Of course, the Python ecosystem is way more mature for data science---but, I think we are very close to being able to demonstrate something that's definitely usable, even if we're not at feature-parity.
  • For those users who don't want to leave their preferred ecosystem, we can provide bindings to their language anyway thanks to the work of folks like @Yashwants19 and others.
  • A "killer feature" mlpack has over the Python ecosystem is easy deployment: instead of needing to, e.g., set up a container with all the right Python dependencies and never upgrade anything (I used to do that a lot at Symantec!), you can literally just compile your model into something deployable. Thanks to @shrit's work of stripping dependencies and reducing the size of compiled programs, these are (hopefully) readily applicable to resource-constrained devices. Compare this with, e.g., TensorFlow's XLA system, for which a user has to understand a lot to deploy something. By working natively in C++, there's no need for such tools---we can lean on the tooling that the C++ community (and the C community and the Linux system community) has built over the past 30-50 years.

I have a book on my desk called "High Performance Python". Amusingly, the main takeaway of this 345-page book is, "if you want code to be fast, you'll have to write C or C++ and wrap it through Cython or similar---or use Python packages where someone already did that". So, why not just write your data science code in C++ and get that performance without having to fight for it?

I wonder, what do people think of this statement?

"mlpack is a library dedicated to demonstrating the advantages and simplicity of a native-C++ data science workflow."

Anyway, please, if you have more thoughts feel free to add them! :)

@rcutin, One way you can increase the ease of use of mlpack is by adding a feature to easily subset a matrix. I come from R programming language background. In R, getting a subset of a matrix is very simple. For ex:-
You have a matrix data which has 10 rows and 10 columns. I want to get rows 1 to 5 and columns 1 to 5. Currently using armadillo, you will have to write the following:-
data(arma::span(0, 4), arma::span(0, 4))
or one of the ways listed here.

But in R I will just have to write:-
data[1:5, 1:5]

Basically 1:5 in R is the similar to arma::span(0, 4) as both of them create a vector of 5 elements.

Also in R, you can type a condition in the row and column field and return a subsetted matrix.
For ex-
The following returns a matrix with only those rows which are more than 5 in column one. Plus passing nothing in column field means return all columns.

data[data[, 1] > 5, ]

I don't know if this can be done with Armadillo. Even if it can be, it would be very difficult for a person who is just new to C++.

The point that I am trying to make with this post is that a lot of time of data analysts and data scientists goes into data manipulation and not actually in training data since they already have the ML algorithms provided by your library. All they have to do is put the data. Write some lines of C++ code. And voila, the model is trained. I am not saying that doing this is very easy. But with good documentation it is possible.

But it takes a lot of time to get the data from the raw format in which it is available and change it into the format of a which is suitable for an ML algorithm.

Again for ex:-
I had a data set in which some rows have null values. In R, I would just have to do the following :-
na.omit(data)
The function removes all the rows which have NA or NaN or NULL values.
But to do this with Armadillo, I would have to first create a vector of rows which have 0 values (since null was changed to 0). Then I will have to use the .shed_rows function and pass the vector as an argument.

So in conclusion if somehow, we can make this happen, it will make a increase the ease of use for this library and also shorten the development cycle of the projects which use this library.

@abernauer
Copy link

@shawnbrar There is an mlpack R package very close to complete in development that uses Rcpp. I have a lot of experience in R programming from course work during college. A lot of stuff you can do in R is going to take a lot more code to with C++ given the language constructs. Also the mlpack R package has a lot of helper functions for converting things to work with R objects. In general it will be a lot easier to interact with mlpack algorithms in R using the data structures within R. You can fork the bindings if you want to test them out and experiment with them.

@shawnbrar
Copy link
Contributor

@abernauer , that is great. But I hope that you get my point.
If you want ease of use for the library, you will have to make the matrix subsetting a little easier. And if not, maybe writing some good tutorials which shows how one can do this.

@abernauer
Copy link

@shawnbrar Yeah I get your point. There is a large gallery of tutorials available here relevant to Armadillo and specifically directed toward R users. Also an article that features the syntax for sub-setting a matrix in Armadillo.

@mlpack-bot
Copy link

mlpack-bot bot commented Sep 13, 2020

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Sep 13, 2020
@kartikdutt18
Copy link
Member

Keep Open.

@zoq zoq added s: keep open and removed s: stale labels Sep 13, 2020
@zsogitbe
Copy link

Wow, this is an awesome outpouring of ideas. It's really cool to see all the different perspectives here, and I think that can help us make a coherent set of goals that we are all aimed at.
...
I have a book on my desk called "High Performance Python". Amusingly, the main takeaway of this 345-page book is, "if you want code to be fast, you'll have to write C or C++ and wrap it through Cython or similar---or use Python packages where someone already did that". So, why not just write your data science code in C++ and get that performance without having to fight for it?

I wonder, what do people think of this statement?

"mlpack is a library dedicated to demonstrating the advantages and simplicity of a native-C++ data science workflow."

I just came across this from Ryan from a few month ago. I could not agree more! The advantage of mlpack is that it uses C++ and not the combination of several programming languages with scripting languages chaos! I am sorry but when I see this "data[data[, 1] > 5, ]" I must quickly scroll away. I do not mind if people keep themselves busy with making a bridge between scripting languages and C++ and then run their code through a few interpreters, ... but if you really need to do that then you could just go for another package. Do not get distracted by this scripting trend and do not waste time with it! Try to make the C++ code as efficient as possible, build in parallel CPU processing as much as possible, etc. try to increase the quality of the code by testing it with real world applications and correcting the issues. It is true that using Armadillo is not easy and debugging it is even more difficult, so try to build some kind of abstraction around it with simple C++ or why not help make Armadillo easier to use?

This is not simple:
m.subcube(arma::span(), arma::span(i), arma::span()) = dataset.submat(arma::span(), arma::span(i + 2, i + rho + 1));

Debugging MLPACK and Armadillo is very difficult. I think that this would be one of the most welcome improvements to MLPACK. Try to make some kind of helper functions or systems to make debugging much easier. For example NATVIS in Visual Studio could help a lot. Developing a full NATVIS (custom views for native objects) template for VS for both Armadillo and MLPACK would be one of the most important additions to MLPACK.

@rcurtin
Copy link
Member Author

rcurtin commented Mar 18, 2021

Hey everyone, I know it has been quite a while. I took the suggestions from our discussion here and compiled them into a document:

https://www.ratml.org/misc/mlpack-vision.pdf

The idea here is that this is a document roughly in the mold of the last design document (https://www.ratml.org/misc/mlpack_future.pdf). So, new contributors can read this to quickly come up to speed on where we came from and (roughly) where we're headed. But, I avoided detailed goals, since those will quickly go out of date (and did in the last design document). My thinking is that we can link to this document in places that new contributors might find (e.g. https://www.mlpack.org/community.html), and then perhaps use the goals written there to be a bit more explicit in issues and PRs and milestones in the repository.

Let me know what you think! If we're all reasonably happy with this, I'll move forward with figuring out how to deploy it to the website and link it up with our issues and goals. I'm also happy to change things around---primarily I was just doing my best to construct a narrative around all the things we wrote here. :)

@zoq zoq pinned this issue Mar 18, 2021
@zsogitbe
Copy link

Just a quick feedback (personal opinnion):

  • I would define mlpack as a fast and easy-to-use C++ machine learning library and this is what I would try to achieve and improve.
  • Prototyping should not be the focus, for example because you cannot compete with Python based prototyping tools.
  • You want to help deploying prototypes? Simply add the possibility to import those Python made prototype models.
  • GPU support? definitely!
  • Adaptable examples and documentation? Yes!
  • Improved compilation time and memory usage? Not very important because high end PC's are available. But simplification of the code is important. Simple design patterns? Yes!
  • Better support for cross-compilation and lightweight deployment? In my case I need speed and thus hardware accelerated BLAS/LAPACK with huge support DLLs. So for me not very important but I can imagine that others would like to deploy lightweight ML models on smaller devices. So why not!
  • Utilities for non-numeric data? Yes!
  • Efficient implementations (OpenMP, Simd)? Yes!
  • Extra: Easy Debug support? Yes!

@zoq
Copy link
Member

zoq commented Mar 26, 2021

  • Prototyping should not be the focus, for example because you cannot compete with Python based prototyping tools.
  • You want to help deploying prototypes? Simply add the possibility to import those Python made prototype models.

I guess in a sense, this is already possible, at least partially, through the python bindings. At least for the methods that are available as an executable. But I see what you mean, it would be interesting to show how such a pipeline could look like in the form of a notebook or tutorial with what we currently have.

@heisenbuug
Copy link
Contributor

Hello everyone.

First of all thanx to @rcurtin for putting together the document and everyone who worked towards those goals.
It is also visible that everyone is working towards making these improvements possible.

We have made progress in many directions since the start of this issue. Removing dependencies is still in progress.

My actual idea is to build a data frame like class for mlpack, here is the link to the issue and PR
A lot of changes have been happened and worked on since then. I am currently working on removing boost::spirits.
by implementing ways to load categorical data and these leading to a python-like dataframe(pandas) in the future.

Successful implementation of this idea might take some time. We are planning to improve the examples and set them in a notebook environment which will make them much easier to handle. Having a df for C++ in notebook setting can ease the process of working with the data.

Progress
You can look through this issue to get an update: issue

Let me know what everyone thinks.

@rcurtin
Copy link
Member Author

rcurtin commented Apr 21, 2021

This has sat for about a month now---I agree with the responses I've gotten here and I think they fit into the vision outlined in the document. In the next days I'll move forward figuring out how to make this document available via the mlpack website and for new contributors.

@conradsnicta
Copy link
Contributor

@rcurtin With the aim of making mlpack more approachable, I suggest to turn off building the tests by default.

The tests take quite a long time and lots of memory to build. Anecdotally, they actually seem to take the majority of the build time. The tests are useful to developers of mlpack, but they are not useful to the users of mlpack.

mlpack is now a huge beast, and installing it from source on typical laptops (2 to 4 cores, ~8 GB RAM), is painfully slow and/or runs into memory issues. My own laptop froze several times (requiring a hard reset) when I made the mistake of building with -j 4.

(Incidentally, I've been told that the R version of mlpack takes several hours to install from source, because the R environment builds everything serially. This significantly hampers adoption.)

@rcurtin
Copy link
Member Author

rcurtin commented Apr 24, 2021

@conradsnicta nice point; I opened #2926 to act on the suggestion. You're right that the R bindings are pretty bad. I'm hoping that removing Boost and simplifying our other dependencies can help reduce the compilation pain.

@conradsnicta
Copy link
Contributor

conradsnicta commented Apr 27, 2021

@rcurtin With an eye towards reducing compilation time, I'd suggest a possible change to the user-facing API (perhaps for mlpack 4.0). Currently mlpack is heavily templated, which forces the compiler to do a lot of reasoning at compilation time -- the more template parameters, the slower the compilation time.

So the suggested change is to move some of the template parameters to be "plain" arguments to functions, where it makes sense and where it's possible. For example, if there are situations where the vast majority of time is taken by matrix multiplication or matrix decompositions (eg. SVD), use of template arguments doesn't give much (if any) runtime speedups, and comes with the expense of slowed down compilation time.

It would be useful to see the actual speed differences between "plain" arguments and template arguments, so that changes would be informed by actual empirical data. The first step could be an audit to see where the majority of runtime takes place, and then work backwards from there. The audit could be either via explicit speed evaluation, or via an approximate approach where we note the use of "heavy" matrix operations (eg. decompositions, etc).

@rcurtin
Copy link
Member Author

rcurtin commented Apr 28, 2021

@conradsnicta you might be right, but I think the much bigger compilation time elephant in the room is the use of Boost. So I think once we remove that, we'll be in a much better place. There are some places like the hyperparameter tuner where complicated template metaprogramming (with recursion) is used, but most of the time it's just policy classes. At least for the policy classes, I would imagine that Armadillo's template metaprogramming used in all the linear algebra expressions works the compiler harder, yet Armadillo compiles pretty fast in my experience.

Anyway, if compilation times are still long even after removing Boost, I totally agree with your approach here. 👍

@rcurtin
Copy link
Member Author

rcurtin commented Apr 28, 2021

I opened mlpack/mlpack.org#48 and #2935 to update the website and this repository to point to those documents. 👍

@rcurtin
Copy link
Member Author

rcurtin commented May 1, 2021

Thanks everyone for the input on this over the past year. The vision document is updated and linked to from the website now, so I'll go ahead and close this issue.

Of course, we can have more discussions in the future! By closing this issue I just want to say "this discussion seems finished", not "all discussions are finished". :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests