-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"The future of mlpack", round two! #2524
Comments
So I have two ideas that come to mind immediately:
Number one has already been addressed to a degree with #2421. I think that's a good goal because it makes mlpack less native to people whose primary experience comes from working with machine learning in other languages. At the same time it aligns well with the original goals of the project. From my own experience using the command line bindings I found those pretty intuitive and easy to use. Goal one might be more relevant in terms of bindings to other languages, but I think it's a good idea to streamline those interfaces to be consistent with other software libraries. As I think it would appeal to practitioners of machine learning more, yet still maintain mlpack's advantages over other machine learning libraries. Number two is sort of addressed with the simple API, but I feel like making mlpack approachable to new users is a great way to grow the community. When I first got interested in mlpack last spring--this was after hearing about the library through a mentor at my University-- my initial impression of the project was that it was a blazing fast machine learning library and had algorithms that others didn't offer. On the other hand after looking at the code base and all the dependencies the library looked a bit overwhelming to a newcomer to the project. Given my interest in software development and learning more skills, I shrugged that concern off and ended up finding the community very helpful and welcoming. However; I don't think that is the norm for most people who are just getting interested in machine learning and software development. Lastly, I think if we're able to lower the barrier to entry without sacrificing any of the advantages mlpack has, growing the popularity of the library will be a lot easier. |
I do not know why, but I had an inner feeling this year that we are going to revisit Some of these ideas are born during my work on my GSoC project, the other part, I going to resume each idea in a title and give the reason why these goals are
I have started using mlpack at the end of 2018, I was looking for a specific
This one is related to the example repository, I like documentations in
Personally I am lucky to have several good CPUs to train on them, but this is not
I would like to see mlpack known by everyone in the machine learning domain, I
I think part of this issue has been addressed in #2440, and the issue will still
This is born with my GSoC project, and I will be happy to continue working on
I made this the last point because I do not think this is important. mlpack uses Finally, these are my thoughts based on observations, testing and using of These are my development interest, hope to see them as a reality in the few coming |
GPU support may be helpful. Bandicot may be helpful because it can be used by Armadillo and seems portable, though it looks like it is still under development. Not sure how it will compare to other GPU libraries such as oneDNN, cuBLAS, RadeonML and Arm Compute, CuDNN and Arm NN Libraries similar to MLPack are FANN, Shogun and Dlib. These are relatively lightweight, primarily use C++. FANN also has many language bindings. Specific design and dependency choices are somewhat different though. Android support would be helpful. ONNX and/or TVM support may be good to have. Finally, would like to have command line input for neural networks. Related issue is #1254 Command line bindings do not seem to be available for many ML libraries, and these are very helpful for production workflows. |
I completely agree with @Shirt about the examples, even if our tests are often a good place to start I don’t like the feeling to point people to the tests if they search for how to use mlpack. For one the tests are really strict; they often focus on a single feature, and they don’t show how you can use the bindings or cli interface. That said I really like to see more real-world examples, ideally in a ready to try version. I think the majority nowadays just searches for a ready to use solution, they don’t care if the interface is elegant, simple to use or fast, at least not in the first place. So all they want is a ready to use solution where they can put their dataset in and are done. If that fails they move along to the next promising solution. So basically I want to point a user to a package they can install or some service they can use to try mlpack and also point people to an example that shows how they can use mlpack to solve a specific problem. Partially I'm already working on a solution, which uses Jupyter notebooks, to provide a platform that doesn't require any setup, working on more examples. |
Extending on excellent points from @shrit and @zoq, It would really nice to have the following functionalities as well :
|
Yes, we already started to implement some notebooks for the supported languages (C++, Julia, Go, Python), here is an example: https://github.com/mlpack/examples/tree/master/forest_covertype_prediction_with_random_forests Also you are right the C++ kernel is xeus cling. |
@zoq Thanks. Would any of these be helpful to have in the benchmarking setup? Would FANN and DLIB be the most appropriate libraries to compare against? |
Wow, this is an awesome outpouring of ideas. It's really cool to see all the different perspectives here, and I think that can help us make a coherent set of goals that we are all aimed at. If I really pare down what I am seeing in these ideas to a few simple, short bullet points that encapsulate everything, here's what I see:
I know I kind of used a seemingly random numbering scheme there, but I tried to match each point each person made to one of those two goals. I know also that I reduced a lot of nuance in what people were saying so that I could put all the ideas in a couple "buckets"---hope that I did an okay job of it. :) The thoughts everyone posted align fairly closely with the thoughts I've been having over the previous months. The question I always want to be able to answer as well as possible is "why should I use mlpack?" And... I think that the efforts we are focusing on here answer this question well:
I have a book on my desk called "High Performance Python". Amusingly, the main takeaway of this 345-page book is, "if you want code to be fast, you'll have to write C or C++ and wrap it through Cython or similar---or use Python packages where someone already did that". So, why not just write your data science code in C++ and get that performance without having to fight for it? I wonder, what do people think of this statement? "mlpack is a library dedicated to demonstrating the advantages and simplicity of a native-C++ data science workflow." Anyway, please, if you have more thoughts feel free to add them! :) |
@rcurtin I would recommend considerable simplification of the user-facing API. Templates are "scary", especially for people coming from "simpler" languages like Python. Even for many people reasonably well versed with C++, use of templates beyond stuff like std::vector can be annoying (which is part of the reason for the abomination known as the auto keyword). Internal use of templates is fine, and can be used to reduce internal code repetition. In general I don't think it's necessary to expose template parameters to the user. Instead, it's possible to provide a nice external interface which then internally uses templates for speed. As an example, in the The approach followed in the There is also possibly slightly longer compilation times, as 2 sets of templated functions are instantiated. However, the latter point can also be advantage: a class which explicitly instantiates everything internally can be pre-compiled into a run-time library, meaning that the instatiations only need to be done once during the compilation of the entire library. During normal use of the library (ie. a user writing code which uses the library) there would be no template instatiations, thereby speeding up compilation. So in that sense the user friendliness is increased, as users don't have to wait as long. |
One other area that may be of interest is reproducibility. Libraries that have minimal dependencies, may not be as fast, but are portable and allow for easier verification of results. |
@conradsnicta the powerful thing that we get from user-exposed templates is "policy classes". For instance, I can write a custom kernel type class called, e.g., That said---and our userbase's habits demonstrate this---people mostly want the vanilla implementation. Very few people seem to want to write a custom kernel or custom distance metric or anything like that. So definitely you are right that there are places where we could simplify things to provide clearer interfaces and documentation (in fact I think our documentation could, in general, use an overhaul!). One idea might even be to use our automatic binding generator... to provide bindings back to C++, which would then match the simple documentation for other languages like this: https://www.mlpack.org/doc/stable/python_documentation.html Does anyone have additional thoughts to add to the discussion? If I went and wrote a brief outline of a "design document" or "direction document" that collected our goals in roughly the manner I described above, would it be useful? Is there any perspective that's been missed? |
Custom distance metrics are useful. Perhaps examples are needed for this
use case. Perhaps examples can be split into basic set and advanced set.
|
About documentation and examples improvements: There's program, very similar to GSoC, called Google Season of Docs. Maybe we can apply for that in the coming years :) |
Hey guys, just a thought. I think if mlpack can have some amount of model interpretability, then I guess it would be really cool and helpful too. |
Sounds good :) I think in many uses cases a simplified API would be useful. |
I’m not sure what that would mean, let’s take template<typename DecompositionPolicy = ExactSVDPolicy>
class PCA
{
PCA(const bool scaleData = false, const DecompositionPolicy& decomposition = DecompositionPolicy());
double Apply(arma::mat& data, const size_t newDimension);
}: So I can see that the
That way a user could directly use |
Yeah, this is definitely barking up the right tree. I was thinking that entirely new functions and classes can be created (which offer simplified API), but the approach of aliasing most common usage patterns (via |
Hi Ryan! I'm sorry I'm late to this party. Congratulations on keeping this
going for so long and for being able to grow the community.
I think that one of the things that resonate in my mind is having better
examples. I'd like to extend that -- it might be a good idea to invest some
time on having a set of tutorials that are better organized than what we
currently have. It seems to me that the popularity of most libraries today
hinges on how well the tutorials are made because 'data scientists' are
interested in copy-pasting blocks of code and reusing them. This is less of
a technical suggestion but more towards something that could spread the
word of mlpack :)
*Suryoday Basak*
*Graduate Student*, Department of Computer Science and Engineering, *The
University of Texas at Arlington*
Website: *suryodaybasak.info <http://suryodaybasak.info>*
Follow me on Medium:
*https://medium.com/@suryodaybasak
<https://medium.com/@suryodaybasak>*Astroinformatics
Research Group: *http://astrirg.org <http://astrirg.org>*
* <http://ascl.net/code/v/1475>*
…On Mon, 27 Jul 2020 at 23:57, Conrad Sanderson ***@***.***> wrote:
using PCA = PCAType<ExactSVDPolicy>;
That way a user could directly use PCA and not PCA<> not sure that is
what is meant by simplified API.
Yeah, this is definitely barking up the right tree. I was thinking that
entirely new functions and classes can be created (which offer simplified
API), but the approach of aliasing most common usage patterns (via using)
can also work. It would certainly reduce the need for much refactoring.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2524 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACDGMIPXM3OWQAEHRSZAEQLR5ZLCXANCNFSM4PBQ3FOA>
.
|
I don't know if I'm allowed to be a part of this conversation or not... Adding gpu support for mlpack might do wonders! (I don't think it has that now). I thought about this and maybe we just need armadillo to use gpu! |
@brightprogrammer yeah, definitely agreed; the Bandicoot project is in progress for this exact goal: https://gitlab.com/conradsnicta/bandicoot-code/ I think I will try and write up a draft summary of this discussion that we can use as a "guide" or roadmap for future development, and we can workshop it until we have something that we're reasonably happy with. Many of the requests here revolve around the same two general ideas: (a) ease-of-use and (b) efficiency and I think we can see everything through that lens. I didn't hear any thoughts against that viewpoint so I'll go ahead and go with it. 😄 Next question: what is the most effective format for this? A long time ago I wrote "The Future of MLPACK" as a typeset LaTeX document primarily out of a fascination with LaTeX, but I don't know if this is really the best way to do it. I imagine, a directional roadmap or design document like this for mlpack is most helpful for new contributors (or existing contributors) who might be asking "what should I do next that has the most impact?" I can think of...
Anyway, let me know what you think! This document will be aimed relatively high-level, with high-level goals. We could also include some lower-level code recommendations, such as API simplification, dependency reduction, and so forth. This can, of course, be a living document that we revisit and update periodically as our goals and needs change (and the world around us changes too). 👍 |
Old is gold! |
@rcurtin take a look at this : http://icl.cs.utk.edu/plasma/software/ (alternative to BLAS) https://icl.utk.edu/slate/ (alternative to ScaLAPACK) The above 3 libraries are highly optimized for linear algebra operations on cpu+gpu! They are supported my AMD, Intel & NVIDIA |
@brightprogrammer A major problem with GPUs is that most "consumer" grade GPUs (read: designed for games) are only suitable for computing with 32 bit floats (and perhaps 16 bit floats). mlpack in general currently defaults to 64 bit floats (@rcurtin @zoq - please correct me if I'm wrong). For 64 bit floats most consumer GPUs are either worse or on par with CPUs. The major exception is of course so-called "data center" GPUs, which cost major $$$. In other words, not easily accessible by the vast majority of people, which in turn means that it's a very narrow niche area. Both Nvidia and AMD are not going to give up milking this particular cow by reducing prices or boosting their gamer GPUs. A possible wildcard is Intel getting into the GPU space. The GPU computing space is essentially divided into two camps: (i) specific case of neural networks (deep learning et al), and (ii) general linear algebra. For case (i), "gamer" GPUs are a good fit. For case (ii), you need major $$$ to play in this area. I suspect that case (i) probably makes up 90%+ of current GPU computing. It's of course entirely possible to use 32 bit floats for general linear algebra, but the associated massive reduction in the precision and range of the floating point values works against that. That's why we have 64 bit floats. |
There are deep learning applications that can use precision less than 64 bit floats, including 32 bit floats and modified 16 bit floats. These may also be important in the embedded and mobile computing space. Some care is needed in choosing these. |
@conradsnicta that information is new to me... Please take a look at this if it helps... MAGMA provides following precisions : I think |
@brightprogrammer |
@brightprogrammer For clarification:
More info: https://en.wikipedia.org/wiki/IEEE_754 Armadillo matrix type For 16 bit float, there are actually two versions: "FP16" and "BF16", which are not compatible:
There is no corresponding 16 bit floating point matrix in Armadillo, as C++ does not have a corresponding native element type. Standard versions of BLAS and LAPACK also do not support 16 bit floating point. FP16 is generally not advised for linear algebra. BF16 is an adapted version of FP32, targeted towards use in neural network applications. |
(Sorry for disturbing if this isn't helpful😅🙏) http://icl.cs.utk.edu/projectsfiles/magma/doxygen/routines.html Please take a look a this page. It mentions some functions that provide 64 bit precision operations, eg : |
(I am a very new user of mlpack so I might be inaccurate about something)
@rcutin, One way you can increase the ease of use of mlpack is by adding a feature to easily subset a matrix. I come from R programming language background. In R, getting a subset of a matrix is very simple. For ex:- But in R I will just have to write:- Basically Also in R, you can type a condition in the row and column field and return a subsetted matrix.
I don't know if this can be done with Armadillo. Even if it can be, it would be very difficult for a person who is just new to C++. The point that I am trying to make with this post is that a lot of time of data analysts and data scientists goes into data manipulation and not actually in training data since they already have the ML algorithms provided by your library. All they have to do is put the data. Write some lines of C++ code. And voila, the model is trained. I am not saying that doing this is very easy. But with good documentation it is possible. But it takes a lot of time to get the data from the raw format in which it is available and change it into the format of a which is suitable for an ML algorithm. Again for ex:- So in conclusion if somehow, we can make this happen, it will make a increase the ease of use for this library and also shorten the development cycle of the projects which use this library. |
@shawnbrar There is an mlpack R package very close to complete in development that uses Rcpp. I have a lot of experience in R programming from course work during college. A lot of stuff you can do in R is going to take a lot more code to with C++ given the language constructs. Also the mlpack R package has a lot of helper functions for converting things to work with R objects. In general it will be a lot easier to interact with mlpack algorithms in R using the data structures within R. You can fork the bindings if you want to test them out and experiment with them. |
@abernauer , that is great. But I hope that you get my point. |
@shawnbrar Yeah I get your point. There is a large gallery of tutorials available here relevant to Armadillo and specifically directed toward R users. Also an article that features the syntax for sub-setting a matrix in Armadillo. |
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
Keep Open. |
I just came across this from Ryan from a few month ago. I could not agree more! The advantage of mlpack is that it uses C++ and not the combination of several programming languages with scripting languages chaos! I am sorry but when I see this "data[data[, 1] > 5, ]" I must quickly scroll away. I do not mind if people keep themselves busy with making a bridge between scripting languages and C++ and then run their code through a few interpreters, ... but if you really need to do that then you could just go for another package. Do not get distracted by this scripting trend and do not waste time with it! Try to make the C++ code as efficient as possible, build in parallel CPU processing as much as possible, etc. try to increase the quality of the code by testing it with real world applications and correcting the issues. It is true that using Armadillo is not easy and debugging it is even more difficult, so try to build some kind of abstraction around it with simple C++ or why not help make Armadillo easier to use? This is not simple: Debugging MLPACK and Armadillo is very difficult. I think that this would be one of the most welcome improvements to MLPACK. Try to make some kind of helper functions or systems to make debugging much easier. For example NATVIS in Visual Studio could help a lot. Developing a full NATVIS (custom views for native objects) template for VS for both Armadillo and MLPACK would be one of the most important additions to MLPACK. |
Hey everyone, I know it has been quite a while. I took the suggestions from our discussion here and compiled them into a document: https://www.ratml.org/misc/mlpack-vision.pdf The idea here is that this is a document roughly in the mold of the last design document (https://www.ratml.org/misc/mlpack_future.pdf). So, new contributors can read this to quickly come up to speed on where we came from and (roughly) where we're headed. But, I avoided detailed goals, since those will quickly go out of date (and did in the last design document). My thinking is that we can link to this document in places that new contributors might find (e.g. https://www.mlpack.org/community.html), and then perhaps use the goals written there to be a bit more explicit in issues and PRs and milestones in the repository. Let me know what you think! If we're all reasonably happy with this, I'll move forward with figuring out how to deploy it to the website and link it up with our issues and goals. I'm also happy to change things around---primarily I was just doing my best to construct a narrative around all the things we wrote here. :) |
Just a quick feedback (personal opinnion):
|
I guess in a sense, this is already possible, at least partially, through the python bindings. At least for the methods that are available as an executable. But I see what you mean, it would be interesting to show how such a pipeline could look like in the form of a notebook or tutorial with what we currently have. |
Hello everyone. First of all thanx to @rcurtin for putting together the document and everyone who worked towards those goals. We have made progress in many directions since the start of this issue. Removing dependencies is still in progress. My actual idea is to build a data frame like class for mlpack, here is the link to the issue and PR Successful implementation of this idea might take some time. We are planning to improve the examples and set them in a notebook environment which will make them much easier to handle. Having a df for C++ in notebook setting can ease the process of working with the data. Progress Let me know what everyone thinks. |
This has sat for about a month now---I agree with the responses I've gotten here and I think they fit into the vision outlined in the document. In the next days I'll move forward figuring out how to make this document available via the mlpack website and for new contributors. |
@rcurtin With the aim of making mlpack more approachable, I suggest to turn off building the tests by default. The tests take quite a long time and lots of memory to build. Anecdotally, they actually seem to take the majority of the build time. The tests are useful to developers of mlpack, but they are not useful to the users of mlpack. mlpack is now a huge beast, and installing it from source on typical laptops (2 to 4 cores, ~8 GB RAM), is painfully slow and/or runs into memory issues. My own laptop froze several times (requiring a hard reset) when I made the mistake of building with -j 4. (Incidentally, I've been told that the R version of mlpack takes several hours to install from source, because the R environment builds everything serially. This significantly hampers adoption.) |
@conradsnicta nice point; I opened #2926 to act on the suggestion. You're right that the R bindings are pretty bad. I'm hoping that removing Boost and simplifying our other dependencies can help reduce the compilation pain. |
@rcurtin With an eye towards reducing compilation time, I'd suggest a possible change to the user-facing API (perhaps for mlpack 4.0). Currently mlpack is heavily templated, which forces the compiler to do a lot of reasoning at compilation time -- the more template parameters, the slower the compilation time. So the suggested change is to move some of the template parameters to be "plain" arguments to functions, where it makes sense and where it's possible. For example, if there are situations where the vast majority of time is taken by matrix multiplication or matrix decompositions (eg. SVD), use of template arguments doesn't give much (if any) runtime speedups, and comes with the expense of slowed down compilation time. It would be useful to see the actual speed differences between "plain" arguments and template arguments, so that changes would be informed by actual empirical data. The first step could be an audit to see where the majority of runtime takes place, and then work backwards from there. The audit could be either via explicit speed evaluation, or via an approximate approach where we note the use of "heavy" matrix operations (eg. decompositions, etc). |
@conradsnicta you might be right, but I think the much bigger compilation time elephant in the room is the use of Boost. So I think once we remove that, we'll be in a much better place. There are some places like the hyperparameter tuner where complicated template metaprogramming (with recursion) is used, but most of the time it's just policy classes. At least for the policy classes, I would imagine that Armadillo's template metaprogramming used in all the linear algebra expressions works the compiler harder, yet Armadillo compiles pretty fast in my experience. Anyway, if compilation times are still long even after removing Boost, I totally agree with your approach here. 👍 |
I opened mlpack/mlpack.org#48 and #2935 to update the website and this repository to point to those documents. 👍 |
Thanks everyone for the input on this over the past year. The vision document is updated and linked to from the website now, so I'll go ahead and close this issue. Of course, we can have more discussions in the future! By closing this issue I just want to say "this discussion seems finished", not "all discussions are finished". :) |
Hello everyone!
Nearly ten years ago, I wrote a document called "The Future of MLPACK": http://www.ratml.org/misc/mlpack_future.pdf
That document laid out four goals for the development of mlpack:
In the decade since I wrote that, I think that we have made some incredible efforts towards those goals. But now it's 2020, and maybe it's time to revisit them.
In the past ten years the world has changed in ways that I certainly couldn't have predicted; when I wrote those four goals above, Python was not even the dominant language for data science! I think even the term "data scientist" hadn't even really entered the popular lexicon.
Is anyone here interested in discussing the directions we should take in the next 3-5 years or so? If we could make a new design document with our goals and the things we want to see mlpack solve, this could be really helpful for new contributors---and for users---to know "what we're all about" and what we're aimed at.
I've certainly learned a lot in the past ten years about project planning and setting goals. So I'd love the chance to help moderate, guide, and contribute to a discussion like this. More importantly, I'd love to see what each of our development interests are, so that maybe we can all team up on the things we all believe in to make them a reality. :)
So... to kick it off:
Let's see where the discussion goes. :)
(Update: downthread, there is a new design document: #2524 (comment))
The text was updated successfully, but these errors were encountered: