Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting armadillo's parser for mlpack(Removing Boost Dependencies) #2942

Merged
merged 116 commits into from Nov 6, 2021

Conversation

heisenbuug
Copy link
Contributor

@heisenbuug heisenbuug commented May 16, 2021

For background knowledge, look at these

Sample code to use the feature

#include <iostream>
#include <mlpack/core.hpp>

int main()
{
  arma::Mat<double> data;
  std::fstream file;
  
  file.open("data.csv");
  mlpack::data::load_data<double>(data, arma::csv_ascii, file);
  data.raw_print();
  
  return 0;  
}

@heisenbuug heisenbuug changed the title Integrating armadillo's parser with mlpack and adapting it for categorical data Adapting armadillo's parser for mlpack May 16, 2021
@heisenbuug heisenbuug changed the title Adapting armadillo's parser for mlpack Adapting armadillo's parser for mlpack(Removing Boost Dependencies) May 16, 2021
@conradsnicta
Copy link
Contributor

conradsnicta commented May 17, 2021

@heisenbuug

Caveat: any function, class, constant or other code not explicitly described in the public API documentation is considered as part of the underlying internal implementation details, and may be removed or changed without notice. (In other words, don't use internal functionality).

@heisenbuug
Copy link
Contributor Author

heisenbuug commented May 17, 2021

Hey, @conradsnicta Thank you for having a look.

Sorry, I forgot to mention that file is taken from the armadillo, I will add that right away.

  • Don't use internal Armadillo functionality such as arma_cold, arma::cond_rel, arma::is_signed, arma::is_real, as that can be changed at any time. Armadillo's API policy states, in part:
    Caveat: any function, class, constant or other code not explicitly described in the public API documentation is considered as part of the underlying internal implementation details, and may be removed or changed without notice. (In other words, don't use internal functionality).

@rcurtin @shrit how should we proceed then?
Maybe we can copy these to mlpack, i.e. even if removed from armadillo we can use them internally, not sure what to do.

@conradsnicta
Copy link
Contributor

@heisenbuug I suggest writing the code from scratch, but using the Armadillo implementation as an inspiration and reference. You'll get a much better understanding of how things work.

The code in mlpack doesn't need arma_cold. Stuff like arma::cond_rel, arma::is_signed and arma::is_real is relatively simple to re-implement. You may be able to use existing C++11 functionality -- see the type_traits header: https://en.cppreference.com/w/cpp/header/type_traits

@shrit
Copy link
Member

shrit commented May 18, 2021

@heisenbuug I agree with @conradsnicta, it would be a better idea to rewrite the parser from scratch by using the same implementation of armadillo as a reference, also we need to mention that the implementation is inspired by Armadillo.

@heisenbuug
Copy link
Contributor Author

heisenbuug commented May 26, 2021

@shrit @conradsnicta okay, I will start rewriting the parser from the scratch.

I have one doubt now that I am writing the code from the scratch we also need to follow
good design principles, right?

So, should I first think about how will I structure the whole parser? Or should I first just try
to reimplement the basic data loading code and we can think about the design along the way.

@conradsnicta, my initial idea was actually to implement a pandas like data structure for
C++, so now that I am implementing the parser itself, maybe I can implement some features
on the parser level?

Let me know what you all think.

@shrit
Copy link
Member

shrit commented May 26, 2021

I would vote for keeping the same implementation as the one provided by Armadillo because it would be impossible to write a better one during the 3 months of the summer. However, this does not mean copying the code directly from armadillo into mlpack, but this would require a complete rewrite of the code to provide the same performance and the same functionalities.

Once this step is complete it would be easier to add additional functionalities similar to one provided by pandas #2722.
The parser is an important part of mlpack core. We need to be sure that everything is working perfectly here.

Considering the design, the next release is mlpack 4.0, which means we can have some changes in the public API. However, we need to try to keep the same interface as the old one, with only minor changes if it is required most.

@heisenbuug Let me know if this is helpful 👍

@heisenbuug
Copy link
Contributor Author

I would vote for keeping the same implementation as the one provided by Armadillo because it would be impossible to write a better one during the 3 months of the summer. However, this does not mean copying the code directly from armadillo into mlpack, but this would require a complete rewrite of the code to provide the same performance and the same functionalities.

Once this step is complete it would be easier to add additional functionalities similar to one provided by pandas #2722.
The parser is an important part of mlpack core. We need to be sure that everything is working perfectly here.

Considering the design, the next release is mlpack 4.0, which means we can have some changes in the public API. However, we need to try to keep the same interface as the old one, with only minor changes if it is required most.

@heisenbuug Let me know if this is helpful

So I will start by rewriting the basic data loading code, as suggested I will reimplement the functions that are not described in armadillo's public API documentation or I will try to find existing C++ 11 feature...

@shrit
Copy link
Member

shrit commented May 31, 2021

@heisenbuug, let me know if you need any help, do not hesitate in pushing the last modification, even if they are not complete yet, this will allow me to check and review it when I have time 👍

@heisenbuug
Copy link
Contributor Author

heisenbuug commented Jun 1, 2021

@shrit We talked about how we want to modify Load functions so that user can choose whether he wants to use arma or coot

So we need to rewrite the function declarations of Load function which should be in this file, right? Also after the project completion, we won't have load.cpp, right?

Now based on whether the user selects namespace arma or coot we need to use different namespaces.

How can we achieve this behavior?
I came across this.

@shrit
Copy link
Member

shrit commented Jun 1, 2021

@shrit We talked about how we want to modify Load functions so that user can choose whether he wants to use arma or coot

So we need to rewrite the function declarations of Load function which should be in this file, right? Also after the project completion, we won't have load.cpp, right?

Yes, exactly, load.cpp should disappear when boost is removed.

Now based on whether the user selects namespace arma or coot we need to use different namespaces.

How can we achieve this behavior?
I came across [this](https://stackoverflow.com/questions/55612759/alternative-to-using-namespace-as-template-parameter

Yeah, in the meanwhile you can use MatType of arma::Mat<>
You can push the modification as soon as possible, it will be easier to add comments directly on the code.

Copy link
Member

@shrit shrit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcurtin @zoq Anything you have for this pull request or we can merge it?
I have reviewed it, nothing to add from my side.

@zoq
Copy link
Member

zoq commented Nov 2, 2021

@rcurtin @zoq Anything you have for this pull request or we can merge it? I have reviewed it, nothing to add from my side.

Will take another look later today.

@rcurtin
Copy link
Member

rcurtin commented Nov 2, 2021

It's all good from my side so @zoq if you're happy with it then feel free to go ahead and merge. :) Awesome work @heisenbuug!

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an armadillo attribution for the code, something along the lines of this was inspired by armadillo's parser, or did the code deviate from the armadillo parse quite a bit?

doc/tutorials/data_loading/datasetmapper.txt Outdated Show resolved Hide resolved
doc/tutorials/data_loading/datasetmapper.txt Outdated Show resolved Hide resolved
src/mlpack/core/data/load_categorical_csv.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/load_categorical_csv.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/load_categorical_csv.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/load_categorical_csv.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/load_categorical_csv.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/load_numeric_csv.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/string_algorithms.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/string_algorithms.hpp Show resolved Hide resolved
@conradsnicta
Copy link
Contributor

conradsnicta commented Nov 5, 2021

Should we add an armadillo attribution for the code, something along the lines of this was inspired by armadillo's parser, or did the code deviate from the armadillo parse quite a bit?

No attribution necessary -- the parser is pretty basic.

I suggest merging this in its current state, as it has been in development since May (ie. ~6 months). It's good enough. Further optional improvements can always be done afterwards.

@shrit
Copy link
Member

shrit commented Nov 5, 2021

@heisenbuug If you do not mind, I am going to commit all these propositions.

shrit and others added 16 commits November 5, 2021 20:14
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
@shrit
Copy link
Member

shrit commented Nov 5, 2021

@conradsnicta Agreed, I am merging this one once these tests are passing.

@heisenbuug
Copy link
Contributor Author

@heisenbuug If you do not mind, I am going to commit all these propositions.

Thank you for doing this. I was a bit busy with some other things and I was planning to look at them later. I hope all cases pass. If this gets merged soon I will open another PR for the parallel loading using OpenMP before the end of Sunday.

Can't wait to finally get this merged.

@shrit shrit merged commit 314557e into mlpack:master Nov 6, 2021
This was referenced Oct 14, 2022
@rcurtin rcurtin mentioned this pull request Oct 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants