Adapting armadillo's parser for mlpack(Removing Boost Dependencies) #2942

heisenbuug · 2021-05-16T12:08:16Z

For background knowledge, look at these

Sample code to use the feature

#include <iostream>
#include <mlpack/core.hpp>

int main()
{
  arma::Mat<double> data;
  std::fstream file;
  
  file.open("data.csv");
  mlpack::data::load_data<double>(data, arma::csv_ascii, file);
  data.raw_print();
  
  return 0;  
}

conradsnicta · 2021-05-17T00:35:51Z

@heisenbuug

Please don't copy the code from Armadillo without respecting the License requirements (Apache 2.0). You also need to acknowledge the author(s). It appears the code was taken from https://gitlab.com/conradsnicta/armadillo-code/-/blob/10.5.x/include/armadillo_bits/diskio_meat.hpp
Don't use internal Armadillo functionality such as arma_cold, arma::cond_rel, arma::is_signed, arma::is_real, as that can be changed at any time. Armadillo's API policy states, in part:

Caveat: any function, class, constant or other code not explicitly described in the public API documentation is considered as part of the underlying internal implementation details, and may be removed or changed without notice. (In other words, don't use internal functionality).

heisenbuug · 2021-05-17T07:14:54Z

Hey, @conradsnicta Thank you for having a look.

Please don't copy the code from Armadillo without respecting the License requirements (Apache 2.0). You also need to acknowledge the author(s). It appears the code was taken from https://gitlab.com/conradsnicta/armadillo-code/-/blob/10.5.x/include/armadillo_bits/diskio_meat.hpp

Sorry, I forgot to mention that file is taken from the armadillo, I will add that right away.

Don't use internal Armadillo functionality such as arma_cold, arma::cond_rel, arma::is_signed, arma::is_real, as that can be changed at any time. Armadillo's API policy states, in part:
Caveat: any function, class, constant or other code not explicitly described in the public API documentation is considered as part of the underlying internal implementation details, and may be removed or changed without notice. (In other words, don't use internal functionality).

@rcurtin @shrit how should we proceed then?
Maybe we can copy these to mlpack, i.e. even if removed from armadillo we can use them internally, not sure what to do.

conradsnicta · 2021-05-18T01:24:21Z

@heisenbuug I suggest writing the code from scratch, but using the Armadillo implementation as an inspiration and reference. You'll get a much better understanding of how things work.

The code in mlpack doesn't need arma_cold. Stuff like arma::cond_rel, arma::is_signed and arma::is_real is relatively simple to re-implement. You may be able to use existing C++11 functionality -- see the type_traits header: https://en.cppreference.com/w/cpp/header/type_traits

shrit · 2021-05-18T10:53:20Z

@heisenbuug I agree with @conradsnicta, it would be a better idea to rewrite the parser from scratch by using the same implementation of armadillo as a reference, also we need to mention that the implementation is inspired by Armadillo.

heisenbuug · 2021-05-26T11:59:31Z

@shrit @conradsnicta okay, I will start rewriting the parser from the scratch.

I have one doubt now that I am writing the code from the scratch we also need to follow
good design principles, right?

So, should I first think about how will I structure the whole parser? Or should I first just try
to reimplement the basic data loading code and we can think about the design along the way.

@conradsnicta, my initial idea was actually to implement a pandas like data structure for
C++, so now that I am implementing the parser itself, maybe I can implement some features
on the parser level?

Let me know what you all think.

shrit · 2021-05-26T20:32:46Z

I would vote for keeping the same implementation as the one provided by Armadillo because it would be impossible to write a better one during the 3 months of the summer. However, this does not mean copying the code directly from armadillo into mlpack, but this would require a complete rewrite of the code to provide the same performance and the same functionalities.

Once this step is complete it would be easier to add additional functionalities similar to one provided by pandas #2722.
The parser is an important part of mlpack core. We need to be sure that everything is working perfectly here.

Considering the design, the next release is mlpack 4.0, which means we can have some changes in the public API. However, we need to try to keep the same interface as the old one, with only minor changes if it is required most.

@heisenbuug Let me know if this is helpful 👍

heisenbuug · 2021-05-26T20:42:40Z

I would vote for keeping the same implementation as the one provided by Armadillo because it would be impossible to write a better one during the 3 months of the summer. However, this does not mean copying the code directly from armadillo into mlpack, but this would require a complete rewrite of the code to provide the same performance and the same functionalities.

Once this step is complete it would be easier to add additional functionalities similar to one provided by pandas #2722.
The parser is an important part of mlpack core. We need to be sure that everything is working perfectly here.

Considering the design, the next release is mlpack 4.0, which means we can have some changes in the public API. However, we need to try to keep the same interface as the old one, with only minor changes if it is required most.

@heisenbuug Let me know if this is helpful

So I will start by rewriting the basic data loading code, as suggested I will reimplement the functions that are not described in armadillo's public API documentation or I will try to find existing C++ 11 feature...

shrit · 2021-05-31T22:15:49Z

@heisenbuug, let me know if you need any help, do not hesitate in pushing the last modification, even if they are not complete yet, this will allow me to check and review it when I have time 👍

heisenbuug · 2021-06-01T19:12:20Z

@shrit We talked about how we want to modify Load functions so that user can choose whether he wants to use arma or coot

So we need to rewrite the function declarations of Load function which should be in this file, right? Also after the project completion, we won't have load.cpp, right?

Now based on whether the user selects namespace arma or coot we need to use different namespaces.

How can we achieve this behavior?
I came across this.

shrit · 2021-06-01T22:22:51Z

@shrit We talked about how we want to modify Load functions so that user can choose whether he wants to use arma or coot

So we need to rewrite the function declarations of Load function which should be in this file, right? Also after the project completion, we won't have load.cpp, right?

Yes, exactly, load.cpp should disappear when boost is removed.

Now based on whether the user selects namespace arma or coot we need to use different namespaces.

How can we achieve this behavior?
I came across [this](https://stackoverflow.com/questions/55612759/alternative-to-using-namespace-as-template-parameter

Yeah, in the meanwhile you can use MatType of arma::Mat<>
You can push the modification as soon as possible, it will be easier to add comments directly on the code.

shrit

@rcurtin @zoq Anything you have for this pull request or we can merge it?
I have reviewed it, nothing to add from my side.

zoq · 2021-11-02T18:47:42Z

@rcurtin @zoq Anything you have for this pull request or we can merge it? I have reviewed it, nothing to add from my side.

Will take another look later today.

rcurtin · 2021-11-02T19:17:24Z

It's all good from my side so @zoq if you're happy with it then feel free to go ahead and merge. :) Awesome work @heisenbuug!

zoq

Should we add an armadillo attribution for the code, something along the lines of this was inspired by armadillo's parser, or did the code deviate from the armadillo parse quite a bit?

doc/tutorials/data_loading/datasetmapper.txt

src/mlpack/core/data/load_categorical_csv.hpp

src/mlpack/core/data/load_numeric_csv.hpp

src/mlpack/core/data/string_algorithms.hpp

conradsnicta · 2021-11-05T03:01:08Z

Should we add an armadillo attribution for the code, something along the lines of this was inspired by armadillo's parser, or did the code deviate from the armadillo parse quite a bit?

No attribution necessary -- the parser is pretty basic.

I suggest merging this in its current state, as it has been in development since May (ie. ~6 months). It's good enough. Further optional improvements can always be done afterwards.

shrit · 2021-11-05T19:13:55Z

@heisenbuug If you do not mind, I am going to commit all these propositions.

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

shrit · 2021-11-05T19:17:23Z

@conradsnicta Agreed, I am merging this one once these tests are passing.

heisenbuug · 2021-11-05T19:42:14Z

@heisenbuug If you do not mind, I am going to commit all these propositions.

Thank you for doing this. I was a bit busy with some other things and I was planning to look at them later. I hope all cases pass. If this gets merged soon I will open another PR for the parallel loading using OpenMP before the end of Sunday.

Can't wait to finally get this merged.

heisenbuug added 3 commits May 16, 2021 04:25

Adding new parser to mlpack::data::

1c48343

changes

fe70a12

changes

0919664

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels May 16, 2021

heisenbuug changed the title ~~Integrating armadillo's parser with mlpack and adapting it for categorical data~~ Adapting armadillo's parser for mlpack May 16, 2021

style checks

aa649bd

heisenbuug changed the title ~~Adapting armadillo's parser for mlpack~~ Adapting armadillo's parser for mlpack(Removing Boost Dependencies) May 16, 2021

shrit added c: core update dependencies and removed s: unanswered s: unlabeled labels May 16, 2021

Adding source of the original file

08b0d16

Adding license

4b672a7

heisenbuug added 5 commits June 2, 2021 15:42

Changed the name to csv_parser, minor style changes

a8534e8

minor style changes

74ad69c

minor style changes

e2e25d3

Add MatType to Load fucntions

89e2942

included csv_parser.hpp in core.hpp

994934b

minor bug

5cade8d

shrit approved these changes Nov 2, 2021

View reviewed changes

Adding comment

49dce56

zoq reviewed Nov 2, 2021

View reviewed changes

shrit and others added 16 commits November 5, 2021 20:14

Update src/mlpack/core/data/string_algorithms.hpp

0307574

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

967a7c9

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_numeric_csv.hpp

2cba1d6

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

ce36efb

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

eec152f

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

14edebb

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

1c8e3bb

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

99def5a

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update doc/tutorials/data_loading/datasetmapper.txt

4f7df58

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update doc/tutorials/data_loading/datasetmapper.txt

6d63b15

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

2b2c7fa

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

5e8afca

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

bd6c244

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

72c989d

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

150099a

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Update src/mlpack/core/data/load_categorical_csv.hpp

4eb9468

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

shrit merged commit 314557e into mlpack:master Nov 6, 2021

This was referenced Oct 14, 2022

Release version 4.0.0 #3285

Closed

Release version 4.0.0 #3286

Closed

rcurtin mentioned this pull request Oct 23, 2022

Release version 4.0.0 #3293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapting armadillo's parser for mlpack(Removing Boost Dependencies) #2942

Adapting armadillo's parser for mlpack(Removing Boost Dependencies) #2942

heisenbuug commented May 16, 2021 •

edited

conradsnicta commented May 17, 2021 •

edited

heisenbuug commented May 17, 2021 •

edited

conradsnicta commented May 18, 2021

shrit commented May 18, 2021 •

edited

heisenbuug commented May 26, 2021 •

edited

shrit commented May 26, 2021

heisenbuug commented May 26, 2021

shrit commented May 31, 2021

heisenbuug commented Jun 1, 2021 •

edited

shrit commented Jun 1, 2021

shrit left a comment

zoq commented Nov 2, 2021

rcurtin commented Nov 2, 2021

zoq left a comment

conradsnicta commented Nov 5, 2021 •

edited

shrit commented Nov 5, 2021

shrit commented Nov 5, 2021

heisenbuug commented Nov 5, 2021

Adapting armadillo's parser for mlpack(Removing Boost Dependencies) #2942

Adapting armadillo's parser for mlpack(Removing Boost Dependencies) #2942

Conversation

heisenbuug commented May 16, 2021 • edited

conradsnicta commented May 17, 2021 • edited

heisenbuug commented May 17, 2021 • edited

conradsnicta commented May 18, 2021

shrit commented May 18, 2021 • edited

heisenbuug commented May 26, 2021 • edited

shrit commented May 26, 2021

heisenbuug commented May 26, 2021

shrit commented May 31, 2021

heisenbuug commented Jun 1, 2021 • edited

shrit commented Jun 1, 2021

shrit left a comment

Choose a reason for hiding this comment

zoq commented Nov 2, 2021

rcurtin commented Nov 2, 2021

zoq left a comment

Choose a reason for hiding this comment

conradsnicta commented Nov 5, 2021 • edited

shrit commented Nov 5, 2021

shrit commented Nov 5, 2021

heisenbuug commented Nov 5, 2021

heisenbuug commented May 16, 2021 •

edited

conradsnicta commented May 17, 2021 •

edited

heisenbuug commented May 17, 2021 •

edited

shrit commented May 18, 2021 •

edited

heisenbuug commented May 26, 2021 •

edited

heisenbuug commented Jun 1, 2021 •

edited

conradsnicta commented Nov 5, 2021 •

edited