Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting armadillo's parser for mlpack(Removing Boost Dependencies) #2942

Merged
merged 116 commits into from Nov 6, 2021
Merged
Show file tree
Hide file tree
Changes from 99 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
1c48343
Adding new parser to mlpack::data::
heisenbuug May 15, 2021
fe70a12
changes
heisenbuug May 16, 2021
0919664
changes
heisenbuug May 16, 2021
aa649bd
style checks
heisenbuug May 16, 2021
08b0d16
Adding source of the original file
heisenbuug May 17, 2021
4b672a7
Adding license
heisenbuug May 17, 2021
a8534e8
Changed the name to csv_parser, minor style changes
heisenbuug Jun 2, 2021
74ad69c
minor style changes
heisenbuug Jun 2, 2021
e2e25d3
minor style changes
heisenbuug Jun 2, 2021
89e2942
Add MatType to Load fucntions
heisenbuug Jun 2, 2021
994934b
included csv_parser.hpp in core.hpp
heisenbuug Jun 2, 2021
e246209
changes
heisenbuug Jun 2, 2021
22ebc35
Removed arma::file_type, changed template parameter to MatType only
heisenbuug Jun 7, 2021
a7456b1
Removing LoadData()
heisenbuug Jun 9, 2021
7ff9b3c
Temporary patch to handle other file types
heisenbuug Jun 9, 2021
6c6aa1b
fix
heisenbuug Jun 9, 2021
ad1a495
Added doxygen comments
heisenbuug Jun 16, 2021
9a52c63
Update csv_parser.hpp
heisenbuug Jun 16, 2021
4dc7fec
added mlpack file type
heisenbuug Jun 16, 2021
cbdab31
adding mlpack file type in detect_file_type
heisenbuug Jun 16, 2021
739e212
Replacing arma file type with mlpack file type
heisenbuug Jun 16, 2021
5f4cbba
Created new file named type.hpp for mlpack file types and utility fuc…
heisenbuug Jun 20, 2021
585dbd9
Minor Fix
heisenbuug Jun 20, 2021
c3907bb
Removed load.cpp file
heisenbuug Jun 20, 2021
6d6616f
Commenting declarations in load.hpp
heisenbuug Jun 20, 2021
68281ea
Minor changes
heisenbuug Jun 22, 2021
d08b4e4
Changes in type checking
heisenbuug Jun 22, 2021
159e177
SFINAE for load()
heisenbuug Jun 24, 2021
195c31a
Applied SFINAE in load()
heisenbuug Jun 24, 2021
c2290fa
trying SFINAE
heisenbuug Jun 25, 2021
409d7ff
Created parser class
heisenbuug Jun 29, 2021
3f7f769
Added GetMatSize
heisenbuug Jun 29, 2021
41f2ef9
test
heisenbuug Jun 30, 2021
ae25a26
Still not working
heisenbuug Jul 1, 2021
e213112
Add these files to get rid of them
shrit Jul 3, 2021
8f4ee42
Remove csv file
shrit Jul 3, 2021
f9f7c82
Remove mlpack/core.hpp header from load_csv
shrit Jul 3, 2021
7eac0dc
Convert csv_parser to load_csv_impl
shrit Jul 3, 2021
0c937e1
Clean headers
shrit Jul 3, 2021
d4f79d4
Remove csv parser from CMakeLists
shrit Jul 3, 2021
ea95bb9
Remove csv parser no reason for it
shrit Jul 3, 2021
9426545
Removed spirits from GetMatrixSize() and GetTransposeMatrixSize()
heisenbuug Jul 17, 2021
4402e5e
minor chages
heisenbuug Jul 19, 2021
fb251ab
Removing arma::file_type from load_save_test.cpp
heisenbuug Jul 22, 2021
43b53c0
Merge branch 'master' of github.com:mlpack/mlpack
heisenbuug Jul 22, 2021
7f71091
Merging master
heisenbuug Jul 22, 2021
722d1d2
resolving issue
heisenbuug Jul 22, 2021
64c4d2f
Running tests locally
heisenbuug Jul 22, 2021
da06849
chaning int to size_t in ConvertToken, changing set_size to zeros
heisenbuug Jul 23, 2021
704a120
missed these size_t changes
heisenbuug Jul 23, 2021
aa3b189
Adding cmakefile
heisenbuug Jul 23, 2021
47e1328
Merge branch 'master' into new-parser
heisenbuug Jul 23, 2021
36ab2d3
uncommmenting main.cpp from CMakeLists.txt
heisenbuug Jul 25, 2021
4b1d111
Removing all instances of boost::trim
heisenbuug Jul 25, 2021
632c013
chaning int to size_t in size comparison
heisenbuug Jul 25, 2021
a88a1ae
Removed boost spirit. Still using boost::trim. Failing cases must be …
heisenbuug Jul 29, 2021
1f915fa
Merge branch 'master' of github.com:mlpack/mlpack into new-parser
heisenbuug Jul 29, 2021
cf87cb9
Removing load.cpp which contains boost::spirit code
heisenbuug Jul 30, 2021
0ddd6ab
trim() implementation to replace boost::trim()
heisenbuug Jul 30, 2021
ed8e614
Removing load.cpp and adding string_algorithms.hpp
heisenbuug Jul 30, 2021
1fa7b64
replace boost::trim() with mlpack::data::trim()
heisenbuug Jul 30, 2021
9ac7cd1
Handling prasing
heisenbuug Aug 4, 2021
b32a913
Solving bug in trim fucntion
heisenbuug Aug 4, 2021
0201984
Handling string containg only space
heisenbuug Aug 4, 2021
3b70751
CSV files with header
heisenbuug Aug 5, 2021
81ebd0a
Uncommenting other tests
heisenbuug Aug 5, 2021
e117294
Implemted trim_if()
heisenbuug Aug 5, 2021
77b5f6b
Somehow deleted this line, adding it back
heisenbuug Aug 5, 2021
1bbae56
Removing comments
heisenbuug Aug 6, 2021
7aca744
Handling style checks
heisenbuug Aug 6, 2021
2a99165
Replacing old C style callback with std::fucntion
heisenbuug Aug 6, 2021
727a02b
Adding back empty constructor | Syntax error in trim_if
heisenbuug Aug 6, 2021
62383c0
Adding comment inside constructor.
heisenbuug Aug 6, 2021
3b5a35a
Solving indentation issues | changed file_type -> FileType, line_stre…
heisenbuug Aug 12, 2021
756d7af
Removing load() for sparse matrix | Solving some errors from last commit
heisenbuug Aug 12, 2021
c1cf824
Chaning template parameter for Load() with DatasetMapper
heisenbuug Aug 12, 2021
626bbc6
Combined GetMatSize() and GetNonNumericMatSize() | Created new fucnti…
heisenbuug Aug 13, 2021
4599bc5
Adding MapOnFirstPass()
heisenbuug Aug 13, 2021
6ee199d
Running all tests
heisenbuug Aug 13, 2021
a2a352d
Checking
heisenbuug Aug 13, 2021
ce0a904
Refactoring parser into two files
heisenbuug Aug 13, 2021
b44a7ee
Adding comments
heisenbuug Aug 14, 2021
22a9646
solving a small error
heisenbuug Aug 14, 2021
3be9474
Running all tests
heisenbuug Aug 14, 2021
7569626
Indentation | Converted GetMatrixSize() to template fuction
heisenbuug Aug 16, 2021
589689e
Changing template parameter MatType to eT in Load()
heisenbuug Aug 21, 2021
c78e626
More indentation issues
heisenbuug Aug 22, 2021
2707ff1
Adding a tutorial for DatasetMapper
heisenbuug Aug 22, 2021
933b68c
Handling empty line at the end
heisenbuug Aug 23, 2021
54d7824
Style changes and fixs in tutorial
heisenbuug Sep 1, 2021
7904f39
Some typos
heisenbuug Sep 2, 2021
128da6b
Adding condition in case ConvertToken() fails
heisenbuug Sep 3, 2021
c2151d7
More style issues
heisenbuug Sep 20, 2021
1537e86
Handling failing of ConvertToken()
heisenbuug Oct 18, 2021
67325c5
Adding comment
heisenbuug Oct 18, 2021
73d37db
Apply suggestions from code review
heisenbuug Oct 22, 2021
cc91855
More style changes
heisenbuug Oct 26, 2021
d986e3d
Replacing ternary operator with simple if/else block
heisenbuug Oct 26, 2021
5cade8d
minor bug
heisenbuug Oct 26, 2021
49dce56
Adding comment
heisenbuug Nov 2, 2021
0307574
Update src/mlpack/core/data/string_algorithms.hpp
shrit Nov 5, 2021
967a7c9
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
2cba1d6
Update src/mlpack/core/data/load_numeric_csv.hpp
shrit Nov 5, 2021
ce36efb
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
eec152f
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
14edebb
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
1c8e3bb
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
99def5a
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
4f7df58
Update doc/tutorials/data_loading/datasetmapper.txt
shrit Nov 5, 2021
6d63b15
Update doc/tutorials/data_loading/datasetmapper.txt
shrit Nov 5, 2021
2b2c7fa
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
5e8afca
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
bd6c244
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
72c989d
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
150099a
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
4eb9468
Update src/mlpack/core/data/load_categorical_csv.hpp
shrit Nov 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
192 changes: 192 additions & 0 deletions doc/tutorials/data_loading/datasetmapper.txt
@@ -0,0 +1,192 @@
/*!

@file datasetmapper.txt
heisenbuug marked this conversation as resolved.
Show resolved Hide resolved
@author Gopi Tatiraju
@breif Introduction and tutorial for how to use DatasetMapper in mlpack.

@page datasetmapper DatasetMapper Tutorial
shrit marked this conversation as resolved.
Show resolved Hide resolved

@section intro_datasetmapper Introduction

DatasetMapper is a class which holds information about a dataset. This can be
used when dataset contains categorical non-numeric features which should be
mapped to numeric features. A simple example can be

```
7,5,True,3
6,3,False,4
4,8,False,2
9,3,True,3
```

The above dataset will be represented as

```
7,5,0,3
6,3,1,4
4,8,1,2
9,3,0,3
```

Here Mappings are

- `True` mapped to `0`
- `False` mapped to `1`

```
**Note** DatasetMapper converts non-numeric values in the order
in which it encounters them in dataset. Therefore there is a chance that
`True` might get mapped to `0` if it encounters `True` before `False`.
This `0` and `1` are not to be confused with C++ bool notations. These
are mapping created by `mpack::DatasetMapper`.
```

DatasetMapper provides an easy API to load such data and stores all the
necessary information of the dataset.

@section toc_datasetmapper Table of Contents

A list of all sections

- \ref intro_datasetmapper
- \ref toc_datasetmapper
- \ref load
- \ref dimensions
- \ref type
- \ref numofmappings
- \ref checkmappings
- \ref unmapstring
- \ref unmapvalue

@section load Loading data

To use \b DatasetMapper we have to call a specific overload of `data::Load()`
fucntion.

@code
using namespace mlpack;

arma::mat data;
data::DatasetMapper info;
data::Load("dataset.csv", data, info);
@endcode

Dataset
```
7, 5, True, 3
6, 3, False, 4
4, 8, False, 2
9, 3, True, 3
```

@section dimensions Dimensionality

There are two ways to initialize a DatasetMapper object.

* First is to initialize the object and set each property yourself.

* Second is to pass the object to Load() in which case mlpack will populate
the object. If we use the latter option then the dimensionality will be same
as what's in the data file.

@code
std::cout << info.Dimensionality();
@endcode

@code
4
@endcode

@section type Type of each Dimension

Each dimension can be of either of the two types
- data::Datatype::numeric
- data::Datatype::categorical

\c `Type(size_t dimension)` takes an argument dimension which is the row
number for which you want to know the type

This will return an enum `data::Datatype`, which is casted to
`size_t` when we print them using `std::cout`
- 0 represents `data::Datatype::numeric`
- 1 represents `data::Datatype::categorical`

@code
std::cout << info.Type(0) << "\n";
std::cout << info.Type(1) << "\n";
std::cout << info.Type(2) << "\n";
std::cout << info.Type(3) << "\n";
@endcode

@code
0
0
1
0
@endcode

@section numofmappings Number of Mappings

If the type of a dimention is `data::Datatype::categorical`, then during the
shrit marked this conversation as resolved.
Show resolved Hide resolved
loading, each unique token in that dimension will be mapped to an integer
starting with 0.

\b NumMappings(size_t dimension) takes dimension as an argument and returns the number of
mappings in that dimension, if the dimension is number or there are no mappings then it
shrit marked this conversation as resolved.
Show resolved Hide resolved
will return 0.

@code
std::cout << info.NumMappings(0) << "\n";
std::cout << info.NumMappings(1) << "\n";
std::cout << info.NumMappings(2) << "\n";
std::cout << info.NumMappings(3) << "\n";
@endcode

@code
0
0
2
0
@endcode

@section checkmappings Check Mappings

There are two ways to check the mappings.
- Enter the string to get mapped integer
- Enter the mapped integer to get string

@subsection unmapstring UnmapString

\b UnmapString(int value, size_t dimension, size_t unmappingIndex = 0UL)
- value is the integer for which you want to find the mapped value
- dimension is the dimension in which you want to check the mappings

@code
std::cout << info.UnmapString(0, 2) << "\n";
std::cout << info.UnmapString(1, 2) << "\n";
@endcode

@code
T
F
@endcode

@subsection unmapvalue UnmapValue

\b UnmapValue(const std::string &input, size_t dimension)
- input is the mapped value for which you want to find mapping
- dimension is the dimension in which you want to find the mapped value

@code
std::cout << info.UnmapValue("T", 2) << "\n";
std::cout << info.UnmapValue("F", 2) << "\n";
@endcode

@code
0
1
@endcode

These are basic uses of DatasetMapper. Some advance use cases will be added soon.
heisenbuug marked this conversation as resolved.
Show resolved Hide resolved

*/
1 change: 1 addition & 0 deletions doc/tutorials/tutorials.txt
Expand Up @@ -59,6 +59,7 @@ mlpack.
- \ref bindings
- \ref cv
- \ref hpt_guide
- \ref datasetmapper

@section policy_tut Policy Class Documentation

Expand Down
7 changes: 5 additions & 2 deletions src/mlpack/core/data/CMakeLists.txt
Expand Up @@ -10,14 +10,14 @@ set(SOURCES
has_serialize.hpp
is_naninf.hpp
load_csv.hpp
load_csv.cpp
heisenbuug marked this conversation as resolved.
Show resolved Hide resolved
load_numeric_csv.hpp
load_categorical_csv.hpp
load.hpp
load_image_impl.hpp
load_image.cpp
load_model_impl.hpp
load_vec_impl.hpp
load_impl.hpp
load.cpp
load_arff.hpp
load_arff_impl.hpp
normalize_labels.hpp
Expand All @@ -26,6 +26,7 @@ set(SOURCES
save_impl.hpp
save_image.cpp
split_data.hpp
string_algorithms.hpp
imputer.hpp
binarize.hpp
string_encoding.hpp
Expand All @@ -34,6 +35,8 @@ set(SOURCES
confusion_matrix.hpp
one_hot_encoding.hpp
one_hot_encoding_impl.hpp
types.hpp
types_impl.hpp
)

# add directory name to sources
Expand Down