/
formats.hpp
529 lines (412 loc) · 18.5 KB
/
formats.hpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
/*! @page formatdoc File formats and loading data in mlpack
@section formatintro Introduction
mlpack supports a wide variety of data (including images) and model formats for use in both its
command-line programs and in C++ programs using mlpack via the
mlpack::data::Load() function. This tutorial discusses the formats that are
supported and how to use them.
@section toc_tut Table of Contents
This tutorial is split into the following sections:
- \ref formatintro
- \ref toc_tut
- Data
- Data Formats
- \ref formatsimple
- \ref formattypes
- \ref formatcpp
- \ref sparseload
- \ref formatcat
- \ref formatcatcpp
- Image Support
- \ref intro_imagetut
- \ref model_api_imagetut
- \ref imageinfo_api_imagetut
- \ref load_api_imagetut
- \ref save_api_imagetut
- Models
- \ref formatmodels
- \ref formatmodelscpp
- \ref formatfinal
@section formatsimple Simple examples to load data in C++
The example code snippets below load data from different formats into an
Armadillo matrix object (\c arma::mat) or model when using C++.
@code
using namespace mlpack;
arma::mat matrix1;
data::Load("dataset.csv", matrix1);
@endcode
@code
using namespace mlpack;
arma::mat matrix2;
data::Load("dataset.bin", matrix2);
@endcode
@code
using namespace mlpack;
arma::mat matrix3;
data::Load("dataset.h5", matrix3);
@endcode
@code
using namespace mlpack;
// ARFF loading is a little different, since sometimes mapping has to be done
// for string types.
arma::mat matrix4;
data::DatasetInfo datasetInfo;
data::Load("dataset.arff", matrix4, datasetInfo);
// The datasetInfo object now holds information about each dimension.
@endcode
@code
using namespace mlpack;
regression::LogisticRegression lr;
data::Load("model.bin", "logistic_regression_model", lr);
@endcode
@section formattypes Supported dataset types
Datasets in mlpack are represented internally as sparse or dense numeric
matrices (specifically, as \c arma::mat or \c arma::sp_mat or similar). This
means that when datasets are loaded from file, they must be converted to a
suitable numeric representation. Therefore, in general, datasets on disk should
contain only numeric features in order to be loaded successfully by mlpack.
The types of datasets that mlpack can load are roughly the same as the types of
matrices that Armadillo can load. However, the load functionality that mlpack
provides <b>only supports loading dense datasets</b>. When datasets are loaded
by mlpack, <b>the file's type is detected using the file's extension</b>.
mlpack supports the following file types:
- csv (comma-separated values), denoted by .csv or .txt
- tsv (tab-separated values), denoted by .tsv, .csv, or .txt
- ASCII (raw ASCII, with space-separated values), denoted by .txt
- Armadillo ASCII (Armadillo's text format with a header), denoted by .txt
- PGM, denoted by .pgm
- PPM, denoted by .ppm
- Armadillo binary, denoted by .bin
- Raw binary, denoted by .bin <b>(note: this will be loaded as
one-dimensional data, which is likely not what is desired.)</b>
- HDF5, denoted by .hdf, .hdf5, .h5, or .he5 (<b>note: HDF5 must be enabled
in the Armadillo configuration</b>)
- ARFF, denoted by .arff (<b>note: this is not supported by all mlpack
command-line programs </b>; see \ref formatcat)
Datasets that are loaded by mlpack should be stored with <b>one row for
one point</b> and <b>one column for one dimension</b>. Therefore, a dataset
with three two-dimensional points \f$(0, 1)\f$, \f$(3, 1)\f$, and \f$(5, -5)\f$
would be stored in a csv file as:
\code
0, 1
3, 1
5, -5
\endcode
As noted earlier, for command-line programs, the format is automatically
detected at load time. Therefore, a dataset can be loaded in many ways:
\code
$ mlpack_logistic_regression -t dataset.csv -v
[INFO ] Loading 'dataset.csv' as CSV data. Size is 32 x 37749.
...
$ mlpack_logistic_regression -t dataset.txt -v
[INFO ] Loading 'dataset.txt' as raw ASCII formatted data. Size is 32 x 37749.
...
$ mlpack_logistic_regression -t dataset.h5 -v
[INFO ] Loading 'dataset.h5' as HDF5 data. Size is 32 x 37749.
...
\endcode
Similarly, the format to save to is detected by the extension of the given
filename.
@section formatcpp Loading simple matrices in C++
When C++ is being written, the mlpack::data::Load() and mlpack::data::Save()
functions are used to load and save datasets, respectively. These functions
should be preferred over the built-in Armadillo \c .load() and \c .save()
functions.
Matrices in mlpack are column-major, meaning that each column should correspond
to a point in the dataset and each row should correspond to a dimension; for
more information, see \ref matrices. This is at odds with how the data is
stored in files; therefore, a transposition is required during load and save.
The mlpack::data::Load() and mlpack::data::Save() functions do this
automatically (unless otherwise specified), which is why they are preferred over
the Armadillo functions.
To load a matrix from file, the call is straightforward. After creating a
matrix object, the data can be loaded:
\code
arma::mat dataset; // The data will be loaded into this matrix.
mlpack::data::Load("dataset.csv", dataset);
\endcode
Saving matrices is equally straightforward. The code below generates a random
matrix with 10 points in 3 dimensions and saves it to a file as HDF5.
\code
// 3 dimensions (rows), with 10 points (columns).
arma::mat dataset = arma::randu<arma::mat>(3, 10);
mlpack::data::Save("dataset.h5", dataset);
\endcode
As with the command-line programs, the type of data to be loaded is
automatically detected from the filename extension. For more details, see the
mlpack::data::Load() and mlpack::data::Save() documentation.
@section sparseload Dealing with sparse matrices
As mentioned earlier, support for loading sparse matrices in mlpack is not
available at this time. To use a sparse matrix with mlpack code, you will have
to write a C++ program instead of using any of the command-line tools, because
the command-line tools all use dense datasets internally. (There is one
exception: the \c mlpack_cf program, for collaborative filtering, loads sparse
coordinate lists.)
In addition, the \c mlpack::data::Load() function does not support loading any
sparse format; so the best idea is to use undocumented Armadillo functionality
to load coordinate lists. Suppose you have a coordinate list file like the one
below:
\code
$ cat cl.csv
0 0 0.332
1 3 3.126
4 4 1.333
\endcode
This represents a 5x5 matrix with three nonzero elements. We can load this
using Armadillo:
\code
arma::sp_mat matrix;
matrix.load("cl.csv", arma::coord_ascii);
matrix = matrix.t(); // We must transpose after load!
\endcode
The transposition after loading is necessary if the coordinate list is in
row-major format (that is, if each row in the matrix represents a point and each
column represents a feature). Be sure that the matrix you use with mlpack
methods has points as columns and features as rows! See \ref matrices for more
information.
@section formatcat Categorical features and command line programs
In some situations it is useful to represent data not just as a numeric matrix
but also as categorical data (i.e. with numeric but unordered categories). This
support is useful for, e.g., decision trees and other models that support
categorical features.
In some machine learning situations, such as, e.g., decision trees, categorical
data can be used. Categorical data might look like this (in CSV format):
\code
0, 1, "true", 3
5, -2, "false", 5
2, 2, "true", 4
3, -1, "true", 3
4, 4, "not sure", 0
0, 7, "false", 6
\endcode
In the example above, the third dimension (which takes values "true", "false",
and "not sure") is categorical. mlpack can load and work with this data, but
the strings must be mapped to numbers, because all dataset in mlpack are
represented by Armadillo matrix objects.
From the perspective of an mlpack command-line program, this support is
transparent; mlpack will attempt to load the data file, and if it detects
entries in the file that are not numeric, it will map them to numbers and then
print, for each dimension, the number of mappings. For instance, if we run the
\c mlpack_hoeffding_tree program (which supports categorical data) on the
dataset above (stored as dataset.csv), we receive this output during loading:
\code
$ mlpack_hoeffding_tree -t dataset.csv -l dataset.labels.csv -v
[INFO ] Loading 'dataset.csv' as CSV data. Size is 6 x 4.
[INFO ] 0 mappings in dimension 0.
[INFO ] 0 mappings in dimension 1.
[INFO ] 3 mappings in dimension 2.
[INFO ] 0 mappings in dimension 3.
...
\endcode
Currently, only the \c mlpack_hoeffding_tree program supports loading
categorical data, and this is also the only program that supports loading an
ARFF dataset.
@section formatcatcpp Categorical features and C++
When writing C++, loading categorical data is slightly more tricky: the mappings
from strings to integers must be preserved. This is the purpose of the
mlpack::data::DatasetInfo class, which stores these mappings and can be used and
load and save time to apply and de-apply the mappings.
When loading a dataset with categorical data, the overload of
mlpack::data::Load() that takes an mlpack::data::DatasetInfo object should be
used. An example is below:
\code
arma::mat dataset; // Load into this matrix.
mlpack::data::DatasetInfo info; // Store information about dataset in this.
// Load the ARFF dataset.
mlpack::data::Load("dataset.arff", dataset, info);
\endcode
After this load completes, the \c info object will hold the information about
the mappings necessary to load the dataset. It is possible to re-use the
\c DatasetInfo object to load another dataset with the same mappings. This is
useful when, for instance, both a training and test set are being loaded, and it
is necessary that the mappings from strings to integers for categorical features
are identical. An example is given below.
\code
arma::mat trainingData; // Load training data into this matrix.
mlpack::data::DatasetInfo info; // This will store the mappings.
// Load the training data, and create the mappings in the 'info' object.
mlpack::data::Load("training_data.arff", trainingData, info);
// Load the test data, but re-use the 'info' object with the already initialized
// mappings. This means that the same mappings will be applied to the test set.
mlpack::data::Load("test_data.arff", trainingData, info);
\endcode
When saving data, pass the same DatasetInfo object it was loaded with in order
to unmap the categorical features correctly. The example below demonstrates
this functionality: it loads the dataset, increments all non-categorical
features by 1, and then saves the dataset with the same DatasetInfo it was
loaded with.
\code
arma::mat dataset; // Load data into this matrix.
mlpack::data::DatasetInfo info; // This will store the mappings.
// Load the dataset.
mlpack::data::Load("dataset.tsv", dataset, info);
// Loop over all features, and add 1 to all non-categorical features.
for (size_t i = 0; i < info.Dimensionality(); ++i)
{
// The Type() function returns whether or not the data is numeric or
// categorical.
if (info.Type(i) != mlpack::data::Datatype::categorical)
dataset.row(i) += 1.0;
}
// Save the modified dataset using the same DatasetInfo.
mlpack::data::Save("dataset-new.tsv", dataset, info);
\endcode
There is more functionality to the DatasetInfo class; for more information, see
the mlpack::data::DatasetInfo documentation.
@section intro_imagetut Loading and Saving Images
Image datasets are becoming increasingly popular in deep learning.
mlpack's image saving/loading functionality is based on [stb/](https://github.com/nothings/stb).
@section model_api_imagetut Image Utilities API
Image utilities supports loading and saving of images.
It supports filetypes "jpg", "png", "tga","bmp", "psd", "gif", "hdr", "pic", "pnm" for loading and "jpg", "png", "tga", "bmp", "hdr" for saving.
The datatype associated is unsigned char to support RGB values in the range 1-255. To feed data into the network typecast of `arma::Mat` may be required. Images are stored in matrix as (width * height * channels, NumberOfImages). Therefore imageMatrix.col(0) would be the first image if images are loaded in imageMatrix.
@section imageinfo_api_imagetut Accessing Metadata of Images: ImageInfo
ImageInfo class contains the metadata of the images.
@code
ImageInfo(const size_t width,
const size_t height,
const size_t channels);
@endcode
Other public memebers include:
- flipVertical Flip the image vertical upon loading.
- quality Compression of the image if saved as jpg (0-100).
@section load_api_imagetut Loading Images in C++
Standalone loading of images.
@code
template<typename eT>
bool Load(const std::string& filename,
arma::Mat<eT>& matrix,
ImageInfo& info,
const bool fatal,
const bool transpose);
@endcode
Loading a test image. It also fills up the ImageInfo class object.
@code
data::ImageInfo info;
data::Load("test_image.png", matrix, info, false, true);
@endcode
ImageInfo requires height, width, number of channels of the image.
@code
size_t height = 64, width = 64, channels = 1;
data::ImageInfo info(width, height, channels);
@endcode
More than one image can be loaded into the same matrix.
Loading multiple images:
@code
template<typename eT>
bool Load(const std::vector<std::string>& files,
arma::Mat<eT>& matrix,
ImageInfo& info,
const bool fatal,
const bool transpose);
@endcode
@code
data::ImageInfo info;
std::vector<std::string>> files{"test_image1.bmp","test_image2.bmp"};
data::load(files, matrix, info, false, true);
@endcode
@section save_api_imagetut Saving Images in C++
Save images expects a matrix of type unsigned char in the form (width * height * channels, NumberOfImages).
Just like load it can be used to save one image or multiple images. Besides image data it also expects the shape of the image as input (width, height, channels).
Saving one image:
@code
template<typename eT>
bool Save(const std::string& filename,
arma::Mat<eT>& matrix,
ImageInfo& info,
const bool fatal,
const bool transpose);
@endcode
@code
data::ImageInfo info;
info.width = info.height = 25;
info.channels = 3;
info.quality = 90;
data::Save("test_image.bmp", matrix, info, false, true);
@endcode
If the matrix contains more than one image, only the first one is saved.
Saving multiple images:
@code
template<typename eT>
bool Save(const std::vector<std::string>& files,
arma::Mat<eT>& matrix,
ImageInfo& info,
const bool fatal,
const bool transpose);
@endcode
@code
data::ImageInfo info;
info.width = info.height = 25;
info.channels = 3;
info.quality = 90;
std::vector<std::string>> files{"test_image1.bmp", "test_image2.bmp"};
data::Save(files, matrix, info, false, true);
@endcode
Multiple images are saved according to the vector of filenames specified.
@section formatmodels Loading and saving models
Using \c boost::serialization, mlpack is able to load and save machine learning
models with ease. These models can currently be saved in three formats:
- binary (.bin); this is not human-readable, but it is small
- text (.txt); this is sort of human-readable and relatively small
- xml (.xml); this is human-readable but very verbose and large
The type of file to save is determined by the given file extension, as with the
other loading and saving functionality in mlpack. Below is an example where a
dataset stored as TSV and labels stored as ASCII text are used to train a
logistic regression model, which is then saved to model.xml.
\code
$ mlpack_logistic_regression -t training_dataset.tsv -l training_labels.txt \
> -M model.xml
\endcode
Many mlpack command-line programs have support for loading and saving models
through the \c --input_model_file (\c -m) and \c --output_model_file (\c -M)
options; for more information, see the documentation for each program
(accessible by passing \c --help as a parameter).
@section formatmodelscpp Loading and saving models in C++
mlpack uses the \c boost::serialization library internally to perform loading
and saving of models, and provides convenience overloads of mlpack::data::Load()
and mlpack::data::Save() to load and save these models.
To be serializable, a class must implement the method
\code
template<typename Archive>
void serialize(Archive& ar, const unsigned int version);
\endcode
\note
For more information on this method and how it works, see the
boost::serialization documentation at
http://www.boost.org/libs/serialization/doc/.
\note
Examples of serialize() methods can be found in most classes; one fairly
straightforward example is found \ref mlpack::math::Range::serialize()
"in the mlpack::math::Range class". A more complex example is found
\ref mlpack::tree::BinarySpaceTree::serialize() "in the mlpack::tree::BinarySpaceTree class".
Using the mlpack::data::Load() and mlpack::data::Save() classes is easy if the
type being saved has a \c serialize() method implemented: simply call either
function with a filename, a name for the object to save, and the object itself.
The example below, for instance, creates an mlpack::math::Range object and saves
it as range.txt. Then, that range is loaded from file into another
mlpack::math::Range object.
\code
// Create range and save it.
mlpack::math::Range r(0.0, 5.0);
mlpack::data::Save("range.txt", "range", r);
// Load into new range.
mlpack::math::Range newRange;
mlpack::data::Load("range.txt", "range", newRange);
\endcode
It is important to be sure that you load the appropriate type; if you save, for
instance, an mlpack::regression::LogisticRegression object and attempt to load
it as an mlpack::math::Range object, the load will fail and an exception will be
thrown. (When the object is saved as binary (.bin), it is possible that the
load will not fail, but instead load with mangled data, which is perhaps even
worse!)
@section formatfinal Final notes
If the examples here are unclear, it would be worth looking into the ways that
mlpack::data::Load() and mlpack::data::Save() are used in the code. Some
example files that may be useful to this end:
- src/mlpack/methods/logistic_regression/logistic_regression_main.cpp
- src/mlpack/methods/hoeffding_trees/hoeffding_tree_main.cpp
- src/mlpack/methods/neighbor_search/knn_main.cpp
If you are interested in adding support for more data types to mlpack, it would
be preferable to add the support upstream to Armadillo instead, so that may be a
better direction to go first. Then very little code modification for mlpack
will be necessary.
*/