Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize load csv 00 #681

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
e049bb7
add overload, able to move string
stereomatchingkiss Jun 4, 2016
c26052b
fix bug--infinite recursive call
stereomatchingkiss Jun 4, 2016
cd7c895
first commit
stereomatchingkiss Jun 4, 2016
97713bd
1 : fix bug, did not consider case like "210DM, 1~200"
stereomatchingkiss Jun 4, 2016
f0afdbe
fix bug--category conversion should based on columns but not rows
stereomatchingkiss Jun 4, 2016
cd5c6c7
remove useless codes
stereomatchingkiss Jun 5, 2016
1c13764
support tsv
stereomatchingkiss Jun 5, 2016
34ed51d
use LoadCSV to implement csv/tsv/txt loader
stereomatchingkiss Jun 5, 2016
fe1feb8
fix bug--LoadCSV should parse txt parse file too
stereomatchingkiss Jun 5, 2016
1865098
can specify fatal or not if file cannot open
stereomatchingkiss Jun 5, 2016
1afb484
fix bug--should not use empty string to open file
stereomatchingkiss Jun 5, 2016
e8b216c
treat \t and space as same category
stereomatchingkiss Jun 5, 2016
bc25da5
refine string construct
stereomatchingkiss Jun 5, 2016
f1c43d5
refine comments
stereomatchingkiss Jun 6, 2016
12fcc47
refine comments
stereomatchingkiss Jun 6, 2016
d0e725e
refine comments and parsers
stereomatchingkiss Jun 6, 2016
88d81dd
add new test cases for strings like "200-DM"
stereomatchingkiss Jun 6, 2016
0b20e9a
simplify parser by phrase_parse
stereomatchingkiss Jun 7, 2016
0a04ec4
simnplify parser and refine format
stereomatchingkiss Jun 7, 2016
8866942
add forward declaration for DatasetInfo, wihtout it the vc2015 compil…
stereomatchingkiss Jun 8, 2016
0ceba31
add load_csv.hpp and load_csv.cpp
stereomatchingkiss Jun 8, 2016
7cc597f
split part of the implementation details into cpp, this may reduce so…
stereomatchingkiss Jun 8, 2016
1bdaf1a
remove forward declaration
stereomatchingkiss Jun 8, 2016
4435829
include mlpack/core.hpp before arma_extend.hpp to prevent some weird …
stereomatchingkiss Jun 8, 2016
2264bb4
add header load_csv.hpp
stereomatchingkiss Jun 8, 2016
baa1a64
change order of sources
stereomatchingkiss Jun 8, 2016
1a70984
remove useless include file
stereomatchingkiss Jun 8, 2016
f19c11a
change order of header file
stereomatchingkiss Jun 8, 2016
3beb890
move implementation details from cpp back to hpp
stereomatchingkiss Jun 28, 2016
8253fec
merged
stereomatchingkiss Feb 4, 2017
3df712b
move part of the implementation details to cpp
stereomatchingkiss Feb 12, 2017
1bf513a
1 : use extern template to export part of the implementation of Load …
stereomatchingkiss Feb 12, 2017
c42372e
Fix Armadillo warning.
rcurtin Feb 14, 2017
cc5d541
Use extern templates to compile Load() overloads, so that spirit does…
rcurtin Feb 14, 2017
436bc2c
Add new file.
rcurtin Feb 14, 2017
f5d1c70
Merge pull request #1 from rcurtin/csvloadtest
stereomatchingkiss Feb 19, 2017
c6c25be
use std::string to replace raw buffer, cpp11 guarantee memory layout …
stereomatchingkiss Feb 19, 2017
09983f6
1 : fix bug, wrong constructor
stereomatchingkiss Feb 19, 2017
6cd94b6
fix format
stereomatchingkiss Feb 19, 2017
0d97eb9
remove useless file
stereomatchingkiss Feb 25, 2017
fd3b4b1
add license
stereomatchingkiss Feb 25, 2017
b92a4c1
use preprocessor to omit extra instant of Load function under windows
stereomatchingkiss Mar 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 0 additions & 1 deletion src/mlpack/core/data/CMakeLists.txt
Expand Up @@ -6,7 +6,6 @@ set(SOURCES
extension.hpp
format.hpp
load_csv.hpp
load_csv.cpp
load.hpp
load_impl.hpp
load_arff.hpp
Expand Down
95 changes: 90 additions & 5 deletions src/mlpack/core/data/load_csv.hpp
Expand Up @@ -15,6 +15,7 @@
#include <set>
#include <string>

#include "extension.hpp"
#include "format.hpp"
#include "dataset_info.hpp"

Expand All @@ -29,7 +30,14 @@ namespace data /** Functions to load and save matrices and models. */ {
class LoadCSV
{
public:
explicit LoadCSV(std::string file, bool fatal = false);
explicit LoadCSV(std::string file, bool fatal = false) :
extension(Extension(file)),
fatalIfOpenFail(fatal),
fileName(std::move(file)),
inFile(fileName)
{
CanOpen();
}

template<typename T>
void Load(arma::Mat<T> &inout, DatasetInfo &infoSet, bool transpose = true)
Expand All @@ -51,9 +59,56 @@ class LoadCSV
}
}

size_t ColSize();
size_t ColSize()
{
//boost tokenizer or strtok can do the same thing, I use
//spirit at here because I think this is a nice example
using namespace boost::spirit;
using bsi_type = boost::spirit::istream_iterator;
using iter_type = boost::iterator_range<bsi_type>;

inFile.clear();
inFile.seekg(0, std::ios::beg);
//spirit::qi requires iterators to be atleast forward iterators,
//but std::istream_iterator is input iteraotr, so we use
//boost::spirit::istream_iterator to overcome this problem
bsi_type begin(inFile);
bsi_type end;
size_t col = 0;

//the parser of boost spirit can work with "actions"(functor)
//when the parser find match target, this functor will be executed
auto findColSize = [&col](iter_type){ ++col; };

//qi::char_ bite an character
//qi::char_(",\r\n") only bite a "," or "\r" or "\n" character
//* means the parser(ex : qi::char_) can bite [0, any size] of characters
//~ means negate, so ~qi::char_(",\r\n") means I want to bite anything except of ",\r\n"
//parse % "," means you want to parse string like "1,2,3,apple"(noticed it without last comma)

//qi::raw restrict the automatic conversion of boost::spirit, without it, spirit parser
//will try to convert the string to std::string, this may cause memory allocation(if small string
//optimization fail).
//After we wrap the parser with qi::raw, the attribute(the data accepted by functor) will
//become boost::iterator_range, this could save a tons of memory allocations
qi::parse(begin, end, qi::raw[*~qi::char_(",\r\n")][findColSize] % ",");

return col;
}

size_t RowSize()
{
inFile.clear();
inFile.seekg(0, std::ios::beg);
size_t row = 0;
std::string line;
while(std::getline(inFile, line))
{
++row;
}

size_t RowSize();
return row;
}

private:
using iter_type = boost::iterator_range<std::string::iterator>;
Expand All @@ -79,7 +134,25 @@ class LoadCSV
}
};

bool CanOpen();
bool CanOpen()
{
if(!inFile.is_open())
{
if(fatalIfOpenFail)
{
Log::Fatal << "Cannot open file '" << fileName << "'. " << std::endl;
}
else
{
Log::Warn << "Cannot open file '" << fileName << "'; load failed."
<< std::endl;
}
return false;
}
inFile.unsetf(std::ios::skipws);

return true;
}

template<typename T>
void NonTranposeParse(arma::Mat<T> &inout, DatasetInfo &infoSet)
Expand Down Expand Up @@ -260,7 +333,19 @@ class LoadCSV
}

boost::spirit::qi::rule<std::string::iterator, iter_type(), boost::spirit::ascii::space_type>
CreateCharRule() const;
CreateCharRule() const
{
using namespace boost::spirit;

if(extension == "csv" || extension == "txt")
{
return qi::raw[*~qi::char_(",\r\n")];
}
else
{
return qi::raw[*~qi::char_("\t\r\n")];
}
}

std::string extension;
bool fatalIfOpenFail;
Expand Down