Optimize load csv 00 #681

Merged
merged 42 commits into from Mar 17, 2017

Conversation

Projects
None yet
3 participants
@stereomatchingkiss
Contributor

stereomatchingkiss commented Jun 6, 2016

Please refer to pull request #678

Edit :
Performance after bug fixed

transpose : 2422 msec
non transpose : 4228 msec

Drop a little bit

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jun 7, 2016

Contributor

Error message : Error downloading file: Unable to connect to the remote server

Do anyone know why this happen?Thanks

Contributor

stereomatchingkiss commented Jun 7, 2016

Error message : Error downloading file: Unable to connect to the remote server

Do anyone know why this happen?Thanks

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jun 7, 2016

Contributor

This time appveyor stop the build suddenly, do anyone know what is the error?

Contributor

stereomatchingkiss commented Jun 7, 2016

This time appveyor stop the build suddenly, do anyone know what is the error?

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jun 7, 2016

Member

All plans have a maximum build job execution time of 60 minutes. It looks like the changes you made increased the build time. Can you test the build time on your system e.g. by using time make with the changes and without?

Member

zoq commented Jun 7, 2016

All plans have a maximum build job execution time of 60 minutes. It looks like the changes you made increased the build time. Can you test the build time on your system e.g. by using time make with the changes and without?

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jun 7, 2016

Contributor

It looks like the changes you made increased the build time.

Ok, I will check it, Spirit2 suffer from long compile time.

Contributor

stereomatchingkiss commented Jun 7, 2016

It looks like the changes you made increased the build time.

Ok, I will check it, Spirit2 suffer from long compile time.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jun 8, 2016

Contributor

OS : win10
compiler : visual c++ 2015 64bits, install update package2
laptop : Y410P

Build time before using boost::spirit--around 17 minutes
Build time after using boost::spirit--around 29 minutes

both of them do not run tests, some CLI tools have build errors, because the global variable in random.hpp cannot find the symbol(ex : mlpack_cf).

Do not know why the compile time increase so much, although the test cases and CLI use the load functions a lot, but I guess most of them do not need to do mappings, so the function should not need to compile LoadCSV over and over again.

Contributor

stereomatchingkiss commented Jun 8, 2016

OS : win10
compiler : visual c++ 2015 64bits, install update package2
laptop : Y410P

Build time before using boost::spirit--around 17 minutes
Build time after using boost::spirit--around 29 minutes

both of them do not run tests, some CLI tools have build errors, because the global variable in random.hpp cannot find the symbol(ex : mlpack_cf).

Do not know why the compile time increase so much, although the test cases and CLI use the load functions a lot, but I guess most of them do not need to do mappings, so the function should not need to compile LoadCSV over and over again.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jun 8, 2016

Contributor

After I split part of the implementation details into load_csv.cpp, the travis build always failed, complain about "load_impl.hpp:320:5: error: ‘LoadCSV’ was not declared in this scope".

It can compile on vc2015 but failed on travis build(g++?clang?).Do anyone know the reasons?Thanks

Contributor

stereomatchingkiss commented Jun 8, 2016

After I split part of the implementation details into load_csv.cpp, the travis build always failed, complain about "load_impl.hpp:320:5: error: ‘LoadCSV’ was not declared in this scope".

It can compile on vc2015 but failed on travis build(g++?clang?).Do anyone know the reasons?Thanks

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jun 8, 2016

Member

Add the following at the end of load_csv.hpp

// Include implementation.
#include "load_csv.cpp"
Member

zoq commented Jun 8, 2016

Add the following at the end of load_csv.hpp

// Include implementation.
#include "load_csv.cpp"
@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 8, 2016

Member

But in that case it should be load_csv_impl.hpp not load_csv.cpp, if we are including it. It seems to me like we just need to add load_csv.cpp to the CMakeLists.txt in that directory.

Member

rcurtin commented Jun 8, 2016

But in that case it should be load_csv_impl.hpp not load_csv.cpp, if we are including it. It seems to me like we just need to add load_csv.cpp to the CMakeLists.txt in that directory.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jun 8, 2016

Contributor

Add the following at the end of load_csv.hpp

I think this would cause duplicate symbols, because I did not make the codes inline

we just need to add load_csv.cpp to the CMakeLists.txt in that directory.

I add load_csv.cpp to the CMakeLists.txt in this commit, still got the same error(the order matters?).

I include the header file, and the cpp into the CMakeLists.txt, make sure the include guard do not repeat with other headers(MLPACK_CORE_DATA_LOAD_CSV_HPP). Which steps I missed?

Contributor

stereomatchingkiss commented Jun 8, 2016

Add the following at the end of load_csv.hpp

I think this would cause duplicate symbols, because I did not make the codes inline

we just need to add load_csv.cpp to the CMakeLists.txt in that directory.

I add load_csv.cpp to the CMakeLists.txt in this commit, still got the same error(the order matters?).

I include the header file, and the cpp into the CMakeLists.txt, make sure the include guard do not repeat with other headers(MLPACK_CORE_DATA_LOAD_CSV_HPP). Which steps I missed?

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 8, 2016

Member

You're right, I didn't see that, sorry. Just looking at the code, it looks like it should compile, so I am not sure why the LoadCSV symbol is not found. I have not had time to check out the branch and try to compile myself though.

Member

rcurtin commented Jun 8, 2016

You're right, I didn't see that, sorry. Just looking at the code, it looks like it should compile, so I am not sure why the LoadCSV symbol is not found. I have not had time to check out the branch and try to compile myself though.

stereomatchingkiss added some commits Jun 8, 2016

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jun 8, 2016

Contributor

You're right, I didn't see that, sorry

Don't need to apologize, this is a weird bug. If I put all of the implementation back to the load_csv.hpp, maybe it would work again. If it works, I could try to separate part of the implementation details of LoadCSV, figure out which part cause the troubles step by step.

Contributor

stereomatchingkiss commented Jun 8, 2016

You're right, I didn't see that, sorry

Don't need to apologize, this is a weird bug. If I put all of the implementation back to the load_csv.hpp, maybe it would work again. If it works, I could try to separate part of the implementation details of LoadCSV, figure out which part cause the troubles step by step.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jul 22, 2016

Member

In this case we'd just have to include it as a 3rd party library. I don't like doing that, because now we have to be sure to keep it up to date, but maybe git submodules or something can be used here. Before we get that far, though, does the fast CSV parser outperform boost::spirit?

Member

rcurtin commented Jul 22, 2016

In this case we'd just have to include it as a 3rd party library. I don't like doing that, because now we have to be sure to keep it up to date, but maybe git submodules or something can be used here. Before we get that far, though, does the fast CSV parser outperform boost::spirit?

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jul 23, 2016

Contributor

does the fast CSV parser outperform boost::spirit?

Will give it a test later on, it is not an easy task to give a fair
test.Because fast csv use their own buffer to store the data, instead of
reading the line one by one, fast csv read a lot of lines at once, this
reduce the overhead of IO. boost spirit can do the same thing too, need
some adjustment. Besides, boost spirit support semantic actions, fast csv
do not support.

now we have to be sure to keep it up to date, but maybe git submodules or
something can be used here

I think this cannot work, because the codes need some changes, unless the
author agree with the change(I can ask him). However, the codes are not
very complicated and well written, it is not hard to maintain it(of course,
not as easy as boost::spirit).

Contributor

stereomatchingkiss commented Jul 23, 2016

does the fast CSV parser outperform boost::spirit?

Will give it a test later on, it is not an easy task to give a fair
test.Because fast csv use their own buffer to store the data, instead of
reading the line one by one, fast csv read a lot of lines at once, this
reduce the overhead of IO. boost spirit can do the same thing too, need
some adjustment. Besides, boost spirit support semantic actions, fast csv
do not support.

now we have to be sure to keep it up to date, but maybe git submodules or
something can be used here

I think this cannot work, because the codes need some changes, unless the
author agree with the change(I can ask him). However, the codes are not
very complicated and well written, it is not hard to maintain it(of course,
not as easy as boost::spirit).

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jul 23, 2016

Contributor

does the fast CSV parser outperform boost::spirit?

BOOST_AUTO_TEST_CASE(FastCSVSpeed)
{
    io::CSVReader<> reader(7, "big_file.csv");
    size_t line_num = 0;
    auto const t1 =
            std::chrono::high_resolution_clock::now();
    std::vector<std::string> val(7);
    //I will create an api support a vector
    while(reader.ReadRow(val[0], val[1], val[2],
                         val[3], val[4], val[5],
                         val[6])){        
    }
    auto const t2 =
            std::chrono::high_resolution_clock::now();
    auto const duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
    std::cout<<"line num "<<line_num<<std::endl;
    std::cout<<"duration "<<duration<<std::endl;
}

BOOST_AUTO_TEST_CASE(boostSpiritSpeed)
{
    using namespace boost::spirit;

    using iter_type = boost::iterator_range<char*>;

    io::LineReader reader("big_file.csv");
    auto const t1 =
            std::chrono::high_resolution_clock::now();
    int line = 0;
    auto parse_str = [&](iter_type const &)
    {
        std::cout<<line++<<",";
    };
    qi::rule<char*, iter_type(), ascii::space_type> charRule =
            qi::raw[*~qi::char_(",")];
    while(auto *line = reader.next_line()){
        qi::phrase_parse(line, line + std::strlen(line),
                         charRule % ",", ascii::space);       
    }
    auto const t2 =
            std::chrono::high_resolution_clock::now();
    auto const duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
    std::cout<<"duration "<<duration<<std::endl;
}

fast csv : 192
boost spirit : 224

In this test, fast csv save the string into std::string already, but spirit haven't.
If I push the string into std::string, spirit took 312 ms to finish the task.

This is just a small test, to implement the Load functions, we need more sophisticated codes, but this small test show us fast csv parser is fast enough.

Edit : maybe some function like string to int, spirit is faster, we may combine them together after the compile issue has been solved

Contributor

stereomatchingkiss commented Jul 23, 2016

does the fast CSV parser outperform boost::spirit?

BOOST_AUTO_TEST_CASE(FastCSVSpeed)
{
    io::CSVReader<> reader(7, "big_file.csv");
    size_t line_num = 0;
    auto const t1 =
            std::chrono::high_resolution_clock::now();
    std::vector<std::string> val(7);
    //I will create an api support a vector
    while(reader.ReadRow(val[0], val[1], val[2],
                         val[3], val[4], val[5],
                         val[6])){        
    }
    auto const t2 =
            std::chrono::high_resolution_clock::now();
    auto const duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
    std::cout<<"line num "<<line_num<<std::endl;
    std::cout<<"duration "<<duration<<std::endl;
}

BOOST_AUTO_TEST_CASE(boostSpiritSpeed)
{
    using namespace boost::spirit;

    using iter_type = boost::iterator_range<char*>;

    io::LineReader reader("big_file.csv");
    auto const t1 =
            std::chrono::high_resolution_clock::now();
    int line = 0;
    auto parse_str = [&](iter_type const &)
    {
        std::cout<<line++<<",";
    };
    qi::rule<char*, iter_type(), ascii::space_type> charRule =
            qi::raw[*~qi::char_(",")];
    while(auto *line = reader.next_line()){
        qi::phrase_parse(line, line + std::strlen(line),
                         charRule % ",", ascii::space);       
    }
    auto const t2 =
            std::chrono::high_resolution_clock::now();
    auto const duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
    std::cout<<"duration "<<duration<<std::endl;
}

fast csv : 192
boost spirit : 224

In this test, fast csv save the string into std::string already, but spirit haven't.
If I push the string into std::string, spirit took 312 ms to finish the task.

This is just a small test, to implement the Load functions, we need more sophisticated codes, but this small test show us fast csv parser is fast enough.

Edit : maybe some function like string to int, spirit is faster, we may combine them together after the compile issue has been solved

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jan 9, 2017

Member

Hmm, so I have spent a while thinking about this and playing with it and I should have updated this ticket months ago. Sorry for the slowness. I realized that maybe you have more time to try and figure things out so I should write down what I've been working on. :)

I really like the boost::spirit solution more than the fast CSV parser because it's more extensible---if we have code to read CSVs with boost::spirit, we can easily extend that to ARFF, bin, txt, and other formats. But with the fast CSV parser it's less clear how to do that.

However the main problem with boost::spirit is the compile time issue. I think that this can work as a solution:

  • Encapsulate all of the boost::spirit code in files other than load.hpp and load_impl.hpp. These files should not be included by mlpack/core.hpp or mlpack/prereqs.hpp.
  • Put extern template instantiations in load.hpp:
// Declare that Load() for all common matrix types is instantiated somewhere else.
extern template bool Load<arma::mat>();
extern template bool Load<arma::Mat<size_t>>();
extern template bool Load<arma::fmat>();
extern template bool Load<arma::umat>();
extern template bool Load<arma::imat>();
...
  • Create a load.cpp file that actually instantiates these:
template bool Load<arma::mat>();
template bool Load<arma::Mat<size_t>>();
template bool Load<arma::fmat>();
template bool Load<arma::umat>();
template bool Load<arma::imat>();
...

This way we avoid including boost::spirit in most of the mlpack compile. If the user wants to load a type where we haven't explicitly instantiated data::Load(), then they can additionally include the file with the boost::spirit definitions. (We'll have to be clear about that in the documentation.)

I did some benchmarking some time ago and the majority of mlpack compile time is spent parsing the input---so if we want to reduce the compile time significantly then we just need to figure out how to avoid including Boost in everything unless we actually need it. One of the PRs I am working on strips out program_options headers from being included, which should help. #841 is also a nice contribution in that direction.

What do you think? Is the idea reasonable? If you think it is then we can do this. @zoq, I'd be interested in your input on this type of strategy to reduce compile times too. :)

Member

rcurtin commented Jan 9, 2017

Hmm, so I have spent a while thinking about this and playing with it and I should have updated this ticket months ago. Sorry for the slowness. I realized that maybe you have more time to try and figure things out so I should write down what I've been working on. :)

I really like the boost::spirit solution more than the fast CSV parser because it's more extensible---if we have code to read CSVs with boost::spirit, we can easily extend that to ARFF, bin, txt, and other formats. But with the fast CSV parser it's less clear how to do that.

However the main problem with boost::spirit is the compile time issue. I think that this can work as a solution:

  • Encapsulate all of the boost::spirit code in files other than load.hpp and load_impl.hpp. These files should not be included by mlpack/core.hpp or mlpack/prereqs.hpp.
  • Put extern template instantiations in load.hpp:
// Declare that Load() for all common matrix types is instantiated somewhere else.
extern template bool Load<arma::mat>();
extern template bool Load<arma::Mat<size_t>>();
extern template bool Load<arma::fmat>();
extern template bool Load<arma::umat>();
extern template bool Load<arma::imat>();
...
  • Create a load.cpp file that actually instantiates these:
template bool Load<arma::mat>();
template bool Load<arma::Mat<size_t>>();
template bool Load<arma::fmat>();
template bool Load<arma::umat>();
template bool Load<arma::imat>();
...

This way we avoid including boost::spirit in most of the mlpack compile. If the user wants to load a type where we haven't explicitly instantiated data::Load(), then they can additionally include the file with the boost::spirit definitions. (We'll have to be clear about that in the documentation.)

I did some benchmarking some time ago and the majority of mlpack compile time is spent parsing the input---so if we want to reduce the compile time significantly then we just need to figure out how to avoid including Boost in everything unless we actually need it. One of the PRs I am working on strips out program_options headers from being included, which should help. #841 is also a nice contribution in that direction.

What do you think? Is the idea reasonable? If you think it is then we can do this. @zoq, I'd be interested in your input on this type of strategy to reduce compile times too. :)

@zoq

This comment has been minimized.

Show comment
Hide comment
@zoq

zoq Jan 11, 2017

Member

I also think the boost::spirit is the superior solution, mainly because we don't have to worry about keeping the code up to date. As @stereomatchingkiss already said, we have make some changes to use the Fast C++ CSV Parser project, means we can't use the code right of the box. Also, sorry for the slowness from my side, it looks like you put some serious efforts into #735 which already uses the Fast C++ CSV Parser project.

The solution @rcurtin suggested sounds promising; have to test it to make sure it's working out. We could also look into precompiled headers, definitely takes some effort to get it right but is another idea.

Member

zoq commented Jan 11, 2017

I also think the boost::spirit is the superior solution, mainly because we don't have to worry about keeping the code up to date. As @stereomatchingkiss already said, we have make some changes to use the Fast C++ CSV Parser project, means we can't use the code right of the box. Also, sorry for the slowness from my side, it looks like you put some serious efforts into #735 which already uses the Fast C++ CSV Parser project.

The solution @rcurtin suggested sounds promising; have to test it to make sure it's working out. We could also look into precompiled headers, definitely takes some effort to get it right but is another idea.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Jan 11, 2017

Contributor

Thanks for your suggestions, I am not familiar with extern template or precompile header, need some time to study and experiment with them. If compile issues could be solved, I agree boost spirit is a more flexible solution.

ps : Long time haven't deal with this issue, there are more things changed than I expected. Will try to come up a solution on tomorrow.

Contributor

stereomatchingkiss commented Jan 11, 2017

Thanks for your suggestions, I am not familiar with extern template or precompile header, need some time to study and experiment with them. If compile issues could be solved, I agree boost spirit is a more flexible solution.

ps : Long time haven't deal with this issue, there are more things changed than I expected. Will try to come up a solution on tomorrow.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Feb 5, 2017

Contributor

I refresh my memory on this pull request and study how to make use of extern template today. I think this idea should work, however, there are two template parameters to specialize

template<typename eT, typename PolicyType>
bool Load(const std::string& filename,
          arma::Mat<eT>& matrix,
          DatasetMapper<PolicyType>& info,
          const bool fatal = false,
          const bool transpose = true);

I would assume the PolicyType is IncrementPolicy and declare the extern template as following in you think it is ok.

extern template Load<int, IncrementPolicy>();
extern template Load<size_t, IncrementPolicy>();
extern template Load<float, IncrementPolicy>();
extern template Load<double, IncrementPolicy>();
Contributor

stereomatchingkiss commented Feb 5, 2017

I refresh my memory on this pull request and study how to make use of extern template today. I think this idea should work, however, there are two template parameters to specialize

template<typename eT, typename PolicyType>
bool Load(const std::string& filename,
          arma::Mat<eT>& matrix,
          DatasetMapper<PolicyType>& info,
          const bool fatal = false,
          const bool transpose = true);

I would assume the PolicyType is IncrementPolicy and declare the extern template as following in you think it is ok.

extern template Load<int, IncrementPolicy>();
extern template Load<size_t, IncrementPolicy>();
extern template Load<float, IncrementPolicy>();
extern template Load<double, IncrementPolicy>();
@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Feb 5, 2017

Contributor

Still the same error messages(pull and merge the codes from newest commit already), "load_impl.hpp:374:5: error: ‘LoadCSV’ was not declared in this scope
LoadCSV loader(filename, fatal);"

I saw some suspicious messages before this error pop out

"In file included from /home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/core/data/dataset_mapper.hpp:18:0,
from /home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/core/data/load.hpp:23,
from /home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/core/data/load.cpp:1:
/home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/../mlpack/prereqs.hpp:20:17: note: #pragma message: Armadillo was included before mlpack; this can sometimes cause problems. It should only be necessary to include <mlpack/core.hpp> and not .
#pragma message "Armadillo was included before mlpack; this can sometimes cause"

Is this the reason?

Contributor

stereomatchingkiss commented Feb 5, 2017

Still the same error messages(pull and merge the codes from newest commit already), "load_impl.hpp:374:5: error: ‘LoadCSV’ was not declared in this scope
LoadCSV loader(filename, fatal);"

I saw some suspicious messages before this error pop out

"In file included from /home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/core/data/dataset_mapper.hpp:18:0,
from /home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/core/data/load.hpp:23,
from /home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/core/data/load.cpp:1:
/home/ramsus/Qt/3rdLibs/mlpack/src/mlpack/../mlpack/prereqs.hpp:20:17: note: #pragma message: Armadillo was included before mlpack; this can sometimes cause problems. It should only be necessary to include <mlpack/core.hpp> and not .
#pragma message "Armadillo was included before mlpack; this can sometimes cause"

Is this the reason?

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 6, 2017

Member

I think it's fine to specialize with IncrementPolicy.

I think, maybe in your code somewhere, you have an #include <armadillo>? That shouldn't be necessary, better to use #include <mlpack/core.hpp> or #include <mlpack/prereqs.hpp> since those include Armadillo with the necessary modifications for mlpack.

If you push your most recent code, I can try to build it and reproduce the error.

Member

rcurtin commented Feb 6, 2017

I think it's fine to specialize with IncrementPolicy.

I think, maybe in your code somewhere, you have an #include <armadillo>? That shouldn't be necessary, better to use #include <mlpack/core.hpp> or #include <mlpack/prereqs.hpp> since those include Armadillo with the necessary modifications for mlpack.

If you push your most recent code, I can try to build it and reproduce the error.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Feb 10, 2017

Contributor

I think it's fine to specialize with IncrementPolicy.

Got it

I think, maybe in your code somewhere, you have an #include ?

Nope, I am sure my codes do not manually include armadillo header

If you push your most recent code, I can try to build it and reproduce the error.

I will clean up the codes and push it to this branch on this week

Contributor

stereomatchingkiss commented Feb 10, 2017

I think it's fine to specialize with IncrementPolicy.

Got it

I think, maybe in your code somewhere, you have an #include ?

Nope, I am sure my codes do not manually include armadillo header

If you push your most recent code, I can try to build it and reproduce the error.

I will clean up the codes and push it to this branch on this week

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 10, 2017

Member

Sounds good, once you push, I'll try and reproduce the issue.

Member

rcurtin commented Feb 10, 2017

Sounds good, once you push, I'll try and reproduce the issue.

stereomatchingkiss added some commits Feb 12, 2017

1 : use extern template to export part of the implementation of Load …
…function

2 : use LoadCSV to relpace implementation of the part of csv loading
@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Feb 12, 2017

Contributor

I push recent implementaion onto the server, to make sure I do not accidentally include header of armadillo, I use grep to find the key word

grep -rnw '../mlpack' -e '<armadillo>'

Results

../mlpack/core/arma_extend/arma_extend.hpp:52:#include <armadillo>
../mlpack/prereqs.hpp:22:<armadillo>."
Contributor

stereomatchingkiss commented Feb 12, 2017

I push recent implementaion onto the server, to make sure I do not accidentally include header of armadillo, I use grep to find the key word

grep -rnw '../mlpack' -e '<armadillo>'

Results

../mlpack/core/arma_extend/arma_extend.hpp:52:#include <armadillo>
../mlpack/prereqs.hpp:22:<armadillo>."
@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 14, 2017

Member

This PR fixes it: stereomatchingkiss#1

My timing simulations show that before, using boost::spirit, time make -j4 gave:

real    11m9.445s
user    41m18.020s
sys     1m17.780s

and with the changes in the PR so that the extern templates are working,

real    8m28.534s
user    31m20.616s
sys     0m59.156s

Big improvement! I think this is a successful change then.

Member

rcurtin commented Feb 14, 2017

This PR fixes it: stereomatchingkiss#1

My timing simulations show that before, using boost::spirit, time make -j4 gave:

real    11m9.445s
user    41m18.020s
sys     1m17.780s

and with the changes in the PR so that the extern templates are working,

real    8m28.534s
user    31m20.616s
sys     0m59.156s

Big improvement! I think this is a successful change then.

rcurtin and others added some commits Feb 14, 2017

Merge pull request #1 from rcurtin/csvloadtest
Fix extern template build
1 : fix bug, wrong constructor
2 : use std::string to replace raw buffer
@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Feb 19, 2017

Contributor

I think this pull request is ready to merged. After it merged I can start to optimized/refactor another part of load functions.

Contributor

stereomatchingkiss commented Feb 19, 2017

I think this pull request is ready to merged. After it merged I can start to optimized/refactor another part of load functions.

@rcurtin

Looks good to me other than the couple of comments I added. I think there are some style bits that could be fixed but I can do that after merge if you prefer. I guess, after this merge, you plan to make data::Load() for matrices use LoadCSV() when the type is CSV? I think that will have some great runtime gains.

If you like we can open some issues for other formats too, and maybe someone can implement readers for those with boost::spirit too.

src/mlpack/core/data/dataset_info.hpp
+ * Datatype::categorical) as well as mappings from strings to unsigned integers
+ * and vice versa.
+ */
+class DatasetInfo

This comment has been minimized.

@rcurtin

rcurtin Feb 21, 2017

Member

Is this file included by accident?

@rcurtin

rcurtin Feb 21, 2017

Member

Is this file included by accident?

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

Sorry, did not notice this error, my fault

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

Sorry, did not notice this error, my fault

+template bool Load<size_t>(const std::string&, arma::Mat<size_t>&, const bool, const bool);
+template bool Load<float>(const std::string&, arma::Mat<float>&, const bool, const bool);
+template bool Load<double>(const std::string&, arma::Mat<double>&, const bool, const bool);
+template bool Load<unsigned long long>(const std::string&, arma::Mat<unsigned long long>&, const bool, const bool);

This comment has been minimized.

@rcurtin

rcurtin Feb 21, 2017

Member

I wonder if this will have a problem if size_t == unsigned long long. Not sure what system we would need to check to see (I think that's true on Windows? Not certain).

@rcurtin

rcurtin Feb 21, 2017

Member

I wonder if this will have a problem if size_t == unsigned long long. Not sure what system we would need to check to see (I think that's true on Windows? Not certain).

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

size_t == unsigned long long under windows environment(64bits) as you suspect.
Easy way to solve this is using preprocessor to control compile time choice, any ideas?

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

size_t == unsigned long long under windows environment(64bits) as you suspect.
Easy way to solve this is using preprocessor to control compile time choice, any ideas?

@@ -0,0 +1,102 @@
+#include "load_csv.hpp"

This comment has been minimized.

@rcurtin

rcurtin Feb 21, 2017

Member

We should add a file header here with the license too.

@rcurtin

rcurtin Feb 21, 2017

Member

We should add a file header here with the license too.

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

Done

@@ -0,0 +1,33 @@
+#include "load.hpp"

This comment has been minimized.

@rcurtin

rcurtin Feb 21, 2017

Member

We should add a file header here with the license.

@rcurtin

rcurtin Feb 21, 2017

Member

We should add a file header here with the license.

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

Hi, load.hpp already got the license but load_csv.hpp did not. I fixed that part already

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

Hi, load.hpp already got the license but load_csv.hpp did not. I fixed that part already

+ * @file load_csv.hpp
+ * @author ThamNgapWei
+ *
+ * This is a csv parsers which use to parse the csv file format

This comment has been minimized.

@rcurtin

rcurtin Feb 21, 2017

Member

We should add the license to this file.

@rcurtin

rcurtin Feb 21, 2017

Member

We should add the license to this file.

This comment has been minimized.

@stereomatchingkiss

stereomatchingkiss Feb 25, 2017

Contributor

Done

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Feb 23, 2017

Contributor

I think there are some style bits that could be fixed but I can do that after merge if you prefer.

I will fix them on this week

I guess, after this merge, you plan to make data::Load() for matrices use LoadCSV() when the type is CSV? I think that will have some great runtime gains.

Leave it to me, this should be done without much trouble

If you like we can open some issues for other formats too, and maybe someone can implement readers for those with boost::spirit too.

Agree with this. I wonder could we make this as part of GSOC project?

Contributor

stereomatchingkiss commented Feb 23, 2017

I think there are some style bits that could be fixed but I can do that after merge if you prefer.

I will fix them on this week

I guess, after this merge, you plan to make data::Load() for matrices use LoadCSV() when the type is CSV? I think that will have some great runtime gains.

Leave it to me, this should be done without much trouble

If you like we can open some issues for other formats too, and maybe someone can implement readers for those with boost::spirit too.

Agree with this. I wonder could we make this as part of GSOC project?

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Feb 27, 2017

Member

Yes, I think maybe this could be a good GSoC project if you want to add some ideas or open some issues or something!

Do you want to go ahead and merge this? I'll release mlpack 2.2.0 once this is merged if so. :)

Member

rcurtin commented Feb 27, 2017

Yes, I think maybe this could be a good GSoC project if you want to add some ideas or open some issues or something!

Do you want to go ahead and merge this? I'll release mlpack 2.2.0 once this is merged if so. :)

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Mar 2, 2017

Contributor

Yes, I think maybe this could be a good GSoC project if you want to add some ideas or open some issues or something!

In that case I would come up something on this week and congratulation for acceptance of GSOC

Do you want to go ahead and merge this? I'll release mlpack 2.2.0 once this is merged if so. :)

Yes, could you help me fixed the problem of "size_t == unsigned long long"? Any idea better than preprocessor?Only this problem is unfixed. Thanks

Contributor

stereomatchingkiss commented Mar 2, 2017

Yes, I think maybe this could be a good GSoC project if you want to add some ideas or open some issues or something!

In that case I would come up something on this week and congratulation for acceptance of GSOC

Do you want to go ahead and merge this? I'll release mlpack 2.2.0 once this is merged if so. :)

Yes, could you help me fixed the problem of "size_t == unsigned long long"? Any idea better than preprocessor?Only this problem is unfixed. Thanks

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Mar 2, 2017

Member

Sure, I will try to work on that today and then merge this, then I can work on the release (although that may be tomorrow).

Member

rcurtin commented Mar 2, 2017

Sure, I will try to work on that today and then merge this, then I can work on the release (although that may be tomorrow).

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Mar 12, 2017

Contributor

I did the fix(by preprocessor, the only solution I can come up by now) to avoid instantiation of the same signature of Load function under windows.Could you tell me where should I upload the proposals of GSOC?Thanks

Contributor

stereomatchingkiss commented Mar 12, 2017

I did the fix(by preprocessor, the only solution I can come up by now) to avoid instantiation of the same signature of Load function under windows.Could you tell me where should I upload the proposals of GSOC?Thanks

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Mar 14, 2017

Member

Things always fall behind schedule for me. :( Thanks for taking care of it.

For GSoC, students are supposed to upload their own proposals to the website. If you mean where to put ideas on the ideas list, that's here: https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas . You can just add an idea to that page.

Member

rcurtin commented Mar 14, 2017

Things always fall behind schedule for me. :( Thanks for taking care of it.

For GSoC, students are supposed to upload their own proposals to the website. If you mean where to put ideas on the ideas list, that's here: https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas . You can just add an idea to that page.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Mar 17, 2017

Contributor
Contributor

stereomatchingkiss commented Mar 17, 2017

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Mar 17, 2017

Member

Sure, if you want to merge feel free; I was going to handle it tomorrow morning and release mlpack 2.2.0 with this support in it.

Member

rcurtin commented Mar 17, 2017

Sure, if you want to merge feel free; I was going to handle it tomorrow morning and release mlpack 2.2.0 with this support in it.

@stereomatchingkiss stereomatchingkiss merged commit 96d201e into mlpack:master Mar 17, 2017

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Mar 21, 2017

Member

Hey, thanks again for this support. It is really nice and important to provide fast loading. I spent some time with the code and realized it introduced a few problems that I fixed:

If I broke anything, let me know. It should be easy to add future support for loading csv/txt/tsv without a DatasetMapper object.

Member

rcurtin commented Mar 21, 2017

Hey, thanks again for this support. It is really nice and important to provide fast loading. I spent some time with the code and realized it introduced a few problems that I fixed:

If I broke anything, let me know. It should be easy to add future support for loading csv/txt/tsv without a DatasetMapper object.

@stereomatchingkiss

This comment has been minimized.

Show comment
Hide comment
@stereomatchingkiss

stereomatchingkiss Mar 21, 2017

Contributor
Contributor

stereomatchingkiss commented Mar 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment