Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The format of the sparse instances in the input file #656

Closed
YaweiZhao opened this issue May 29, 2016 · 4 comments
Closed

The format of the sparse instances in the input file #656

YaweiZhao opened this issue May 29, 2016 · 4 comments

Comments

@YaweiZhao
Copy link

Hi,

Since the input training data is often organized as the sparse matrix in many datasets, what is the format of such those sparse training instances presented in the input files? For example, if one training instance consists of 1000 dimensions, and 100 of those dimensions are not zero, then is the training instance presented in the input file by using 900 zeros and 100 non-zeros? The same sparse instances in the datasets which are published on the LibSVM website are organized as the format of "dimension_id : value".

Best wishes,

Yawei

@rcurtin
Copy link
Member

rcurtin commented May 30, 2016

Hi Yawei,

mlpack unfortunately doesn't have any current support for loading sparse matrices from disk. In addition, because of this, the command-line programs only load dense data.

So if you want to use sparse data specifically, I think the best way is to write a C++ program using arma::sp_mat. But to make it harder... Armadillo does not have good documentation for their support for loading sparse matrices. You can load a coordinate list of the form

1 2 10.3
3 1 5.2
3 2 1.3

and this represents a matrix with three nonzero elements. You can load it using the function

arma::sp_mat m;
m.load("file.txt", arma::coord_ascii);

and then you can use that in mlpack methods. I wish that this was documented in the Armadillo docs but currently it is not.

I hope this is helpful... let me know if I can clarify anything.

@YaweiZhao
Copy link
Author

Hi Ryan,

Thanks for your answer! It really helps me understand how to use MLPACK. Could I (or you) add your suggestion into the doc of MLPACK?

@rcurtin
Copy link
Member

rcurtin commented May 31, 2016

I updated the documentation in e36eec5; it's online at
http://mlpack.org/docs/mlpack-git/doxygen.php?doc=formatdoc.html

Let me know what you think, if anything can be clarified. I'll mark this as resolved since I've updated the documentation, but let me know if there is anything else to be done.

Thanks for pointing this out!

Ryan

@YaweiZhao
Copy link
Author

Hi Ryan,

I have read the update documentation. I think it is clear and understandable. Thanks for your time. Nice work!!!!

Yawei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants