New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoadCSV: only reset the DatasetMapper if the dimensionality is wrong #2980
Conversation
I'm trying to understand in what situation I would use a pre-populated |
Looks like I have a failing test on macOS to debug, but I can still describe the use case before I fix that--- In the CSV you gave, there is no ambiguity---the string-valued columns are definitely categorical. But what if I represented that string column as integer values instead?
In this case, the CSV reader has no way of knowing that the third column is categorical, even though we know it is. It is in a situation like this that you would want to pre-populate a Another instance is in, e.g., the |
I get the idea, but in the example case I don't see any problem with converting the categorical data to numeric, it will still work. Maybe there is a dateset out there that requires us to avoid the auto-conversion. I guess there is a use case when the data format changes between training and test set. |
Actually the pokerhand dataset is a good example of this.
The columns are these:
Now you could treat the values of each card as numeric, because there is ordering, but for the suits there is no ordering, and so if you try to learn a model assuming that the suits are numeric and not categorical, the model will perform very poorly. 👍 |
That is a great example, now this makes even more sense to me personally 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the clarification comments 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍
I was trying to put together an example for @RishabhGarg108 of how to use a pre-populated
DatasetInfo
object to load categorical data. Here's the code I wrote:But, much to my surprise, this did not work, and the data was continually loaded as numeric!
As I dug into the issue, I noticed that
LoadARFF()
only resets theDatasetInfo
if the givenDatasetInfo
has dimensionality 0 (and throws an exception if the givenDatasetInfo
has some dimensionality other than the dimensionality of the data being loaded). But,LoadCSV()
behaves differently, always resetting theDatasetInfo
!So, I changed
LoadCSV()
to matchLoadARFF()
, and updated the documentation fordata::Load()
to match how it actually behaves.Now,
data::Load()
will always behave as described in this tutorial: https://www.mlpack.org/doc/mlpack-git/doxygen/formatdoc.html#formatcatcppCC/FYI: @RishabhGarg108, @gmanlan, @shrit, @heisenbuug