Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadCSV: only reset the DatasetMapper if the dimensionality is wrong #2980

Merged
merged 5 commits into from Jun 20, 2021

Conversation

rcurtin
Copy link
Member

@rcurtin rcurtin commented Jun 14, 2021

I was trying to put together an example for @RishabhGarg108 of how to use a pre-populated DatasetInfo object to load categorical data. Here's the code I wrote:

#include <mlpack/core.hpp>

using namespace arma;
using namespace mlpack;
using namespace mlpack::data;

int main()
{
  // Load the iris dataset, but mark the 2nd and 3rd dimensions as categorical.
  DatasetInfo di(4);
  di.Type(1) = Datatype::categorical;
  di.Type(2) = Datatype::categorical;

  mat m;
  data::Load("iris.csv", m, di);

  std::cout << "Loaded 'iris.csv':\n";
  for (size_t i = 0; i < 4; ++i)
  {
    std::cout << " - Dimension " << i << ": " << di.Type(i) << ", " << di.NumMappings(i)
        << " mappings." << std::endl;
  }

  m.print("m:");
}

But, much to my surprise, this did not work, and the data was continually loaded as numeric!

As I dug into the issue, I noticed that LoadARFF() only resets the DatasetInfo if the given DatasetInfo has dimensionality 0 (and throws an exception if the given DatasetInfo has some dimensionality other than the dimensionality of the data being loaded). But, LoadCSV() behaves differently, always resetting the DatasetInfo!

So, I changed LoadCSV() to match LoadARFF(), and updated the documentation for data::Load() to match how it actually behaves.

Now, data::Load() will always behave as described in this tutorial: https://www.mlpack.org/doc/mlpack-git/doxygen/formatdoc.html#formatcatcpp

CC/FYI: @RishabhGarg108, @gmanlan, @shrit, @heisenbuug

@zoq
Copy link
Member

zoq commented Jun 15, 2021

I'm trying to understand in what situation I would use a pre-populated DatasetInfo object. Because I can load something like https://gist.github.com/zoq/c92adb9e960ffc44d2de8cf41ad86137 without pre-populating.

@rcurtin rcurtin mentioned this pull request Jun 16, 2021
6 tasks
@rcurtin
Copy link
Member Author

rcurtin commented Jun 16, 2021

Looks like I have a failing test on macOS to debug, but I can still describe the use case before I fix that---

In the CSV you gave, there is no ambiguity---the string-valued columns are definitely categorical. But what if I represented that string column as integer values instead?

0 is true, 1 is false, and 2 is not sure:

0, 1, 0, 3
5, -2, 1, 5
2, 2, 0, 4
3, -1, 0, 3
4, 4, 2, 0
0, 7, 1, 6

In this case, the CSV reader has no way of knowing that the third column is categorical, even though we know it is. It is in a situation like this that you would want to pre-populate a DatasetInfo to indicate which dimensions are categorical and which aren't.

Another instance is in, e.g., the decision_tree binding, where we saved a DatasetInfo as part of the model, and if the user is passing a CSV to do prediction on, we want to use the same DatasetInfo object that was used for training.

@zoq
Copy link
Member

zoq commented Jun 16, 2021

In this case, the CSV reader has no way of knowing that the third column is categorical, even though we know it is. It is in a situation like this that you would want to pre-populate a DatasetInfo to indicate which dimensions are categorical and which aren't.

I get the idea, but in the example case I don't see any problem with converting the categorical data to numeric, it will still work. Maybe there is a dateset out there that requires us to avoid the auto-conversion.

I guess there is a use case when the data format changes between training and test set.

@rcurtin
Copy link
Member Author

rcurtin commented Jun 16, 2021

In this case, the CSV reader has no way of knowing that the third column is categorical, even though we know it is. It is in a situation like this that you would want to pre-populate a DatasetInfo to indicate which dimensions are categorical and which aren't.

I get the idea, but in the example case I don't see any problem with converting the categorical data to numeric, it will still work. Maybe there is a dateset out there that requires us to avoid the auto-conversion.

Actually the pokerhand dataset is a good example of this.

$ head pokerhand.csv 
1,10,1,11,1,13,1,12,1,1
2,11,2,13,2,10,2,12,2,1
3,12,3,11,3,13,3,10,3,1
4,10,4,11,4,1,4,13,4,12
4,1,4,13,4,12,4,11,4,10
1,2,1,4,1,5,1,3,1,6

The columns are these:

  • suit of first card (1/2/3/4), categorical
  • value of first card (2-9, 10 = jack, 11 = queen, 12 = king, 1 = ace)
  • suit of second card
  • value of second card
  • suit of third card
  • value of third card
  • suit of fourth card
  • value of fourth card
  • suit of fifth card
  • value of fifth card

Now you could treat the values of each card as numeric, because there is ordering, but for the suits there is no ordering, and so if you try to learn a model assuming that the suits are numeric and not categorical, the model will perform very poorly. 👍

@zoq
Copy link
Member

zoq commented Jun 19, 2021

That is a great example, now this makes even more sense to me personally 👍

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the clarification comments 👍

Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@zoq zoq merged commit a888b56 into mlpack:master Jun 20, 2021
@rcurtin rcurtin deleted the dataset-info-resize-fix branch June 23, 2021 23:37
This was referenced Oct 14, 2022
@rcurtin rcurtin mentioned this pull request Oct 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants