LoadCSV: only reset the DatasetMapper if the dimensionality is wrong #2980

rcurtin · 2021-06-14T18:53:05Z

I was trying to put together an example for @RishabhGarg108 of how to use a pre-populated DatasetInfo object to load categorical data. Here's the code I wrote:

#include <mlpack/core.hpp>

using namespace arma;
using namespace mlpack;
using namespace mlpack::data;

int main()
{
  // Load the iris dataset, but mark the 2nd and 3rd dimensions as categorical.
  DatasetInfo di(4);
  di.Type(1) = Datatype::categorical;
  di.Type(2) = Datatype::categorical;

  mat m;
  data::Load("iris.csv", m, di);

  std::cout << "Loaded 'iris.csv':\n";
  for (size_t i = 0; i < 4; ++i)
  {
    std::cout << " - Dimension " << i << ": " << di.Type(i) << ", " << di.NumMappings(i)
        << " mappings." << std::endl;
  }

  m.print("m:");
}

But, much to my surprise, this did not work, and the data was continually loaded as numeric!

As I dug into the issue, I noticed that LoadARFF() only resets the DatasetInfo if the given DatasetInfo has dimensionality 0 (and throws an exception if the given DatasetInfo has some dimensionality other than the dimensionality of the data being loaded). But, LoadCSV() behaves differently, always resetting the DatasetInfo!

So, I changed LoadCSV() to match LoadARFF(), and updated the documentation for data::Load() to match how it actually behaves.

Now, data::Load() will always behave as described in this tutorial: https://www.mlpack.org/doc/mlpack-git/doxygen/formatdoc.html#formatcatcpp

CC/FYI: @RishabhGarg108, @gmanlan, @shrit, @heisenbuug

zoq · 2021-06-15T03:34:45Z

I'm trying to understand in what situation I would use a pre-populated DatasetInfo object. Because I can load something like https://gist.github.com/zoq/c92adb9e960ffc44d2de8cf41ad86137 without pre-populating.

rcurtin · 2021-06-16T01:14:29Z

Looks like I have a failing test on macOS to debug, but I can still describe the use case before I fix that---

In the CSV you gave, there is no ambiguity---the string-valued columns are definitely categorical. But what if I represented that string column as integer values instead?

0 is true, 1 is false, and 2 is not sure:

0, 1, 0, 3
5, -2, 1, 5
2, 2, 0, 4
3, -1, 0, 3
4, 4, 2, 0
0, 7, 1, 6

In this case, the CSV reader has no way of knowing that the third column is categorical, even though we know it is. It is in a situation like this that you would want to pre-populate a DatasetInfo to indicate which dimensions are categorical and which aren't.

Another instance is in, e.g., the decision_tree binding, where we saved a DatasetInfo as part of the model, and if the user is passing a CSV to do prediction on, we want to use the same DatasetInfo object that was used for training.

zoq · 2021-06-16T16:52:40Z

In this case, the CSV reader has no way of knowing that the third column is categorical, even though we know it is. It is in a situation like this that you would want to pre-populate a DatasetInfo to indicate which dimensions are categorical and which aren't.

I get the idea, but in the example case I don't see any problem with converting the categorical data to numeric, it will still work. Maybe there is a dateset out there that requires us to avoid the auto-conversion.

I guess there is a use case when the data format changes between training and test set.

rcurtin · 2021-06-16T18:15:48Z

In this case, the CSV reader has no way of knowing that the third column is categorical, even though we know it is. It is in a situation like this that you would want to pre-populate a DatasetInfo to indicate which dimensions are categorical and which aren't.

I get the idea, but in the example case I don't see any problem with converting the categorical data to numeric, it will still work. Maybe there is a dateset out there that requires us to avoid the auto-conversion.

Actually the pokerhand dataset is a good example of this.

$ head pokerhand.csv 
1,10,1,11,1,13,1,12,1,1
2,11,2,13,2,10,2,12,2,1
3,12,3,11,3,13,3,10,3,1
4,10,4,11,4,1,4,13,4,12
4,1,4,13,4,12,4,11,4,10
1,2,1,4,1,5,1,3,1,6

The columns are these:

suit of first card (1/2/3/4), categorical
value of first card (2-9, 10 = jack, 11 = queen, 12 = king, 1 = ace)
suit of second card
value of second card
suit of third card
value of third card
suit of fourth card
value of fourth card
suit of fifth card
value of fifth card

Now you could treat the values of each card as numeric, because there is ordering, but for the suits there is no ordering, and so if you try to learn a model assuming that the suits are numeric and not categorical, the model will perform very poorly. 👍

…-fix

zoq · 2021-06-19T02:14:41Z

That is a great example, now this makes even more sense to me personally 👍

zoq

Thanks for all the clarification comments 👍

mlpack-bot

Second approval provided automatically after 24 hours. 👍

rcurtin added 2 commits June 14, 2021 14:46

Only reset the DatasetMapper if the dimensionality is wrong.

0de456e

Update documentation so it is correct.

7d416ed

rcurtin added c: core t: bugfix labels Jun 14, 2021

Update history.

2fc7910

rcurtin mentioned this pull request Jun 16, 2021

Add regression tree #2905

Closed

6 tasks

rcurtin added 2 commits June 16, 2021 14:24

Use SetDimensionality(); the constructor will lose the policy.

e48e4b7

Merge remote-tracking branch 'origin/master' into dataset-info-resize…

ae17302

…-fix

zoq approved these changes Jun 19, 2021

View reviewed changes

mlpack-bot bot approved these changes Jun 20, 2021

View reviewed changes

zoq merged commit a888b56 into mlpack:master Jun 20, 2021

rcurtin deleted the dataset-info-resize-fix branch June 23, 2021 23:37

This was referenced Oct 14, 2022

Release version 4.0.0 #3285

Closed

Release version 4.0.0 #3286

Closed

rcurtin mentioned this pull request Oct 23, 2022

Release version 4.0.0 #3293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoadCSV: only reset the DatasetMapper if the dimensionality is wrong #2980

LoadCSV: only reset the DatasetMapper if the dimensionality is wrong #2980

rcurtin commented Jun 14, 2021 •

edited

zoq commented Jun 15, 2021

rcurtin commented Jun 16, 2021

zoq commented Jun 16, 2021

rcurtin commented Jun 16, 2021

zoq commented Jun 19, 2021

zoq left a comment

mlpack-bot bot left a comment

LoadCSV: only reset the DatasetMapper if the dimensionality is wrong #2980

LoadCSV: only reset the DatasetMapper if the dimensionality is wrong #2980

Conversation

rcurtin commented Jun 14, 2021 • edited

zoq commented Jun 15, 2021

rcurtin commented Jun 16, 2021

zoq commented Jun 16, 2021

rcurtin commented Jun 16, 2021

zoq commented Jun 19, 2021

zoq left a comment

Choose a reason for hiding this comment

mlpack-bot bot left a comment

Choose a reason for hiding this comment

rcurtin commented Jun 14, 2021 •

edited