Error if attributes have the same value in training set. #14

goibon · 2013-11-18T13:52:46Z

I've encountered a problem where if you have a training set like:

[[5, 1, "9.990941"], [5, 1, "9.990926"], [5, 1, "9.991411"], [5, 1, "9.991286"], [5, 1, "9.9916579615681"], [5, 1, "9.9917500513096"], [5, 1, "9.991682"], [5, 1, "9.991675"], [5, 1, "9.990981"], [5, 1, "9.990918"], [5, 1, "9.990918"], [5, 1, "9.990926"], [5, 1, "9.990934"], [5, 1, "9.9907993677691"], [5, 1, "9.9907108548716"], [5, 1, "9.9907190386056"], [5, 1, "9.9907190386056"]]

Where the attributes do not vary, only the target feature does, I get an error:

undefined method `to_a' for 9.990941:String

I encounter the same issue if the attributes vary but the target feature does not.

The text was updated successfully, but these errors were encountered:

igrigorik · 2013-11-23T04:09:58Z

Well, in either case, that's not much of a decision tree..

a) same attributes resolving to different labels - current code just picks the last value, so all of the examples get resolved to one case.
b) different attributes resolving to same label - I guess that's a single node tree.

For (b), I guess it would make sense to handle this, even if its an edge case, but for (a), it seems like we should just throw an exception during training..

goibon · 2013-11-28T12:28:02Z

As mentioned in my first post, I get an error for the values posted. I would have imagined that a decision tree would pick the target feature that appears the most times, or if no such case exists then pick the last one. But if you think an exception during training is a better solution then that's fine I guess.
For case (b) it would make sense to just resolve to the only target feature available.

In my training set I have decided to add a fake entry to bypass both cases discussed.

sigfrid · 2014-02-28T06:34:16Z

For case (a) how about just ignoring those examples?
Anyway, an exception is much better than undefined method error.

For case (b), as goibon said, it would make sense to just resolve to the only target feature available.

DannyBen · 2017-02-03T14:27:43Z

Although this is quite an old ticket, I must say that this issue causes me some headaches also.

As I see it:

I would also imagine that a decision tree should make best effort to answer the question. If the same input resolves to different outputs, than get the output with the most occurrences of input.
I understand if the above is not in line with the common implementations (don't know if it is or isn't)
If it is decided that this case should trigger an error, then I would expect it to trigger a specialized, custom error and not undefined_method so at least we could rescue it and be sure we rescued the correct one.

jesselawson · 2017-09-02T18:52:36Z

This is a very old thread but I think the issue is still here. I am a proponent of leaving it how it is.

To address @DannyBen 's list:

I would also imagine that a decision tree should make best effort to answer the question. If the same input resolves to different outputs, than get the output with the most occurrences of input.

I believe this makes the assumption that FIFO and LIFO data observation models are exactly the same. In other words, if I have a set of observations where the same occurrence of features are yielding different outputs, can I be certain that there weren't exigent circumstances or variables that were causing the variation in output? (As a matter of fact, I can probably assume that there was an exigent variable, since we would expect a model like y = a+b to always yield the same observation when a and b are constant--and if they aren't, we should assume that there is another variable that we are not accounting for.

Put another way, if we are training a model with data that has non-changing inputs, then what are we really doing? As @igrigorik said, a check could be made to throw an error when something like that happens, but it would be remiss of an objective modeling algorithm to choose a FIFO or LIFO method to select the "correct" observation, when we might not even understand what exigent variable is causing the discrepancies.

I understand if the above is not in line with the common implementations (don't know if it is or isn't)

I think this means you already understand that the above limitations of assumptions are something we shouldn't control for. Is that correct?

If it is decided that this case should trigger an error, then I would expect it to trigger a specialized, custom error and not undefined_method so at least we could rescue it and be sure we rescued the correct one.

I can see how this would be frustrating, but maybe a second look at the data is in order. If you have a multivariate system of continuous inputs that are constant, maybe one solution is to convert them to discrete variables instead? On other words, have you tried something like this:

[
["cat5", "option1", "9.990941"], 
["cat5", "option1",  "9.990926"], 
["cat5", "option1",  "9.991411"], 
["cat5", "option1",  "9.991286"], 
...
["cat5", "option1",  "9.9907190386056"]
]

I don't know your data, but if I saw a set of continuous inputs like this, I can safely assume that one of two things are true: either A) there are continuous variables that should be discretized, or B) the wrong data is being measured to yield an observation.

Think about the nature of a decision tree, and then think about how it would play out on a model that had the exact same inputs but different outputs. For example, pretend we have a model trained on the data from OP. If we asked the model to predict the outcome of [5,1], what should the model give us? Well, since it knows there are more than one outcome--and there's no way to determine whether it should adhere to LIFO or FIFO prioritization (nor any reason it should)--what it can only return is a range. That's sort of antithetical to a decision tree, isn't it?

One could argue that it should return the mean value, but again, that assumption would also depend heavily on the environment and the type of data we're collecting.

DannyBen · 2017-09-02T19:32:13Z

I can see how this would be frustrating, but maybe a second look at the data is in order.

Psychologists call this "deflection" no? 😏

Let's ignore all the different ways to handle - or not handle - multiple values, I think there should be no disagreement that at least the library should raise a specialized error and not let ruby fail with an unrelated error.

But, I understand if this is not on anybody's priority list. To be honest, I also moved on.

harryloewen · 2018-10-14T12:52:33Z

2018 update
IMHO @igrigorik is absolutely right. I falsely used the same default values, until the user starts providing information.

All it needs is a simple random modifier for the default values. That might cause some false-correlation in the beginning, but fixes itself as data grows

imi-milaap mentioned this issue Sep 3, 2019

undefined method to_a for "abc":String #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error if attributes have the same value in training set. #14

Error if attributes have the same value in training set. #14

goibon commented Nov 18, 2013

igrigorik commented Nov 23, 2013

goibon commented Nov 28, 2013

sigfrid commented Feb 28, 2014

DannyBen commented Feb 3, 2017

jesselawson commented Sep 2, 2017 •

edited

Loading

DannyBen commented Sep 2, 2017

harryloewen commented Oct 14, 2018

Error if attributes have the same value in training set. #14

Error if attributes have the same value in training set. #14

Comments

goibon commented Nov 18, 2013

igrigorik commented Nov 23, 2013

goibon commented Nov 28, 2013

sigfrid commented Feb 28, 2014

DannyBen commented Feb 3, 2017

jesselawson commented Sep 2, 2017 • edited Loading

DannyBen commented Sep 2, 2017

harryloewen commented Oct 14, 2018

jesselawson commented Sep 2, 2017 •

edited

Loading