Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error if attributes have the same value in training set. #14

Open
goibon opened this issue Nov 18, 2013 · 7 comments
Open

Error if attributes have the same value in training set. #14

goibon opened this issue Nov 18, 2013 · 7 comments

Comments

@goibon
Copy link

goibon commented Nov 18, 2013

I've encountered a problem where if you have a training set like:

[[5, 1, "9.990941"], [5, 1, "9.990926"], [5, 1, "9.991411"], [5, 1, "9.991286"], [5, 1, "9.9916579615681"], [5, 1, "9.9917500513096"], [5, 1, "9.991682"], [5, 1, "9.991675"], [5, 1, "9.990981"], [5, 1, "9.990918"], [5, 1, "9.990918"], [5, 1, "9.990926"], [5, 1, "9.990934"], [5, 1, "9.9907993677691"], [5, 1, "9.9907108548716"], [5, 1, "9.9907190386056"], [5, 1, "9.9907190386056"]]

Where the attributes do not vary, only the target feature does, I get an error:

undefined method `to_a' for 9.990941:String

I encounter the same issue if the attributes vary but the target feature does not.

@igrigorik
Copy link
Owner

Well, in either case, that's not much of a decision tree..

a) same attributes resolving to different labels - current code just picks the last value, so all of the examples get resolved to one case.
b) different attributes resolving to same label - I guess that's a single node tree.

For (b), I guess it would make sense to handle this, even if its an edge case, but for (a), it seems like we should just throw an exception during training..

@goibon
Copy link
Author

goibon commented Nov 28, 2013

As mentioned in my first post, I get an error for the values posted. I would have imagined that a decision tree would pick the target feature that appears the most times, or if no such case exists then pick the last one. But if you think an exception during training is a better solution then that's fine I guess.
For case (b) it would make sense to just resolve to the only target feature available.

In my training set I have decided to add a fake entry to bypass both cases discussed.

@sigfrid
Copy link

sigfrid commented Feb 28, 2014

For case (a) how about just ignoring those examples?
Anyway, an exception is much better than undefined method error.

For case (b), as goibon said, it would make sense to just resolve to the only target feature available.

@DannyBen
Copy link
Collaborator

DannyBen commented Feb 3, 2017

Although this is quite an old ticket, I must say that this issue causes me some headaches also.

As I see it:

  1. I would also imagine that a decision tree should make best effort to answer the question. If the same input resolves to different outputs, than get the output with the most occurrences of input.
  2. I understand if the above is not in line with the common implementations (don't know if it is or isn't)
  3. If it is decided that this case should trigger an error, then I would expect it to trigger a specialized, custom error and not undefined_method so at least we could rescue it and be sure we rescued the correct one.

@jesselawson
Copy link

jesselawson commented Sep 2, 2017

This is a very old thread but I think the issue is still here. I am a proponent of leaving it how it is.

To address @DannyBen 's list:

  1. I would also imagine that a decision tree should make best effort to answer the question. If the same input resolves to different outputs, than get the output with the most occurrences of input.

I believe this makes the assumption that FIFO and LIFO data observation models are exactly the same. In other words, if I have a set of observations where the same occurrence of features are yielding different outputs, can I be certain that there weren't exigent circumstances or variables that were causing the variation in output? (As a matter of fact, I can probably assume that there was an exigent variable, since we would expect a model like y = a+b to always yield the same observation when a and b are constant--and if they aren't, we should assume that there is another variable that we are not accounting for.

Put another way, if we are training a model with data that has non-changing inputs, then what are we really doing? As @igrigorik said, a check could be made to throw an error when something like that happens, but it would be remiss of an objective modeling algorithm to choose a FIFO or LIFO method to select the "correct" observation, when we might not even understand what exigent variable is causing the discrepancies.

  1. I understand if the above is not in line with the common implementations (don't know if it is or isn't)

I think this means you already understand that the above limitations of assumptions are something we shouldn't control for. Is that correct?

  1. If it is decided that this case should trigger an error, then I would expect it to trigger a specialized, custom error and not undefined_method so at least we could rescue it and be sure we rescued the correct one.

I can see how this would be frustrating, but maybe a second look at the data is in order. If you have a multivariate system of continuous inputs that are constant, maybe one solution is to convert them to discrete variables instead? On other words, have you tried something like this:

[
["cat5", "option1", "9.990941"], 
["cat5", "option1",  "9.990926"], 
["cat5", "option1",  "9.991411"], 
["cat5", "option1",  "9.991286"], 
...
["cat5", "option1",  "9.9907190386056"]
]

I don't know your data, but if I saw a set of continuous inputs like this, I can safely assume that one of two things are true: either A) there are continuous variables that should be discretized, or B) the wrong data is being measured to yield an observation.

Think about the nature of a decision tree, and then think about how it would play out on a model that had the exact same inputs but different outputs. For example, pretend we have a model trained on the data from OP. If we asked the model to predict the outcome of [5,1], what should the model give us? Well, since it knows there are more than one outcome--and there's no way to determine whether it should adhere to LIFO or FIFO prioritization (nor any reason it should)--what it can only return is a range. That's sort of antithetical to a decision tree, isn't it?

One could argue that it should return the mean value, but again, that assumption would also depend heavily on the environment and the type of data we're collecting.

@DannyBen
Copy link
Collaborator

DannyBen commented Sep 2, 2017

I can see how this would be frustrating, but maybe a second look at the data is in order.

Psychologists call this "deflection" no? 😏

Let's ignore all the different ways to handle - or not handle - multiple values, I think there should be no disagreement that at least the library should raise a specialized error and not let ruby fail with an unrelated error.

But, I understand if this is not on anybody's priority list. To be honest, I also moved on.

@harryloewen
Copy link

2018 update
IMHO @igrigorik is absolutely right. I falsely used the same default values, until the user starts providing information.

All it needs is a simple random modifier for the default values. That might cause some false-correlation in the beginning, but fixes itself as data grows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants