Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does ranger do with new factor levels in prediction? #116

Closed
mayer79 opened this issue Sep 3, 2016 · 6 comments
Closed

What does ranger do with new factor levels in prediction? #116

mayer79 opened this issue Sep 3, 2016 · 6 comments

Comments

@mayer79
Copy link

mayer79 commented Sep 3, 2016

Hello rangers

I was recently stumbling over the error message

Error in predict.ranger.forest(forest, data, predict.all, seed, num.threads, :
Missing values in data.

It was the "classic" problem of having a new fector level in a categorical predictor during prediction which seems to happen if respect.unordered.factors = ["order"/TRUE] only. For "partition" and "ignore"/FALSE, there is no such message.

I think the behaviour in the cases "order" and "ignore" (FALSE) is clear although the error message for "order" could be more specific like "new or unknown factor levels in regressor". But what does ranger do in the last case respect.unordered.factors = "partition" (no error)?

Below the small example for test:

# All possible two-partitions
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "partition")
predict(fit, data.frame(Species = ""))$predictions

# Ordered by proportion of second class (respect.unordered.factors = TRUE)
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "order")
predict(fit, data.frame(Species = ""))$predictions

# Factors are considered ordered (respect.unordered.factors = FALSE)
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "ignore")
predict(fit, data.frame(Species = ""))$predictions
@mnwright
Copy link
Member

mnwright commented Sep 6, 2016

Both things are on my TODO list for a while:

  1. Fix error for respect.unordered.factors = "order"
  2. How to handle new levels in general?

On 2.: In randomForest all new factor levels go to the left. Currently the same is done in ranger. However, we should change this. Any better ideas than just assign them randomly?

@mayer79
Copy link
Author

mayer79 commented Sep 7, 2016

Oh I see. So I guess the same strategy (always to the left) is also used when calculating the OOB predictions. Maybe you can leave it like that until we have strategies of handling missing values in predictors? Then you could start to offer an option like newFactorLevels = c("left", "right", "missing") or so.

@mnwright
Copy link
Member

Error fixed in #120.

@ghost
Copy link

ghost commented Jun 10, 2017

In randomForest all new factor levels go to the left.

i'm not sure about this
randomForest.predict returns error "new factor levels not present in the training data" when it is faced with unseen levels.

@mnwright
Copy link
Member

mnwright commented Jun 12, 2017

Yes, you are right. I was checking for factor levels being present in the levels but not in the data. However, I've just checked that again and it was changed in a recent version of randomForest: They are assigned randomly since version 4.6-10.

@ghost
Copy link

ghost commented Jun 12, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants