Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Error: Missing data in columns" due to fix.factors.prediction=TRUE #2611

Closed
feinmann opened this issue Jun 26, 2019 · 3 comments
Closed

"Error: Missing data in columns" due to fix.factors.prediction=TRUE #2611

feinmann opened this issue Jun 26, 2019 · 3 comments

Comments

@feinmann
Copy link

library(mlr)

train_data <- data.frame(
  A = runif(100), B = factor(sample(c("A", "B"), 100, replace = T)))

test_data <- data.frame(
  A = runif(100), B = factor(sample(c("A", "B", "C"), 100, replace = T)))

lrn <- makeLearner("regr.ranger", fix.factors.prediction = TRUE)

train_task <- makeRegrTask(
  data = train_data,
  target = "A"
)

model <- train(lrn, train_task)

predictions <- predict(model, newdata = test_data)

Gives Error: Missing data in columns: B., although there is no missing data. Same for classification.

Kind regards

@pat-s
Copy link
Member

pat-s commented Jun 26, 2019

This issue sounds familiar to me - I think it hit me some time in the past as well.
I cannot tell you when I will have time to look into this.

Thanks a lot for the reprex in the first place!

@jakob-r
Copy link
Sponsor Member

jakob-r commented Sep 11, 2020

The reason is that the new factor C is converted to an NA because of fix.factors.prediction = TRUE.
As you can see in the documentation this is feature was intended for cases where the test data has less factors than the training set. However, it has the side effect that it reduces the levels to the one seen in the training and R then just sets new factor levels to NA. Maybe some learners can deal better with an NA then with an unseen factor? However this is not really intended and definitely has to go into the documentation of the fix.factors.prediction argument.

Also we might want to deal with it better.
#2771 is kind of related.

@pat-s
Copy link
Member

pat-s commented Oct 28, 2020

The PR from Jakob provides a good approach to the problem. Most likely there is not much else we can do in such situations to account for all possible issues with missing data in prediction scenarios.

@pat-s pat-s closed this as completed Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants