Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost learner inverts labels #32

Closed
001ben opened this issue Sep 24, 2019 · 7 comments
Closed

xgboost learner inverts labels #32

001ben opened this issue Sep 24, 2019 · 7 comments

Comments

@001ben
Copy link

001ben commented Sep 24, 2019

label = match(as.character(as.matrix(task$data(cols = task$target_names))), lvls) - 1

The match line for extracting labels from the task inverts the labels which messes with measures on binary tasks. This causes issues when supplying a watchlist to an xgb task for early stopping.

# positive class comes first
lvls = c('1', '0')
labels = c('0', '1', '0')
new_labels = match(labels, pos_1_lvls) - 1
new_labels == labels # FALSE

Suggested:

label = length(lvls) - match(as.character(as.matrix(task$data(cols = task$target_names))), lvls)
@mllg
Copy link
Sponsor Member

mllg commented Sep 24, 2019

@berndbischl @pat-s this also affects mlr2.

@mllg mllg closed this as completed in 333f231 Sep 24, 2019
@mllg
Copy link
Sponsor Member

mllg commented Sep 24, 2019

Thanks for reporting.

@pat-s
Copy link
Member

pat-s commented Sep 24, 2019

Why were the labels inverted in the first place?

@mllg
Copy link
Sponsor Member

mllg commented Sep 24, 2019

xgboost needs the labels translated to 0:nclass. Usually it does not matter how you encode from factor -> int as long as you translate back correctly (and we did this).

However, xgboost supports stuff like early stopping where it calculates performance measures internally to decide whether to terminate or keep going. And for some binary classification measures it matters which class is the positive class (PPV, precision, recall, ...).

@bmreiniger
Copy link

This obviously causes problems if one wants to extract the underlying xgboost model (in my case, to convert into PMML), but I don't see an easy way around that on the mlr side. (I've brought it up for r2pmml at jpmml/r2pmml#46 (comment).)

@mllg
Copy link
Sponsor Member

mllg commented Feb 21, 2020

@bmreiniger Are you still encountering problems in mlr3?

@bmreiniger
Copy link

bmreiniger commented Feb 21, 2020

@mllg Yes. There's an additional weirdness around column order. Here's the mlr3 adaptation of what I posted over at r2pmml:

library(r2pmml)
library(mlr3)
library(mlr3learner)
library(xgboost)

set.seed(314)

data("iris")
# make binary target
iris$Species <- as.integer(iris$Species)
iris$Species <- as.integer(abs(iris$Species - 2))
iris$Species <- as.factor(iris$Species)

task <- mlr3::TaskClassif$new("bin_iris", iris, "Species")
task
xgb_learner <- lrn("classif.xgboost")
xgb_learner$param_set$values = list(
  objective = 'binary:logistic',
  eval_metric = 'auc',
  nrounds = 10
  )
xgb_learner$predict_type = "prob"

xgb_learner$train(task)

mlr_preds <- xgb_learner$predict(task)

xgb_model <- xgb_learner$model
dmat <- xgb.DMatrix(data = as.matrix(iris[, c(3,4,1,2)]))  # drop Species and reorder columns to match xgb_model$feature_names
xgb_preds <- predict(xgb_model, dmat)

head(mlr_preds$prob)
head(xgb_preds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants