Feature: Handling of Count Data, e.g. Poisson Regression #515

javdg · 2015-10-06T09:20:40Z

I intended to use mlr for a benchmark of Poisson Regression models but realized, that support for count data models hasn't actually been implemented yet.
While successfully working around these limitations in my specific use case, I talked to @berndbischl and this issue/feature request is supposed to track and coordinate any efforts towards a proper implementation.

As to specific performance measures, these StackExchange threads might be helpful:

I also compiled a (likely incomplete) list of learners that might be considered (I'm personally not familiar with most of the more complex models, these might not be appropriate/a priority):

Models for Count Data

glm() {stats}
glm.nb() {MASS}
glmnet() {glmnet}
cv.glmnet() {glmnet}
hurdle() {pscl}
zeroinfl() {pscl}

General Implementations (including models for count data)

gamlss() {gamlss}
poissonff() {VGAM}

Further Extensions (to the classical glm, including count data models)

finite mixture models {flexmix}
generalized estimating equations {geepack}
mixed-effects models {lme4, nlme}

Apparently Outdated

zicounts() {zicounts - orphaned}
fit.zigp() {ZIGP - not on CRAN anymore}

see also https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf

Cheers,
Johannes

ghost · 2015-11-05T13:16:56Z

Hello,

two more boosting algorithms which can be used with count data:

xgboost() {xgboost}
gbm() {gbm}

Best
Basil

zmjones · 2015-12-09T04:32:41Z

i am a bit confused about why this is necessary. most (almost all?) regression methods for counts produce positive real-valued predictions. i don't see how this wouldn't work as a normal regression task. any count specific measures could be added without adding any other new structure. is there some other reason we should have count tasks that i've missed? If there is could it not be solved by adding additional structure to a regression task and maybe an additional property to regression learners (e.g., check/enforce non-negativity).

javdg · 2015-12-09T10:16:51Z

Hi,
this thread arose from encountering what has later been independently reported (and fixed in the meantime) as #559.
In my subsequent discussions with @berndbischl he suggested to tackle this issue in a systematic manner and to create this bug to track any developments in this regard.
I tend to agree, this doesn't necessarily require to create a completely new and separate count task - even though I think that's what Bernd had in mind originally, but he needed some more time to look into it.

javdg · 2016-04-06T13:52:07Z

As per our conversation from the weekend, this is to ping @berndbischl ...

SimonCoulombe · 2018-12-20T23:49:44Z

Hi, I was wondering if there is any way to do poisson regression using xgboost in mlr? I also need to be able to offset for exposure using xgboost setinfo "base_margin".

Taking a guess , I would need to edit the trainLearner.regr.xgboost function in RLearner_regr_xgboost.R
to add the part in bold, but I really dont know what I am doing..

  parlist$data = xgboost::xgb.DMatrix(data = data.matrix(task.data$data), label = task.data$target)

  **if (!is.null(parlist$base_margin)){
    print("pouet pouet")
    xgboost::setinfo(parlist$data, "base_margin", parlist$base_margin)}**


  if (!is.null(.weights))
    xgboost::setinfo(parlist$data, "weight", .weights)

Below is what some non-mlr code looks like, followed by a mlrMBO implementation I believe works, but doesnt use the makeLearner function.

appendix 1 : Here is a sample non-mlr code. The goal would be to maximise cv$evaluation_log[, max(test_poisson_nloglik_mean)]

library(xgboost)
library(insuranceData) # example dataset https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
library(tidyverse)


set.seed(1234)
data(dataCar)
mydb <- dataCar %>% select(numclaims, exposure, veh_value, veh_body,
                           veh_age, gender, area, agecat)

label_var <- "numclaims"  
offset_var <- "exposure"
feature_vars <- mydb %>% 
  select(-one_of(c(label_var, offset_var))) %>% 
  colnames()


#preparing data for xgboost (one hot encoding of categorical (factor) data
myformula <- paste0( "~", paste0( feature_vars, collapse = " + ") ) %>% as.formula()
dummyFier <- caret::dummyVars(myformula, data=mydb, fullRank = TRUE)
dummyVars.df <- predict(dummyFier,newdata = mydb)
mydb_dummy <- cbind(mydb %>% select(one_of(c(label_var, offset_var))), 
                    dummyVars.df)
rm(myformula, dummyFier, dummyVars.df)

feature_vars_dummy <-  mydb_dummy  %>% select(-one_of(c(label_var, offset_var))) %>% colnames()

# create xgb.matrix for xgboost consumption
mydb_xgbmatrix <- xgb.DMatrix(
  data = mydb_dummy %>% select(feature_vars_dummy) %>% as.matrix, 
  label = mydb_dummy %>% pull(label_var),
  missing = "NAN")

#base_margin: base margin is the base prediction Xgboost will boost from  (ie: exposure)
setinfo(mydb_xgbmatrix,"base_margin", mydb %>% pull(offset_var) %>% log() )

# random constraint, just to show how it can be used
myConstraint   <- data_frame(Variable = feature_vars_dummy) %>%
  mutate(sens = ifelse(Variable == "veh_age", -1, 0))

# cv folds
cv_folds = rBayesianOptimization::KFold(mydb_dummy$numclaims,
                                        nfolds= 3,
                                        stratified = TRUE,
                                        seed= 0)

cv <- xgb.cv(params = list(
  booster = "gbtree",
  eta = 0.01,
  max_depth = 2,
  min_child_weight = 2,
  gamma = 0,
  subsample = 0.6,
  colsample_bytree = 0.6,
  objective = 'count:poisson', 
  eval_metric = "poisson-nloglik"),
  data = mydb_xgbmatrix,
  nround = 50,
  folds=  cv_folds,
  monotone_constraints = myConstraint$sens,
  prediction = FALSE,
  showsd = TRUE,
  early_stopping_rounds = 20,
  verbose = 0)

cv$evaluation_log[, max(test_poisson_nloglik_mean)]

appendix 2: Here is some code based on mlrMBO that I believe works, but doesnt use the makeLearner function:


obj.fun  <- makeSingleObjectiveFunction(
  name = "xgb_cv_bayes",
  fn =   function(                  eta){
  set.seed(1234)
  cv <- xgb.cv(params = list(
    booster = "gbtree",
    eta = eta,
    max_depth = 2,
    min_child_weight = 2,
    gamma = 0,
    subsample = 0.6,
    colsample_bytree = 0.6,
    objective = 'count:poisson', 
    eval_metric = "poisson-nloglik"),
    data = mydb_xgbmatrix,
    nround = 200,
    folds=  cv_folds,
    monotone_constraints = myConstraint$sens,
    prediction = FALSE,
    showsd = TRUE,
    early_stopping_rounds = 50,
    verbose = 0)
  
  cv$evaluation_log[, max(test_poisson_nloglik_mean)]
},
  par.set = makeParamSet(
    makeNumericParam("eta", lower = 0.001, upper = 0.05)
    ),
  minimize = FALSE
)

des = generateDesign(n = 3, par.set = getParamSet(obj.fun), fun = lhs::randomLHS)
des$y = apply(des, 1, obj.fun)
des
control = makeMBOControl()
control = setMBOControlTermination(control, iters = 1)
run = exampleRun(obj.fun, control = control, show.info = FALSE)
plotExampleRun(run, iters = c(1L, 2L), pause = FALSE)
print(run)
run$evals

j-hartshorn · 2019-02-01T16:15:40Z

This would be a very useful addition, and I think it could be done with an addition of an offset argument to tasks not unlike the weights argument. From there, learners could use that for training and prediction.

Does anyone know why this might not be a good idea?

pat-s · 2019-06-17T13:26:29Z

If someone comes up with a PR for this, we are happy to review this. For now, I close here and advise future enhancements to be added to mlr3.

javdg mentioned this issue Nov 5, 2015

Family argument of glmnet in "regr.glmnet" learner #559

Closed

ja-thomas added effort-hardfix prio-low type-enhancement labels Mar 7, 2017

Coorsaa added the project - base label Mar 9, 2017

pat-s added wontfixnow mlr3? labels Jun 17, 2019

pat-s closed this as completed Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Handling of Count Data, e.g. Poisson Regression #515

Feature: Handling of Count Data, e.g. Poisson Regression #515

javdg commented Oct 6, 2015

ghost commented Nov 5, 2015

zmjones commented Dec 9, 2015

javdg commented Dec 9, 2015

javdg commented Apr 6, 2016

SimonCoulombe commented Dec 20, 2018 •

edited

j-hartshorn commented Feb 1, 2019

pat-s commented Jun 17, 2019

Feature: Handling of Count Data, e.g. Poisson Regression #515

Feature: Handling of Count Data, e.g. Poisson Regression #515

Comments

javdg commented Oct 6, 2015

ghost commented Nov 5, 2015

zmjones commented Dec 9, 2015

javdg commented Dec 9, 2015

javdg commented Apr 6, 2016

SimonCoulombe commented Dec 20, 2018 • edited

j-hartshorn commented Feb 1, 2019

pat-s commented Jun 17, 2019

SimonCoulombe commented Dec 20, 2018 •

edited