Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Handling of Count Data, e.g. Poisson Regression #515

Closed
javdg opened this issue Oct 6, 2015 · 7 comments
Closed

Feature: Handling of Count Data, e.g. Poisson Regression #515

javdg opened this issue Oct 6, 2015 · 7 comments

Comments

@javdg
Copy link

javdg commented Oct 6, 2015

I intended to use mlr for a benchmark of Poisson Regression models but realized, that support for count data models hasn't actually been implemented yet.
While successfully working around these limitations in my specific use case, I talked to @berndbischl and this issue/feature request is supposed to track and coordinate any efforts towards a proper implementation.

As to specific performance measures, these StackExchange threads might be helpful:

I also compiled a (likely incomplete) list of learners that might be considered (I'm personally not familiar with most of the more complex models, these might not be appropriate/a priority):

Models for Count Data

  • glm() {stats}
  • glm.nb() {MASS}
  • glmnet() {glmnet}
  • cv.glmnet() {glmnet}
  • hurdle() {pscl}
  • zeroinfl() {pscl}

General Implementations (including models for count data)

  • gamlss() {gamlss}
  • poissonff() {VGAM}

Further Extensions (to the classical glm, including count data models)

  • finite mixture models {flexmix}
  • generalized estimating equations {geepack}
  • mixed-effects models {lme4, nlme}

Apparently Outdated

  • zicounts() {zicounts - orphaned}
  • fit.zigp() {ZIGP - not on CRAN anymore}

see also https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf

Cheers,
Johannes

@ghost
Copy link

ghost commented Nov 5, 2015

Hello,

two more boosting algorithms which can be used with count data:

  • xgboost() {xgboost}
  • gbm() {gbm}

Best
Basil

@zmjones
Copy link
Contributor

zmjones commented Dec 9, 2015

i am a bit confused about why this is necessary. most (almost all?) regression methods for counts produce positive real-valued predictions. i don't see how this wouldn't work as a normal regression task. any count specific measures could be added without adding any other new structure. is there some other reason we should have count tasks that i've missed? If there is could it not be solved by adding additional structure to a regression task and maybe an additional property to regression learners (e.g., check/enforce non-negativity).

@javdg
Copy link
Author

javdg commented Dec 9, 2015

Hi,
this thread arose from encountering what has later been independently reported (and fixed in the meantime) as #559.
In my subsequent discussions with @berndbischl he suggested to tackle this issue in a systematic manner and to create this bug to track any developments in this regard.
I tend to agree, this doesn't necessarily require to create a completely new and separate count task - even though I think that's what Bernd had in mind originally, but he needed some more time to look into it.

@javdg
Copy link
Author

javdg commented Apr 6, 2016

As per our conversation from the weekend, this is to ping @berndbischl ...

@SimonCoulombe
Copy link

SimonCoulombe commented Dec 20, 2018

Hi, I was wondering if there is any way to do poisson regression using xgboost in mlr? I also need to be able to offset for exposure using xgboost setinfo "base_margin".

Taking a guess , I would need to edit the trainLearner.regr.xgboost function in RLearner_regr_xgboost.R
to add the part in bold, but I really dont know what I am doing..

  parlist$data = xgboost::xgb.DMatrix(data = data.matrix(task.data$data), label = task.data$target)

  **if (!is.null(parlist$base_margin)){
    print("pouet pouet")
    xgboost::setinfo(parlist$data, "base_margin", parlist$base_margin)}**


  if (!is.null(.weights))
    xgboost::setinfo(parlist$data, "weight", .weights)


Below is what some non-mlr code looks like, followed by a mlrMBO implementation I believe works, but doesnt use the makeLearner function.

appendix 1 : Here is a sample non-mlr code. The goal would be to maximise cv$evaluation_log[, max(test_poisson_nloglik_mean)]

library(xgboost)
library(insuranceData) # example dataset https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
library(tidyverse)


set.seed(1234)
data(dataCar)
mydb <- dataCar %>% select(numclaims, exposure, veh_value, veh_body,
                           veh_age, gender, area, agecat)

label_var <- "numclaims"  
offset_var <- "exposure"
feature_vars <- mydb %>% 
  select(-one_of(c(label_var, offset_var))) %>% 
  colnames()


#preparing data for xgboost (one hot encoding of categorical (factor) data
myformula <- paste0( "~", paste0( feature_vars, collapse = " + ") ) %>% as.formula()
dummyFier <- caret::dummyVars(myformula, data=mydb, fullRank = TRUE)
dummyVars.df <- predict(dummyFier,newdata = mydb)
mydb_dummy <- cbind(mydb %>% select(one_of(c(label_var, offset_var))), 
                    dummyVars.df)
rm(myformula, dummyFier, dummyVars.df)

feature_vars_dummy <-  mydb_dummy  %>% select(-one_of(c(label_var, offset_var))) %>% colnames()

# create xgb.matrix for xgboost consumption
mydb_xgbmatrix <- xgb.DMatrix(
  data = mydb_dummy %>% select(feature_vars_dummy) %>% as.matrix, 
  label = mydb_dummy %>% pull(label_var),
  missing = "NAN")

#base_margin: base margin is the base prediction Xgboost will boost from  (ie: exposure)
setinfo(mydb_xgbmatrix,"base_margin", mydb %>% pull(offset_var) %>% log() )

# random constraint, just to show how it can be used
myConstraint   <- data_frame(Variable = feature_vars_dummy) %>%
  mutate(sens = ifelse(Variable == "veh_age", -1, 0))

# cv folds
cv_folds = rBayesianOptimization::KFold(mydb_dummy$numclaims,
                                        nfolds= 3,
                                        stratified = TRUE,
                                        seed= 0)

cv <- xgb.cv(params = list(
  booster = "gbtree",
  eta = 0.01,
  max_depth = 2,
  min_child_weight = 2,
  gamma = 0,
  subsample = 0.6,
  colsample_bytree = 0.6,
  objective = 'count:poisson', 
  eval_metric = "poisson-nloglik"),
  data = mydb_xgbmatrix,
  nround = 50,
  folds=  cv_folds,
  monotone_constraints = myConstraint$sens,
  prediction = FALSE,
  showsd = TRUE,
  early_stopping_rounds = 20,
  verbose = 0)

cv$evaluation_log[, max(test_poisson_nloglik_mean)]

appendix 2: Here is some code based on mlrMBO that I believe works, but doesnt use the makeLearner function:


obj.fun  <- makeSingleObjectiveFunction(
  name = "xgb_cv_bayes",
  fn =   function(                  eta){
  set.seed(1234)
  cv <- xgb.cv(params = list(
    booster = "gbtree",
    eta = eta,
    max_depth = 2,
    min_child_weight = 2,
    gamma = 0,
    subsample = 0.6,
    colsample_bytree = 0.6,
    objective = 'count:poisson', 
    eval_metric = "poisson-nloglik"),
    data = mydb_xgbmatrix,
    nround = 200,
    folds=  cv_folds,
    monotone_constraints = myConstraint$sens,
    prediction = FALSE,
    showsd = TRUE,
    early_stopping_rounds = 50,
    verbose = 0)
  
  cv$evaluation_log[, max(test_poisson_nloglik_mean)]
},
  par.set = makeParamSet(
    makeNumericParam("eta", lower = 0.001, upper = 0.05)
    ),
  minimize = FALSE
)

des = generateDesign(n = 3, par.set = getParamSet(obj.fun), fun = lhs::randomLHS)
des$y = apply(des, 1, obj.fun)
des
control = makeMBOControl()
control = setMBOControlTermination(control, iters = 1)
run = exampleRun(obj.fun, control = control, show.info = FALSE)
plotExampleRun(run, iters = c(1L, 2L), pause = FALSE)
print(run)
run$evals

@j-hartshorn
Copy link
Contributor

This would be a very useful addition, and I think it could be done with an addition of an offset argument to tasks not unlike the weights argument. From there, learners could use that for training and prediction.

Does anyone know why this might not be a good idea?

@pat-s
Copy link
Member

pat-s commented Jun 17, 2019

If someone comes up with a PR for this, we are happy to review this. For now, I close here and advise future enhancements to be added to mlr3.

@pat-s pat-s closed this as completed Jun 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants