New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in resample using predict = "train" #1284

Closed
giuseppec opened this Issue Oct 13, 2016 · 2 comments

Comments

3 participants
@giuseppec
Copy link
Contributor

giuseppec commented Oct 13, 2016

lrn = makeLearner("classif.rpart", predict.type = "prob")
rdesc = makeResampleDesc("CV", iter = 3, predict = "train")
mmce.train = setAggregation(mmce, train.mean)
res = resample(lrn, binaryclass.task, rdesc, mmce.train)

Issue 1: printing res$pred yields

Resampled Prediction for:
Resample description: cross-validation with 3 iterations.
Predict: train
Stratification: FALSE

threshold: 
time (mean): NA
Error in as.data.frame.default(x) : 
  cannot coerce class "c("ResamplePrediction", "NULL")" to a data.frame
In addition: Warning message:
In mean.default(x$time) : argument is not numeric or logical: returning NA

Issue 2: res$pred$predict.type seems to be NULL but should still have the same value as lrn$predict.type!
For example, using predict = "both" seems to work:

lrn = makeLearner("classif.rpart", predict.type = "prob")
rdesc = makeResampleDesc("CV", iter = 3, predict = "both")
mmce.train = setAggregation(mmce, train.mean)
res = resample(lrn, binaryclass.task, rdesc, mmce.train)
res$pred$predict.type
# [1] "prob"
@MariaErdmann

This comment has been minimized.

Copy link
Contributor

MariaErdmann commented Oct 13, 2016

The problem is in makeResamplePrediction. See my (unfortunately not so short example below):

learner = makeLearner("classif.rpart")
task = binaryclass.task
resampling = makeResampleDesc("CV", iters = 2, predict = "train")
resampling = makeResampleInstance(resampling, task = task)
rin = resampling

mmce.train = setAggregation(mmce, train.mean)
extract = function(model) {}
more.args = list(learner = learner, task = task, rin = rin, weights = NULL,
  measures = list(mmce.train), model = FALSE, extract = extract, show.info = getMlrOption("show.info"))

library(parallelMap)
parallelLibrary("mlr", master = FALSE, level = "mlr.resample", show.info = FALSE)
exportMlrOptions(level = "mlr.resample")
iter.results = parallelMap(doResampleIteration, seq_len(rin$desc$iters), level = "mlr.resample", more.args = more.args)

ms.train = as.data.frame(extractSubList(iter.results, "measures.train", simplify = "rows"))
ms.train
ms.test = extractSubList(iter.results, "measures.test", simplify = FALSE)
ms.test = as.data.frame(do.call(rbind, ms.test))
ms.test

preds.test = extractSubList(iter.results, "pred.test", simplify = FALSE)
preds.test
preds.train = extractSubList(iter.results, "pred.train", simplify = FALSE)
preds.train

pred = makeResamplePrediction(instance = rin, preds.test = preds.test, preds.train = preds.train)
# calling pred I can reproduce the error
pred

# looking into makeResamplePrediction we see where it comes from
tenull = sapply(preds.test, is.null)
trnull = sapply(preds.train, is.null)
if (any(tenull)) pr.te = preds.test[!tenull] else pr.te = preds.test
if (any(trnull)) pr.tr = preds.train[!trnull] else pr.tr = preds.train

data = setDF(rbind(
  rbindlist(lapply(seq_along(pr.te), function(X) cbind(pr.te[[X]]$data, iter = X, set = "test"))),
  rbindlist(lapply(seq_along(pr.tr), function(X) cbind(pr.tr[[X]]$data, iter = X, set = "train")))
))

# the problem ist p1 which is NULL because we just calculated measures for train
p1 = preds.test[[1L]]
p1

# printing the S3 object fails because some 'slots' are NULL (see below)

makeS3Obj(c("ResamplePrediction", class(p1)),
  instance = rin,
  predict.type = p1$predict.type,
  data = data,
  threshold = p1$threshold,
  task.desc = p1$task.desc,
  time = extractSubList(preds.test, "time")
)

p1$predict.type
p1$threshold
p1$task.desc
extractSubList(preds.test, "time")

So my suggestion for the task description and predict.type would be to pass the learner and the task to the makeResamplePrediction function which is no big deal because makeResamplePrediction is only called infunctionmergeResampleResult`where the learner and task are passed anyway.

For threshold and the time I am not sure how to handle this.
Is it possible to have different thresholds for train and predict measures? If so, then we need to pass a vector, right? Regarding the time slot: what time shall be displayed here?

@MariaErdmann

This comment has been minimized.

Copy link
Contributor

MariaErdmann commented Oct 28, 2016

Fixed in #1315

larskotthoff added a commit that referenced this issue Oct 31, 2016

Fix bug in resampling when using predict = "train" (#1284) (#1315)
* enable printing pred of resample results when predict type is train

* Finish bug fix and adapt test

* better tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment