Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning the probability threshold for classification #856

Closed
studerus opened this issue Apr 26, 2016 · 28 comments
Closed

Tuning the probability threshold for classification #856

studerus opened this issue Apr 26, 2016 · 28 comments

Comments

@studerus
Copy link
Contributor

studerus commented Apr 26, 2016

Is it possible with mlr to tune the probability threshold for classification using nested cross-validation? In the tutorial on cost-sensitive classification the function tuneThreshold is briefly explained, but if I understand it correctly, it can only be used for unnested resampling. I think it is important to be able to tune the threshold in nested cross-validation because searching for an optimal threshold can lead to strong overoptimism. Hence, if we want to properly estimate the predictive performance in the same sample, we have to strictly separate testing from learning.

Why don't we treat the probability threshold as a regular tuning parameter? This would not only allow nested cross-validation but also tuning this parameter with other tuning parameters at the same time.

I also don't understand why changing the threshold is not discussed as potential remedy for class imbalance in the tutorial on imbalanced classification problems.

@schiffner
Copy link
Contributor

You actually can tune the threshold in nested cross-validation: You just need to set tune.threshold = TRUE in the makeTuneControl* function. So also joint tuning of threshold and other learner parameters works.

I also don't understand why changing the threshold is not discussed as potential remedy for class imbalance in the tutorial on imbalanced classification problems.

That's something that's still missing unfortunately.
Generally, the cost-sensitive, imbalanced and ROC pages need to be synchronized better, as there are some things that are explained several times and some things that are missing. I also wanted to add a page about the decision threshold, explain how the generalization to the multi-class case works and what options there are for tuning. I have written quite a lot already, but have no time currently. :(

@studerus
Copy link
Contributor Author

Thanks a lot! I will try it out.

@berndbischl
Copy link
Sponsor Member

You actually can tune the threshold in nested cross-validation

Here is a script that shows this.

library(ElemStatLearn)
data(spam)
load_all()

task = makeClassifTask(data = spam, target = "spam")
lrn1 = makeLearner("classif.gbm", predict.type = "prob")
ps = makeParamSet(
  makeIntegerParam("interaction.depth", lower = 1, upper = 5)
)
ctrl = makeTuneControlRandom(maxit = 2, tune.threshold = TRUE)
lrn2 = makeTuneWrapper(lrn1, par.set = ps, control = ctrl, resampling = cv2)
r = resample(lrn2, task, cv3, extract = getTuneResult)
print(r$extract)

@berndbischl
Copy link
Sponsor Member

berndbischl commented Apr 26, 2016

Further comments from my side, and questions for Erich:

  1. In the above script the tuning is (of course) nested. What happens is:
    The tuner evals a certain learner config via (inner) CV2. On these predictions then the optimal threshold is selected, for this learner config (by calling tuneThreshold on the ResamplePrediction object, which was generated in the config eval). So for the optimal learner config, at the end of tuning, we also know its selected threshold. The model is then trained on the complete outer training data set, the threshold is set, and we predict the outer test set. (the whole process is repeated for the outer resampling loop)

Do we need to improve docs here? I guess my text above should be in there for clarification.

  1. I do see a problem: I don't know if we support tuning the threshold WITHOUT tuning ANY OTHER hyperpar. So for nomal classif.logreg I dont know how to do this.....?
    (I meant: I don't know how to code this as a "user". It would definitely possible with a change in mlr to support this....)

  2. Erich asked:

Why don't we treat the probability threshold as a regular tuning parameter?

I had that ages ago in mlr. It was changed, and is now handled "specially" because we want to be efficient. If you would treat it as a normal tuning param, this would happen:
You eval alpha=2, th = 0.7. Now if you want to know what alpha=2, th=0.8 is, you would need to TRAIN your model again, as this is what the tuner does. What we do, is train alpha=2, then select the optimal th from all possible thresholds.
Does that make sense?

  1. Regarding the tutorial:
    Many places are "imperfect" here, as we have so much in the package. (NO criticism of Julia's extremely cool work here).
    Care to help out to make it better? At least in this instance?

@berndbischl
Copy link
Sponsor Member

Regarding 2)
Possibly the cleanest solution would be to create a specific control object for this?

lrn1 = makeLearner("classif.logreg", predict.type = "prob")
ctrl = makeTuneControlThresholdOnly()
lrn2 = makeTuneWrapper(lrn1, par.set = ps, control = ctrl, resampling = cv2)
r = resample(lrn2, task, cv3, extract = getTuneResult)

?

@studerus
Copy link
Contributor Author

Yes, I think this would be helful. As a workaround for only tuning the threshold and not other parameters at the same time, I can specify an integer tuning parameter whose upper and lower bounds are equal to the default value.

@andrewjohnlowe
Copy link

Hello,

I see how setting tune.threshold = TRUE in the makeTuneControl* function above can be used to tune the probability threshold used to convert probabilities to predictions for the class labels. However, I don't see what is being optimised in order to find this threshold, and I don't understand how I can change whatever is being optimised to what I want.

What I want to do is tune the probability threshold so that I get a specific TPR, and I want both the threshold and the FPR at that point. I actually want to do this for a small range of TPRs (60, 70, 80%). I'm training a glmnet model on moderately unbalanced data and I'm currently using AUC as the optimisation objective (but this is only a surrogate objective; my metric of interest are really the FPRs for specific TPRs).

How can I control how the probability threshold is optimised in nested CV? Thanks!

@PhilippPro
Copy link
Member

PhilippPro commented Sep 8, 2016

Hi Andrew,

you can control the measure that is optimised by setting it in makeTuneWrapper. See the measures argument in the help of makeTuneWrapper. Here you can set e.g. auc, as the first measure, so it is optimised over this measure.

Regarding your second paragraph:
If you have trained a learner with probabilities (set predict.type = "prob" in makeLearner), you can extract values for ROC-analysis and get all possible combinations of tpr and fpr. See the tutorial page for ROC-Analysis:
http://mlr-org.github.io/mlr-tutorial/devel/html/roc_analysis/index.html

There you can find this line, which extracts you different tpr and fpr combinations:

df = generateThreshVsPerfData(list(lda = pred1, ksvm = pred2), measures = list(fpr, tpr))

Tuning directly fpr or tpr does not make sense for me, as you then just have to predict all FALSE or all TRUE.

If you want to set the threshold from the beginning you can set it already when creating the learner with makeLearner.

The modified example from above for your use case (first makeTuneWrapper with auc, then get tpr and fpr combinations):

library(mlr)
library(ElemStatLearn)
data(spam)

task = makeClassifTask(data = spam, target = "spam")
lrn1 = makeLearner("classif.gbm", predict.type = "prob")
ps = makeParamSet(
  makeIntegerParam("interaction.depth", lower = 1, upper = 5)
)
ctrl = makeTuneControlRandom(maxit = 2, tune.threshold = TRUE)
lrn2 = makeTuneWrapper(lrn1, par.set = ps, control = ctrl, resampling = cv2, measures = list(auc))
r = resample(lrn2, task, cv3, extract = getTuneResult)

generateThreshVsPerfData(r, measures = list(fpr, tpr))

Hope I could help you.

@andrewjohnlowe
Copy link

Hi, it wasn't clear to me from the example above that the measure that is optimised to find the probability threshold is mmce, which is the default. I now understand that if I change the first measure in the list of measures in makeTuneWrapper, this new measure is the one being optimised. Thanks! With regards to the rest: instead of using AUC as a surrogate optimisation objective and then finding the probability threshold for a desired TPR by evaluating on a holdout sample or during CV, I want to minimise the FPR for a desired TPR directly by using some combination of TPR and FPR as a single measure that is my new optimisation objective that is tuned with tune.threshold = TRUE in the makeTuneControl* function. This is what Lars Kotthoff proposed here:
http://stackoverflow.com/questions/39214123/tune-glmnet-hyperparameters-and-evaluate-performance-using-nested-cross-validati
I have found that trying to use generateThreshVsPerfData to extract the FPR for a specified TPR does not work; invoking this from inside a custom measure fails silently during tuning -- I don't know why. This is probably not the way I want to go. So far, the custom measures I've tried are:
sqrt((0.6 - tpr)^2 + fpr^2)
i.e., Euclidean distance. This didn't guarantee a TPR of 0.6 as desired. I also tried:
tolerance <- 0.01; ifelse(abs(tpr - 0.6) > tolerance, 1, fpr)
this at least ensures that the TPR is within a specified range of values; the custom measure is minimised when TPR = 0.6 (or as close to this value as tolerance allows). The reason that I'm doing this is that for my application, the metric of interest for comparing models is the FPR at a specified TPR; models are usually tuned for a specific TPR. Can you suggest a better single measure (some combination of TPR and FPR or perhaps something else) to achieve what I want?

@schiffner
Copy link
Contributor

Hi Andrew,

maybe partial AUC (e.g. in package pROC) is alternative for you. As far as I know you can restrict tpr or fpr to be in some interval, but I've never tried how it works if the interval is very small like in your case.

Cheers,
Julia

@larskotthoff
Copy link
Sponsor Member

You could, in your custom measure, check whether tpr is the desired value and if not return a really bad score. Not sure how well that would work in practice though.

@andrewjohnlowe
Copy link

Julia: Using pROC to get hold of the FPR as a specified TPR and minimise that FPR seems to work. It took a bit of debugging, but my code for my custom measure looks like this:

## Define a custom measure to calculate FPR at TPR = 60%:
my.fpr.60.fun = function(task, model, pred, feats, extra.args) {
  roc <- roc(as.numeric(pred$data$truth),
             as.numeric(pred$data$response),
             smooth = FALSE)
  spc <- coords(roc, tpr.tune, input = "sensitivity", ret = "specificity")
  fpr <- 1 - spc
  return(fpr)
}

where tpr.tune = 0.6 for a desired TPR of 60%. Using smoothing didn't work. I'm not sure why.

Lars: That was my first idea; I mentioned this above. This also seems to work, although I don't like having to specify a tuning parameter to decide how close to 60% I must be to return the FPR without a big penalty for being outside the accepted range. I wonder how efficiently this measure can be minimised. I don't know anything about how the optimisation engine in mlr works, but I imagine that if it's using something like gradient descent, having a measure that has zero gradient except for one small region where there is a Dirac delta-type spike might be a bit tricky to minimise. This is just a wild guess. This method seems to work, but I have this nagging doubt that the results I get may be sub-optimal and I don't know it.

Provided its fast enough, Julia's method seems to be the simplest and most direct way to get the measure I want, so I'll go with that for the moment.

As I understand, after obtaining an estimate of the performance I can expect from the final model on new data if I use the model selection method in the inner loop of the CV, I should repeat exactly the same procedure that I performed in the inner loop of the CV and apply it to the entire dataset in order to build the final model. That is, using all the data, I find the optimal hyperparameters (the alpha and lambda of my glmnet model) using CV and then use these values to build the final model. Please scream if I have this wrong! :-)

Thanks!

@larskotthoff
Copy link
Sponsor Member

mlr has different optimisation methods and they are all able to deal with these spikes. If you're using grid or random search they're completely unaffected by these.

@berndbischl
Copy link
Sponsor Member

hi,

let me clear this up.

Hi, it wasn't clear to me from the example above that the measure that is optimised to find the probability threshold is mmce, which is the default. I now understand that if I change the first measure in the list of measures in makeTuneWrapper, this new measure is the one being optimised. Thanks!

this is correct. tuneThreshold optimizes the measure you selected for tuning. it can basically be any measure.

  1. what you are requesting: "optimize the FPR under a constraint on the TPR like TPR >= 0.6" is a very common and reasonable approach. i will try to help you with this. and mlr should support this (better) as this is so important

  2. @PhilippPro is correct that normally you can simply the values directly from the ROC plot. I think what you are doing is sometimes called "selecting an operating point on the ROC curve". but if you are in a nested resampling setup, with tuning, this plot does of course not help you.

  3. @schiffner is correct that using a partial AUC is very similar to what you want. it is not exactly the same, but quite similar. i do like partial AUCs. but we should also support EXACTLY what you want.

I will post more later

@andrewjohnlowe
Copy link

Hi Bernd,

Thanks for your reply. I look forward to hearing more later. In the meantime, it might help to know the context for the model I'm building. I'm a particle physicist working on one of the large experiments at the Large Hadron Collider at CERN, and I want to use mlr to build a classifier that can distinguish between two different classes of subatomic particle, based on their decay properties. I will be citing the mlr package in my paper, and you'll be able to add my paper to your list of works that use mlr. I'm trying to improve on this work: https://arxiv.org/abs/1405.6583

Quick comments in reply to your comments:

  1. OK, I have this figured out and it works.
  2. I'm actually trying to optimise the FPR for TPR = 0.6, not >= 0.6, so that I can compare with earlier work.
  3. I don't understand. Why doesn't the ROC plot help me? You are saying that the method Julia proposed is wrong for nested CV?
  4. Is there a different between what I want and what has already suggested? I don't follow. Sorry.

Thanks!

@berndbischl
Copy link
Sponsor Member

I'm actually trying to optimise the FPR for TPR = 0.6, not >= 0.6, so that I can compare with earlier work.

the reason i said "subject to TPR >= 0.6" as the constraint for "FPR = min!", instead of "subject to TPR = 0.6" is that my both versions are mathematically equivalent, right?
A higher TPR would result in a lower FPR, in general. and if you could get a higher TPR (than 0.6) for the same optimal FPR (lets assume 0.1) that result is preferable? so i would always model it with the inequality constraint instead of the equality constraint in optimization....
that just seems better?

if you somehow have a "baggage" of earlier work that you need to compare to, its hard for me to factor that in....

tolerance <- 0.01; ifelse(abs(tpr - 0.6) > tolerance, 1, fpr)

modelling it like in the measure formula above seems like i would do it as well. simply do that. it would look very similar for a constraint ala TPR >= 0.6.

I don't understand. Why doesn't the ROC plot help me?

i simply meant you cannot "act on a plot" in nested resampling / your model selection. with the ROC plot i meant: for a "static" model like simple a logistic regression, you could look at the plot, select the point, read off the FPR at TPR = 0.6 and you are done. during nested resampling with tuning and model selection, what you seem to be doing, you need a numerical criterion (the measure we just discussed or something similar, but not a ggplot object....)

You are saying that the method Julia proposed is wrong for nested CV?

Is there a different between what I want and what has already suggested? I don't follow. Sorry.

like i said, creating a custom measure that does this here:

tolerance <- 0.01; ifelse(abs(tpr - 0.6) > tolerance, 1, fpr)

is doable in mlr, does exactly what you want. also the mlr threshold optimizer is basically and interval search and has no real problem is your objective function is "unsmooth" or looks weird. so also no problem.

in general: i could probably "teach" mlr so that a user can create general measures CONDITIONED to general constraints like TPR > 0.6. then you can tune for that or tuneThreshold for that. that would be very cool. its not that hard i think but i need some time.

All ok?

@studerus
Copy link
Contributor Author

Any news regarding this problem?

  1. I do see a problem: I don't know if we support tuning the threshold WITHOUT tuning ANY OTHER hyperpar. So for nomal classif.logreg I dont know how to do this.....?

@jakob-r
Copy link
Sponsor Member

jakob-r commented Nov 16, 2016

You can certainly just use a tuneWrapper with a custom grid with just on parameter setting. Not so beautiful but should to the trick.

@studerus
Copy link
Contributor Author

So, for learners that don't have a tuning parameter, the trick would be to wrap something around it (e.g. makeWeightedClassesWrapper) so that the original learner gets a tuning parameter. We can then tune the threshold together with the new tuning parameter but set the new tuning parameter to a constant value that does not change the original model. Here's my example for tuning the threshold of classif.logreg in nested cross-validation. I hope this is correct.

lrn <- makeLearner('classif.logreg', predict.type = 'prob')
lrn <- makeWeightedClassesWrapper(lrn)
ps <- makeParamSet(makeDiscreteParam('wcw.weight', value = 1))
ctrl <- makeTuneControlGrid(resolution = 10, tune.threshold = T)
meas <- list(gmean, auc, tpr, tnr)
lrn2 <- makeTuneWrapper(learner = lrn, resampling = cv10, measures = meas,
                                    par.set = ps, control = ctrl)
resample(learner = lrn2, resampling = cv10, task = pid.task, measures = meas)

@jakob-r
Copy link
Sponsor Member

jakob-r commented Nov 17, 2016

Certainly you don't need the resolution = 10 because you only have one possible value for Tuning so it will be just evaluated once.

To get the result you can use the following code

res = resample(learner = lrn2, resampling = cv10, task = pid.task, measures = meas, extract = getTuneResult)
extractSubList(res$extract, "threshold")

@studerus
Copy link
Contributor Author

studerus commented Nov 18, 2016

Thanks, I thought that resolution would have an influence on the number of thresholds being evaluated. Does that mean that the number of thresholds cannot be set manually? How many values are evaluated? 100?

@jakob-r
Copy link
Sponsor Member

jakob-r commented Nov 18, 2016

The threshold tuning is done in a more exhaustive and accurate way as it is not so expensive. See ?tuneThreshold for details.

@ajing
Copy link

ajing commented Mar 29, 2017

I have a trivial question about the tune.threshold = TRUE cutoff. For this parameter, will the threshold be different for each repeat? If the cutoff is different, does the result return the mean threshold across repeats?

@PhilippPro
Copy link
Member

Yes, they will be different.

Look at the example above from Bernd:

library(ElemStatLearn)
data(spam)
load_all()

task = makeClassifTask(data = spam, target = "spam")
lrn1 = makeLearner("classif.gbm", predict.type = "prob")
ps = makeParamSet(
  makeIntegerParam("interaction.depth", lower = 1, upper = 5)
)
ctrl = makeTuneControlRandom(maxit = 2, tune.threshold = TRUE)
lrn2 = makeTuneWrapper(lrn1, par.set = ps, control = ctrl, resampling = cv2)
r = resample(lrn2, task, cv3, extract = getTuneResult)
print(r$extract)

Here you can see the different thresholds for each Cross Validation step.
Taking the mean:
mean(r$extract[[1]]$threshold, r$extract[[2]]$threshold, r$extract[[3]]$threshold)

@mbbrigitte
Copy link

mbbrigitte commented Aug 2, 2017

After you get a different optimal threshold (or tuning parameter) in each of the Cross Validation steps (see comment by PhilippPro), how do you decide which threshold to use if you want to make a prediction model on the entire dataset? Is there a function to do this in mlr?

@jakob-r
Copy link
Sponsor Member

jakob-r commented Aug 7, 2017

@mbbrigitte The resampling of the tuned learner (the result of makeTuneWrapper) is meant to give you an idea if your strategy of tuning works when compared to other strategies.
The actual tuning (including the thresholds) should be done on the complete training data. This can be done with tuneParams() and afterwards with tuneThreshold().

@stale
Copy link

stale bot commented Dec 18, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 18, 2019
@stale stale bot closed this as completed Dec 25, 2019
@anaclaramatos
Copy link

@schiffner can you share with me what you have gathered regarding how the generalization to the multi-class case works and what options there are for tuning.
I'm currently facing a problem of optimizing the decision threshold for a multiclass problem and I'm having some doubts about how to accomplish this.

thank you in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests