Evaluating models. #109

cgreene · 2016-10-14T17:01:44Z

One topic that's come up quite a bit (e.g. #13) is how do we best eval these models? In some cases (e.g. TF binding), AUC is unlikely to match the use case that we really care about. While this is true for all bioinformatics problems, we should probably raise the issue since we get the opportunity - as part of a headline review - to provide perspective.

agitter · 2016-10-14T18:10:48Z

Yes, this is important. The review #27 brought up precision-recall versus ROC but only briefly. We can also provide some intuition about why ROC is especially bad with skewed classes and maybe even refer to alternatives beyond area under a curve.

A related topic that has come up a few times is the choice of a gold standard for evaluating models, which can be hard in some biomedical domains.

akundaje · 2016-10-14T19:19:10Z

There is a nice review here in Nature Methods
http://www.nature.com/nmeth/journal/v13/n8/full/nmeth.3945.html that
clearly and intuitively showcases the problem with AUCs for unbalanced
learning

On Fri, Oct 14, 2016 at 11:10 AM, Anthony Gitter notifications@github.com
wrote:

Yes, this is important. The review #27
#27 brought up
precision-recall versus ROC but only briefly. We can also provide some
intuition about why ROC is especially bad with skewed classes and maybe
even refer to alternatives beyond area under a curve.

A related topic that has come up a few times is the choice of a gold
standard for evaluating models, which can be hard in some biomedical
domains.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#109 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAI7ERc1wSZAuJrgbGo3O3KQeHNXNvV7ks5qz8WogaJpZM4KXOjy
.

akundaje · 2016-10-14T19:20:37Z

Yes confidence rated predictions are important and also in gold standard
datasets providing "ambiguous labels" or confidence on labels is useful and
a good practice. It helps with evaluation (you can weight the loss by the
confidence in the label) as well.

On Fri, Oct 14, 2016 at 12:18 PM, Anshul Kundaje anshul@kundaje.net wrote:

There is a nice review here in Nature Methods http://www.nature.com/
nmeth/journal/v13/n8/full/nmeth.3945.html that clearly and intuitively
showcases the problem with AUCs for unbalanced learning

On Fri, Oct 14, 2016 at 11:10 AM, Anthony Gitter <notifications@github.com

wrote:

Yes, this is important. The review #27
#27 brought up
precision-recall versus ROC but only briefly. We can also provide some
intuition about why ROC is especially bad with skewed classes and maybe
even refer to alternatives beyond area under a curve.

A related topic that has come up a few times is the choice of a gold
standard for evaluating models, which can be hard in some biomedical
domains.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#109 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAI7ERc1wSZAuJrgbGo3O3KQeHNXNvV7ks5qz8WogaJpZM4KXOjy
.

traversc · 2016-10-19T00:12:08Z

@akundaje In the nmeth paper, their example of why ROC is bad in imbalanced datasets isn't exactly clear to me. In fig 4a, they show an example of a case where there are only a few positive patients (e.g., 5 out of 100). In their graph, several of those patients are predicted with zero false positives; and the next several patients were detected with ~10-20% FPR. They state that this is bad without qualification, because 10-20% FPR means that there are more FPs than TPs.

However, this performance would be considered really good in some cases, such as rapid screens. For example, rapid HIV tests in clinical use have FPRs of ~40% or higher (http://www.bmj.com/content/335/7612/188).

Therefore, I think they haven't proved their premise that AUC is always bad in imbalanced datasets. I think it depends on the biological nature of the dataset.

gokceneraslan · 2016-10-19T05:47:46Z

@traversc It's FDR that is not covered in ROC, not FPR. If one classifies 15 out of 100 people as positives, from which 5 are actually positive, then FPR=10/95=10.5% but FDR=10/15=66.7%.

agitter · 2016-10-19T14:52:22Z

@traversc Judging from the abstract alone, that HIV test example is essentially looking a single point on the curve. The relative tolerance for false positives and false negatives is domain-specific, and in HIV screening it makes sense that it would be important to minimize false negatives (increase sensitivity).

@gokceneraslan The Nature Methods paper shows false positive rate (FPR) on the x-axis of the ROC figures. You're getting at the right point though.

The precision-recall curve will account for the poor FDR, or more specifically 1 - FDR = precision. So the classifier in Figure 4a shows poor performance (visually and when computing area under the curve) with precision-recall but not ROC.

The thought experiment I find useful is what would happen if we add another 1000 negative instances to the right side of the ranked list in Figure 4a. In other words, how much do we reward the classifier for getting these additional true negatives correct? The area under the ROC curve will increase because of the true negative term in the FPR. Neither precision nor recall include the true negative term, so the precision recall curve and area under it do not change.

agitter · 2016-10-24T21:22:13Z

We can start drafting this part of the Discussion sub-section that I created in #118.

cgreene mentioned this issue Oct 14, 2016

DeepSEA: Predicting effects of noncoding variants with deep learning–based sequence model #13

Closed

agitter mentioned this issue Oct 21, 2016

Refine our guiding question. #88

Closed

agitter mentioned this issue May 7, 2017

evaluation section writeup #380

Merged

cgreene closed this as completed Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating models. #109

Evaluating models. #109

cgreene commented Oct 14, 2016

agitter commented Oct 14, 2016

akundaje commented Oct 14, 2016

akundaje commented Oct 14, 2016

traversc commented Oct 19, 2016

gokceneraslan commented Oct 19, 2016

agitter commented Oct 19, 2016

agitter commented Oct 24, 2016

Evaluating models. #109

Evaluating models. #109

Comments

cgreene commented Oct 14, 2016

agitter commented Oct 14, 2016

akundaje commented Oct 14, 2016

akundaje commented Oct 14, 2016

traversc commented Oct 19, 2016

gokceneraslan commented Oct 19, 2016

agitter commented Oct 19, 2016

agitter commented Oct 24, 2016