Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating models. #109

Closed
cgreene opened this issue Oct 14, 2016 · 7 comments
Closed

Evaluating models. #109

cgreene opened this issue Oct 14, 2016 · 7 comments

Comments

@cgreene
Copy link
Member

cgreene commented Oct 14, 2016

One topic that's come up quite a bit (e.g. #13) is how do we best eval these models? In some cases (e.g. TF binding), AUC is unlikely to match the use case that we really care about. While this is true for all bioinformatics problems, we should probably raise the issue since we get the opportunity - as part of a headline review - to provide perspective.

@agitter
Copy link
Collaborator

agitter commented Oct 14, 2016

Yes, this is important. The review #27 brought up precision-recall versus ROC but only briefly. We can also provide some intuition about why ROC is especially bad with skewed classes and maybe even refer to alternatives beyond area under a curve.

A related topic that has come up a few times is the choice of a gold standard for evaluating models, which can be hard in some biomedical domains.

@akundaje
Copy link
Contributor

There is a nice review here in Nature Methods
http://www.nature.com/nmeth/journal/v13/n8/full/nmeth.3945.html that
clearly and intuitively showcases the problem with AUCs for unbalanced
learning

On Fri, Oct 14, 2016 at 11:10 AM, Anthony Gitter notifications@github.com
wrote:

Yes, this is important. The review #27
#27 brought up
precision-recall versus ROC but only briefly. We can also provide some
intuition about why ROC is especially bad with skewed classes and maybe
even refer to alternatives beyond area under a curve.

A related topic that has come up a few times is the choice of a gold
standard for evaluating models, which can be hard in some biomedical
domains.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#109 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAI7ERc1wSZAuJrgbGo3O3KQeHNXNvV7ks5qz8WogaJpZM4KXOjy
.

@akundaje
Copy link
Contributor

Yes confidence rated predictions are important and also in gold standard
datasets providing "ambiguous labels" or confidence on labels is useful and
a good practice. It helps with evaluation (you can weight the loss by the
confidence in the label) as well.

On Fri, Oct 14, 2016 at 12:18 PM, Anshul Kundaje anshul@kundaje.net wrote:

There is a nice review here in Nature Methods http://www.nature.com/
nmeth/journal/v13/n8/full/nmeth.3945.html that clearly and intuitively
showcases the problem with AUCs for unbalanced learning

On Fri, Oct 14, 2016 at 11:10 AM, Anthony Gitter <notifications@github.com

wrote:

Yes, this is important. The review #27
#27 brought up
precision-recall versus ROC but only briefly. We can also provide some
intuition about why ROC is especially bad with skewed classes and maybe
even refer to alternatives beyond area under a curve.

A related topic that has come up a few times is the choice of a gold
standard for evaluating models, which can be hard in some biomedical
domains.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#109 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAI7ERc1wSZAuJrgbGo3O3KQeHNXNvV7ks5qz8WogaJpZM4KXOjy
.

@traversc
Copy link
Contributor

@akundaje In the nmeth paper, their example of why ROC is bad in imbalanced datasets isn't exactly clear to me. In fig 4a, they show an example of a case where there are only a few positive patients (e.g., 5 out of 100). In their graph, several of those patients are predicted with zero false positives; and the next several patients were detected with ~10-20% FPR. They state that this is bad without qualification, because 10-20% FPR means that there are more FPs than TPs.

However, this performance would be considered really good in some cases, such as rapid screens. For example, rapid HIV tests in clinical use have FPRs of ~40% or higher (http://www.bmj.com/content/335/7612/188).

Therefore, I think they haven't proved their premise that AUC is always bad in imbalanced datasets. I think it depends on the biological nature of the dataset.

@gokceneraslan
Copy link

@traversc It's FDR that is not covered in ROC, not FPR. If one classifies 15 out of 100 people as positives, from which 5 are actually positive, then FPR=10/95=10.5% but FDR=10/15=66.7%.

@agitter
Copy link
Collaborator

agitter commented Oct 19, 2016

@traversc Judging from the abstract alone, that HIV test example is essentially looking a single point on the curve. The relative tolerance for false positives and false negatives is domain-specific, and in HIV screening it makes sense that it would be important to minimize false negatives (increase sensitivity).

@gokceneraslan The Nature Methods paper shows false positive rate (FPR) on the x-axis of the ROC figures. You're getting at the right point though.

The precision-recall curve will account for the poor FDR, or more specifically 1 - FDR = precision. So the classifier in Figure 4a shows poor performance (visually and when computing area under the curve) with precision-recall but not ROC.

The thought experiment I find useful is what would happen if we add another 1000 negative instances to the right side of the ranked list in Figure 4a. In other words, how much do we reward the classifier for getting these additional true negatives correct? The area under the ROC curve will increase because of the true negative term in the FPR. Neither precision nor recall include the true negative term, so the precision recall curve and area under it do not change.

@agitter
Copy link
Collaborator

agitter commented Oct 24, 2016

We can start drafting this part of the Discussion sub-section that I created in #118.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants