-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluating models. #109
Comments
Yes, this is important. The review #27 brought up precision-recall versus ROC but only briefly. We can also provide some intuition about why ROC is especially bad with skewed classes and maybe even refer to alternatives beyond area under a curve. A related topic that has come up a few times is the choice of a gold standard for evaluating models, which can be hard in some biomedical domains. |
There is a nice review here in Nature Methods On Fri, Oct 14, 2016 at 11:10 AM, Anthony Gitter notifications@github.com
|
Yes confidence rated predictions are important and also in gold standard On Fri, Oct 14, 2016 at 12:18 PM, Anshul Kundaje anshul@kundaje.net wrote:
|
@akundaje In the nmeth paper, their example of why ROC is bad in imbalanced datasets isn't exactly clear to me. In fig 4a, they show an example of a case where there are only a few positive patients (e.g., 5 out of 100). In their graph, several of those patients are predicted with zero false positives; and the next several patients were detected with ~10-20% FPR. They state that this is bad without qualification, because 10-20% FPR means that there are more FPs than TPs. However, this performance would be considered really good in some cases, such as rapid screens. For example, rapid HIV tests in clinical use have FPRs of ~40% or higher (http://www.bmj.com/content/335/7612/188). Therefore, I think they haven't proved their premise that AUC is always bad in imbalanced datasets. I think it depends on the biological nature of the dataset. |
@traversc It's FDR that is not covered in ROC, not FPR. If one classifies 15 out of 100 people as positives, from which 5 are actually positive, then FPR=10/95=10.5% but FDR=10/15=66.7%. |
@traversc Judging from the abstract alone, that HIV test example is essentially looking a single point on the curve. The relative tolerance for false positives and false negatives is domain-specific, and in HIV screening it makes sense that it would be important to minimize false negatives (increase sensitivity). @gokceneraslan The Nature Methods paper shows false positive rate (FPR) on the x-axis of the ROC figures. You're getting at the right point though. The precision-recall curve will account for the poor FDR, or more specifically 1 - FDR = precision. So the classifier in Figure 4a shows poor performance (visually and when computing area under the curve) with precision-recall but not ROC. The thought experiment I find useful is what would happen if we add another 1000 negative instances to the right side of the ranked list in Figure 4a. In other words, how much do we reward the classifier for getting these additional true negatives correct? The area under the ROC curve will increase because of the true negative term in the FPR. Neither precision nor recall include the true negative term, so the precision recall curve and area under it do not change. |
We can start drafting this part of the Discussion sub-section that I created in #118. |
One topic that's come up quite a bit (e.g. #13) is how do we best eval these models? In some cases (e.g. TF binding), AUC is unlikely to match the use case that we really care about. While this is true for all bioinformatics problems, we should probably raise the issue since we get the opportunity - as part of a headline review - to provide perspective.
The text was updated successfully, but these errors were encountered: