-
Notifications
You must be signed in to change notification settings - Fork 23
Conversation
Whoops should probably fix that first. |
func (a ByX) Less(i, j int) bool { return a.x[i] < a.x[j] } | ||
|
||
|
||
// ROC returns paired FPR and TPR values which describe the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may want to mention that ROC
will modify y
and pos
ah, the build fails because $> gofmt -w . (and/or configure your editor to do such a thing when saving. |
|
||
import "sort" | ||
|
||
type ByX struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Is there a concise description of what the ROC curve does? Wikipedia says that it's a plot of true positive rate and true negative rate as a function of tolerance. With that discription, I would have expected the signature to be something like
Where tolerance is the set of values for which we want to know the ROC, obs is the set of observations on which we make the decision, and ans is if it was actually a positive. I don't understand how that maps to your function. |
// ROC returns paired FPR and TPR values which describe the | ||
// ROC curve treating y as a binary classifier for pos. | ||
func ROC(y []float64, pos []bool) ([]float64, []float64) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete blank line,
@btracey the tolerances are implicit in the returned slices. |
@btracey I've added a concise(ish) description of what a ROC curve does. Also, you Thanks for all the constructive comments everyone! Super helpful, I am learning so |
What happens when the values of y are not distinct? |
} | ||
func (a byX) Less(i, j int) bool { return a.x[i] < a.x[j] } | ||
|
||
// ROC sorts both inputs for increasing values in y, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally we start doc comments along the lines "ROC returns the ..." Would you prefix this?
Done. PTAL @kortschak Good question haha. Now I think about it it is actually a So long as each set of duplicate values in y correspond to a single HOWEVER, if at least two equal values in y correspond to different values |
What does R do in this situation? |
// https://en.wikipedia.org/wiki/Receiver_operating_characteristic | ||
// Note that ROC will modify both inputs -- sorting them in order | ||
// of increasing y values. | ||
func ROC(y []float64, pos []bool) (tpr, fpr []float64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use y for the label here, but byX sorts by the field x. Either change y or x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also changed a bunch of other variable names to be more concise/ prettier.
Is it ok to use gotFPR instead of the perhaps more usual gotFpr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotFPR (FPR is an initialism, not a word).
r-base does not have a ROC function. And the packages with that functionality are pretty dodge in my experience. This guy: This package: As does this package: So although I don't think any of those packages are actually any |
|
||
// ROC returns paired false positive rate (FPR) and true positive | ||
// rate (TPR) values which describe the relative (or receiver) | ||
// operator chracteristic (ROC) curve obtained when y is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo:
s/chracteristic/characteristic/
} else { | ||
fpr[n-1] = 1 | ||
if tpr[n-1] == 1 { | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly don't even know if this is better or worse. Its faster, I suppose.
Ok, added the special cases, tests for them, and comments. PTAL @btracey |
} | ||
} | ||
if n == 0 { | ||
tpr = tpr[0:(bin + 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omit the 0 index in these slice operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okie Dokie
Done. I just set the final values of tpr and fpr to 1. Figure that makes as much sense as anything else in the special case classes is either all true or all false. Sound about right @btracey ? |
var bin int = 1 // the initial bin is known to have 0 fpr and 0 tpr | ||
var nPos, nNeg float64 | ||
for i, u := range classes { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this newline.
LGTM with the newline removed. Do you have more comments @kortschak ? Otherwise, the last thing is to fixup all of the commits to make it one commit instead of many. I find this link useful https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request. |
Yeah cool I figured out how to use |
// | ||
// If weights is nil, all weights are treated as 1. | ||
// | ||
// n of zero results in all possible cutoffs being calculated -- this will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use --
in comments. It has not markup meaning to godoc. I think this para could be rewritten to make it more English, since at the moment it reads more like code than prose.
Not to muddy things any further... but are we happy with the `n' solution -- calculating for n equally spaced cutoffs (this is the solution some other packages use)? Now I think about it allowing for cutoffs to be specified, as I beleive @btracey suggested at one point, which would include this (as equally spaced cutoffs could be specified) is more general and might be better. Of course seeing as the current solution seems acceptable we could also leave it and then I could write an alternative and put in a pull request for that at some later point. |
// n of zero results in all possible cutoffs being calculated, resulting | ||
// in fpr and tpr having length one greater than the number of unique | ||
// values in y. n greater than one will result in fpr and tpr having | ||
// length n. ROC will panic if n is equal to one or less than 0. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried re-writing the comments, what do you think @kortschak ?
Ok I've made the improvements you both suggested @btracey @kortschak and cleaned up the commit history PTAL. Also over the weekend I thought about it and realised that we can make the code easier to understand, simpler, and also more general by replacing the input n int with cutoffs []float64, specifying cutoffs, so I implemented that version and you can take a look at it here: |
@kortschak @btracey Hey sorry I got distracted and really busy for a week or two there. ROC Function:Where we left this, I think I addressed all your points, and then also suggested an alternative Confusion Matrix:In the meantime, I was thinking about it and realised that maybe would be better to simply calculate the entire confusion matrix (four outputs, fp, tp, fn, tn) of which this would be a subset -- i.e. fpr = fp / (fp + tn) and tpr = tp / (tp + fn). On the other hand this probably belongs in its own function, as its purpose would apply to different situations -- mostly ROC curves are calculated from samples which are not representative of the prevalence (tp + fn) / (tp + fn + tn + fp), and the ROC curve is being calculated because fpr and tpr are prevalence-independent. In these cases you would not want to calculate the confusion matrix directly from your data, but rather to infer it from your fpr, tpr values and a known prevalence value from some cohort study or something. So thats probably the better way to go about that. Or maybe two different functions, one for calculating the confusion matrix from data, one to infer it from a ROC curve using a known prevalence value. What do you think? |
LGTM Let's get this in as is since it's already been a lot. We can discuss changes and additions elsewhere. |
There's an accidental file inclusion that needs to be removed, and the commits need to be squashed and rebased on to master. Then LGTM.
|
Please |
Produces a ROC curve either for all possible cutoffs, or for n equally spaced cutoffs.
Done. PTAL @kortschak |
LGTM @btracey Do you want to merge? otherwise I'll do it in the morning. |
Produces a ROC curve.
I'm pretty new to go, and to software development in general
so I'm in the middle of a pretty steep learning curve -- I thought
I should start with some small, simple contributions as I learn.
Any feedback would be very helpful.
@kortschak @btracey please take a look?