ROC function #96

Armadilloa16 · 2016-03-16T06:52:48Z

Produces a ROC curve.

I'm pretty new to go, and to software development in general
so I'm in the middle of a pretty steep learning curve -- I thought
I should start with some small, simple contributions as I learn.
Any feedback would be very helpful.

@kortschak @btracey please take a look?

Armadilloa16 · 2016-03-16T06:55:23Z

Whoops should probably fix that first.

sbinet · 2016-03-16T08:18:07Z

roc.go

+func (a ByX) Less(i, j int) bool { return a.x[i] < a.x[j] }
+
+
+// ROC returns paired FPR and TPR values which describe the 


you may want to mention that ROC will modify y and pos

sbinet · 2016-03-16T08:34:03Z

ah, the build fails because roc.go has not been gofmted.
you can do it like so:

$> gofmt -w .

(and/or configure your editor to do such a thing when saving. goimports is also nice.)

https://godoc.org/golang.org/x/tools/cmd/goimports

btracey · 2016-03-16T13:15:15Z

roc.go

+
+import "sort"
+
+type ByX struct {


This shouldn't be exported.

btracey · 2016-03-16T13:28:54Z

Is there a concise description of what the ROC curve does? Wikipedia says that it's a plot of true positive rate and true negative rate as a function of tolerance. With that discription, I would have expected the signature to be something like

     ROC(tolerenaces, obs []float64, ans []bool) (fpr, tnr)

Where tolerance is the set of values for which we want to know the ROC, obs is the set of observations on which we make the decision, and ans is if it was actually a positive.

I don't understand how that maps to your function.

kortschak · 2016-03-16T22:30:47Z

roc.go

+// ROC returns paired FPR and TPR values which describe the 
+// ROC curve treating y as a binary classifier for pos.
+func ROC(y []float64, pos []bool) ([]float64, []float64) {
+


Delete blank line,

kortschak · 2016-03-16T22:40:32Z

@btracey the tolerances are implicit in the returned slices.

Armadilloa16 · 2016-03-17T01:02:35Z

Is there a concise description of what the ROC curve does? Wikipedia says that it's a plot of true positive rate and true negative rate as a function of tolerance. With that discription, I would have expected the signature to be something like
ROC(tolerenaces, obs []float64, ans []bool) (fpr, tnr)
Where tolerance is the set of values for which we want to know the ROC, obs is the set of observations on which we make the decision, and ans is if it was actually a positive.

I don't understand how that maps to your function.

@btracey I've added a concise(ish) description of what a ROC curve does. Also, you
are right in the sense that you could make a function to take tolerances (or cutoffs
whichever term you prefer), what a ROC curve (and this function) does is essentially
calculate the fpr and tpr's for all possible cutoffs. These will only change at the
values of y, and so assuming the values of y are distinct the output will satisfy
len(fpr) == len(y) + 1, with each value corresponding to the value of fpr/tpr for any
tolerance in the corresponding interval -- there are len(y) + 1 distinct intervals, the
first corresponding to any tolerance less than the minimum value in y, the last
corresponding to any tolerance greater than the maximum value in y.

Thanks for all the constructive comments everyone! Super helpful, I am learning so
much! :D

kortschak · 2016-03-17T01:49:34Z

What happens when the values of y are not distinct?

kortschak · 2016-03-17T01:52:58Z

roc.go

+}
+func (a byX) Less(i, j int) bool { return a.x[i] < a.x[j] }
+
+// ROC sorts both inputs for increasing values in y, and


Normally we start doc comments along the lines "ROC returns the ..." Would you prefix this?

Armadilloa16 · 2016-03-17T02:30:01Z

Done. PTAL

@kortschak Good question haha. Now I think about it it is actually a
problem.

So long as each set of duplicate values in y correspond to a single
value in pos, its all good. For plotting purposes and calculating area
under curves (AUC), etc. ROC will return some redundant values.
If you consider some of the len(y)+1 intervals I mentioned above to
be width zero then this is still correct, will still produce correct plots
and AUC curves, and produce consistent output, so I think this
behaviour is acceptable -- the only downside I see is that it makes the
cutoff interpretation a little funky.

HOWEVER, if at least two equal values in y correspond to different values
of pos, then what is `correct' is actually ambiguous, and ROC will simply
return one of the possibilities. The only way I can think of to address this
would be to remove values in fpr and tpr corresponding to intervals
between duplicate values in y. This would result in diagonal lines in the ROC
curve, which would be the only solution I could think of that would be
unique. Seems to me like the best solution. However it would also significantly
complicate the code. Worth doing? An alternative solution is to leave the function
as it is, and note that we assume y consists of distinct values (or at least that
duplicate values correspond to single values in pos). What do you think?

kortschak · 2016-03-17T02:44:40Z

What does R do in this situation?

kortschak · 2016-03-17T02:49:08Z

roc.go

+// https://en.wikipedia.org/wiki/Receiver_operating_characteristic
+// Note that ROC will modify both inputs -- sorting them in order
+// of increasing y values.
+func ROC(y []float64, pos []bool) (tpr, fpr []float64) {


You use y for the label here, but byX sorts by the field x. Either change y or x.

Done. Also changed a bunch of other variable names to be more concise/ prettier.
Is it ok to use gotFPR instead of the perhaps more usual gotFpr?

gotFPR (FPR is an initialism, not a word).

Armadilloa16 · 2016-03-17T06:05:14Z

r-base does not have a ROC function. And the packages with that functionality are pretty dodge in my experience.

This guy:
http://www.r-bloggers.com/illustrated-guide-to-roc-and-auc/
avoids the issue by simply calculating for a fixed number of
equally spaced thresholds, similar to what @btracey suggested.

This package:
https://cran.r-project.org/web/packages/pROC/index.html
does the removing between-duplicates intervals thing I mentioned
when plotting.

As does this package:
https://rocr.bioinf.mpi-sb.mpg.de/

So although I don't think any of those packages are actually any
good (honestly), I do think that the idea of removing between-duplicate
interval values is a good idea and I'll have a go at implementing it
tomorrow, despite the fact that it will complicate the code abit. I'll
also add a test for those edge cases.

sbinet · 2016-03-17T06:11:25Z

roc.go

+
+// ROC returns paired false positive rate (FPR) and true positive
+// rate (TPR) values which describe the relative (or receiver)
+// operator chracteristic (ROC) curve obtained when y is


typo:
s/chracteristic/characteristic/

Armadilloa16 · 2016-04-01T01:28:25Z

roc.go

 				} else {
 					fpr[n-1] = 1
+					if tpr[n-1] == 1 {
+						break
+					}
 				}


Honestly don't even know if this is better or worse. Its faster, I suppose.

Armadilloa16 · 2016-04-01T02:32:25Z

Ok, added the special cases, tests for them, and comments. PTAL @btracey
There is one special case that I solved in a really ugly way that I hope there is an easier way to solve.
Alternatively we could just decide to do something different in that case. Its a pretty specific case that its addressing (when all the values in classes are equal -- either all false or all true).

kortschak · 2016-04-01T03:00:33Z

roc.go

+		}
+	}
+	if n == 0 {
+		tpr = tpr[0:(bin + 1)]


Omit the 0 index in these slice operations.

Armadilloa16 · 2016-04-01T05:55:40Z

Done. I just set the final values of tpr and fpr to 1. Figure that makes as much sense as anything else in the special case classes is either all true or all false. Sound about right @btracey ?

btracey · 2016-04-01T15:50:24Z

roc.go

+	var bin int = 1 // the initial bin is known to have 0 fpr and 0 tpr
+	var nPos, nNeg float64
+	for i, u := range classes {
+


Remove this newline.

btracey · 2016-04-01T15:53:17Z

LGTM with the newline removed. Do you have more comments @kortschak ?

Otherwise, the last thing is to fixup all of the commits to make it one commit instead of many. I find this link useful https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request.

Armadilloa16 · 2016-04-01T22:57:20Z

Yeah cool I figured out how to use
git rebase -i
awhile back now I just gotta remember.
I'll try it now for practice. If @kortschak
has more comments I can do it again
afterwards. :)

kortschak · 2016-04-01T23:13:07Z

roc.go

+//
+// If weights is nil, all weights are treated as 1.
+//
+// n of zero results in all possible cutoffs being calculated -- this will


Don't use -- in comments. It has not markup meaning to godoc. I think this para could be rewritten to make it more English, since at the moment it reads more like code than prose.

Armadilloa16 · 2016-04-01T23:20:23Z

Not to muddy things any further... but are we happy with the `n' solution -- calculating for n equally spaced cutoffs (this is the solution some other packages use)? Now I think about it allowing for cutoffs to be specified, as I beleive @btracey suggested at one point, which would include this (as equally spaced cutoffs could be specified) is more general and might be better. Of course seeing as the current solution seems acceptable we could also leave it and then I could write an alternative and put in a pull request for that at some later point.

Armadilloa16 · 2016-04-01T23:26:13Z

roc.go

+// n of zero results in all possible cutoffs being calculated, resulting
+// in fpr and tpr having length one greater than the number of unique 
+// values in y. n greater than one will result in fpr and tpr having 
+// length n. ROC will panic if n is equal to one or less than 0.
 //


I tried re-writing the comments, what do you think @kortschak ?

Armadilloa16 · 2016-04-04T01:08:43Z

Ok I've made the improvements you both suggested @btracey @kortschak and cleaned up the commit history PTAL.

Also over the weekend I thought about it and realised that we can make the code easier to understand, simpler, and also more general by replacing the input n int with cutoffs []float64, specifying cutoffs, so I implemented that version and you can take a look at it here:
Armadilloa16@1a94226
If you like it, I'll merge it into this pull request? cutoffs == nil produces the same behaviour as n==0, but this can also be done by selecting the correct cutoffs as well now.

Armadilloa16 · 2016-04-20T06:40:24Z

@kortschak @btracey Hey sorry I got distracted and really busy for a week or two there.

ROC Function:

Where we left this, I think I addressed all your points, and then also suggested an alternative
in which the thresholds can be explicitly given, as an alternative to the current version where just
the number of thresholds is given and they are equally spaced. Or if the number is zero then they
can be un-equally spaced and cover every possible case. Inputting the thresholds themselves results in slightly simpler code, and includes both these cases, but also allows for arbitrary unequally spaced thresholds to be specified. A link to a branch I made showing how this would look is above.

Confusion Matrix:

In the meantime, I was thinking about it and realised that maybe would be better to simply calculate the entire confusion matrix (four outputs, fp, tp, fn, tn) of which this would be a subset -- i.e. fpr = fp / (fp + tn) and tpr = tp / (tp + fn). On the other hand this probably belongs in its own function, as its purpose would apply to different situations -- mostly ROC curves are calculated from samples which are not representative of the prevalence (tp + fn) / (tp + fn + tn + fp), and the ROC curve is being calculated because fpr and tpr are prevalence-independent. In these cases you would not want to calculate the confusion matrix directly from your data, but rather to infer it from your fpr, tpr values and a known prevalence value from some cohort study or something. So thats probably the better way to go about that. Or maybe two different functions, one for calculating the confusion matrix from data, one to infer it from a ROC curve using a known prevalence value.

What do you think?

btracey · 2016-04-23T02:24:41Z

LGTM

Let's get this in as is since it's already been a lot. We can discuss changes and additions elsewhere.

kortschak · 2016-04-23T10:02:09Z

There's an accidental file inclusion that needs to be removed, and the commits need to be squashed and rebased on to master. Then LGTM.

kortschak · 2016-04-26T00:26:21Z

Please git rm .roc.go.swp and git commit --amend and then rebase against gonum/stat master with a forced push to your fork's branch.

Produces a ROC curve either for all possible cutoffs, or for n equally spaced cutoffs.

Armadilloa16 · 2016-04-26T06:38:02Z

Done. PTAL @kortschak

kortschak · 2016-04-26T12:21:31Z

LGTM

@btracey Do you want to merge? otherwise I'll do it in the morning.

Armadilloa16 closed this Mar 16, 2016

Armadilloa16 reopened this Mar 16, 2016

sbinet reviewed Mar 16, 2016
View reviewed changes

btracey reviewed Mar 16, 2016
View reviewed changes

kortschak reviewed Mar 16, 2016
View reviewed changes

kortschak reviewed Mar 17, 2016
View reviewed changes

sbinet reviewed Mar 17, 2016
View reviewed changes

Armadilloa16 reviewed Apr 1, 2016
View reviewed changes

kortschak reviewed Apr 1, 2016
View reviewed changes

btracey reviewed Apr 1, 2016
View reviewed changes

Armadilloa16 force-pushed the master branch from 6252755 to cc4838f Compare April 1, 2016 23:05

kortschak reviewed Apr 1, 2016
View reviewed changes

Armadilloa16 reviewed Apr 1, 2016
View reviewed changes

Armadilloa16 force-pushed the master branch from e783635 to 7e716d7 Compare April 4, 2016 00:09

Armadilloa16 force-pushed the master branch from 7e716d7 to b4dae31 Compare April 26, 2016 06:25

ROC function

926be36

Produces a ROC curve either for all possible cutoffs, or for n equally spaced cutoffs.

Armadilloa16 force-pushed the master branch from b4dae31 to 926be36 Compare April 26, 2016 06:36

btracey merged commit 510f301 into gonum:master Apr 26, 2016

		func (a ByX) Less(i, j int) bool { return a.x[i] < a.x[j] }


		// ROC returns paired FPR and TPR values which describe the

ROC function #96

ROC function #96

Conversation

Armadilloa16 commented Mar 16, 2016

Armadilloa16 commented Mar 16, 2016

Choose a reason for hiding this comment

sbinet commented Mar 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

btracey commented Mar 16, 2016

Choose a reason for hiding this comment

kortschak commented Mar 16, 2016

Armadilloa16 commented Mar 17, 2016

kortschak commented Mar 17, 2016

Choose a reason for hiding this comment

Armadilloa16 commented Mar 17, 2016

kortschak commented Mar 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Armadilloa16 commented Mar 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Armadilloa16 commented Apr 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Armadilloa16 commented Apr 1, 2016

Choose a reason for hiding this comment

btracey commented Apr 1, 2016

Armadilloa16 commented Apr 1, 2016

Choose a reason for hiding this comment

Armadilloa16 commented Apr 1, 2016

Choose a reason for hiding this comment

Armadilloa16 commented Apr 4, 2016

Armadilloa16 commented Apr 20, 2016

ROC Function:

Confusion Matrix:

btracey commented Apr 23, 2016

kortschak commented Apr 23, 2016 via email

kortschak commented Apr 26, 2016

Armadilloa16 commented Apr 26, 2016

kortschak commented Apr 26, 2016