Skip to content
This repository has been archived by the owner on Dec 22, 2018. It is now read-only.

ROC function #96

Merged
merged 1 commit into from
Apr 26, 2016
Merged

ROC function #96

merged 1 commit into from
Apr 26, 2016

Conversation

Armadilloa16
Copy link
Contributor

Produces a ROC curve.

I'm pretty new to go, and to software development in general
so I'm in the middle of a pretty steep learning curve -- I thought
I should start with some small, simple contributions as I learn.
Any feedback would be very helpful.

@kortschak @btracey please take a look?

@Armadilloa16
Copy link
Contributor Author

Whoops should probably fix that first.

@Armadilloa16 Armadilloa16 reopened this Mar 16, 2016
func (a ByX) Less(i, j int) bool { return a.x[i] < a.x[j] }


// ROC returns paired FPR and TPR values which describe the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to mention that ROC will modify y and pos

@sbinet
Copy link
Member

sbinet commented Mar 16, 2016

ah, the build fails because roc.go has not been gofmted.
you can do it like so:

$> gofmt -w .

(and/or configure your editor to do such a thing when saving. goimports is also nice.)

https://godoc.org/golang.org/x/tools/cmd/goimports


import "sort"

type ByX struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be exported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@btracey
Copy link
Member

btracey commented Mar 16, 2016

Is there a concise description of what the ROC curve does? Wikipedia says that it's a plot of true positive rate and true negative rate as a function of tolerance. With that discription, I would have expected the signature to be something like

     ROC(tolerenaces, obs []float64, ans []bool) (fpr, tnr) 

Where tolerance is the set of values for which we want to know the ROC, obs is the set of observations on which we make the decision, and ans is if it was actually a positive.

I don't understand how that maps to your function.

// ROC returns paired FPR and TPR values which describe the
// ROC curve treating y as a binary classifier for pos.
func ROC(y []float64, pos []bool) ([]float64, []float64) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete blank line,

@kortschak
Copy link
Member

@btracey the tolerances are implicit in the returned slices.

@Armadilloa16
Copy link
Contributor Author

Is there a concise description of what the ROC curve does? Wikipedia says that it's a plot of true positive rate and true negative rate as a function of tolerance. With that discription, I would have expected the signature to be something like
ROC(tolerenaces, obs []float64, ans []bool) (fpr, tnr)
Where tolerance is the set of values for which we want to know the ROC, obs is the set of observations on which we make the decision, and ans is if it was actually a positive.

I don't understand how that maps to your function.

@btracey I've added a concise(ish) description of what a ROC curve does. Also, you
are right in the sense that you could make a function to take tolerances (or cutoffs
whichever term you prefer), what a ROC curve (and this function) does is essentially
calculate the fpr and tpr's for all possible cutoffs. These will only change at the
values of y, and so assuming the values of y are distinct the output will satisfy
len(fpr) == len(y) + 1, with each value corresponding to the value of fpr/tpr for any
tolerance in the corresponding interval -- there are len(y) + 1 distinct intervals, the
first corresponding to any tolerance less than the minimum value in y, the last
corresponding to any tolerance greater than the maximum value in y.

Thanks for all the constructive comments everyone! Super helpful, I am learning so
much! :D

@kortschak
Copy link
Member

What happens when the values of y are not distinct?

}
func (a byX) Less(i, j int) bool { return a.x[i] < a.x[j] }

// ROC sorts both inputs for increasing values in y, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we start doc comments along the lines "ROC returns the ..." Would you prefix this?

@Armadilloa16
Copy link
Contributor Author

Done. PTAL

@kortschak Good question haha. Now I think about it it is actually a
problem.

So long as each set of duplicate values in y correspond to a single
value in pos, its all good. For plotting purposes and calculating area
under curves (AUC), etc. ROC will return some redundant values.
If you consider some of the len(y)+1 intervals I mentioned above to
be width zero then this is still correct, will still produce correct plots
and AUC curves, and produce consistent output, so I think this
behaviour is acceptable -- the only downside I see is that it makes the
cutoff interpretation a little funky.

HOWEVER, if at least two equal values in y correspond to different values
of pos, then what is `correct' is actually ambiguous, and ROC will simply
return one of the possibilities. The only way I can think of to address this
would be to remove values in fpr and tpr corresponding to intervals
between duplicate values in y. This would result in diagonal lines in the ROC
curve, which would be the only solution I could think of that would be
unique. Seems to me like the best solution. However it would also significantly
complicate the code. Worth doing? An alternative solution is to leave the function
as it is, and note that we assume y consists of distinct values (or at least that
duplicate values correspond to single values in pos). What do you think?

@kortschak
Copy link
Member

What does R do in this situation?

// https://en.wikipedia.org/wiki/Receiver_operating_characteristic
// Note that ROC will modify both inputs -- sorting them in order
// of increasing y values.
func ROC(y []float64, pos []bool) (tpr, fpr []float64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use y for the label here, but byX sorts by the field x. Either change y or x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also changed a bunch of other variable names to be more concise/ prettier.
Is it ok to use gotFPR instead of the perhaps more usual gotFpr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotFPR (FPR is an initialism, not a word).

@Armadilloa16
Copy link
Contributor Author

r-base does not have a ROC function. And the packages with that functionality are pretty dodge in my experience.

This guy:
http://www.r-bloggers.com/illustrated-guide-to-roc-and-auc/
avoids the issue by simply calculating for a fixed number of
equally spaced thresholds, similar to what @btracey suggested.

This package:
https://cran.r-project.org/web/packages/pROC/index.html
does the removing between-duplicates intervals thing I mentioned
when plotting.

As does this package:
https://rocr.bioinf.mpi-sb.mpg.de/

So although I don't think any of those packages are actually any
good (honestly), I do think that the idea of removing between-duplicate
interval values is a good idea and I'll have a go at implementing it
tomorrow, despite the fact that it will complicate the code abit. I'll
also add a test for those edge cases.


// ROC returns paired false positive rate (FPR) and true positive
// rate (TPR) values which describe the relative (or receiver)
// operator chracteristic (ROC) curve obtained when y is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:
s/chracteristic/characteristic/

} else {
fpr[n-1] = 1
if tpr[n-1] == 1 {
break
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly don't even know if this is better or worse. Its faster, I suppose.

@Armadilloa16
Copy link
Contributor Author

Ok, added the special cases, tests for them, and comments. PTAL @btracey
There is one special case that I solved in a really ugly way that I hope there is an easier way to solve.
Alternatively we could just decide to do something different in that case. Its a pretty specific case that its addressing (when all the values in classes are equal -- either all false or all true).

}
}
if n == 0 {
tpr = tpr[0:(bin + 1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omit the 0 index in these slice operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okie Dokie

@Armadilloa16
Copy link
Contributor Author

Done. I just set the final values of tpr and fpr to 1. Figure that makes as much sense as anything else in the special case classes is either all true or all false. Sound about right @btracey ?

var bin int = 1 // the initial bin is known to have 0 fpr and 0 tpr
var nPos, nNeg float64
for i, u := range classes {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this newline.

@btracey
Copy link
Member

btracey commented Apr 1, 2016

LGTM with the newline removed. Do you have more comments @kortschak ?

Otherwise, the last thing is to fixup all of the commits to make it one commit instead of many. I find this link useful https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request.

@Armadilloa16
Copy link
Contributor Author

Yeah cool I figured out how to use
git rebase -i
awhile back now I just gotta remember.
I'll try it now for practice. If @kortschak
has more comments I can do it again
afterwards. :)

//
// If weights is nil, all weights are treated as 1.
//
// n of zero results in all possible cutoffs being calculated -- this will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use -- in comments. It has not markup meaning to godoc. I think this para could be rewritten to make it more English, since at the moment it reads more like code than prose.

@Armadilloa16
Copy link
Contributor Author

Not to muddy things any further... but are we happy with the `n' solution -- calculating for n equally spaced cutoffs (this is the solution some other packages use)? Now I think about it allowing for cutoffs to be specified, as I beleive @btracey suggested at one point, which would include this (as equally spaced cutoffs could be specified) is more general and might be better. Of course seeing as the current solution seems acceptable we could also leave it and then I could write an alternative and put in a pull request for that at some later point.

// n of zero results in all possible cutoffs being calculated, resulting
// in fpr and tpr having length one greater than the number of unique
// values in y. n greater than one will result in fpr and tpr having
// length n. ROC will panic if n is equal to one or less than 0.
//
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried re-writing the comments, what do you think @kortschak ?

@Armadilloa16
Copy link
Contributor Author

Ok I've made the improvements you both suggested @btracey @kortschak and cleaned up the commit history PTAL.

Also over the weekend I thought about it and realised that we can make the code easier to understand, simpler, and also more general by replacing the input n int with cutoffs []float64, specifying cutoffs, so I implemented that version and you can take a look at it here:
Armadilloa16@1a94226
If you like it, I'll merge it into this pull request? cutoffs == nil produces the same behaviour as n==0, but this can also be done by selecting the correct cutoffs as well now.

@Armadilloa16
Copy link
Contributor Author

@kortschak @btracey Hey sorry I got distracted and really busy for a week or two there.

ROC Function:

Where we left this, I think I addressed all your points, and then also suggested an alternative
in which the thresholds can be explicitly given, as an alternative to the current version where just
the number of thresholds is given and they are equally spaced. Or if the number is zero then they
can be un-equally spaced and cover every possible case. Inputting the thresholds themselves results in slightly simpler code, and includes both these cases, but also allows for arbitrary unequally spaced thresholds to be specified. A link to a branch I made showing how this would look is above.

Confusion Matrix:

In the meantime, I was thinking about it and realised that maybe would be better to simply calculate the entire confusion matrix (four outputs, fp, tp, fn, tn) of which this would be a subset -- i.e. fpr = fp / (fp + tn) and tpr = tp / (tp + fn). On the other hand this probably belongs in its own function, as its purpose would apply to different situations -- mostly ROC curves are calculated from samples which are not representative of the prevalence (tp + fn) / (tp + fn + tn + fp), and the ROC curve is being calculated because fpr and tpr are prevalence-independent. In these cases you would not want to calculate the confusion matrix directly from your data, but rather to infer it from your fpr, tpr values and a known prevalence value from some cohort study or something. So thats probably the better way to go about that. Or maybe two different functions, one for calculating the confusion matrix from data, one to infer it from a ROC curve using a known prevalence value.

What do you think?

@btracey
Copy link
Member

btracey commented Apr 23, 2016

LGTM

Let's get this in as is since it's already been a lot. We can discuss changes and additions elsewhere.

@kortschak
Copy link
Member

kortschak commented Apr 23, 2016 via email

@kortschak
Copy link
Member

Please git rm .roc.go.swp and git commit --amend and then rebase against gonum/stat master with a forced push to your fork's branch.

Produces a ROC curve either for all possible
cutoffs, or for n equally spaced cutoffs.
@Armadilloa16
Copy link
Contributor Author

Done. PTAL @kortschak

@kortschak
Copy link
Member

LGTM

@btracey Do you want to merge? otherwise I'll do it in the morning.

@btracey btracey merged commit 510f301 into gonum:master Apr 26, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants