Built-in box for ROC curve #197

darabos · 2021-08-18T17:07:16Z

Resolves #181.

In action:

And inside:

The BoxOutputState may not exist because we may not have opened the inside of the custom box. But the BoxOutputState is actually a table GUID. The user is ignored in this case, but the original code didn't offer real access control either on this endpoint.

darabos · 2021-08-19T08:40:34Z

built-ins/draw-ROC-curve

+      select
+      label, score,
+      sum(label) over (
+        order by score desc rows between
+        unbounded preceding and current row)
+        / (select sum(label) from input)
+        as tpr,
+
+      sum(1 - label) over (
+        order by score desc rows between
+        unbounded preceding and current row)
+        / (select sum(1 - label) from input)
+        as fpr
+
+      from input


@erbenpeter when you are not on vacation I'd appreciate if you could take a look at this SQL query for calculating the curve.

darabos · 2021-08-19T08:41:54Z

built-ins/draw-ROC-curve

+      select sum(1 - fpr) / count(1) as AUC
+      from input where label == 1


Plus this one for calculating the AUC. I made up both queries myself instead of copying them from somewhere, so I'm not terribly confident in them. 😅 Thanks!

darabos · 2021-08-23T13:35:53Z

Everyone's on vacation but I want to prepare a release candidate. I'm merging this while hoping for a future review.

erbenpeter · 2021-09-01T08:07:33Z

I'm back from vacation, so now I'll take a look.

erbenpeter

Thanks, nice and clean!

I almost like it. My issue is with the possibility of items have the same score but different labels (which I think is possible). See my more detailed comments inline.

erbenpeter · 2021-09-01T08:21:24Z

app/com/lynxanalytics/biggraph/serving/JsonServer.scala

@@ -392,8 +392,7 @@ class ProductionJsonServer @javax.inject.Inject() (

  def downloadCSV = asyncAction(parse.anyContent) { (user: User, r: mvc.Request[mvc.AnyContent]) =>
    val request = parseJson[GetTableOutputRequest](user, r)
-    implicit val metaManager = workspaceController.metaManager
-    val table = workspaceController.getOutput(user, request.id).table
+    val table = workspaceController.metaManager.table(java.util.UUID.fromString(request.id))


This part I don't understand even after reading the long commit command. How is it related to the newly added built-in box?

This part I don't understand even after reading the long commit command. How is it related to the newly added built-in box?

The plot refers to the table GUID and the frontend sends this request to get the data. The box output state for a table also has the table's GUID. So you can access the same table either by looking up the GUID as a box output state ID and then taking the table from it (the old code) or looking up the GUID as a table (the new code).

Box output states are not persisted. We assume you only want to look at a box output that we have returned in this run. So if you restart LynxKite after creating a plot, and then look at the plot without looking at the box that generated it, you get an error. This is an edge case I didn't consider originally. You don't typically look at box outputs when not looking at the box. Except this happens with custom boxes!

erbenpeter · 2021-09-01T08:23:03Z

built-ins/draw-ROC-curve

+      from vertices
+      where isnotnull(${`true label`})
+        and isnotnull(${`predicted score`})
+      limit ${`sample size`}


Can we regard this a a true random sample because the rows are in random order in the query?

Can we regard this a a true random sample because the rows are in random order in the query?

Yes. The input is a graph where this usually holds.

erbenpeter · 2021-09-01T09:10:50Z

built-ins/draw-ROC-curve

+      select
+      label, score,
+      sum(label) over (
+        order by score desc rows between


I only found a small issue here: If it's possible that the classifier assigns the same score for multiple items, for that specific score we will get a lot of different (fpr, tpr) pairs. Only the last one (order by score desc) is really meaningful.

On the chart it can create weird blobs I think.

One possible solution is to add a second sql to pick the last value from each score-group, but it only works if we can assume that the order of the records remains the same between the first and the second sql. Or we can pre-aggregate by score (num_1, num_0) and the then your SQL will work even for repeated score values.

I only found a small issue here: If it's possible that the classifier assigns the same score for multiple items, for that specific score we will get a lot of different (fpr, tpr) pairs. Only the last one (order by score desc) is really meaningful.

On the chart it can create weird blobs I think.

Nope. Looks fine:

erbenpeter · 2021-09-01T09:55:12Z

built-ins/draw-ROC-curve

+  parameters:
+    persist: 'no'
+    sql: |
+      select sum(1 - fpr) / count(1) as AUC


It's a tricky implementation (compared to the first 10 I've found googling) but it also assumes that scores are unique (which I think is not realistic: if two customers have the same ingredients the (deterministic) classifier needs to assign the same score to them, but it's possible that their labels are different in reality.)

Mathematically speaking this formula only works when the segments of the ROC curve are all vertical or horizontal.

To phrase my problem in a different way: if it's possible to have repeated scores with different labels, your curve is not defined, its shape (and consequently the AUC value) )depends on the order of the data points inside a group with the same score.

To phrase my problem in a different way: if it's possible to have repeated scores with different labels, your curve is not defined, its shape (and consequently the AUC value) )depends on the order of the data points inside a group with the same score.

You need to have them in a random order, which is the case here. So you get a nearly diagonal line and an AUC close to 0.5 in the case of the above screenshot. (Random 0/1 label and random 0/1 prediction.)

Mathematically speaking this formula only works when the segments of the ROC curve are all vertical or horizontal.

Then I have great news! The segments of the ROC curve are all vertical or horizontal.

erbenpeter

Thanks, nice and clean!

I almost like it. My issue is with the possibility of items have the same score but different labels (which I think is possible). See my more detailed comments inline.

darabos · 2021-09-10T08:32:06Z

I almost like it. My issue is with the possibility of items have the same score but different labels (which I think is possible). See my more detailed comments inline.

I've checked how scikit-learn does the ROC curve when scores are repeated and it does it like you say. 😓 Sorry I was confidently wrong. I've sent a fix in #202. Thanks!

darabos added 7 commits August 18, 2021 16:07

Ignore new directory from SBT.

7275007

Built-in custom box for ROC curve.

a0b0d15

Manual fixes to YAML.

12a11a0

Uppercase "ROC".

7b0d38e

Automatic changes that I missed earlier.

a29b2bd

ROC in the changelist.

05ee4e9

darabos requested a review from erbenpeter August 19, 2021 08:29

darabos commented Aug 19, 2021

View reviewed changes

darabos merged commit 69018ae into main Aug 23, 2021

darabos deleted the darabos-roc branch August 23, 2021 13:35

erbenpeter reviewed Sep 1, 2021

View reviewed changes

darabos mentioned this pull request Sep 10, 2021

Use sklearn for ROC curve #202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in box for ROC curve #197

Built-in box for ROC curve #197

darabos commented Aug 18, 2021 •

edited

darabos Aug 19, 2021

darabos Aug 19, 2021

darabos commented Aug 23, 2021

erbenpeter commented Sep 1, 2021

erbenpeter left a comment

erbenpeter Sep 1, 2021

darabos Sep 1, 2021

erbenpeter Sep 1, 2021

darabos Sep 1, 2021

erbenpeter Sep 1, 2021

darabos Sep 1, 2021

erbenpeter Sep 1, 2021

erbenpeter Sep 1, 2021

darabos Sep 1, 2021

darabos Sep 1, 2021

erbenpeter left a comment

darabos commented Sep 10, 2021

		select sum(1 - fpr) / count(1) as AUC
		from input where label == 1

Built-in box for ROC curve #197

Built-in box for ROC curve #197

Conversation

darabos commented Aug 18, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darabos commented Aug 23, 2021

erbenpeter commented Sep 1, 2021

erbenpeter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erbenpeter left a comment

Choose a reason for hiding this comment

darabos commented Sep 10, 2021

darabos commented Aug 18, 2021 •

edited