Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in box for ROC curve #197

Merged
merged 7 commits into from
Aug 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ __pycache__
.cache/
.history/
.lib/
/.bsp
/dist/*
target/
/logs/
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Please add changes to "master", preferably ordered by their significance. (Most
- Boxes used as steps in a wizard are highlighted in the workspace view by a faint glow.
[#183](https://github.com/lynxkite/lynxkite/pull/183)
- _"Compute in Python"_ boxes can be used on tables. [#160](https://github.com/lynxkite/lynxkite/pull/160)
- Added a _"Draw ROC curve"_ built-in custom box. [#197](https://github.com/lynxkite/lynxkite/pull/197)
- Performance and compatibility improvements.
[#188](https://github.com/lynxkite/lynxkite/pull/188)
[#194](https://github.com/lynxkite/lynxkite/pull/194)
Expand Down
3 changes: 1 addition & 2 deletions app/com/lynxanalytics/biggraph/serving/JsonServer.scala
Original file line number Diff line number Diff line change
Expand Up @@ -392,8 +392,7 @@ class ProductionJsonServer @javax.inject.Inject() (

def downloadCSV = asyncAction(parse.anyContent) { (user: User, r: mvc.Request[mvc.AnyContent]) =>
val request = parseJson[GetTableOutputRequest](user, r)
implicit val metaManager = workspaceController.metaManager
val table = workspaceController.getOutput(user, request.id).table
val table = workspaceController.metaManager.table(java.util.UUID.fromString(request.id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I don't understand even after reading the long commit command. How is it related to the newly added built-in box?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I don't understand even after reading the long commit command. How is it related to the newly added built-in box?

The plot refers to the table GUID and the frontend sends this request to get the data. The box output state for a table also has the table's GUID. So you can access the same table either by looking up the GUID as a box output state ID and then taking the table from it (the old code) or looking up the GUID as a table (the new code).

Box output states are not persisted. We assume you only want to look at a box output that we have returned in this run. So if you restart LynxKite after creating a plot, and then look at the plot without looking at the box that generated it, you get an error. This is an edge case I didn't consider originally. You don't typically look at box outputs when not looking at the box. Except this happens with custom boxes!

sqlController.downloadCSV(table, request.sampleRows)
}

Expand Down
155 changes: 155 additions & 0 deletions built-ins/draw-ROC-curve
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
boxes:
- id: anchor
inputs: {}
operationId: Anchor
parameters:
description: |-
Draws an ROC curve and computes the AUC
for a binary classifier prediction.
The two parameters are the true label (0 or 1)
and the predicted score from the model (between 0 and 1).

To avoid an overly detailed plot,
the curve is based on a sample of vertices.
parameters: >-
[{"kind":"vertex attribute (number)","id":"true
label","defaultValue":""},{"kind":"vertex
attribute (number)","id":"predicted
score","defaultValue":""},{"kind":"text","id":"sample
size","defaultValue":"1000"}]
parametricParameters: {}
x: 0
y: 0
- id: Custom-plot_2
inputs:
table:
boxId: SQL1_5
id: table
operationId: Custom plot
parameters:
plot_code: |-
{
"layer": [{
"mark": "line",
"encoding": {
"x": {
"field": "fpr",
"title": "False positive rate",
"type": "quantitative"
},
"y": {
"field": "tpr",
"title": "True positive rate",
"type": "quantitative"
}
}
}, {
"mark": {
"type": "rule",
"color": "lightgray",
"strokeDash": [8, 8]
},
"encoding": {
"x": { "datum": 0 },
"y": { "datum": 0 },
"x2": { "datum": 1 },
"y2": { "datum": 1 }
}
}]
}
parametricParameters: {}
x: 700
y: 150
- id: SQL1_4
inputs:
input:
boxId: input-input
id: input
operationId: SQL1
parameters:
persist: 'yes'
summary: Rename and filter
parametricParameters:
sql: |-
select
${`true label`} as label,
${`predicted score`} as score
from vertices
where isnotnull(${`true label`})
and isnotnull(${`predicted score`})
limit ${`sample size`}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we regard this a a true random sample because the rows are in random order in the query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we regard this a a true random sample because the rows are in random order in the query?

Yes. The input is a graph where this usually holds.

x: 250
y: 250
- id: input-input
inputs: {}
operationId: Input
parameters:
name: input
parametricParameters: {}
x: 50
y: 250
- id: SQL1_5
inputs:
input:
boxId: SQL1_4
id: table
operationId: SQL1
parameters:
persist: 'no'
sql: |-
select
label, score,
sum(label) over (
order by score desc rows between
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only found a small issue here: If it's possible that the classifier assigns the same score for multiple items, for that specific score we will get a lot of different (fpr, tpr) pairs. Only the last one (order by score desc) is really meaningful.

On the chart it can create weird blobs I think.

One possible solution is to add a second sql to pick the last value from each score-group, but it only works if we can assume that the order of the records remains the same between the first and the second sql. Or we can pre-aggregate by score (num_1, num_0) and the then your SQL will work even for repeated score values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only found a small issue here: If it's possible that the classifier assigns the same score for multiple items, for that specific score we will get a lot of different (fpr, tpr) pairs. Only the last one (order by score desc) is really meaningful.

On the chart it can create weird blobs I think.

Nope. Looks fine:

image

unbounded preceding and current row)
/ (select sum(label) from input)
as tpr,

sum(1 - label) over (
order by score desc rows between
unbounded preceding and current row)
/ (select sum(1 - label) from input)
as fpr

from input
Comment on lines +100 to +114
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erbenpeter when you are not on vacation I'd appreciate if you could take a look at this SQL query for calculating the curve.

summary: Compute TPR / FPR
parametricParameters: {}
x: 450
y: 250
- id: SQL1_6
inputs:
input:
boxId: SQL1_5
id: table
operationId: SQL1
parameters:
persist: 'no'
sql: |
select sum(1 - fpr) / count(1) as AUC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tricky implementation (compared to the first 10 I've found googling) but it also assumes that scores are unique (which I think is not realistic: if two customers have the same ingredients the (deterministic) classifier needs to assign the same score to them, but it's possible that their labels are different in reality.)

Mathematically speaking this formula only works when the segments of the ROC curve are all vertical or horizontal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To phrase my problem in a different way: if it's possible to have repeated scores with different labels, your curve is not defined, its shape (and consequently the AUC value) )depends on the order of the data points inside a group with the same score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To phrase my problem in a different way: if it's possible to have repeated scores with different labels, your curve is not defined, its shape (and consequently the AUC value) )depends on the order of the data points inside a group with the same score.

You need to have them in a random order, which is the case here. So you get a nearly diagonal line and an AUC close to 0.5 in the case of the above screenshot. (Random 0/1 label and random 0/1 prediction.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically speaking this formula only works when the segments of the ROC curve are all vertical or horizontal.

Then I have great news! The segments of the ROC curve are all vertical or horizontal.

from input where label == 1
Comment on lines +128 to +129
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus this one for calculating the AUC. I made up both queries myself instead of copying them from somewhere, so I'm not terribly confident in them. 😅 Thanks!

summary: Compute AUC
parametricParameters: {}
x: 700
y: 300
- id: output-plot
inputs:
output:
boxId: Custom-plot_2
id: plot
operationId: Output
parameters:
name: plot
parametricParameters: {}
x: 900
y: 150
- id: output-table
inputs:
output:
boxId: SQL1_6
id: table
operationId: Output
parameters:
name: AUC
parametricParameters: {}
x: 900
y: 300
13 changes: 0 additions & 13 deletions dependency-licenses/scala.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,6 @@ Apache | [Apache License v2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) |
Apache | [Apache License, Version 2](http://www.apache.org/licenses/LICENSE-2.0) | org.neo4j.driver # neo4j-java-driver # 4.2.5 | <notextile></notextile>
Apache | [Apache License, Version 2.0](https://aws.amazon.com/apache2.0) | com.amazonaws # aws-java-sdk # 1.7.4 | <notextile></notextile>
Apache | [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.clearspring.analytics # stream # 2.9.6 | <notextile></notextile>
Apache | [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.cloud.bigdataoss # gcs-connector # 1.6.1-hadoop2 | <notextile></notextile>
Apache | [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.cloud.bigdataoss # gcsio # 1.6.1 | <notextile></notextile>
Apache | [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.cloud.bigdataoss # util # 1.6.1 | <notextile></notextile>
Apache | [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.cloud.bigdataoss # util-hadoop # 1.6.1-hadoop2 | <notextile></notextile>
Apache | [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.guava # guava # 30.1-android | <notextile></notextile>
Apache | [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) | com.jamesmurty.utils # java-xmlbuilder # 1.1 | <notextile></notextile>
Apache | [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) | commons-codec # commons-codec # 1.15 | <notextile></notextile>
Expand Down Expand Up @@ -118,21 +114,12 @@ Apache | [The Apache Software License, Version 2.0](https://www.apache.org/licen
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.github.docker-java # docker-java-api # 3.2.7 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.github.docker-java # docker-java-transport # 3.2.7 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.github.docker-java # docker-java-transport-zerodep # 3.2.7 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.api-client # google-api-client # 1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.api-client # google-api-client-jackson2 # 1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.api-client # google-api-client-java6 # 1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.apis # google-api-services-storage # v1-rev35-1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.code.findbugs # jsr305 # 3.0.2 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.guava # failureaccess # 1.0.1 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.guava # guava-jdk5 # 13.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.guava # listenablefuture # 9999.0-empty-to-avoid-conflict-with-guava | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.http-client # google-http-client # 1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.http-client # google-http-client-jackson2 # 1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.inject # guice # 4.2.3 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.inject.extensions # guice-assistedinject # 4.2.3 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.j2objc # j2objc-annotations # 1.3 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.oauth-client # google-oauth-client # 1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | com.google.oauth-client # google-oauth-client-java6 # 1.20.0 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | commons-logging # commons-logging # 1.2 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | commons-pool # commons-pool # 1.5.4 | <notextile></notextile>
Apache | [The Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt) | javax.inject # javax.inject # 1 | <notextile></notextile>
Expand Down