Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SW-2567] Fix CoxPH docs example for Scala and Python #2531

Merged
merged 3 commits into from Jun 3, 2021

Conversation

neema-m
Copy link
Contributor

@neema-m neema-m commented May 29, 2021

Fixes JIRA: https://h2oai.atlassian.net/browse/SW-2567

Current example fails for

org.apache.spark.sql.AnalysisException: cannot resolve '`label`' given input columns: [start, transplant, stop, age, event, year, id, surgery];;
'Project [start#2305, stop#2306, event#2307, age#2308, year#2309, surgery#2310, transplant#2311, id#2312, 'label]
+- Relation[start#2305,stop#2306,event#2307,age#2308,year#2309,surgery#2310,transplant#2311,id#2312] csv

need to add parameters that are needed for CoxPH

Current example fails for 
```
org.apache.spark.sql.AnalysisException: cannot resolve '`label`' given input columns: [start, transplant, stop, age, event, year, id, surgery];;
'Project [start#2305, stop#2306, event#2307, age#2308, year#2309, surgery#2310, transplant#2311, id#2312, 'label]
+- Relation[start#2305,stop#2306,event#2307,age#2308,year#2309,surgery#2310,transplant#2311,id#2312] csv
```
@h2o-ops
Copy link
Collaborator

h2o-ops commented May 29, 2021

Hi @neemah2o, your PR title "Create sw_coxph.rst" is missing JIRA issue number! Please specify it in the following form [SW-]

@h2o-ops h2o-ops changed the title Create sw_coxph.rst [SW-XXX] Create sw_coxph.rst May 29, 2021
@neema-m neema-m requested a review from michalkurka May 29, 2021 07:11
@neema-m neema-m changed the title [SW-XXX] Create sw_coxph.rst [SW-2567] Fix CoxPH docs example for Scala and Python May 29, 2021
@michalkurka michalkurka requested a review from mn-mikke June 1, 2021 16:33
@mn-mikke
Copy link
Collaborator

mn-mikke commented Jun 2, 2021

It seems we need to get a different dataset with test data, all predictions are NaN:

+------------------+-----------------+-------+----------+-------------------+----------+
|age               |year             |surgery|transplant|detailed_prediction|prediction|
+------------------+-----------------+-------+----------+-------------------+----------+
|-12.9390828199863 |6.39561943874059 |1      |0         |{NaN}              |NaN       |
|-0.249144421629019|5.77686516084873 |0      |1         |{NaN}              |NaN       |
|3.04722792607803  |5.51129363449692 |0      |1         |{NaN}              |NaN       |
|-15.3401779603012 |4.19712525667351 |0      |1         |{NaN}              |NaN       |
|0.0164271047227942|3.97809719370294 |1      |0         |{NaN}              |NaN       |
|-0.208076659822041|3.75085557837098 |0      |0         |{NaN}              |NaN       |
|-4.53388090349076 |2.30800821355236 |0      |0         |{NaN}              |NaN       |
|0.610540725530456 |3.27720739219713 |1      |1         |{NaN}              |NaN       |
|-1.86817          |4.966785763      |1      |1         |{NaN}              |NaN       |
|13.5003422313484  |2.56536618754278 |0      |0         |{NaN}              |NaN       |
|-17.4647501711157 |1.58247775496235 |0      |0         |{NaN}              |NaN       |
|3.80013689253936  |1.56605065023956 |0      |0         |{NaN}              |NaN       |
|-11.8165639972622 |3.26351813826146 |0      |1         |{NaN}              |NaN       |
|1.44832306639288  |1.07049965776865 |0      |0         |{NaN}              |NaN       |
|6.59548254620123  |0.700889801505818|0      |0         |{NaN}              |NaN       |
|2.51882272416153  |2.63381245722108 |0      |1         |{NaN}              |NaN       |
|4.03285420944559  |5.51403148528405 |1      |1         |{NaN}              |NaN       |
|0.703627652292951 |5.09240246406571 |0      |1         |{NaN}              |NaN       |
|-4.15879534565367 |5.95482546201232 |1      |0         |{NaN}              |NaN       |
|0.03490759753595  |2.157401         |0      |0         |{NaN}              |NaN       |
+------------------+-----------------+-------+----------+-------------------+----------+

Testing dataset doesn't have start and stop column. https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/coxph_test/heart_test.csv Is it expected?

@neema-m
Copy link
Contributor Author

neema-m commented Jun 2, 2021

Good point @mn-mikke , the H2O 3 version uses heart.csv and splits it

heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")

# Split the dataset into a train and test set:
train, test = heart.split_frame(ratios = [.8], seed = 1234)

I will modify the PR to match it

Updated frames with a split of heart.csv to train/test 
& corrected scala formatting to work in cli/jupyter notebook

Note that heart.csv is SW smalldata folder is the same as the heart.csv example as H2O 3 (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/coxph.html#examples)
@neema-m
Copy link
Contributor Author

neema-m commented Jun 2, 2021

@mn-mikke , I made the changes to give a proper test dataset:

scala> model.transform(testingDF).show(false)
+-----+----+-----+------------------+-----------------+-------+----------+---+----------------------+--------------------+
|start|stop|event|age               |year             |surgery|transplant|id |detailed_prediction   |prediction          |
+-----+----+-----+------------------+-----------------+-------+----------+---+----------------------+--------------------+
|0    |2   |0    |0.610540725530456 |3.27720739219713 |1      |0         |46 |[-0.45681461656609834]|-0.45681461656609834|
|0    |2   |1    |4.18069815195072  |4.99657768651609 |0      |0         |75 |[0.03651537014229078] |0.03651537014229078 |
|0    |2   |1    |4.56399726214921  |4.17522245037645 |0      |0         |61 |[0.16226172474195466] |0.16226172474195466 |
|0    |4   |0    |-21.2731006160164 |4.87063655030801 |0      |0         |72 |[-0.5015662484269671] |-0.5015662484269671 |
|0    |4   |0    |-6.52977412731006 |2.5927446954141  |0      |0         |38 |[0.20394003741166356] |0.20394003741166356 |
|0    |4   |0    |-4.15879534565367 |5.95482546201232 |1      |0         |94 |[-0.9108067656338691] |-0.9108067656338691 |
|0    |13  |0    |-21.3497604380561 |6.00958247775496 |0      |0         |96 |[-0.6054142887730242] |-0.6054142887730242 |
|0    |18  |0    |7.27994524298425  |1.3305954825462  |0      |0         |20 |[0.7266981608568234]  |0.7266981608568234  |
|0    |19  |0    |13.5003422313484  |2.56536618754278 |0      |0         |37 |[0.6295012210602893]  |0.6295012210602893  |
|0    |26  |0    |-1.55509924709104 |5.18275154004107 |1      |0         |80 |[-0.7557212094322586] |-0.7557212094322586 |
|0    |28  |0    |1.44832306639288  |1.07049965776865 |0      |0         |16 |[0.6448465982857511]  |0.6448465982857511  |
|0    |36  |0    |-7.73716632443532 |0.490075290896646|0      |0         |4  |[0.5033912420589028]  |0.5033912420589028  |
|0    |40  |1    |-5.42094455852156 |3.16495550992471 |1      |0         |44 |[-0.570992880113988]  |-0.570992880113988  |
|0    |41  |0    |-0.657084188911703|3.75085557837098 |0      |0         |53 |[0.10108321544611465] |0.10108321544611465 |
|0    |46  |0    |0.925393566050651 |2.507871321013   |0      |0         |36 |[0.36831841133591947] |0.36831841133591947 |
|0    |51  |0    |0.903490759753595 |2.1574264202601  |0      |0         |33 |[0.45740528733585895] |0.45740528733585895 |
|0    |51  |0    |2.86926762491444  |0.780287474332649|0      |0         |7  |[0.663074262583844]   |0.663074262583844   |
|0    |78  |0    |-6.61738535249829 |3.99452429842574 |1      |0         |59 |[-0.703921414261432]  |-0.703921414261432  |
|0    |102 |1    |-6.75154004106776 |3.564681724846   |0      |0         |52 |[0.0260867300031159]  |0.0260867300031159  |
|0    |428 |0    |-18.798083504449  |4.08487337440109 |0      |0         |82 |[-0.02612265146660009]|-0.02612265146660009|
+-----+----+-----+------------------+-----------------+-------+----------+---+----------------------+--------------------+
only showing top 20 rows

Copy link
Collaborator

@mn-mikke mn-mikke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @neemah2o!

@mn-mikke mn-mikke merged commit 7b6ea69 into master Jun 3, 2021
@mn-mikke mn-mikke deleted the neemah2o-SW2567 branch June 3, 2021 13:15
mn-mikke pushed a commit that referenced this pull request Jun 3, 2021
* Create sw_coxph.rst

Current example fails for 
```
org.apache.spark.sql.AnalysisException: cannot resolve '`label`' given input columns: [start, transplant, stop, age, event, year, id, surgery];;
'Project [start#2305, stop#2306, event#2307, age#2308, year#2309, surgery#2310, transplant#2311, id#2312, 'label]
+- Relation[start#2305,stop#2306,event#2307,age#2308,year#2309,surgery#2310,transplant#2311,id#2312] csv
```

* Corrected "Isolation Forest" to "CoxPH"

* Updated example with split of heart.csv & corrected scala formatting

Updated frames with a split of heart.csv to train/test 
& corrected scala formatting to work in cli/jupyter notebook

Note that heart.csv is SW smalldata folder is the same as the heart.csv example as H2O 3 (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/coxph.html#examples)

(cherry picked from commit 7b6ea69)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants