Multi-phenotype logistic regression#5072
Conversation
|
awesome! will look tomorrow. |
tpoterba
left a comment
There was a problem hiding this comment.
This is great! Just a few minor comments.
hail/python/hail/methods/statgen.py
Outdated
|
|
||
| >>> result_ht = hl.logistic_regression_rows( | ||
| ... test='wald', | ||
| ... y=[dataset.pheno1, dataset.pheno2], #where pheno1 and pheno2 values are 0, 1, or missing |
There was a problem hiding this comment.
style: two spaces before the # and one space after
pheno2], # where pheno...
| y_field = list(f'__y_{i}' for i in range(len(y))) if y_is_list else "__y" | ||
|
|
||
| y_dict = dict(zip(y_field, y)) if y_is_list else {y_field: y} | ||
| func = Env.hail().methods.LogisticRegression |
There was a problem hiding this comment.
no need to abstract this here - you overloaded the method in scala, which is fine.
There was a problem hiding this comment.
My goal here was to preserve much of the original code path for single phenotype logistic regression. My rational is that this function is probably widely used and this approach seems less risky. The alternative looks like it will require more refactoring, which probably should be done after some burn in time.
There was a problem hiding this comment.
I just mean that func is only used once below (is that right?) so it could be inlined there. There aren't two different function names being called like in linear regression.
This is fine though.
|
|
||
|
|
||
|
|
||
| class LogisticRegressionTest extends SparkSuite { |
There was a problem hiding this comment.
These tests look like they're covered by the Python tests. Is that true?
There was a problem hiding this comment.
Yes, end-to-end functionality is covered by the python tests, but adding scala tests helped with quickly isolating lower level issues.
There was a problem hiding this comment.
Ah, got it. I'm happy to add them for now, but we may remove those tests when we refactor the scala side.
The lack of easy debug-ability of the Python tests is the biggest problem 😦
|
looks like there's a weird merge commit in here - mind rebasing to clean up the diff before I take another look? |
| y_field = list(f'__y_{i}' for i in range(len(y))) if y_is_list else "__y" | ||
|
|
||
| y_dict = dict(zip(y_field, y)) if y_is_list else {y_field: y} | ||
| func = Env.hail().methods.LogisticRegression |
There was a problem hiding this comment.
I just mean that func is only used once below (is that right?) so it could be inlined there. There aren't two different function names being called like in linear regression.
This is fine though.
batch/Makefile
Outdated
| push push-test \ | ||
| run-docker run \ | ||
| test test-local deploy | ||
| test test-local deploy clean |
There was a problem hiding this comment.
the merge commit seemed to cause problems
| rvb.addFields(fullRowType, rv, copiedFieldIndices) | ||
| rvb.startArray(_yVecs.cols) | ||
| logregAnnotations.foreach(stats => { | ||
| //rvb.addFields(_resultSchema.physicalType,rv,) //TODO How to add strcut here? |
There was a problem hiding this comment.
oops, sorry, remove this.
That's the last commend and we're ready to merge!
|
Thanks for the contribution! |
First crack at supporting multi phenotype logistic regression. No matrix optimizations, as is implemented in multi phenotype linear regression, but I attempt to follow a similar approach as far as far as API and single call of mapPartitions.