-
Notifications
You must be signed in to change notification settings - Fork 117
Pandas based linear regression #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean way of masking! I had some comments/questions. Will look at tests in the next round.
| return oe.contract(subscripts, *operands, memory_limit=5e7) | ||
|
|
||
|
|
||
| def _linear_regression_inner(genotype_df, Y, Q, QtY, YdotY, Y_mask, dof, phenotype_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add descriptions like those Leland has put for functions describing the function, arguments, and their dimensions, and output? And can we refer to https://arxiv.org/pdf/1901.09531.pdf here as well?
| Y = phenotype_df.to_numpy('float64', copy=True) | ||
| Y_mask = ~np.isnan(Y) | ||
| Y[~Y_mask] = 0 | ||
| Q = np.zeros((covariate_df.shape[0], covariate_df.shape[1], phenotype_df.shape[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this 3D initialization for in light of the next line?
| Q = np.zeros((covariate_df.shape[0], covariate_df.shape[1], phenotype_df.shape[1])) | ||
| Q = np.linalg.qr(C)[0] | ||
| QtY = Q.T @ Y | ||
| YdotY = np.sum(Y * Y, axis = 0) - np.sum(QtY * QtY, axis = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YdotY name is misleading as it is actualy YdotY - QtYdotQtY. Can we either change the name or separate the two parts?
| ''' | ||
| A wrapper around opt_einsum to ensure uniform memory limits. | ||
| ''' | ||
| return oe.contract(subscripts, *operands, memory_limit=5e7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention where 5e7 comes from?
|
|
||
| with oe.sharing.shared_intermediates(): | ||
| XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask) | ||
| XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line repeated?
| X = np.column_stack(genotype_df['values'].array) | ||
|
|
||
| with oe.sharing.shared_intermediates(): | ||
| XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the theoretical approach regarding the missingness? If for a phenotype, a sample is missing, does that mean we ignore that row in X and C as well? In that case I wonder whether Q obtained from QR factorization of C would be a totally different matrix. In above it seems we are assuming Q does not change.
| X = np.column_stack(genotype_df['values'].array) | ||
|
|
||
| with oe.sharing.shared_intermediates(): | ||
| XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the name XdotY here does not match its definition https://arxiv.org/pdf/1901.09531.pdf. I think it is better we match the names with that paper.
| XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask) | ||
| QtX = _einsum('sc,sp,sg->pgc', Q, Y_mask, X) | ||
| XdotX_reciprocal = 1 / (_einsum('sp,sg,sg->pg', Y_mask, X, X) - | ||
| _einsum('pgc,pgc->pg', QtX, QtX)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I prefer we do not call this XdotX.
| with oe.sharing.shared_intermediates(): | ||
| XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask) | ||
| XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask) | ||
| QtX = _einsum('sc,sp,sg->pgc', Q, Y_mask, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this ->pcg instead of pgc to match QtX?
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #302 +/- ##
=======================================
Coverage 93.62% 93.62%
=======================================
Files 95 95
Lines 4812 4812
Branches 456 456
=======================================
Hits 4505 4505
Misses 307 307 Continue to review full report at Codecov.
|
kianfar77
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Just a few minor comments.
| A Spark DataFrame that contains: | ||
| - All columns from `genotype_df` except the `values_column` | ||
| - `effect`: The effect size estimate for the genotype | ||
| - `stderror`: The estimated standard error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: of the effect
|
|
||
| C = covariate_df.to_numpy(np.float64, copy=True) | ||
| if fit_intercept: | ||
| intercept = np.ones((phenotype_df.shape[0], 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use _add_intercept?
|
|
||
|
|
||
| @typechecked | ||
| def _linear_regression_inner(genotype_df: pd.DataFrame, Y: NDArray[(Any, Any), np.float64], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a name other than genotype_df here as it is used for the spark dataframe before?
| Q: NDArray[(Any, Any), np.float64], dof: int, | ||
| phenotype_names: pd.Series) -> pd.DataFrame: | ||
| ''' | ||
| Applies a linear regression model to a block of genotypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here and perhaps in the linear_regression function that the problem after projection covariates is performing multiple single variable regressions in parallel? The notation in linear algebra can be easily confused with where X is a vector for multi-variate regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion
| - click=7.1.1 # Docs notebook source generation | ||
| - databricks-cli=0.9.1 # Docs notebook source generation | ||
| - jinja2 | ||
| - jupyter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's very useful for testing though
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
kianfar77
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D henrydavidge@gmail.com
What changes are proposed in this pull request?
(Details)
How is this patch tested?
(Details)