-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linreg stratified docs #4458
linreg stratified docs #4458
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome sauce
... .or_missing() | ||
>>> mt_linreg = hl.linear_regression(y = male_pheno, x = [1, mt_linreg.GT.n_alt_alleles()], root='linreg_male') | ||
|
||
Approach #2: Use the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` aggregators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd reverse the order to match that of the code (group_by then linreg)
|
||
Approach #2: Use the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` aggregators | ||
|
||
>>> mt_linreg = mt.annotate_rows(linreg = hl.agg.group_by(mt.pheno.is_female, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move linreg to new line so the subsequent lines aren't so wide
variable. The first approach utilizes the :func:`.linear_regression` method and must be called | ||
separately for each group even though it can compute statistics for multiple phenotypes | ||
simultaneously. This is because the :func:`.linear_regression` method drops samples that have | ||
more than one missing value across all phenotypes, such as when the groups are mutually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that have a missing value for any of the phenotypes; when the groups are mutually exclusive, such as 'Male' and 'Female', no samples remain!"
simultaneously. This is because the :func:`.linear_regression` method drops samples that have | ||
more than one missing value across all phenotypes, such as when the groups are mutually | ||
exclusive such as 'Male' and 'Female'. Note that the expressions for `female_pheno` and | ||
`male_pheno` cannot be computed at the same time because they are inputs to two different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that you're trying to point out why mt_linreg
is used in male_pheno
rather than mt
, i.e. why we couldn't just define male_pheno = ~female_pheno
. How about:
"Note that we cannot define male_pheno = ~female_pheno
because we subsequently need male_pheno
to be an expression on the mt_linreg
rather thanmt
."
exclusive such as 'Male' and 'Female'. Note that the expressions for `female_pheno` and | ||
`male_pheno` cannot be computed at the same time because they are inputs to two different | ||
matrix tables. Lastly, the argument to `root` must be specified for both cases -- otherwise | ||
the output for the 'Male' grouping will overwrite the 'Female' output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 'Male' output will overwrite the 'Female' output.
matrix tables. Lastly, the argument to `root` must be specified for both cases -- otherwise | ||
the output for the 'Male' grouping will overwrite the 'Female' output. | ||
|
||
The second approach uses the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse order again
|
||
The second approach uses the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` | ||
aggregators. The aggregation expression generates a dictionary where the keys are the grouping | ||
variables and the values are the linear regression statistics for that group. The result of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where a key is a group (value of the grouping variable) and the corresponding value is the linear regression statistics for those samples in the group.
aggregators. The aggregation expression generates a dictionary where the keys are the grouping | ||
variables and the values are the linear regression statistics for that group. The result of the | ||
aggregation expression is then used to annotate the matrix table. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd note some pros of each. linear_regression is more efficient, especially when analyzing many phenotypes.
linreg aggregator is more flexible (multiple covariates can be vary by entry) and returns a richer set of statistics.
|
||
:**code**: | ||
|
||
Approach #1: Use the :func:`.linear_regression` method for all phenotypes simulatenously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in simultaneously
statistics. If the phenotypes being analyzed have different patterns of missingness, you should | ||
**not** use the :func:`.linear_regression` method for all phenotypes simulatenously (Approach #1). | ||
This is because the :func:`.linear_regression` method drops samples that have a missing value for | ||
any of the phenotypes. Approach #2 will do two passes over the data while Approach #3 will do one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggregator, especially when analyzing many phenotypes. However, the :func:`.aggregators.linreg` | ||
aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of | ||
statistics. If the phenotypes being analyzed have different patterns of missingness, you should | ||
**not** use the :func:`.linear_regression` method for all phenotypes simulatenously (Approach #1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too strong a statement. In some cases, restricting to the common samples may be reasonable. Instead, just make users aware of the behavior that will result in Approach #1.
the matrix table. | ||
|
||
The :func:`.linear_regression` method is more efficient than the :func:`.aggregators.linreg` | ||
aggregator, but the :func:`.aggregators.linreg` aggregator is more flexible (multiple covariates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...aggregator and can be extended to multiple phenotypes, but...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to fix this error
185 >>> mt_linreg = hl.linear_regression(y=mt.pheno.height, x=[1, mt.GT.n_alt_alleles()])
UNEXPECTED EXCEPTION: TypeError("linear_regression() missing 1 required positional argument: 'covariates'",)
one more:
|
Fixes #4251
@jbloom22 Is there anything else I should add? Maybe something about the relative performance of each approach? I also thought about using two separate linreg aggregator annotations, but that didn't seem better than the group_by approach.