New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

linreg stratified docs #4458

Merged

danking merged 7 commits into hail-is:master from jigold:fix4251

Sep 30, 2018

Contributor

jigold commented Sep 26, 2018

Fixes #4251

@jbloom22 Is there anything else I should add? Maybe something about the relative performance of each approach? I also thought about using two separate linreg aggregator annotations, but that didn't seem better than the group_by approach.


          linreg stratified docs

6d9e17d

jigold assigned jbloom22

jbloom22 suggested changes

View reviewed changes

Contributor

jbloom22 left a comment

awesome sauce

hail/python/hail/docs/guides/genetics.rst Outdated

+                  ...                .or_missing()
+                  >>> mt_linreg = hl.linear_regression(y = male_pheno, x = [1, mt_linreg.GT.n_alt_alleles()], root='linreg_male')
+                  Approach #2: Use the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` aggregators

Contributor

jbloom22 Sep 26, 2018

I'd reverse the order to match that of the code (group_by then linreg)

hail/python/hail/docs/guides/genetics.rst Outdated


		Approach #2: Use the :func:`.aggregators.linreg` and :func:`.aggregators.group_by` aggregators

		>>> mt_linreg = mt.annotate_rows(linreg = hl.agg.group_by(mt.pheno.is_female,

Contributor

jbloom22 Sep 26, 2018

move linreg to new line so the subsequent lines aren't so wide

hail/python/hail/docs/guides/genetics.rst Outdated

+                          variable. The first approach utilizes the :func:`.linear_regression` method and must be called
+                          separately for each group even though it can compute statistics for multiple phenotypes
+                          simultaneously. This is because the :func:`.linear_regression` method drops samples that have
+                          more than one missing value across all phenotypes, such as when the groups are mutually

Contributor

jbloom22 Sep 26, 2018

"that have a missing value for any of the phenotypes; when the groups are mutually exclusive, such as 'Male' and 'Female', no samples remain!"

hail/python/hail/docs/guides/genetics.rst Outdated

+                          simultaneously. This is because the :func:`.linear_regression` method drops samples that have
+                          more than one missing value across all phenotypes, such as when the groups are mutually
+                          exclusive such as 'Male' and 'Female'. Note that the expressions for `female_pheno` and
+                          `male_pheno` cannot be computed at the same time because they are inputs to two different

Contributor

jbloom22 Sep 26, 2018

I get that you're trying to point out why mt_linreg is used in male_pheno rather than mt, i.e. why we couldn't just define male_pheno = ~female_pheno. How about:
"Note that we cannot define male_pheno = ~female_pheno because we subsequently need male_pheno to be an expression on the mt_linreg rather thanmt."

hail/python/hail/docs/guides/genetics.rst Outdated

+                          exclusive such as 'Male' and 'Female'. Note that the expressions for `female_pheno` and
+                          `male_pheno` cannot be computed at the same time because they are inputs to two different
+                          matrix tables. Lastly, the argument to `root` must be specified for both cases -- otherwise
+                          the output for the 'Male' grouping will overwrite the 'Female' output.

Contributor

jbloom22 Sep 26, 2018

the 'Male' output will overwrite the 'Female' output.

hail/python/hail/docs/guides/genetics.rst Outdated

+                          matrix tables. Lastly, the argument to `root` must be specified for both cases -- otherwise
+                          the output for the 'Male' grouping will overwrite the 'Female' output.
+                          The second approach uses the :func:`.aggregators.linreg` and :func:`.aggregators.group_by`

Contributor

jbloom22 Sep 26, 2018

reverse order again

hail/python/hail/docs/guides/genetics.rst Outdated

+                          The second approach uses the :func:`.aggregators.linreg` and :func:`.aggregators.group_by`
+                          aggregators. The aggregation expression generates a dictionary where the keys are the grouping
+                          variables and the values are the linear regression statistics for that group. The result of the

Contributor

jbloom22 Sep 26, 2018

where a key is a group (value of the grouping variable) and the corresponding value is the linear regression statistics for those samples in the group.

hail/python/hail/docs/guides/genetics.rst

+                          aggregators. The aggregation expression generates a dictionary where the keys are the grouping
+                          variables and the values are the linear regression statistics for that group. The result of the
+                          aggregation expression is then used to annotate the matrix table.

Contributor

jbloom22 Sep 27, 2018

Yes, I'd note some pros of each. linear_regression is more efficient, especially when analyzing many phenotypes.
linreg aggregator is more flexible (multiple covariates can be vary by entry) and returns a richer set of statistics.

jigold added 2 commits

September 27, 2018 10:17


          address comments and add sections

3cffc2c


          more info

081688e

jbloom22 previously requested changes

View reviewed changes

hail/python/hail/docs/guides/genetics.rst Outdated


		:code:

		Approach #1: Use the :func:`.linear_regression` method for all phenotypes simulatenously

Contributor

jbloom22 Sep 27, 2018

typo in simultaneously

hail/python/hail/docs/guides/genetics.rst Outdated

+                          statistics. If the phenotypes being analyzed have different patterns of missingness, you should
+                          **not** use the :func:`.linear_regression` method for all phenotypes simulatenously (Approach #1).
+                          This is because the :func:`.linear_regression` method drops samples that have a missing value for
+                          any of the phenotypes. Approach #2 will do two passes over the data while Approach #3 will do one

Contributor

jbloom22 Sep 27, 2018

while Approaches #1 and #3 will

hail/python/hail/docs/guides/genetics.rst Outdated

+                          aggregator, especially when analyzing many phenotypes. However, the :func:`.aggregators.linreg`
+                          aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of
+                          statistics. If the phenotypes being analyzed have different patterns of missingness, you should
+                          **not** use the :func:`.linear_regression` method for all phenotypes simulatenously (Approach #1).

Contributor

jbloom22 Sep 27, 2018

I think this is too strong a statement. In some cases, restricting to the common samples may be reasonable. Instead, just make users aware of the behavior that will result in Approach #1.

hail/python/hail/docs/guides/genetics.rst Outdated

+                          the matrix table.
+                          The :func:`.linear_regression` method is more efficient than the :func:`.aggregators.linreg`
+                          aggregator, but the :func:`.aggregators.linreg` aggregator is more flexible (multiple covariates

Contributor

jbloom22 Sep 27, 2018

...aggregator and can be extended to multiple phenotypes, but...


          address comments

jigold dismissed jbloom22’s stale review

September 27, 2018 17:22

done

jbloom22 previously requested changes

View reviewed changes

Contributor

jbloom22 left a comment

need to fix this error

185     >>> mt_linreg = hl.linear_regression(y=mt.pheno.height, x=[1, mt.GT.n_alt_alleles()])
UNEXPECTED EXCEPTION: TypeError("linear_regression() missing 1 required positional argument: 'covariates'",)


          fix code

bcb1fcd

Contributor

jbloom22 commented Sep 27, 2018

one more:

  File "/home/hail/.conda/envs/hail/lib/python3.6/doctest.py", line 1330, in __run
    compileflags, 1), test.globs)
  File "<doctest genetics.rst[21]>", line 1
    female_pheno = hl.case()
                           ^
SyntaxError: multiple statements found while compiling a single statement

jigold added 2 commits

September 28, 2018 07:52


          fix?

98ad055


          fix formatting

095b57d

jigold dismissed jbloom22’s stale review

September 28, 2018 15:39

done.

jbloom22 approved these changes

View reviewed changes

danking merged commit d22ebf7 into hail-is:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet