Pandas based linear regression #302

henrydavidge · 2020-10-27T18:53:47Z

Signed-off-by: Henry D henrydavidge@gmail.com

What changes are proposed in this pull request?

(Details)

How is this patch tested?

Unit tests
Integration tests
Manual tests

(Details)

Signed-off-by: Henry D <henrydavidge@gmail.com>

kianfar77

Clean way of masking! I had some comments/questions. Will look at tests in the next round.

kianfar77 · 2020-11-03T13:25:38Z

python/glow/gwas/linear_regression.py

+    return oe.contract(subscripts, *operands, memory_limit=5e7)
+
+
+def _linear_regression_inner(genotype_df, Y, Q, QtY, YdotY, Y_mask, dof, phenotype_names):


Can you please add descriptions like those Leland has put for functions describing the function, arguments, and their dimensions, and output? And can we refer to https://arxiv.org/pdf/1901.09531.pdf here as well?

kianfar77 · 2020-11-06T18:55:45Z

python/glow/gwas/linear_regression.py

+    Y = phenotype_df.to_numpy('float64', copy=True)
+    Y_mask = ~np.isnan(Y)
+    Y[~Y_mask] = 0
+    Q = np.zeros((covariate_df.shape[0], covariate_df.shape[1], phenotype_df.shape[1]))


What is this 3D initialization for in light of the next line?

kianfar77 · 2020-11-06T19:24:40Z

python/glow/gwas/linear_regression.py

+    Q = np.zeros((covariate_df.shape[0], covariate_df.shape[1], phenotype_df.shape[1]))
+    Q = np.linalg.qr(C)[0]
+    QtY = Q.T @ Y
+    YdotY = np.sum(Y * Y, axis = 0) - np.sum(QtY * QtY, axis = 0)


YdotY name is misleading as it is actualy YdotY - QtYdotQtY. Can we either change the name or separate the two parts?

kianfar77 · 2020-11-06T19:27:39Z

python/glow/gwas/linear_regression.py

+    '''
+    A wrapper around opt_einsum to ensure uniform memory limits.
+    '''
+    return oe.contract(subscripts, *operands, memory_limit=5e7)


Can you mention where 5e7 comes from?

kianfar77 · 2020-11-06T19:28:11Z

python/glow/gwas/linear_regression.py

+
+    with oe.sharing.shared_intermediates():
+        XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
+        XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)


Why is this line repeated?

kianfar77 · 2020-11-06T19:43:45Z

python/glow/gwas/linear_regression.py

+    X = np.column_stack(genotype_df['values'].array)
+
+    with oe.sharing.shared_intermediates():
+        XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)


What is the theoretical approach regarding the missingness? If for a phenotype, a sample is missing, does that mean we ignore that row in X and C as well? In that case I wonder whether Q obtained from QR factorization of C would be a totally different matrix. In above it seems we are assuming Q does not change.

kianfar77 · 2020-11-06T19:47:15Z

python/glow/gwas/linear_regression.py

+    X = np.column_stack(genotype_df['values'].array)
+
+    with oe.sharing.shared_intermediates():
+        XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)


It seems the name XdotY here does not match its definition https://arxiv.org/pdf/1901.09531.pdf. I think it is better we match the names with that paper.

kianfar77 · 2020-11-06T20:35:18Z

python/glow/gwas/linear_regression.py

+        XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
+        QtX = _einsum('sc,sp,sg->pgc', Q, Y_mask, X)
+        XdotX_reciprocal = 1 / (_einsum('sp,sg,sg->pg', Y_mask, X, X) -
+                                _einsum('pgc,pgc->pg', QtX, QtX))


Again I prefer we do not call this XdotX.

kianfar77 · 2020-11-06T20:42:19Z

python/glow/gwas/linear_regression.py

+    with oe.sharing.shared_intermediates():
+        XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
+        XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
+        QtX = _einsum('sc,sp,sg->pgc', Q, Y_mask, X)


Can we make this ->pcg instead of pgc to match QtX?

Signed-off-by: Henry D <henrydavidge@gmail.com>

codecov · 2020-11-11T19:27:57Z

Codecov Report

Merging #302 (d489a86) into master (a0b1dd6) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #302   +/-   ##
=======================================
  Coverage   93.62%   93.62%           
=======================================
  Files          95       95           
  Lines        4812     4812           
  Branches      456      456           
=======================================
  Hits         4505     4505           
  Misses        307      307

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0b1dd6...d489a86. Read the comment docs.

kianfar77

Great! Just a few minor comments.

kianfar77 · 2020-11-16T17:16:32Z

python/glow/gwas/linear_regression.py

+        A Spark DataFrame that contains:
+        - All columns from `genotype_df` except the `values_column`
+        - `effect`: The effect size estimate for the genotype
+        - `stderror`: The estimated standard error


nit: of the effect

kianfar77 · 2020-11-16T17:27:44Z

python/glow/gwas/linear_regression.py

+
+    C = covariate_df.to_numpy(np.float64, copy=True)
+    if fit_intercept:
+        intercept = np.ones((phenotype_df.shape[0], 1))


Why not use _add_intercept?

kianfar77 · 2020-11-16T18:06:54Z

python/glow/gwas/linear_regression.py

+
+
+@typechecked
+def _linear_regression_inner(genotype_df: pd.DataFrame, Y: NDArray[(Any, Any), np.float64],


Can you use a name other than genotype_df here as it is used for the spark dataframe before?

kianfar77 · 2020-11-16T18:53:00Z

python/glow/gwas/linear_regression.py

+                             Q: NDArray[(Any, Any), np.float64], dof: int,
+                             phenotype_names: pd.Series) -> pd.DataFrame:
+    '''
+    Applies a linear regression model to a block of genotypes.


Can you add a comment here and perhaps in the linear_regression function that the problem after projection covariates is performing multiple single variable regressions in parallel? The notation in linear algebra can be easily confused with where X is a vector for multi-variate regression.

Good suggestion

kianfar77 · 2020-11-16T18:54:37Z

python/environment.yml

  - click=7.1.1 # Docs notebook source generation
  - databricks-cli=0.9.1 # Docs notebook source generation
  - jinja2
+  - jupyter


Is this related to this PR?

No, it's very useful for testing though

Signed-off-by: Henry D <henrydavidge@gmail.com>

kianfar77

Thanks

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge added 3 commits October 27, 2020 14:53

update

9884989

Signed-off-by: Henry D <henrydavidge@gmail.com>

add files

420442c

Signed-off-by: Henry D <henrydavidge@gmail.com>

more tests

53df6ae

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge requested a review from kianfar77 November 2, 2020 20:54

kianfar77 requested changes Nov 6, 2020

View reviewed changes

henrydavidge added 2 commits November 11, 2020 12:19

Update with new method

c14e6fc

Signed-off-by: Henry D <henrydavidge@gmail.com>

clean up

3f3059c

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge changed the title ~~[WIP] Pandas based linear regression~~ Pandas based linear regression Nov 11, 2020

henrydavidge requested a review from kianfar77 November 11, 2020 17:34

Merge branch 'master' of github.com:projectglow/glow into lin-reg-pandas

ce12b78

Signed-off-by: Henry D <henrydavidge@gmail.com>

kianfar77 reviewed Nov 16, 2020

View reviewed changes

henrydavidge added 2 commits November 17, 2020 11:16

Kiavash's comments; add test for values column

6f92e5f

Signed-off-by: Henry D <henrydavidge@gmail.com>

only run tests on spark 3.0+

511aabf

Signed-off-by: Henry D <henrydavidge@gmail.com>

kianfar77 approved these changes Nov 18, 2020

View reviewed changes

merge conflicts

d489a86

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge merged commit 9ee9c69 into projectglow:master Nov 20, 2020

		return oe.contract(subscripts, *operands, memory_limit=5e7)


		def _linear_regression_inner(genotype_df, Y, Q, QtY, YdotY, Y_mask, dof, phenotype_names):



		@typechecked
		def _linear_regression_inner(genotype_df: pd.DataFrame, Y: NDArray[(Any, Any), np.float64],

Pandas based linear regression #302

Pandas based linear regression #302

Uh oh!

Conversation

henrydavidge commented Oct 27, 2020

What changes are proposed in this pull request?

How is this patch tested?

Uh oh!

kianfar77 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kianfar77 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kianfar77 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kianfar77 left a comment •

edited

Loading

codecov bot commented Nov 11, 2020 •

edited

Loading