Skip to content

Conversation

@henrydavidge
Copy link
Contributor

Signed-off-by: Henry D henrydavidge@gmail.com

What changes are proposed in this pull request?

(Details)

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

(Details)

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean way of masking! I had some comments/questions. Will look at tests in the next round.

return oe.contract(subscripts, *operands, memory_limit=5e7)


def _linear_regression_inner(genotype_df, Y, Q, QtY, YdotY, Y_mask, dof, phenotype_names):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add descriptions like those Leland has put for functions describing the function, arguments, and their dimensions, and output? And can we refer to https://arxiv.org/pdf/1901.09531.pdf here as well?

Y = phenotype_df.to_numpy('float64', copy=True)
Y_mask = ~np.isnan(Y)
Y[~Y_mask] = 0
Q = np.zeros((covariate_df.shape[0], covariate_df.shape[1], phenotype_df.shape[1]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this 3D initialization for in light of the next line?

Q = np.zeros((covariate_df.shape[0], covariate_df.shape[1], phenotype_df.shape[1]))
Q = np.linalg.qr(C)[0]
QtY = Q.T @ Y
YdotY = np.sum(Y * Y, axis = 0) - np.sum(QtY * QtY, axis = 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YdotY name is misleading as it is actualy YdotY - QtYdotQtY. Can we either change the name or separate the two parts?

'''
A wrapper around opt_einsum to ensure uniform memory limits.
'''
return oe.contract(subscripts, *operands, memory_limit=5e7)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention where 5e7 comes from?


with oe.sharing.shared_intermediates():
XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line repeated?

X = np.column_stack(genotype_df['values'].array)

with oe.sharing.shared_intermediates():
XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the theoretical approach regarding the missingness? If for a phenotype, a sample is missing, does that mean we ignore that row in X and C as well? In that case I wonder whether Q obtained from QR factorization of C would be a totally different matrix. In above it seems we are assuming Q does not change.

X = np.column_stack(genotype_df['values'].array)

with oe.sharing.shared_intermediates():
XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the name XdotY here does not match its definition https://arxiv.org/pdf/1901.09531.pdf. I think it is better we match the names with that paper.

XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
QtX = _einsum('sc,sp,sg->pgc', Q, Y_mask, X)
XdotX_reciprocal = 1 / (_einsum('sp,sg,sg->pg', Y_mask, X, X) -
_einsum('pgc,pgc->pg', QtX, QtX))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I prefer we do not call this XdotX.

with oe.sharing.shared_intermediates():
XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
XdotY = Y.T @ X - _einsum('cp,sc,sg,sp->pg', QtY, Q, X, Y_mask)
QtX = _einsum('sc,sp,sg->pgc', Q, Y_mask, X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this ->pcg instead of pgc to match QtX?

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
@henrydavidge henrydavidge changed the title [WIP] Pandas based linear regression Pandas based linear regression Nov 11, 2020
Signed-off-by: Henry D <henrydavidge@gmail.com>
@codecov
Copy link

codecov bot commented Nov 11, 2020

Codecov Report

Merging #302 (d489a86) into master (a0b1dd6) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #302   +/-   ##
=======================================
  Coverage   93.62%   93.62%           
=======================================
  Files          95       95           
  Lines        4812     4812           
  Branches      456      456           
=======================================
  Hits         4505     4505           
  Misses        307      307           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0b1dd6...d489a86. Read the comment docs.

Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just a few minor comments.

A Spark DataFrame that contains:
- All columns from `genotype_df` except the `values_column`
- `effect`: The effect size estimate for the genotype
- `stderror`: The estimated standard error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: of the effect


C = covariate_df.to_numpy(np.float64, copy=True)
if fit_intercept:
intercept = np.ones((phenotype_df.shape[0], 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use _add_intercept?



@typechecked
def _linear_regression_inner(genotype_df: pd.DataFrame, Y: NDArray[(Any, Any), np.float64],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a name other than genotype_df here as it is used for the spark dataframe before?

Q: NDArray[(Any, Any), np.float64], dof: int,
phenotype_names: pd.Series) -> pd.DataFrame:
'''
Applies a linear regression model to a block of genotypes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here and perhaps in the linear_regression function that the problem after projection covariates is performing multiple single variable regressions in parallel? The notation in linear algebra can be easily confused with where X is a vector for multi-variate regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion

- click=7.1.1 # Docs notebook source generation
- databricks-cli=0.9.1 # Docs notebook source generation
- jinja2
- jupyter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's very useful for testing though

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Signed-off-by: Henry D <henrydavidge@gmail.com>
@henrydavidge henrydavidge merged commit 9ee9c69 into projectglow:master Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants