ENH Generic design support using formulaic#328
Conversation
…d compatibiliy, but throw deprecation warning
…ey are no longer used
…ds without its unpicklable attributes
grst
left a comment
There was a problem hiding this comment.
I really prefer this over the old appraoch, many thanks for moving this forward @BorisMuzellec!
|
|
||
|
|
||
| @dataclass | ||
| class FactorMetadata: |
There was a problem hiding this comment.
There's also quite a bunch of test cases for the _formulaic.py file and the LinearModelBase in pertpy. Would be great if you could also port them here!
| @property | ||
| def variables(self): | ||
| """Get the names of the variables used in the model definition.""" | ||
| try: | ||
| return self.obsm["design_matrix"].model_spec.variables_by_source["data"] | ||
| except AttributeError: | ||
| raise ValueError( | ||
| """Retrieving variables is only possible if the model was initialized | ||
| using a formula.""" | ||
| ) from None |
There was a problem hiding this comment.
Maybe this stuff could really become a Mixin as you suggested, then we can more easily reuse it across pyDESeq2 and pertpy.
There was a problem hiding this comment.
If it's simpler for perpty than yes I can put this in a Mixin.
There's a difference though, because here the design is stored in .obsm["design_matrix"] as opposed to .design in pertpy.
There was a problem hiding this comment.
I'm not entirely sure yet what's the best solution. Maybe we just leave it as is now, and when I'll look into refactoring pertpy I can propose a PR with changes to pyDESeq2 if required.
| def cond(self, **kwargs): | ||
| """ | ||
| Get a contrast vector representing a specific condition. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| **kwargs | ||
| Column/value pairs. | ||
|
|
||
| Returns | ||
| ------- | ||
| ndarray | ||
| A contrast vector that aligns to the columns of the design matrix. | ||
| """ | ||
| cond_dict = kwargs | ||
| if not set(cond_dict.keys()).issubset(self.dds.variables): | ||
| raise ValueError( | ||
| """You specified a variable that is not part of the model. Available | ||
| variables: """ | ||
| + ",".join(self.dds.variables) | ||
| ) | ||
| new_ref_idx = self.LFC.columns.get_loc(f"{factor}_{ref}_vs_{old_ref}") | ||
| self.contrast_vector[new_alternative_idx] = 1 | ||
| self.contrast_vector[new_ref_idx] = -1 | ||
| for var in self.dds.variables: | ||
| if var in cond_dict: | ||
| self.dds._check_category(var, cond_dict[var]) | ||
| else: | ||
| cond_dict[var] = self.dds._get_default_value(var) | ||
| df = pd.DataFrame([kwargs]) | ||
| return self.dds.obsm["design_matrix"].model_spec.get_model_matrix(df).iloc[0] |
There was a problem hiding this comment.
Depends on you if you want to adopt the .cond() syntax for building contrasts (which was originally devised by @const-ae in glmGamPoi).
A lot of the code in the _formulaic.py module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.
In case you were just to support [column, baseline, treatment] and numpy array contrasts, you could probably come up with a way simpler solution.
There was a problem hiding this comment.
I don't have a strong opinion on this, whatever offers the most flexibility is best.
At first I tried simplifying the code in _formulaic.py to keep only what I need (mainly retrieving levels for a given factor + whether it has numerical or categorical type), but I ended up keeping everything because I didn't find a straightforward simplification.
A lot of the code in the _formulaic.py module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.
How would you define a contrast to test interaction terms using .cond()? Right now I don't see how to do it without using a numerical factor.
There was a problem hiding this comment.
Let's consider a design ~ disease * timepoint
| disease | timepoint |
|---|---|
| healthy | T0 |
| healthy | T1 |
| diseased | T0 |
| diseased | T1 |
which gives us the following coefficients:
Intercept, diseased, T1, T1:diseased
Then you could test:
diseased vs. healthy
contrast = dds.cond(disease="diseased") - dds.cond(disease="healthy")T1 vs. T0
contrast = dds.cond(timepoint="T1") - dds.cond(timepoint="T0")Interaction T1:diseased
contrast = (
(dds.cond(timepoint="T1", disease="diseased") - dds.cond(timepoint="T0", disease="diseased")) -
(dds.cond(timepoint="T1", disease="healthy") - dds.cond(timepoint="T0", disease="healthy"))
)There was a problem hiding this comment.
I admit there's no suitable documentation for this in pertpy. But in principle, using this "DSL", it should be possible to specify arbitrary contrasts.
…cal constrasts have incorrect shapes
| # Also check continuous factors | ||
| if self.continuous_factors is not None: | ||
| self.continuous_factors = replace_underscores(self.continuous_factors) | ||
| assert isinstance(self.design, (str, pd.DataFrame)) or isinstance( |
There was a problem hiding this comment.
nitpick: if this is meant as a user-facing error message, it should probably be a ValueError instead of an assertion
umarteauowkin
left a comment
There was a problem hiding this comment.
Thanks for this great PR, I m convinced :) Just one thing that is not clear for me is what should be put in the LFC shrinkage: is it really a column of the design ? For me it should be LFC @ contrast, but maybe this is not relavant for this PR, I just spotted it since you made the modification.
Finally, could you add a comment on what the Materializer is supposed to do ? (i.e., just one line of comment on what a materializer is :))
Thanks again !
| # method. | ||
|
|
||
| stat_res.lfc_shrink(coeff="condition_B_vs_A") | ||
| ds.lfc_shrink(coeff="condition[T.B]") |
There was a problem hiding this comment.
Maybe add a comment to explain what this means ?
…n with invalid type
Yes: LFC shrinkage performs MAP estimation with a prior on a given LFC coefficient (i.e. column) that the user must specify. In principle, I guess it would be possible to do the same thing with contrast @ LFC, I'm just not sure what it would mean. (Also, would the prior be amenable to linear combination?) |
|
Thanks for your reviews @grst @umarteauowkin ! I'm merging this :) |
Reference Issue or PRs
Closes #181
Closes #213
Closes #309
Closes #272
Closes #202
Closes #184
Closes #125
Will unblock scverse/pertpy#610
What does your PR implement? Be specific.
This PR implements support for general design matrices thanks to formulaic, and using utils from pertpy.
DeseqDataSetDeseqDataSetusing thedesignargument, either in the form of a string representing aformulaicformula (e.g."~condition + treatment","~condition + condition:treatment","~condition + exp(cofactor)"...), or an ndarray directly corresponding to a design matrix.design_factorsis still supported but throws aDeprecationWarningcontinuous_factorsis deprecated, as continuous type inference is handled byformulaicref_levelis deprecatedDeseqDataSetis no longer picklable. Ato_picklable_anndata()method was added to allow users to pickle results for later use.DeseqStats["treatment", "test", "control"]), or directly in the form of a contrast vector (a numpy array).lfc_shrinkno longer supports a defaultcoefargumentBREAKING CHANGE: python 3.9 is no longer supported.
TODO:
Failure seems to be due to the fact that it's not possible to pickle classes with decorated functions.Solved usingto_picklable_anndata._formulaic.pytests from pertpy