Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a mutation in the bundle is unmeasured, throw away data for all variants with mutations at this site #84

Closed
Haddox opened this issue Jul 16, 2023 · 6 comments

Comments

@Haddox
Copy link
Contributor

Haddox commented Jul 16, 2023

Here is my reasoning for doing this, which I've added to the SI (for context see "Reckoning mutations with respect to the reference experiment"):

If there is an unmeasured mutation in the bundle, our strategy for removing this mutation from the summation term of Eq.(1) is to completely ignore this site when analyzing the data.
This involves ignoring all variants with mutations at this site.
Although this can result in throwing away data, it is necessary for the math to work out.
For instance, in the above example where Y30A is missing from the non-reference experiment and A30Y is missing from the reference experiment, $\alpha_d$ will be used to capture the effect of A30Y when modeling the latent phenotype of the wildtype sequence from the non-reference experiment.
If a variant from the non-reference experiment has a Y30G mutation, this mutation will be included in the summation term of Eq.(1) as A30G.
However, since $\alpha_d$ is a constant offset used to compute the latent phenotype of \emph{all} variants from experiment $d$, and since $\alpha_d$ includes the effect of A30Y, Eq.(1) would effectively be adding the effects of both A30G and A30Y when computing the latent phenotype of this particular variant, which does not make sense.
Because of this problem, we choose to ignore all data at such sites.

@jgallowa07: if you agree with this logic, I think the thing to do would be to include these sites in the list of disallowed sites that gets used to identify variants to discard. This would only apply to sites where the forward mutation is missing from the reference experiment and the reverse mutation is missing from the non-reference experiment.

@jgallowa07
Copy link
Member

since $\alpha_d$ includes the effect of A30Y, Eq.(1) would effectively be adding the effects of both A30G and A30Y when computing the latent phenotype of this particular variant, which does not make sense.

Just to clarify, you're saying if either the forward or the reversion exists, then the double mut counting is not a problem bc:

  • if the forward mut is there (A30Y), then we have an effective beta for this mutation and then $\alpha_d$ need not model it (is this really the case 🤔 )
  • If the reversion is there, then we have an effective shift to model it, and again, $\alpha_d$ need not model it.

@jgallowa07
Copy link
Member

jgallowa07 commented Jul 16, 2023

It seems a little hand-wavy and maybe too confident about what $\alpha_d$ is actually doing. Is there any evidence that the double counting thing is actually a problem? We should be able to investigate this with the current results somehow.

Not that I disagree with the logic, but it seems like currently, this is purely a theoretical problem and thus maybe not as high priority ...

@Haddox
Copy link
Contributor Author

Haddox commented Jul 16, 2023

I agree if just the forward mutation is there it's a bit strange. And same thing for just the reverse mutation. So there may be an argument for excluding the site in those cases as well.

I think you're right that it isn't currently a problem with the spike data since I think all forward and reverse mutations in the bundle are sampled in the libraries (excluding indels). I think if we made the change I suggested, there wouldn't be much or any additional data that gets thrown out and the overall results wouldn't change.

But, I think it is a theoretical problem for future users. It seems like an easy fix, and it makes it easier to describe a general strategy for dropping mutations in the bundle from the summation term when they are missing from the data.

Whether $\alpha_d$ is actually capturing what we want it to is one question. But it does have a well-defined purpose in the mathematical model -- in theory, it is the only parameter that should be able to capture the effects of missing mutations in the bundle. And we can show that the math doesn't work out in the case that I outlined above, so I think it's a clear problem that needs addressing.

For priority, I think it's probably something we'd want to update before posting the paper. But let's discuss on Monday. I know you have a lot on your plate!

@WSDeWitt
Copy link
Contributor

WSDeWitt commented Jul 16, 2023

I think I understand the logic above, but to me it seems like another symptom of our approach to modeling unobserved mutations, which is to not model them. While this approach avoids assuming some additional prior information, it results in an overly brittle model. The other main symptom of this is that we've convinced ourselves—erroneously, I've argued—that we can't do out-of-sample prediction.

I'll record below how I would deal with unobserved mutations, since this is tangentially relevant to what you decide for this issue:

It is not unusual for a statistical model on categorical data to have to cope with features that end up being constant over a training set, but variable in a test set.
There are standard approaches for dealing with this kind of modeling challenge via flavors of ridge penalty (see issue #51), possibly with grouping (I think the polyclonal project from the Bloom lab did this with a site-level baseline model). These address the issue by sharing information in a few possible ways, all of which can be viewed as adding some weak prior information:

  • Global ridge penalty (issue A ridge penalty that doesn't bias toward WT #51): shared info among all mutations, so that an unobserved mutation is encouraged to have an effect of a typical mutation.
  • Site-wise ridge penalty: shares info among all mutations at a site, so that an unobserved mutation at a given site is encouraged to have an effect of a typical mutation at that site. For example: site 1 is tolerant to mutations, site 2 is not.
  • AA-wise ridge penalty: shares info among all mutations from a given AA to a given AA, so that an unobserved mutation is encouraged to have an effect of a typical mutation with the same starting and ending AA states. For example: alanine-to-proline mutations are bad.

@Haddox
Copy link
Contributor Author

Haddox commented Jul 16, 2023

Thanks, Will. I agree that these are neat ideas.

For the purposes of getting something out the door this week, my suggestion would be to have our base model be something that throws out the problematic variants described above, so that users could opt to avoid making assumptions if they wish. But, then to include your suggested approaches for estimating the effects of unobserved mutations as optional features that we could add onto the base model in future versions of multidms.

@Haddox
Copy link
Contributor Author

Haddox commented Jul 19, 2023

@jgallowa07: just pinging this thread since in my opinion it would be good to resolve this issue in version 1.0.

To sum up our above convo, I'd suggest using the draft code we already wrote to:

  • identify mutations in the bundle that are missing from either the reference experiment (forward mutation) or the non-reference (reverse mutation) experiment or both.
  • flagging the corresponding sites as invalid
  • tossing variants with mutations at the invalid sites

This is likely to have zero or very minimal effect on the spike data, as well as other datasets from the Bloom lab. But, it helps to avoid problems in the future and provides a concrete strategy for dealing with the basic issue described above.

But let me know if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants