-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about dmr
score
#93
Comments
Hello, Hello @Ge0rges, Sorry about the delay. Thanks for bringing to my attention an error in the documentation. // control_methyls and exp_methyls are the counts of methylated bases
// in the first sample (called the control in the code) and the second sample
// (called the experiment in the code), respectively
// Similarly, control_canonicals and exp_canonicals is the counts of
// canonical, unmodified, bases, respectively. `beta_llk` calculates the log marginal likehood
// of the data under the posterior distribution
let llk_control = beta_llk(control_methyls, control_canonicals);
let llk_exp = beta_llk(exp_methyls, exp_canonicals);
let llk_same = beta_llk(
exp_methyls + control_methyls,
exp_canonicals + control_canonicals,
);
// llk_control, llk_exp, and llk_same are the log marginal likelihood of the data.
// The score is calculated in log-space, note that the actual code (linked below)
// just returns this value.
let score = llk_control + llk_exp - llk_same; You can find this code here, and the rest of the DMR calculation code in that source file. |
Thank you. I was also wondering if you've considered implementing an exact test such as Fischer's for cases where coverage is low? For reference, in case others should stumble upon this, I am currently determining statistical significance by applying a chi-square test after applying a coverage filter thanks to the new dmr flag. Code for which is below (python):
|
I appreciate that you're trying to find a reliable decision function and that the scoring method is probably not being overly helpful in that sense. Yes, I considered Fischer's and chi-squared, I didn't add these into mainline because I wanted a metric that was stable to changes in region size (1000s of CpGs to a single CpG) and coverage. In my experiments, this scoring scheme seemed to work well across a variety of use cases while also giving good resolution (e.g. not putting everything into 2 bins). Of course, you can still calculate another test statistic after parsing the output table because all of the counts are in there. Now that the "per-base" option is available, maybe I'll reconsider adding these contingency tests with suitable warnings around coverage, etc. |
I think the scoring is useful! I've just been trying to complement it with methods to get statistical significance. I'm curious as to how you've seen chi-squared perform over a function of region size and coverage. I've been thinking about that as well, and hoping the region size (in my case considering genes) is on average the same, and applying a coverage filter gives me at least some confidence there. |
As always, glad to hear it.
If you're willing, I'd be interested in hearing what methods you end up using and the use case.
I'll scrounge up some examples showing the difference. It would be good to have on the docs or a blog. |
I'm currently experimenting with chi-squared to determine the set of genes that are DMRed in a statistically significant way. I'm currently working on seeing how the type I error rate, and test power fluctuate with a simulated binomial. Currently it seems like low coverage is going to cause me to shift away from this and towards an exact test (my coverage is ~5 which is quite low). I'm trying to figure how to best implement that with the modkit output however since it requires two groups and getting that data back out from |
I've got a couple ideas to try, let me do some experiments on my side and circle back.
What do you need exactly? You should be able to get the two groups out of the |
You're right I guess it should be sufficient to use set of |
Yes, as long as you're comparing only one methylation state at that base. For example, if you're input data has only 5mC at cytosine bases and 6mA at adenine bases, then subtracting the two values will give you the difference in methylation between the samples. If your data has more than one methylation state, for example 5hmC and 5mC at cytosine bases, then, as you probably guess, subtracting will give you the difference in methylation state for that modification. To get the difference in methylation levels in the latter case, sum the fractions before taking the difference. |
Hey @ArtRand, I'm circling back to this and was wondering if you've found any better statistical approaches to picking out the relevant (confident) set of DMR scores? I'm tempted to stick with my previously outlined method currently. |
Hey @Ge0rges, Thanks for checking in. I haven't come up with anything I'm happy with yet, but I'm actively working on it. |
Hello @Ge0rges, The latest version of modkit (v0.2.5) has a MAP-based p-value feel free to give it a try. It should run a lot faster too. |
Hi Arthur,
I was talking to some statistics PhDs about different methods I could use on my end to filter out statistically significant data from the
dmr
scores. One of them wrote this comment that I wanted to pass by you:In reference to the
dmr
score, why is it that you can have negative scores? Is this because you mix bayesian MAP estimates and MLE?Hope this is relevant (and perhaps even helpful). Thanks!
The text was updated successfully, but these errors were encountered: