Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"NA"s in dmltest output #20

Open
milaeri opened this issue Jan 14, 2022 · 11 comments
Open

"NA"s in dmltest output #20

milaeri opened this issue Jan 14, 2022 · 11 comments

Comments

@milaeri
Copy link

milaeri commented Jan 14, 2022

Hi,
I have a question about result output:

I recently ran a model with three treatments using the DML.fitmultiFactor function, and then ran a DMLtest.multifactor to test for the effect of each treatment. Long story short, I'm noticing that the output tables from each test contains NAs under stat. p-val, and fdr for particular sets of chrs (not all of them). I was wondering if this just means that the model was not able to detect any difference for these particular loci/positions? I would just like some confirmation/explanation (in case I'm doing something wrong) so I can move on to the next stage in the analysis, your help much appreciated thanks!

chr pos stat pvals fdrs 1 NW_019289415.1 69 NA NA NA 2 NW_019289415.1 134 NA NA NA 3 NW_019289415.1 154 NA NA NA 4 NW_019289415.1 178 NA NA NA 5 NW_019289415.1 186 NA NA NA 6 NW_019289415.1 244 NA NA NA

@haowulab
Copy link
Owner

I never encountered this. Are there only a few rows with NA or all results are NA? It might be that some sites have missing data, or the variance is 0.

@milaeri
Copy link
Author

milaeri commented Jan 14, 2022 via email

@haowulab
Copy link
Owner

I’m not sure. Can you check if there are missing data? It's possible that some sites has missing data for some experimental factors, so that the regression can't run. DSS doesn't filter data. It'll keep everything and report a result whenever possible.

@milaeri
Copy link
Author

milaeri commented Jan 17, 2022

Hi Hao,

I don't see any missing cases in my data, but I do notice that not all sites are equally represented for each sample (see below). Do you think that could be the issue?
Just to give some background, I'm working with RRBS data from an non-model insect (genome not highly methylated as in mammals) thus there's a lot of zeros in my data (see below). I'm also working with scaffolds instead of chr since the genome is still under works for this species.

List of 24 samples
$ :'data.frame': 3504291 obs. of 4 variables:
..$ chr: chr [1:3504291] "NW_019289415.1" "NW_019289415.1" "NW_019289415.1" "NW_019289415.1" ...
..$ pos: int [1:3504291] 69 134 154 178 186 244 265 380 729 734 ...
..$ X : int [1:3504291] 0 0 0 0 0 0 0 0 0 0 ...
..$ N : int [1:3504291] 1 1 1 1 1 1 1 1 1 1 ...
$ :'data.frame': 3174646 obs. of 4 variables:
..$ chr: chr [1:3174646] "NW_019289415.1" "NW_019289415.1" "NW_019289415.1" "NW_019289415.1" ...
..$ pos: int [1:3174646] 812 828 883 884 892 893 925 926 934 935 ...
..$ X : int [1:3174646] 0 0 0 0 0 0 0 0 0 0 ...
..$ N : int [1:3174646] 4 4 2 12 2 12 2 12 2 12 ...
$ :'data.frame': 2854266 obs. of 4 variables:
..$ chr: chr [1:2854266] "NW_019289415.1" "NW_019289415.1" "NW_019289415.1" "NW_019289415.1" ...
..$ pos: int [1:2854266] 735 811 812 827 828 883 884 892 893 925 ...
..$ X : int [1:2854266] 0 0 0 0 0 0 0 0 0 0 ...
..$ N : int [1:2854266] 1 1 11 1 11 2 25 2 25 1 ...`

@haowulab
Copy link
Owner

"not all sites are equally represented for each sample" means there are missing data. DSS combines all data for all sites. If some samples don't have coverage for some sites, they are deemed as missing. However, based on your data, there shouldn't be 60% sites with missing data. Can you send me a small portion of data, such as chr21, so that I can try it?

@milaeri
Copy link
Author

milaeri commented Jan 18, 2022

Hi Hao,
Thanks for clarifying. Sure, I can send you a subset of the data. What is your email address?

@haowulab
Copy link
Owner

It's better to put the data and your analysis script on a cloud drive such as dropbox, so that I can download. I don't think it'll fit in an email.

@milaeri
Copy link
Author

milaeri commented Jan 18, 2022 via email

@milaeri
Copy link
Author

milaeri commented Jan 18, 2022 via email

@haowulab
Copy link
Owner

I looked at your data. It is indeed caused by missing data. For the data you sent to me, there are 24 samples, each ha 3000-4000 CG sites. makeBSseqData combines all data and return an object with 10161 methylation loci (which is a union of all CG sites in the inputs). A lot of them will only have data from a few samples, thus a regression cannot be performed. The function returned non-NA for 4197 sites, which are the ones that a model can be fit. This is a reasonably result. You can just ignore and filter out the sites with NA values.

@milaeri
Copy link
Author

milaeri commented Jan 20, 2022

Hi Hao,

Thank you for the help! This makes a lot of sense now. I'm glad I can move on with the analysis.

Best,
E

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants