Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group file sorted differently in SMMAT vs. SMMAT.meta #46

Open
g3png opened this issue Jul 19, 2022 · 7 comments
Open

Group file sorted differently in SMMAT vs. SMMAT.meta #46

g3png opened this issue Jul 19, 2022 · 7 comments

Comments

@g3png
Copy link

g3png commented Jul 19, 2022

Dear Han,

We are running a large meta-analysis and have collected intermediate files from several cohorts. I realised however that SMMAT.meta fails at the following check at specifically multiallelic sites, despite ensuring all cohorts use the same group file.

if(any(sort(tmp.scores$idx)!=tmp.scores$idx)) {
	    cat("In some", meta.files.prefix[i], "score files, the order of group and variants is not the same as in the group-sorted group.file.\n")
	    stop("Error: meta files possibly not generated using this group.file!")
    	}

An example of where this fails (for a single cohort) is:

  group chr      pos ref alt    N missrate      altfreq      SCORE       VAR
1:  A1BG  19 58409184   C   T 1586        0 0.0022068096  0.1737194 6.9899839
2:  A1BG  19 58409184   C   G 1586        0 0.0003152585 -0.6138912 0.9923992
        PVAL idx                              file
1: 0.9476113 823 prefix.score.1
2: 0.5377377 822 prefix.score.1

In this case index 823 comes before 822 which causes the error. I am guessing this is because SMMAT did not initially order variants according to ALT alleles at multiallelic sites.

Is there any way around this?

Edit:
I have just read about the issue here regarding SMMAT being designed for biallelics. Would love to know what you think anyway, and if there are (near) future plans to include multiallelic variants.

Thanks for your help in advance,

Grace

@hanchenphd
Copy link
Owner

Hi Grace,

Thank you for your interest in SMMAT! I have not seen this issue before, but I guess the problem was probably because this tri-allelic marker was ordered differently in the GDS file and the group definition file. In SMMAT (which uses the GDS file to generate meta-analysis files), the variants are sorted based on the variant.id. In SMMAT.meta, since we assume no access to individual GDS files, we could only sort them based on chr and pos. For tri-allelic markers with the same chr and pos, it is possible that the order is different in the GDS files (not necessarily alphabetical).

If that was the case, the easiest solution would be to use a group definition file with variants in the same order as your GDS files. For example, if your C/G is before C/T in your group definition file, but C/T is before C/G in the GDS files, you might be able to fix the problem by switching C/G and C/T in your group definition file, without having to ask each cohort to rerun. Please let me know if it does not work.

Best,
Han

@g3png
Copy link
Author

g3png commented Jul 19, 2022 via email

@anh151
Copy link

anh151 commented Jul 25, 2024

Hello.
I had the same issue when trying to combine 2 cohorts. I tried everything to get things in the right order and I feel like there is a bug in SMMAT.meta when it attempts to sort the groups. If we're assuming the order is set by the GDS, why would SMMAT sort alphabetically?

I tried alphabetical. I tried combine the variant positions, outputting a GDS then using that order. I tried running a fake dataset and using the outputted scores file. None of them worked. I ended up just dropping mulitiallelic positions.

Thanks,
Andrew

@hanchenphd
Copy link
Owner

Hi Andrew,

Have you tried fixing the order in your group definition file (instead of the order in GDS) as I suggested above? If you could send me a small reproducible example, I am happy to take a look.

Thanks,
Han

@youngchanpark
Copy link

Hi Han,

I haven't tried your suggestion on changing the order of variants in the group file, but wouldn't there also be a possibility that the issue may be fixed for one cohort, but the same error occurs on another cohort?

I'm still trying to wrap my head around how the analysis is being performed in the SMMAT.meta function so this might be a stupid question, but is there a fundamental reason for why the variants needs to be strictly ordered to perform the analysis?

I understand we need to "know" the order of variants because we're dealing with score files and covariance matrices across multiple cohorts that may have different sets of variants. But when it comes to a point of collecting the variants from the score files and the covariance matrices for each groups to run the meta-analysis, what's the reason for needing the files across different cohorts to conform to the order of variants in the group file?

Best wishes,
YC

@hanchenphd
Copy link
Owner

Hi YC,

That's a very good question. SMMAT.meta does not assume access to individual GDS files, so variants in the score and covariance files need to be sorted in someway. If variants are not sorted, the only way to go back in the score file (which is a plain text file) is to close and reopen it, and it could be even more complicated if tri-allelic variants happen to be chunked in different score files. Sometimes tri-allelic variants could be sorted differently in different studies (maybe during the VCF -> GDS conversion), and this is a tricky situation to harmonize across studies.

Best,
Han

@youngchanpark
Copy link

Hi @hanchenphd,

I have thoroughly gone through the SMMAT and SMMAT.meta code and I think I now can confidently say this issue is fixable without needing to know the order of variants in the individual cohort GDS files.

Both the single-cohort and meta-analysis are performed per-group. When running the meta-analysis, before performing the calculations, what ultimately happens is you read the score file and covariance matrix files across all cohorts and create a large score vector (U) and covariance matrix (V). The score and covariance value for each of the variant are combined in the combined score vector (U) and covariance matrix (V), respectively. These are then later computed for the meta-analysis.

The current code relies heavily on indexing to match variants across the group file and per-cohort summary statistics and covariance matrix. Because of the reliance on indexing to align variants, it appears you had added the check of whether the variants in the score file followed the order of variants in the group file.

When running SMMAT for the single-cohort analysis, the per-group analysis results (variant scores and covariance matrix) were appended to the output .score and .var file. Since they were appended, and the order of variants in the covariance matrix follows the order of variants in the corresponding score file group, we know which cell in the covariance matrix corresponds to which covariance value for the combination of variants.

If we change the code to match on variant ID instead of relying on indices, I believe we can make the analysis work without needing to have access to the individual GDS files to confirm the order of variants. I’m not yet 100% certain, but even now with the current indexing-based code, I think it’s okay to remove that check as well.

I hope my explanation made sense 😅. I wanted to share my thoughts to confirm with you whether I’m correct.

I am going to work on adapting the code to match on variant ID instead of indexing so I can run my meta-analysis, because I cannot ask our collaborators to rerun all of their analyses😅. I'll let you know how this goes.

Best wishes,
YC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants