Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug for nonPAR chrX: heterozygous haploids in the phased output. #100

Open
dtaliun opened this issue Apr 23, 2024 · 2 comments
Open

Bug for nonPAR chrX: heterozygous haploids in the phased output. #100

dtaliun opened this issue Apr 23, 2024 · 2 comments

Comments

@dtaliun
Copy link

dtaliun commented Apr 23, 2024

Hi,

When phasing non-PAR X, if present in the input VCF/BCF, the heterozygous haploids are not correctly set to missing.
Here is the problematic code:

for (uint32_t v = 0 ; v < n_variants ; v++) {
	if (VAR_GET_AMB(MOD2(v), Variants[DIV2(v)])) {
			VAR_SET_MIS(MOD2(v), Variants[DIV2(v)]);
			nreset ++;
		}
	}

The first 2 bits are 10 for het, and it is picked up by VAR_GET_AMB. Next, VAR_SET_MIS sets the last two bits to 11. However, code 11 is not recognized as the code for missing genotype by the VAR_GET_MIS (in fact, it is recognized by VAR_GET_SCA only). The problem is that only VAR_GET_MIS is used later, for example, to check if the missing values should be imputed. As a result, the heterozygous haploids are being propagated to the final phased output.

One option to resolve this is to clear the first 2 bits before setting them to code 01 (missing) e.g.:

#define VAR_CLEAR_FLAG(e, v)    ((v)&=(~(3<<((e)<<2))))

for (uint32_t v = 0 ; v < n_variants ; v++) {
	if (VAR_GET_AMB(MOD2(v), Variants[DIV2(v)])) {
	                VAR_CLEAR_FLAG(MOD2(v), Variants[DIV2(v)]);
			VAR_SET_MIS(MOD2(v), Variants[DIV2(v)]);
			nreset ++;
		}
	}

Note for the users who read this issue: If you are sure that you don't have haploids (i.e. males) on non-PAR chrX with het values, then everything should be fine. Use bcftools +fixploidy and bcftools +check-ploidy plugins to check your data.
If your haploids in your input VCF/BCF are coded according to the spec with single GT value (i.e. 0 instead of 0/0 and 1 instead of 1/1), the see issue #99 .

@dtaliun
Copy link
Author

dtaliun commented Apr 24, 2024

Following up on this issue. Looking at your scripts/documentation, I can't find any information if a check for the presence of het genotypes in non-PAR regions in males was done before phasing UK Biobank data. I.e. do you think this issue may have had an effect in the UK Biobank phasing of chrX nonPAR? Thanks!

@LindoNkambule
Copy link

To add, I noticed the same behaviour when the input has missing genotypes. There were males in the non-PAR who had missing genotypes but were heterozygous in the phased output.

See attached image: males in the phased output are heterozygous, and have missing genotypes in the input I passed to SHAPEIT5.

Version: 5.1.1 / commit = 990ed0d / release = 2023-05-08 was used
SHAPEIT5_non_par_hets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants