Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discrepancy with running mhcflurry-predict on not supported alleles #210

Open
liviatran opened this issue Dec 7, 2022 · 6 comments
Open

Comments

@liviatran
Copy link

I ran mhcflurry-predict --list-supported-alleles > 'supported_alleles.txt' to get a list of of alleles supported for mhcflurry.

I noted HLA-A02:172, HLA-C02:16, and HLA-C12:139 were not in that list of supported alleles. The aforementioned alleles are all Well-Documented alleles (https://www.ihiw18.org/component-immunogenetics/download-common-and-well-documented-alleles-3-0).

I was able to successfully run mhcflurry-predict on HLA-A02:172 and HLA-C02:16, despite them not being on the list of supported alleles. HLA-C12:139 caused a ValueError.

Are the results for the unsupported alleles (HLA-A02:172, HLA-C02:16) reliable?

@timodonnell
Copy link
Contributor

Looks like the first two alleles you mentioned are getting canonicalized by mhcgnomes to other allele names, whereas the third is genuinely unsupported:

image

We should take a look at the sequences in IMGT for the first two alleles and what they are getting mapped to (HLA-A02:172 -> HLA-A02:17, HLA-C02:16 -> HLA-C02:137) to understand if this canonicalization is reasonable or a bug. I will add this to my todo list but if you have a chance to do it first please let us know what you see. @iskandr who wrote mhcgnomes may also have thoughts on this.

@liviatran
Copy link
Author

I'm not sure why A02:172 is getting mapped to A02:17, as they are different alleles with different protein sequences.

I did note that C02:16:01 was renamed to C02:137 from IMGT's Deleted_alleles.txt. However, C02:16 shouldn't be mapped to C012:137.

Screen Shot 2022-12-15 at 9 45 55 PM

As for C12:139, this allele was changed from C12:139 in IMGT v3.38.0 to C*12:139Q in v 3.39.0.

@timodonnell
Copy link
Contributor

@iskandr do you think the following is a mhcgnomes bug:

> mhcgnomes.parse("HLA-A*02:172", use_allele_aliases=True).restrict_allele_fields(2).to_string()
HLA-A*02:17

@liviatran I am seeing that C02:16:01 is getting canonicalized to HLA-C02:137, which from your comment I think is the correct canonicalization, right? (I.e. I am not seeing it getting canonicalized to C012:137.)

@iskandr
Copy link
Contributor

iskandr commented Dec 16, 2022

This does seem like a bug.

In [1]: mhcgnomes.parse("HLA-A*02:172", use_allele_aliases=True)
Out[2]: Allele(gene=Gene(species=Species(name='Homo sapiens', mhc_prefix='HLA'), name='A', mutations=()), allele_fields=('02', '17', '02', '01'), annotations=(), mutations=())

Checking the IMGT/HLA allele history entry, I see:

HLA03784,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*02:172,A*9272,A*9272,A*9272,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

I'm trying to figure out how this "normalization" happens and my guess is that it's via this entry:

HLA00023,A*02:17:02:01,A*02:17:02:01,A*02:17:02:01,A*02:17:02:01,A*02:17:02:01,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*02:17:02,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*021702,A*02172,A*02172,A*02172,A*02172,A*02172,A*02172,A*02172,A*02172,A*02172,A*02172,A*02172

...which links "A*02172" with "A*02:17:02:01".

I'm going to figure out where in the logic "A*02:172" gets checked against "A*02172" and make it more cautious.

@liviatran
Copy link
Author

liviatran commented Dec 19, 2022

@timodonnell The answer for the C02:16 and C02:137 canonicalization question depends on what mhcflurry is evaluating. Is it evaluating the whole protein structure, which in some cases would affect peptide binding? In that case, the two proteins are different. Is it evaluating only the peptide binding domain, which is encoded by exons 2 and 3? In that case, the exons 2 and 3 protein encoded sequences are the same for the two alleles.

@liviatran
Copy link
Author

liviatran commented Dec 19, 2022

@iskandr Perhaps mhcgenomes.parse should only look at version 3 (colon delimited) HLA allele names to avoid these name collisions.

For the non-colon delimited (versions 1 and 2) allele names, the names have to be evaluated in pairs of numbers. There would never be a version 1 or 2 allele name with three digits in the first field.

*Caveat: if anyone is analyzing version 1 allele names (which is not recommended), almost all the time, the alleles have four or five digits instead of four or six digits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants