Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF REF column outputting in bytes instead of string #69

Closed
rarsenal opened this issue Oct 20, 2021 · 3 comments
Closed

VCF REF column outputting in bytes instead of string #69

rarsenal opened this issue Oct 20, 2021 · 3 comments

Comments

@rarsenal
Copy link

Hello, we are trying to apply GTCtoVCF on Illumina's iScan data with Global Diversity Array. We've converted from IDAT to GTC via iaap-cli, but we noticed the VCF output from GTCtoVCF has a few formatting issues that hopefully you could help resolve.

  1. In the REF column, instead of a base C, we have b'C', which suggested that the output from the python script is in bytes instead of strings.
  2. In the ALT column, we not only have the alternative allele, but also the reference allele. Could this be related to the fact that the REF column is using bytes characters instead of strings?
  3. In the GT field, genotypes are encoded as 1, 2. no 0. Not sure if this is also related to the bytes format?

Example line from our output:
1 762320 JHU_1.762319,exm2268640 b'C' T,C . PASS . GT:GQ 2/2:6

Are there environmental variables that we should specify to prevent this behavior?

For reference, we used the manifest from https://support.illumina.com/downloads/infinium-global-diversity-array-v1-product-files.html and the references were built using the provided download_reference.sh

@jjzieve
Copy link
Contributor

jjzieve commented Oct 20, 2021

@rarsenal Thanks for bringing this to our attention. I will try to reproduce this. What version of python are you running?
Also, did any errors occur when building the reference genome fasta? This issue seems like it could be related to #64 but that only occurred with a custom fasta file.

@rarsenal
Copy link
Author

Hi jjzieve,

Thanks for the fast reply! Actually I just found the source of the error. I containerized the various tools for the pipeline we are building, so the container had both python2 and python3 environments built in. After I separated the GTCtoVCF component into a standalone container with only miniconda2 base environment, everything is working as expect. I suppose that GTCtoVCF was inadvertently running on python3 and while it produced no runtime errors, its bytes/string decoding functions are not compatible with python3? Anyhow, thanks again for your attention, and you can close the issue when you see fit.

@jjzieve
Copy link
Contributor

jjzieve commented Oct 20, 2021

Glad you found the issue! In hindsight, should've known the byte vs. string issue would be a python2 vs. 3 underlying cause.

@jjzieve jjzieve closed this as completed Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants