Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can maf2vcf handle hg38 maf as input ? #162

Closed
ShixiangWang opened this issue Mar 28, 2018 · 9 comments
Closed

Can maf2vcf handle hg38 maf as input ? #162

ShixiangWang opened this issue Mar 28, 2018 · 9 comments
Assignees

Comments

@ShixiangWang
Copy link

ShixiangWang commented Mar 28, 2018

Hi,

I want to transform hg38 maf to vcf,

$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs
ERROR: Provided Reference FASTA is missing or empty! Path: ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

I know how to download the file, but the info remind me using GRCh37 while maf file I use based on hg38. Can I use this command, or I should use hg19 maf file?

Best wishes,
Shixiang

@ShixiangWang
Copy link
Author

ShixiangWang commented Mar 28, 2018

Same errors no matter what genome build I use:

# use hg38 maf while ensembl v75 fa.gz
$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

# use hg38 maf while ensembl v91 GRCh38 fa.gz
$../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz


# use hg19 maf while ensembl v75 fa.gz
$../biotools/vcf2maf/maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

hg19 maf download from https://portal.gdc.cancer.gov/legacy-archive/files/b2e25bdf-f2b5-4a37-b330-05251ea09f2c

hg38 maf download from https://portal.gdc.cancer.gov/files/0458c57f-316c-4a7c-9294-ccd11c97c2f9

Please help me.

@ShixiangWang
Copy link
Author

I try using UCSC hg19 and hg38, and the outcome is

$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/Annotation/reference/hg19.fa
[W::fai_fetch] Reference 19:58420782-58420784 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 19:58420782-58420784
[W::fai_fetch] Reference 10:26581394-26581396 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 10:26581394-26581396
[W::fai_fetch] Reference 14:20528636-20528638 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 14:20528636-20528638
[W::fai_fetch] Reference 11:55657434-55657436 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 11:55657434-55657436
[W::fai_fetch] Reference 1:78269058-78269060 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 1:78269058-78269060
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/Annotation/reference/hg19.fa

$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/Annotation/reference/hg38.fa

$ cd test_vcfs/

$ ls -lh
total 1.8G
-rw-rw-r-- 1 diviner-wsx diviner-wsx  33K 3月  28 18:14 TCGA.LUAD.mutect.hg38.somatic.pairs.tsv
-rw-rw-r-- 1 diviner-wsx diviner-wsx 1.8G 3月  28 18:17 TCGA.LUAD.mutect.hg38.somatic.vcf

Thank your tool, the hg38 works.

So the hg19 geome build TCGA used is neither equals to ensembl v75 or UCSC hg19?

@ShixiangWang
Copy link
Author

ShixiangWang commented Mar 28, 2018

But the conent of vcf is very weird. 😢 How to explain this???

image

Left is outcome and the right is normal vcf file.

I open it using notepad++,

image

It seems normal, I am not sure. What the right part means?

@ShixiangWang ShixiangWang reopened this Mar 28, 2018
@ckandoth
Copy link
Collaborator

@ShixiangWang there are many different files and commands you're using here. There is only one rule for maf2vcf - the --ref-fasta chromosome names and lengths must exactly match the chromosomes used in the --input-maf.

@ckandoth ckandoth self-assigned this Mar 28, 2018
@ShixiangWang
Copy link
Author

ShixiangWang commented Mar 28, 2018 via email

@ckandoth
Copy link
Collaborator

It is most likely that your hg19.fa uses chr-prefixes for the chromosome names, while the MAF file does not. For example, your hg19.fa uses chr7 for chromosome 7, while the MAF uses just 7. Read more about this issue here - https://www.biostars.org/p/119295/#119308

@ShixiangWang
Copy link
Author

@ckandoth Yes, the hg19.fa is different from GRCh37.fa

$ head Annotation/reference/hg19.fa
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

$ zcat ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz | head
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

While Neither Ensembl v75 or hg19.fa from UCSC can work with this hg19 maf file download from TCGA https://portal.gdc.cancer.gov/legacy-archive/files/b2e25bdf-f2b5-4a37-b330-05251ea09f2c.

$.maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

$ maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/Annotation/reference/hg19.fa
[W::fai_fetch] Reference 19:58420782-58420784 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 19:58420782-58420784
[W::fai_fetch] Reference 10:26581394-26581396 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 10:26581394-26581396
[W::fai_fetch] Reference 14:20528636-20528638 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 14:20528636-20528638
[W::fai_fetch] Reference 11:55657434-55657436 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 11:55657434-55657436
[W::fai_fetch] Reference 1:78269058-78269060 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 1:78269058-78269060
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/Annotation/reference/hg19.fa

Should I use GRCh 37.p13
from ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p13/ instead of Ensembl v75? I am haunted by the three different source and I don't know how to pick a correct assembly.fa to work with TCGA hg19 maf file. I don't want to bother you more but can you give me a suggestion?

Thank you very much,
Shixiang

@ckandoth
Copy link
Collaborator

The error [E::fai_build3] Cannot index files compressed with gzip, please use bgzip is easy to fix. It is usually due to having an older version of Perl. If it is not easy for you to upgrade your Perl, then unzip the GRCh37 fasta using gzip -d. VEP should work fine with unzipped FASTAs.

@ShixiangWang
Copy link
Author

@ckandoth Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants