Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running variant calling: "Expected three tokens in header line, got 2" #67

Closed
kvg opened this issue May 28, 2021 · 15 comments
Closed

Comments

@kvg
Copy link

kvg commented May 28, 2021

Hello,
I'm running into a strange problem when running PEPPER-Margin-DeepVariant on ONT data for a human whole-genome sample covered to ~30x. About an hour into the run, I get an error message stating, "Expected three tokens in header line, got 2".

The log file (excerpted) is as follows:

run_pepper_margin_deepvariant call_variant -b /cromwell_root/fc-31acdd56-adb8-4708-95df-9b3932a0c162/e9b2e7eb-e792-4340-a40a-cb80c89eb4de/ONTWholeGenome/ffae1946-1f1a-4e8a-8b76-451afd50c64c/call-CallVariants/CallVariants/39efbaff-86c0-432e-937d-67f263b72875/call-SubsetBam/shard-0/cacheCopy/subset.bam -f /cromwell_root/broad-dsde-methods-long-reads/resources/references/grch38_noalt/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa -o ./ -p subset.deepvariant_pepper -r chr1 -t 12 -s HG00514 --gvcf --phased_output --ont
[05-27-2021 09:31:48] INFO: VARIANT CALLING MODULE SELECTED
[05-27-2021 09:31:48] INFO: [1/9] RUNNING THE FOLLOWING COMMAND

...

> Polishing 98% complete (2445/2490). Estimated time remaining: 20s
> Polishing 99% complete (2471/2490). Estimated time remaining: 10s
> Starting merge
Expected three tokens in header line, got 2

Expected three tokens in header line, got 2

real 18m40.959s
user 206m40.101s
sys 1m2.565s
mv: cannot stat './/*.bam': No such file or directory
[E::hts_open_format] Failed to open file ".//MARGIN_PHASED.PEPPER_SNP_MARGIN.haplotagged.bam" : No such file or directory
samtools index: failed to open ".//MARGIN_PHASED.PEPPER_SNP_MARGIN.haplotagged.bam": No such file or directory
[05-27-2021 10:43:34] ERROR: None]
Traceback (most recent call last):
File "/usr/local/bin/run_pepper_margin_deepvariant", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/run_pepper_margin_deepvariant/run_pepper_margin_deepvariant.py", line 732, in main
run_variant_calling(flags)
File "/usr/local/lib/python3.6/dist-packages/run_pepper_margin_deepvariant/run_pepper_margin_deepvariant.py", line 376, in run_variant_calling
subprocess.check_call(command, shell=True, executable='/bin/bash')
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'time margin phase /cromwell_root/fc-31acdd56-adb8-4708-95df-9b3932a0c162/e9b2e7eb-e792-4340-a40a-cb80c89eb4de/ONTWholeGenome/ffae1946-1f1a-4e8a-8b76-451afd50c64c/call-CallVariants/CallVariants/39efbaff-86c0-432e-937d-67f263b72875/call-SubsetBam/shard-0/cacheCopy/subset.bam /cromwell_root/broad-dsde-methods-long-reads/resources/references/grch38_noalt/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa .//PEPPER_SNP_OUPUT.vcf.gz /opt/margin_dir/params/misc/allParams.ont_haplotag.json -t 12 -r chr1 -V -o .//MARGIN_PHASED.PEPPER_SNP_MARGIN 2>&1 | tee .//logs/2_margin_haplotag.log;
mv .//*.bam .//MARGIN_PHASED.PEPPER_SNP_MARGIN.haplotagged.bam;
samtools index -@12 .//MARGIN_PHASED.PEPPER_SNP_MARGIN.haplotagged.bam' returned non-zero exit status 1.

I'm not sure which file it's saying is effectively malformed - is this an error on my end or is this perhaps a bug in PEPPER-Margin-DeepVariant?

Thanks,
-Kiran

@tpesout
Copy link

tpesout commented May 28, 2021

Hi Kiran, can you attach more of the log file before the error? Specifically beginning from the invocation of margin. This is an issue in margin's merge function which I've seen when the file margin uses to track read ids per chunk is corrupted, but that shouldn't happen during execution in this workflow. Also if you can send the command you're using to run this and the specific version of the docker container, this would help us debug.

@kvg kvg closed this as completed May 30, 2021
@kvg kvg reopened this May 30, 2021
@kvg
Copy link
Author

kvg commented May 30, 2021

Sure thing. I've attached the full log below.

I'm using the Docker image "kishwars/pepper_deepvariant:r0.4.1". The specific command I'm running is:

$ run_pepper_margin_deepvariant call_variant \
    -b fixed.bam \
    -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
    -o ./ \
    -p subset.deepvariant_pepper \
    -r chr22 \
    -t 12 \
    -s HG00514 \
    --gvcf \
    --phased_output \
    --ont

Also if it helps, I've put the chr22 BAM file (~5 GB) in Dropbox .

PEPPER.chr22.log

@tpesout
Copy link

tpesout commented Jun 1, 2021

Thanks Kiran, I don't see anything out of place in the log file. I'm downloading the bam now, can you send PEPPER_SNP_OUPUT.vcf.gz so I can debug this?

@tpesout
Copy link

tpesout commented Jun 7, 2021

Hi @kvg , just checking in to see if you have resolved this issue and whether you can supply the VCF

@kvg
Copy link
Author

kvg commented Jun 7, 2021 via email

@kvg
Copy link
Author

kvg commented Jun 9, 2021

Hi @tpesout - I've now generated the PEPPER_SNP_OUPUT.vcf.gz for the same sample and chromosome and attached it here. Thanks!

PEPPER_SNP_OUPUT.vcf.gz

@tpesout
Copy link

tpesout commented Jun 9, 2021

Kiran,

It looks like the issue is arising because your BAM has duplicate reads:

$ samtools view -F 0x900 -q 10 PMDV_GITHUB_ISSUE_67.bam | cut -f1 | uniq -c | awk '$1 > 1' | head
      4 c8372a67-46d0-4350-8e3c-59ee60560008
      4 bb1d21d6-529b-46c4-9887-ded1610e20b9
      4 d4338656-30d6-48c6-a4fa-a9c19a2f25ff
      4 dc190758-e1dc-454f-988a-8849ed3b89ff
      4 4a802c7b-f00a-428e-9d07-c81f2b09a800
      4 5d719e9d-22c9-4fcd-a629-77c1a5692b99
      4 72d52980-9b99-4906-b0ed-5d65bd4c2308
      4 68b8c95f-fb44-4fd3-ba39-01420ab3d183
      4 c9e6f9a1-d73e-43c9-b152-404322b78159
      4 1ba8267d-4992-410c-9fde-fa6ff1444a82

I'll need some time to discuss if or how we should handle this in the pipeline, but for now removing duplicates from your BAM should fix this issue.

@tpesout
Copy link

tpesout commented Jun 10, 2021

@kvg As this is an issue with bad data, we're not going to make changes to the pipeline to support this case. I'm going to close this issue, but please feel free to reopen or reply if you have further issues.

@kvg
Copy link
Author

kvg commented Jun 10, 2021

Sounds great - thanks very much for your help in diagnosing the issue! Now to investigate why my data has duplicate reads in the first place.

@tpesout
Copy link

tpesout commented Jun 15, 2021

@kvg if it's helpful, I found that for the two that I checked there were three alignments which were exactly the same, and one which had something different (as far as "uniq" thinks).

@xsvato01
Copy link

Hi I am having the same issue. I am working with data coming from cancer chromothripsis (ie highly broken up chromosomes with lots of structural variants), hence supplementary alignments (and consequently multiple reads with the same QNAME in the bam file) are expected. If I filtered out all these reads I would loose a substantial part of the information.

@Modernism-01
Copy link

Modernism-01 commented Jan 2, 2024

Hi I am having the same issue. I am working with data coming from cancer chromothripsis (ie highly broken up chromosomes with lots of structural variants), hence supplementary alignments (and consequently multiple reads with the same QNAME in the bam file) are expected. If I filtered out all these reads I would loose a substantial part of the information.

Have you solved it? I am facing the same issues, but I do not know if I need to remove the duplicate.

@xsvato01
Copy link

xsvato01 commented Jan 8, 2024

Hi I am having the same issue. I am working with data coming from cancer chromothripsis (ie highly broken up chromosomes with lots of structural variants), hence supplementary alignments (and consequently multiple reads with the same QNAME in the bam file) are expected. If I filtered out all these reads I would loose a substantial part of the information.

Have you solved it? I am facing the same issues, but I do not know if I need to remove the duplicate.

Hi, I have not looked into it yet. Let me know if you worked it out somehow. All I can think of right now is appending some random strings to the read names, as a quick workaround..

@Modernism-01
Copy link

Yes, I tried to use the command sambamba markdup -r and it worked well.

@xsvato01
Copy link

xsvato01 commented Feb 1, 2024

Here is quick workaround to append random umi-like sequences to the readnames, then Pepper does not complain:

samtools view -h ${bam} | \
awk -v output_bam=${name}.uniqueNames.bam \
    'BEGIN {OFS="\\t"} 
     { 
        if (\$1 ~ /^@/) {
            print \$0;
        } else {
            cmd = "openssl rand -hex 4";
            cmd | getline random_string;
            close(cmd);
            \$1 = \$1 "_" random_string;
            print \$0;
        }
     }' | \
samtools view -b -o ${name}.uniqueNames.bam -
samtools index ${name}.uniqueNames.bam ${name}.uniqueNames.bam.bai

you may want to delete the escaping \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants