Skip to content

Fix/sentieon gvcftyper intervals#11668

Merged
SPPearce merged 5 commits into
masterfrom
fix/sentieon-gvcftyper-intervals
May 18, 2026
Merged

Fix/sentieon gvcftyper intervals#11668
SPPearce merged 5 commits into
masterfrom
fix/sentieon-gvcftyper-intervals

Conversation

@nicorap
Copy link
Copy Markdown
Contributor

@nicorap nicorap commented May 18, 2026

Description

SENTIEON_GVCFTYPER declares path(intervals) as an input and stages the file into the work dir, but the rendered sentieon driver command never references it — so GVCFtyper has been
running unconstrained even when callers pass per-interval BEDs.

+ def interval_command = intervals ? "--interval ${intervals}" : ""
  ...
- sentieon driver -r ${fasta} --algo GVCFtyper ${gvcfs_input} ${dbsnp_cmd} ${prefix}.vcf.gz
+ sentieon driver -r ${fasta} ${interval_command} --algo GVCFtyper ${gvcfs_input} ${dbsnp_cmd} ${prefix}.vcf.gz

The fix mirrors the pattern already used in sentieon/haplotyper (modules/nf-core/sentieon/haplotyper/main.nf:39).

Reason for PR

In a multi-sample joint-genotyping pipeline (e.g. nf-core/sarek's sentieon path), GVCFtyper is invoked per shard with a per-interval BED so each shard genotypes only its assigned region;
the per-interval VCFs are then concatenated by GATK4 MergeVcfs (which assumes non-overlapping inputs). Without --interval, every shard processes the full gVCF, so neighbouring shards
emit overlapping records at interval boundaries — the concat then produces duplicate/overlapping variant calls. Single-sample runs are largely unaffected, which is why this slipped
through.

Evidence the input was being ignored

The sentieon driver invocation lacks any --interval flag despite path(intervals) being declared and staged. With the fix the command renders correctly as:

sentieon driver -r genome.fasta --interval genome.bed --algo GVCFtyper -v test.genome.vcf.gz test_genotyped.vcf.gz

Note: the existing snapshot file does not demonstrate the bug behaviorally — the BED in the test-datasets happens to cover all variants in the test gVCF, so the constrained and
unconstrained outputs produce the same md5. The bug is real (the flag was missing from the command), but exercising it from tests would require a gVCF with variants outside the BED's
coverage, which the current test-datasets don't provide.

Tests

No new test added. With the available test data any regression test using it would be tautological (BED covers all gVCF variants, so the BED is a no-op regardless of whether --interval
is honored). The fix mirrors the pattern in sentieon/haplotyper; existing tests continue to pass.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests! — see "Tests" section above; current test data can't surface the bug
  • If you've added a new tool - have you followed the module conventions in the contribution docs — N/A,
    existing module
  • If necessary, include test data in your PR. — no new data
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions — unchanged from existing module
  • Follow the naming conventions. — unchanged
  • Follow the parameters requirements. — unchanged
  • Follow the input/output options guidelines. — unchanged
  • Add a resource labellabel 'process_high' already present
  • Use BioConda and BioContainers if possible to fulfil software requirements. — already configured
  • For modules:
    • nf-core modules test sentieon/gvcftyper --profile docker
    • nf-core modules test sentieon/gvcftyper --profile singularity
    • nf-core modules test sentieon/gvcftyper --profile conda

Cannot run the three test-profile boxes locally: the Sentieon container needs SENTIEON_AUTH_MECH + SENTIEON_AUTH_DATA (plus SENTIEON_LICSRVR_IP) which I don't have outside CI. Now
that this PR is on an upstream branch, the CI matrix runs them with the license-server credentials.

nicorap added 2 commits May 18, 2026 08:10
  The test asserted variantsMD5 differs from the buggy md5, but the
  genome.bed used here covers all variants in test.genome.vcf.gz, so
  the BED is a no-op and the md5 is the same with or without --interval.
  Surfacing the bug would require a gVCF with variants outside the
  BED's coverage, which the test-datasets don't provide.
@SPPearce SPPearce added this pull request to the merge queue May 18, 2026
Merged via the queue into master with commit 19430b8 May 18, 2026
35 checks passed
@SPPearce SPPearce deleted the fix/sentieon-gvcftyper-intervals branch May 18, 2026 12:29
manascripts pushed a commit to manascripts/modules that referenced this pull request May 21, 2026
* fix: sentieon gvcftyper accepts intervals

* style: line breaks in sentieon gvcftyper driver call

  Per @SPPearce review on nf-core#11582 — match the multi-line format used
  in sentieon/haplotyper.

* lint: rename meta1..meta4 to meta2..meta5 per nf-core convention

* test: drop tautological regression test

  The test asserted variantsMD5 differs from the buggy md5, but the
  genome.bed used here covers all variants in test.genome.vcf.gz, so
  the BED is a no-op and the md5 is the same with or without --interval.
  Surfacing the bug would require a gVCF with variants outside the
  BED's coverage, which the test-datasets don't provide.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants