Skip to content

ZeroDivisionError in write_read_str_stats when an STR length bucket has no coverage #5

@nate-d-olson

Description

@nate-d-olson

Version: GQC 1.1.0 (also reproduced against main at GQC/stats.py:869)

Summary

readbench --strs crashes with ZeroDivisionError: division by zero in stats.write_read_str_stats when the included STR set contains a runlength for which no reads pass the qualification filters (alignment span / soft-clip / 5 bp flank checks).

Traceback

File ".../GQC/readbench.py", line 176, in main
  stats.write_read_str_stats('trinuc', trinucstats, outputfiles, args)
File ".../GQC/stats.py", line 869, in write_read_str_stats
  accrate = correctbylength[refstrlength]/(correctbylength[refstrlength] + lengtherrorbylength[refstrlength] + complexbylength[refstrlength] + flankbylength[refstrlength])
ZeroDivisionError: division by zero

Root cause

In errors.assess_str_read_coverage_require_flank (errors.py:376-377), every STR in the included BED is registered in strdict before read tallying:

runlength = end - start
if runlength not in strdict:
    strdict[runlength] = {}

If zero reads at any STR with that runlength survive the filters at lines 401-434 (reference_start/reference_end checks, soft-clip requirement, <5 bp flank check), strdict[runlength] remains {}.

Then in stats.write_read_str_stats the inner loop over stats[refstrlength].keys() never runs, so correctbylength, lengtherrorbylength, complexbylength, and flankbylength all stay at 0, and the per-row division on line 869 fails.

Note: even when reads are present, this can still fire if the only category seen for a given length is HET, since HET is excluded from the denominator.

Reproduction

Run readbench --strs against a BAM with sparse coverage at a particular STR locus. We hit it on a low-coverage diploid mapping where one trinuc STR (TNR_7289_ATT at chr21_MATERNAL:47134596-47134638, runlength=42) had zero qualifying reads. The same crash also occurred on chr7_MATERNAL (trinuc) and on a genome-wide --rerun (tetranuc) in the same dataset.

Suggested fix

Guard the division and skip the row when there's no data:

for refstrlength in sorted(correctbylength.keys()):
    denom = (correctbylength[refstrlength]
             + lengtherrorbylength[refstrlength]
             + complexbylength[refstrlength]
             + flankbylength[refstrlength])
    if denom == 0:
        continue
    accrate = correctbylength[refstrlength] / denom
    lfh.write(...)

Alternatively, write the row with accrate=NA so empty buckets are still represented in the output.

Environment

  • GQC 1.1.0, installed via uvx --from gqc readbench
  • Python 3.14
  • Linux x86_64

Bug patched and issue text generated with the help of claude opus 4.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions