Skip to content

fix: N characters in block consensus#176

Merged
mmolari merged 11 commits intomasterfrom
fix/block_consensus
Mar 16, 2026
Merged

fix: N characters in block consensus#176
mmolari merged 11 commits intomasterfrom
fix/block_consensus

Conversation

@mmolari
Copy link
Copy Markdown
Collaborator

@mmolari mmolari commented Mar 12, 2026

issue with NNNNN sequences

For sequences that contain large stretches of NNNNNNs, it can rarely happen that these end up in the consensus of a block and prevent the block from being subsequently aligned and merged.

This can happen for example when merging two blocks of depth one (e.g. two single-genome graphs, which happens at all tips of the guide tree). In this case one of the two sequences is randomly chosen as the consensus sequences, and it could be the one with many NNNNNN.

This could be later corrected in the reconsensus step if the block is aligned and increases in depth, and the NNNNN stretch is found in only a minority of genomes. However if the stretch is very long (or the block is split such that the NNNNN stretch is now a majority of the consensus sequence) now the block cannot be further aligned anymore, and the other non-NNNNN sequence in the block is "trapped" and masked, even if it could find homology somewhere else.

the fix

A possible first-order fix, implemented here, is to simply break ties when picking consensus sequences of blocks. Currently the consensus sequence is chosen based on the block depth (this minimizes re-alignments). On top of this, now for blocks with the same depth sequences with fewer amounts of NNNNNNs are favored.

a possible further improvement

This fix does not solve cases where NNNNNNs are found in multiple sequences. This could be in principle be taken care of in the reconsensus step, by accepting observed substitutions N -> X to the consensus even if they are present in a minority of sequences, as long as they remove Ns from the consensus. Currently I did not implement this, since it requires deeper changes and might break other things if not done properly (one needs to be careful that these Ns are not re-introduced later because they still are majority substitutions).

I'm not sure these edge-cases are frequent enough to merit more complicated changes to the code, so for now I only implemented this minimal change.


Thanks to @plaquette for running into this and reporting!

@ivan-aksamentov
Copy link
Copy Markdown
Member

ivan-aksamentov commented Mar 13, 2026

Will review in more details later. Need to brush up science first - I already forgot most of it.

For now, pasting AI review below, in case useful. If not, please ignore.


⚠️🤖 AI-generated content below. May contain mistakes

Problem Statement

  • During graph merging, when two blocks with equal depth are candidates for merging, the anchor block's consensus becomes the merged block's consensus. Previously, ties were broken arbitrarily (ref wins). This PR aims to prefer the block with fewer ambiguous N characters, improving merged consensus quality.

Proposed Changes

  • Add a tie-breaker to anchor block selection: when two blocks have equal depth, prefer the block with fewer N characters in its consensus. The goal is to minimize Ns in merged block consensuses.

Identified Issues

H1: The new heuristic measures Ns on the wrong sequence [click to expand]

The code flow reveals a mismatch between what's measured and what's used:

  1. Anchor decision (packages/pangraph/src/pangraph/reweave.rs#L144-L163):

    let ref_block = &graph.blocks[&m.reff.name];
    let qry_block = &graph.blocks[&m.qry.name];
    // ...
    if ref_block.count_n() <= qry_block.count_n() {

    The count_n() is called on whole original blocks from graph.blocks.

  2. Slicing happens later (packages/pangraph/src/pangraph/reweave.rs#L364):

    let (b_slice, n_dict) = block_slice(b, &interval, graph);
  3. Slice creates new consensus (packages/pangraph/src/pangraph/slice.rs#L143):

    let new_consensus: Seq = b.consensus()[i.interval.to_range()].into();

    Only this interval slice becomes the merged consensus.

  4. Merged block uses anchor's slice (packages/pangraph/src/pangraph/reweave.rs#L318 and reweave.rs#L93):

    let anchor_block = b_anch.block.clone(); // already sliced
    // ... in solve_promise:
    Ok(self.anchor_block.clone()) // returns sliced anchor

Concrete counterexample:

Block Total Consensus Aligned Interval Total Ns Interval Ns
A NNNNN-ACGT-NNNNN ACGT 10 0
B ACGT-ACNT-ACGT ACNT 1 1

With equal depth:

  • Current code picks B as anchor (1 < 10 total Ns)
  • Merged consensus: ACNT (1 N)
  • Optimal choice: A's interval ACGT (0 Ns)

Why this matters:

  • Partial-block merges are the norm in reweave - alignments rarely span entire block consensuses
  • The m.reff.interval and m.qry.interval fields contain the actual aligned regions
  • Reconsensus (packages/pangraph/src/reconsensus/reconsensus.rs#L32-L94) uses majority-rule on the merged block (post-slice), so starting with more Ns in the anchor slice propagates them

Possible Solutions:

S1. Count Ns on aligned intervals (recommended) [click to expand]

In assign_anchor_block(), extract and count Ns on intervals instead of whole blocks:

fn assign_anchor_block(mergers: &mut [Alignment], graph: &Pangraph) {
  for m in mergers.iter_mut() {
    let ref_block = &graph.blocks[&m.reff.name];
    let qry_block = &graph.blocks[&m.qry.name];

    let anchor = if ref_block.depth() != qry_block.depth() {
      // ... existing depth logic
    } else {
      // Count Ns in aligned intervals only
      let ref_interval_ns = ref_block.consensus()[m.reff.interval.to_range()]
        .iter().filter(|c| c.0 == b'N').count();
      let qry_interval_ns = qry_block.consensus()[m.qry.interval.to_range()]
        .iter().filter(|c| c.0 == b'N').count();

      if ref_interval_ns <= qry_interval_ns {
        AnchorBlock::Ref
      } else {
        AnchorBlock::Qry
      }
    };
    m.anchor_block = Some(anchor);
  }
}
S2. Add helper method to PangraphBlock [click to expand]

Add count_n_in_range(interval: &Interval) -> usize to packages/pangraph/src/pangraph/pangraph_block.rs for cleaner code.

S3. Regression test [click to expand]

Test where block A has more total Ns but fewer Ns in aligned interval. Fails with current implementation, passes once fixed.

#[test]
fn test_assign_anchor_block_prefers_fewer_interval_ns() {
  fn new_hit(block_id: usize, start: usize, end: usize) -> Hit {
    Hit { name: BlockId(block_id), length: 15, interval: Interval::new(start, end) }
  }

  fn new_aln(q: usize, q_interval: (usize, usize), r: usize, r_interval: (usize, usize)) -> Alignment {
    Alignment {
      qry: new_hit(q, q_interval.0, q_interval.1),
      reff: new_hit(r, r_interval.0, r_interval.1),
      matches: 0, length: 0, quality: 0, orientation: Forward,
      new_block_id: None, anchor_block: None, cigar: Cigar::default(),
      divergence: None, align: None,
    }
  }

  fn e(nids: &[usize]) -> BTreeMap<NodeId, Edit> {
    nids.iter().map(|nid| (NodeId(*nid), Edit::empty())).collect()
  }

  // Block 1: 10 total Ns, interval [5..9] "ACGT" has 0 Ns
  // Block 2: 1 total N, interval [4..8] "ACNT" has 1 N
  let b1 = PangraphBlock::new(BlockId(1), "NNNNNACGTNNNNN", e(&[1, 2]));
  let b2 = PangraphBlock::new(BlockId(2), "ACGTACNTACGT", e(&[3, 4]));

  let pangraph = Pangraph {
    blocks: btreemap! { BlockId(1) => b1, BlockId(2) => b2 },
    paths: btreemap! {}, nodes: btreemap! {},
  };

  // With interval-based counting: Ref wins (0 < 1)
  // With whole-block counting (bug): Qry wins (1 < 10)
  let mut mergers = vec![new_aln(2, (4, 8), 1, (5, 9))];
  assign_anchor_block(&mut mergers, &pangraph);
  assert_eq!(mergers[0].anchor_block, Some(AnchorBlock::Ref));
}

Summary

Issue Valid Fix
H1: The new heuristic measures Ns on the wrong sequence Yes Count Ns on aligned intervals, not whole blocks

@mmolari
Copy link
Copy Markdown
Collaborator Author

mmolari commented Mar 13, 2026

AI was right, this was indeed an issue, thanks for the review! Should now be addressed!

Extends #176 with a test case that would fail if N counting used whole-block
consensus instead of aligned intervals.

Scenario:
- Block 1: 10 total Ns, but aligned interval has 0 Ns
- Block 2: 1 total N, and aligned interval has 1 N

With whole-block counting: Block 2 wins (1 < 10)
With interval counting:    Block 1 wins (0 < 1)
Extends test/pr-176-interval-regression by refactoring the three separate
anchor block test functions into a single rstest parameterized test.

Benefits:
- Reduces test code from 148 to 45 lines
- Table format makes test cases easier to compare
- Adding new cases requires only one line

All 4 original test cases preserved with identical coverage.
Extends refactor/pr-176-parameterized-tests with additional edge cases.

New cases (7 added, 11 total):
- equal_depth_equal_ns_ref_wins: tie-breaker when both have same N count
- equal_depth_zero_ns_ref_wins: tie-breaker when neither has Ns
- equal_depth_many_ns_qry_wins: qry wins when ref has more Ns
- ref_deeper_wins: ref with greater depth wins despite more Ns
- depth_large_difference: large depth gap (10 vs 2)
- interval_at_end: N at block end position
- single_base_interval: single nucleotide alignment
@ivan-aksamentov
Copy link
Copy Markdown
Member

Our AI overlords are happy now!

My human take:

  • Technically - perfect
  • I brushed up on science a little and I think changes are sound scientifically, for my current understanding.

My current understanding how it affects production, from user perspective:

The problem was: when input genomes contain stretches of Ns, these could get "stuck" in block consensus sequences, preventing related sequences from being merged - effectively hiding good data.

With the fix: less ambiguous consensus sequences, better merging of related regions, fewer blocks trapped by ambiguous data, nicer pangraphs.

@ivan-aksamentov
Copy link
Copy Markdown
Member

ivan-aksamentov commented Mar 13, 2026

While going through code, I thought of some improvements to the tests.

I made optional stacked PRs extending on top of this PR, for consideration:

All optional - can be merged independently, along with or after this PR.

To incorporate all 3 optional changes into this PR, preserving nice linear history:

git checkout fix/block_consensus
git merge --ff-only test/pr-176-more-cases  # fast forward PR branch pointer to the last stacked branch pointer
git push

Otherwise, 176 is good to go as is. Feel free to merge. Let me know if I should release on Tuesday-Wednesday or if you want to do it.

mmolari added 3 commits March 16, 2026 10:37
test: add regression test for interval-based N counting
…ests

refactor: consolidate anchor block tests into parameterized test
test: add more anchor block selection test cases
@mmolari
Copy link
Copy Markdown
Collaborator Author

mmolari commented Mar 16, 2026

Our AI overlords are happy now!

😆

Thank you! I checked the other PRs and they all make a lot of sense! Merged everything and I can take care of the release tomorrow/wed, I think your instructions for this were very complete!

@mmolari mmolari merged commit e330586 into master Mar 16, 2026
19 checks passed
@mmolari mmolari deleted the fix/block_consensus branch March 16, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants