alternative codon incorrectly computed in context of multi-nucleotide mutation? #21

Haddox · 2024-02-20T18:52:53Z

Hi, thank you for writing BTE, we in the Matsen lab have found it very useful.

I ran into a potential bug. I've been using the tree.translate() method to analyze codon-level mutations on an UShER tree. And for some nodes, the alternative codon appears to be incorrect. This always happens in cases where there are two nucleotide mutations to the same codon. For instance, the below table shows the set of translated mutations for the node Indonesia/NT-EIJK26/2020|EPI_ISL_766048|2020-04-14:

gene, aa_index, original_codon, alternative_codon, nuc
S, 347, TTT, TTG, T22603G
S, 348, GCA, CCA, G22604C
S, 348, GCA, CCA, C22605A

There are a total of three nucleotide mutations. The bottom two are in the same gene (spike) in the same codon (348). Given these mutations, it seems like the codon-level mutation should be GCA->CAA (reflecting both nucleotide mutations). Instead, it is GCA->CCA for both rows, which is correct for the single mutation in the middle row, but ignores the single mutation in the bottom row.

I appreciate that the output of tree.translate() (one entry per nucleotide mutation) makes it a little unclear how to report multi-nucleotide mutations. But just wanted to raise this in case you all had not considered it before.

The below is code what I've used to identify nodes that show this problem. Of the first 10,000 nodes in the tree I'm analyzing, 15 showed the problem.

Thanks!

import bte
from collections import Counter

# Read in tree
tree = bte.MATree(usher_tree_file)

# Get translated mutations
translations = tree.translate(ref_gtf, ref_fasta)

# Iterate over nodes and identify ones with potential problem
nodes = tree.depth_first_expansion()
problematic_nodes = []
for (i, node) in enumerate(nodes):
    if node.id not in translations:
        continue
    codon_muts = translations[node.id]
    counts = Counter((codon_mut.gene, codon_mut.aa_index, codon_mut.original_codon, codon_mut.alternative_codon) for codon_mut in codon_muts)
    if max(counts.values()) > 1:
        problematic_nodes.append(node)
    if i > 10000:
        break

print(len(problematic_nodes))

The text was updated successfully, but these errors were encountered:

jmcbroome · 2024-02-20T21:48:09Z

Hi, thank you for raising this issue. Glad BTE has been of use to you and your team. Definitely looks like a bug.

BTE is largely a wrapper around UShER's core library, but does do some of its own custom processing for mutation translation, so the bug might be in BTE or in UShER upstream. Do you find that the tabular output for UShER's translate command exhibits the same behavior?

Also, could you attach a small file and command/script that reproduces the issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alternative codon incorrectly computed in context of multi-nucleotide mutation? #21

alternative codon incorrectly computed in context of multi-nucleotide mutation? #21

Haddox commented Feb 20, 2024

jmcbroome commented Feb 20, 2024

alternative codon incorrectly computed in context of multi-nucleotide mutation? #21

alternative codon incorrectly computed in context of multi-nucleotide mutation? #21

Comments

Haddox commented Feb 20, 2024

jmcbroome commented Feb 20, 2024