Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative codon incorrectly computed in context of multi-nucleotide mutation? #21

Open
Haddox opened this issue Feb 20, 2024 · 1 comment

Comments

@Haddox
Copy link

Haddox commented Feb 20, 2024

Hi, thank you for writing BTE, we in the Matsen lab have found it very useful.

I ran into a potential bug. I've been using the tree.translate() method to analyze codon-level mutations on an UShER tree. And for some nodes, the alternative codon appears to be incorrect. This always happens in cases where there are two nucleotide mutations to the same codon. For instance, the below table shows the set of translated mutations for the node Indonesia/NT-EIJK26/2020|EPI_ISL_766048|2020-04-14:

gene, aa_index, original_codon, alternative_codon, nuc
S, 347, TTT, TTG, T22603G
S, 348, GCA, CCA, G22604C
S, 348, GCA, CCA, C22605A

There are a total of three nucleotide mutations. The bottom two are in the same gene (spike) in the same codon (348). Given these mutations, it seems like the codon-level mutation should be GCA->CAA (reflecting both nucleotide mutations). Instead, it is GCA->CCA for both rows, which is correct for the single mutation in the middle row, but ignores the single mutation in the bottom row.

I appreciate that the output of tree.translate() (one entry per nucleotide mutation) makes it a little unclear how to report multi-nucleotide mutations. But just wanted to raise this in case you all had not considered it before.

The below is code what I've used to identify nodes that show this problem. Of the first 10,000 nodes in the tree I'm analyzing, 15 showed the problem.

Thanks!

import bte
from collections import Counter

# Read in tree
tree = bte.MATree(usher_tree_file)

# Get translated mutations
translations = tree.translate(ref_gtf, ref_fasta)

# Iterate over nodes and identify ones with potential problem
nodes = tree.depth_first_expansion()
problematic_nodes = []
for (i, node) in enumerate(nodes):
    if node.id not in translations:
        continue
    codon_muts = translations[node.id]
    counts = Counter((codon_mut.gene, codon_mut.aa_index, codon_mut.original_codon, codon_mut.alternative_codon) for codon_mut in codon_muts)
    if max(counts.values()) > 1:
        problematic_nodes.append(node)
    if i > 10000:
        break

print(len(problematic_nodes))
@jmcbroome
Copy link
Owner

Hi, thank you for raising this issue. Glad BTE has been of use to you and your team. Definitely looks like a bug.

BTE is largely a wrapper around UShER's core library, but does do some of its own custom processing for mutation translation, so the bug might be in BTE or in UShER upstream. Do you find that the tabular output for UShER's translate command exhibits the same behavior?

Also, could you attach a small file and command/script that reproduces the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants