Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joint ancestral sequence reconstruction fails when input sequences are identical and GTR is inferred #253

Closed
huddlej opened this issue Aug 3, 2023 · 0 comments · Fixed by #255

Comments

@huddlej
Copy link
Contributor

huddlej commented Aug 3, 2023

When running an augur ancestral prototype that allows reconstruction of amino acid sequences, I ran into the following error for reconstruction of Zika amino acid sequences for the 2K gene:

augur ancestral is using TreeTime version 0.11.0
Read in 2 features from reference sequence file
Processing gene: 2K
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: divide by zero encountered in divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:554: RuntimeWarning: invalid value encountered in divide
  W_ij = (nij+nij.T+2*pc_mat)/mu/(np.outer(pi,Ti) + np.outer(Ti,pi)
/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py:10: RuntimeWarning: invalid value encountered in scalar subtract
  return (np.einsum('i,ij,j', pi, W, pi) - np.sum(pi*W[:,gap_index])*pi[gap_index])/(1-pi[gap_index])
Traceback (most recent call last):
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/__init__.py", line 66, in run
    return args.__command__.run(args)
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 274, in run
    aa_result = run_ancestral(T, fname, root_sequence=root_seq, is_vcf=is_vcf, fill_overhangs=not args.keep_overhangs,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 131, in run_ancestral
    tt = ancestral_sequence_inference(tree=T, aln=aln, ref=root_sequence if is_vcf else None, marginal=marginal,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/augur/augur/ancestral.py", line 70, in ancestral_sequence_inference
    tt.infer_ancestral_sequences(infer_gtr=infer_gtr, marginal=bool_marginal,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 526, in infer_ancestral_sequences
    self.infer_gtr(marginal=marginal, **kwargs)
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/treeanc.py", line 1497, in infer_gtr
    self._gtr = GTR.infer(n_ij, T_ia.sum(axis=-1), root_state, fixed_pi=fixed_pi, pc=pc,
  File "/Users/jhuddlesfredhutch.org/projects/nextstrain/treetime/treetime/gtr.py", line 550, in infer
    while LA.norm(pi_old-pi) > dp and count < Nit:
  File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/scipy/linalg/_misc.py", line 146, in norm
    a = np.asarray_chkfinite(a)
  File "/Users/jhuddlesfredhutch.org/miniconda3/envs/augur/lib/python3.10/site-packages/numpy/lib/function_base.py", line 628, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

It turns out that this error occurs only when attempting to infer ancestral sequences with joint inference and GTR inference. The following Python code provides a minimal unit test to recreate the error.

def test_seq_joint_reconstruction_of_identical_sequences_with_inferred_gtr():
    """Reconstruct the sequences of the internal nodes with joint inference and
    inferred GTR using tip sequences that are identical and confirm that the
    reconstruction is correct.

    """
    from treetime import TreeAnc
    from Bio import Phylo, AlignIO
    tiny_tree = Phylo.read(StringIO("((A:0.60100000009,B:0.3010000009):0.1,C:0.2):0.001;"), 'newick')
    tiny_aln = AlignIO.read(StringIO(">A\nAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTT\n"
                                     ">B\nAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTT\n"
                                     ">C\nAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTT\n"), 'fasta')
    myTree = TreeAnc(gtr="JC69", tree=tiny_tree, aln=tiny_aln, verbose=4)
    myTree.infer_ancestral_sequences(infer_gtr=True, infer_tips=True, marginal=False)

    return myTree

The error can be avoiding by either not inferring GTR (infer_gtr=False) or by using marginal inference (marginal=True). The augur ancestral command does not provide a user argument to toggle the GTR inference from the default of True to False. It does provide a toggle to switch to marginal inference, though.

@rneher notes in the related issue:

the TreeTime error is indeed from the identical sequences. There is nothing to estimate the GTR model from. I thought this had been regularized by pseudo counts.

That statement made me think this issue might want to be resolved in TreeTime. If TreeTime could check for this issue internally and raise a user-readable error, Augur could pass that error on to the user. Alternately, Augur could be responsible for checking whether the input sequences are identical and inconsistent with inferring GTR state, so users would know. In that case, we'd want to add an augur ancestral argument for users to toggle off GTR inference when this issue occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant