**Back to TreeTime**

In [None]:
#import libraries
from IPython.display import IFrame
import pandas as pd

**Rerooting Phylogenetic Trees**

It's important to never trust the rooting from IQTREE or similar: the output is always unrooted, no matter how rooted it may look!  This means that `--keep-root` in TreeTime won't work, since there is no root.  However, you can define the root in TreeTime using `--reroot 'oldest'`, if you know the oldest sample, or one of the other rerooting options.

In [8]:
# ! treetime -h

**Molecular Clock**

Usually, the molecular clock rate is set to `0.008` for SARS-CoV-2.  This is the average mutation rate estimate, but it doesn't work in general for epidemiology.  Using a hard clock will cause TreeTime to ignore some tips so it can get the clock right.  This is noticeable especially when you have clusters from different lineages, where the jump in mutation counts between lineages is bigger than the clock would estimate (like the jump to Omicron).  In these cases, if you want a regression plot like TreeTime's, you can view the plot in Tempest and make a regression line within each cluster separately.

For a more realistic measure, you can use the **relaxed molecular clock**, which says that each cluster can have its own mutation rate independent of the other clusters.  Leo recommends setting `--relax 1.0 0`.  It's good to include a molecular clock rate as well, just so that all the clusters together have to average out to `0.008`.  Once you've done both of these, TreeTime's regression plot kind of means nothing.

The number used for the rate is the one used in publications in the past, but with Omicron being super transmissible it might not work anymore.  Proceed with caution if your samples are ~90% Omicron, then.


**Wuhan-Hu-1**

Wuhan-Wu-1 (MN908947) is one of the two earliest samples we have for SARS-CoV-2, and it's the one most often used as the reference genome for SARS-CoV-2 (in this lab, always).  It comes from a patient sample taken on December 26, 2019, and is from the B lineage.  The other sample, Wuhan-Hu-4, is from the A lineage and was sampled in January 2020.  For more on SARS-CoV-2 nomenclature, see [this paper](https://www.nature.com/articles/s41564-020-0770-5).

Since we only have access to the A and B lineage references but not their common ancestor, it can happen that you'll build a tree with Wuhan-Hu-1 in it only to see that it's not a common ancestor for all the samples in the tree.

In general, it's difficult to infer a good tree using samples all clustered near each other in time plus one sample from years ago--like the other day, when I tried to build a tree using 200 samples from winter 2021 + Wuhan-Wu-1.  If you want to include the reference, a good way to do it is to build the tree using just the 200 samples, and then to add the reference sequence afterwards.  In IQTREE you can do this with `-g`.  **Question: Is this done before TreeTime or after?  I think after.  See if it can be done.**



**Visualizing Nexus trees**

Nexus trees can be viewed using `figtree`: `figtree mytree.nexus`.  Here's one of mine from the other day:

In [3]:
#this will open the file in a new window
! figtree COG-UK-NORW/results/timetree-tiplabels/timetree.nexus

javax.swing.UIManager$LookAndFeelInfo[Metal javax.swing.plaf.metal.MetalLookAndFeel]
javax.swing.UIManager$LookAndFeelInfo[Nimbus javax.swing.plaf.nimbus.NimbusLookAndFeel]
javax.swing.UIManager$LookAndFeelInfo[CDE/Motif com.sun.java.swing.plaf.motif.MotifLookAndFeel]
javax.swing.UIManager$LookAndFeelInfo[GTK+ com.sun.java.swing.plaf.gtk.GTKLookAndFeel]


**Uniform vs. Coalescent Trees**

In a uniform tree, the branch lengths between nodes is constant.  This is good for humans, cows and bacteria: things with discrete generations.  For viral transmission dynamics, it makes more sense to use a coalescent model, in which the branch lengths get shorter as time goes on, representing viral spread.  **Q: clarify this part.**  TreeTime has a few coalescent models, including a skyline model: test these out and see what they do.

**Testing some treetime models**

These are going to be tested on the 200 NORW samples, without Wuhan-Hu-1.

In [15]:
! mkdir -p COG-UK-NORW/results/20220210
! mafft COG-UK-NORW/NORW_200.fasta > COG-UK-NORW/results/20220210/NORW_200_aln.fa #align NORW samples

nthread = 0
nthreadpair = 0
nthreadtb = 0
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..

There are 161919 ambiguous characters.
  101 / 200
done.

Constructing a UPGMA tree (efffree=0) ... 
  190 / 200
done.

Progressive alignment 1/2... 
STEP   195 / 199 
Reallocating..done. *alloclen = 60816
STEP   199 / 199 
done.

Making a distance matrix from msa.. 
  100 / 200
done.

Constructing a UPGMA tree (efffree=1) ... 
  190 / 200
done.

Progressive alignment 2/2... 
STEP   199 / 199 
done.

disttbfast (nuc) Version 7.490
alg=A, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
0 thread(s)


Strategy:
 FFT-NS-2 (Fast but rough)
 Progressive method (guide trees were built 2 times.)

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been

In [5]:
! iqtree -s COG-UK-NORW/results/20220210/NORW_200_aln.fa -pre 'COG-UK-NORW/results/20220210/NORW_200_aln.fa'
! grep '>' COG-UK-NORW/results/20220210/NORW_200_aln.fa | sed 's/>//g' > 'COG-UK-NORW/results/20220210/NORW_200_leaftips.csv'

IQ-TREE multicore version 1.6.12 for Linux 64-bit built Mar 23 2020
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor,
Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    tamarisk (AVX2, FMA3, 9 GB RAM)
Command: iqtree -s ../data/COG-UK-NORW/results/20220210/NORW_200_aln.fa -pre ../data/COG-UK-NORW/results/20220210/NORW_200_aln.fa
Seed:    628004 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Mon Feb 14 15:42:47 2022
Kernel:  AVX+FMA - 1 threads (8 CPU cores detected)

HINT: Use -nt option to specify number of threads because your CPU has 8 cores!
HINT: -nt AUTO will automatically determine the best number of threads to use.

Reading alignment file ../data/COG-UK-NORW/results/20220210/NORW_200_aln.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
Alignment has 200 sequences with 29914 columns, 1567 distinct patterns
174 parsimony-informative, 238 singleton sites, 29502 constant sites
England/NORW-22D223/2021 -> England

NOTE: England_NORW-22D0DE_2021 is identical to England_NORW-22D205_2021 but kept for subsequent analysis
NOTE: England_NORW-22D2BA_2021 is identical to England_NORW-22D27E_2021 but kept for subsequent analysis
NOTE: England_NORW-22BEC7_2021 is identical to England_NORW-22B216_2021 but kept for subsequent analysis
NOTE: England_NORW-22899C_2021 is identical to England_NORW-22C1CD_2021 but kept for subsequent analysis
NOTE: England_NORW-22B9F3_2021 is identical to England_NORW-22BDF7_2021 but kept for subsequent analysis
NOTE: England_NORW-22B42F_2021 is identical to England_NORW-22BE6D_2021 but kept for subsequent analysis
NOTE: England_NORW-22BCCD_2021 is identical to England_NORW-22B902_2021 but kept for subsequent analysis
NOTE: England_NORW-22BE9A_2021 is identical to England_NORW-22BD7F_2021 but kept for subsequent analysis
NOTE: England_NORW-22B2DA_2021 is identical to England_NORW-22C057_2021 but kept for subsequent analysis
NOTE: England_NORW-224611_2021 is identical to England_

185  TIM2e+G4      45357.188    393 91500.377    91510.867    94764.667
186  TIM2e+I+G4    45355.018    394 91498.035    91508.580    94770.631
187  TIM2e+R2      45371.436    394 91530.871    91541.416    94803.468
188  TIM2e+R3      45368.796    396 91529.592    91540.244    94818.800
196  TIM2+F        44419.918    395 89629.836    89640.435    92910.739
197  TIM2+F+I      44402.174    396 89596.348    89607.001    92885.557
198  TIM2+F+G4     44402.584    396 89597.167    89607.819    92886.375
199  TIM2+F+I+G4   44402.245    397 89598.491    89609.197    92896.005
200  TIM2+F+R2     44413.927    397 89621.855    89632.561    92919.369
201  TIM2+F+R3     44411.473    399 89620.947    89631.762    92935.073
209  TIM3e         45380.630    392 91545.261    91555.698    94801.245
210  TIM3e+I       45356.680    393 91499.361    91509.851    94763.651
211  TIM3e+G4      45359.357    393 91504.714    91515.204    94769.004
212  TIM3e+I+G4    45356.687    394 91501.375    91511.919    94

In [2]:
readfile = 'COG-UK-NORW/NORW_metadata.csv'
df = pd.read_csv(readfile, header=0, usecols=['sequence_name', 'sample_date'])
df = df.rename(columns={'sequence_name':'name', 'sample_date':'date'}) #headers required by treetime

#include only the names that match the names in the alignment
leaf_df = pd. read_csv('COG-UK-NORW/results/20220210/NORW_200_leaftips.csv', header=None)
leaf_df = leaf_df.rename(columns={0:'name'})

#add dates to leaf_df, and format the names like in the tree
save_df = leaf_df.merge(df, how='left')
save_df['name'] = save_df['name'].str.replace('/', '_')

save_df.to_csv('COG-UK-NORW/results/20220210/NORW_treetime_dates.csv', sep=',', index=False)

In [3]:
! sed -i -e 's/\//_/g' COG-UK-NORW/results/20220210/NORW_200_aln.fa

1) `--relax 1.0 0`, `--clock-rate 0.008`, `--reroot 'oldest'` 

In [10]:
! treetime --relax 1.0 0 --clock-rate 0.008 --reroot 'oldest' --tree COG-UK-NORW/results/iqtree/NORW_200_ref_aln.fa.treefile --dates COG-UK-NORW/NORW_treetime_dates.csv --aln COG-UK-NORW/results/NORW_200_ref_aln.fa --outdir COG-UK-NORW/results/20220210/timetree-1


Attempting to parse dates...
	Using column 'name' as name. This needs match the taxon names in the tree!!
	Using column 'date' as date.

0.00	-TreeAnc: set-up

    	tips at positions with AMBIGUOUS bases. This resulted in unexpected
    	behavior is some cases and is no longer done by default. If you want to
    	replace those ambiguous sites with their most likely state, rerun with
    	`reconstruct_tip_states=True` or `--reconstruct-tip-states`.

5.02	TreeTime.reroot: with method or node: oldest

5.16	TreeTime.reroot: with method or node: oldest

5.96	###TreeTime.run: INITIAL ROUND

12.19	TreeTime.reroot: with method or node: oldest

12.31	###TreeTime.run: rerunning timetree after rerooting

18.75	###TreeTime.run: ITERATION 1 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

26.79	###TreeTime.run: ITERATION 2 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

34.21	###TreeTime.run: CONVERGED

Inferred sequence evolution model (saved as ../data/COG-UK

In [3]:
IFrame("COG-UK-NORW/results/20220210/timetree-1/timetree.pdf", width=600, height=300)

In [None]:
2) `--relax 1.0 0`, `--clock-rate 0.008`, `--reroot 'oldest'` --coalescent 'skyline'

In [5]:
! treetime --relax 1.0 0 --clock-rate 0.008 --reroot 'oldest' --coalescent 'skyline' --tree COG-UK-NORW/results/iqtree/NORW_200_ref_aln.fa.treefile --dates COG-UK-NORW/NORW_treetime_dates.csv --aln COG-UK-NORW/results/NORW_200_ref_aln.fa --outdir COG-UK-NORW/results/20220210/timetree-2


Attempting to parse dates...
	Using column 'name' as name. This needs match the taxon names in the tree!!
	Using column 'date' as date.

0.00	-TreeAnc: set-up

    	tips at positions with AMBIGUOUS bases. This resulted in unexpected
    	behavior is some cases and is no longer done by default. If you want to
    	replace those ambiguous sites with their most likely state, rerun with
    	`reconstruct_tip_states=True` or `--reconstruct-tip-states`.

5.02	TreeTime.reroot: with method or node: oldest

5.16	TreeTime.reroot: with method or node: oldest

5.99	###TreeTime.run: INITIAL ROUND

12.01	TreeTime.reroot: with method or node: oldest

12.13	###TreeTime.run: rerunning timetree after rerooting

18.36	###TreeTime.run: ITERATION 1 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

34.17	###TreeTime.run: ITERATION 2 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

Inferred sequence evolution model (saved as ../data/COG-UK-NORW/results/20220210/timetree-2/

In [2]:
IFrame("COG-UK-NORW/results/20220210/timetree-2/skyline.pdf", width=600, height=300)

The skyline plot is a graph of the estimated "piece-wise linear merger rate trajectory" that TreeTime creates: a guess at the population size for different timepoints on the tree.

In [11]:
! ls COG-UK-NORW/results/20220210/timetree-2

ancestral_sequences.fasta  root_to_tip_regression.pdf	 substitution_rates.tsv
dates.tsv		   sequence_evolution_model.txt  timetree.nexus
divergence_tree.nexus	   skyline.pdf			 timetree.pdf
molecular_clock.txt	   skyline.tsv			 trace_run.log


3) `--relax 1.0 0`, `--clock-rate 0.008`, `--reroot 'oldest'` --coalescent 'const'

In [12]:
! treetime --relax 1.0 0 --clock-rate 0.008 --reroot 'oldest' --coalescent 'const' --tree COG-UK-NORW/results/iqtree/NORW_200_ref_aln.fa.treefile --dates COG-UK-NORW/NORW_treetime_dates.csv --aln COG-UK-NORW/results/NORW_200_ref_aln.fa --outdir COG-UK-NORW/results/20220210/timetree-3


Attempting to parse dates...
	Using column 'name' as name. This needs match the taxon names in the tree!!
	Using column 'date' as date.

0.00	-TreeAnc: set-up

    	tips at positions with AMBIGUOUS bases. This resulted in unexpected
    	behavior is some cases and is no longer done by default. If you want to
    	replace those ambiguous sites with their most likely state, rerun with
    	`reconstruct_tip_states=True` or `--reconstruct-tip-states`.

5.11	TreeTime.reroot: with method or node: oldest

5.23	TreeTime.reroot: with method or node: oldest

6.02	###TreeTime.run: INITIAL ROUND

12.22	TreeTime.reroot: with method or node: oldest

12.34	###TreeTime.run: rerunning timetree after rerooting

18.73	###TreeTime.run: ITERATION 1 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

34.90	###TreeTime.run: ITERATION 2 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

Inferred sequence evolution model (saved as ../data/COG-UK-NORW/results/20220210/timetree-3/

In [13]:
IFrame("COG-UK-NORW/results/20220210/timetree-3/timetree.pdf", width=600, height=300)

4) `--relax 1.0 0`, `--clock-rate 0.008`, `--reroot 'oldest'` --coalescent '20.0'

In [19]:
! treetime --relax 1.0 0 --clock-rate 0.008 --reroot 'oldest' --coalescent '20.0' --tree COG-UK-NORW/results/iqtree/NORW_200_ref_aln.fa.treefile --dates COG-UK-NORW/NORW_treetime_dates.csv --aln COG-UK-NORW/results/NORW_200_ref_aln.fa --outdir COG-UK-NORW/results/20220210/timetree-4


Attempting to parse dates...
	Using column 'name' as name. This needs match the taxon names in the tree!!
	Using column 'date' as date.

0.00	-TreeAnc: set-up

    	tips at positions with AMBIGUOUS bases. This resulted in unexpected
    	behavior is some cases and is no longer done by default. If you want to
    	replace those ambiguous sites with their most likely state, rerun with
    	`reconstruct_tip_states=True` or `--reconstruct-tip-states`.

5.12	TreeTime.reroot: with method or node: oldest

5.26	TreeTime.reroot: with method or node: oldest

6.06	###TreeTime.run: INITIAL ROUND

12.09	TreeTime.reroot: with method or node: oldest

12.21	###TreeTime.run: rerunning timetree after rerooting

18.43	###TreeTime.run: ITERATION 1 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

33.48	###TreeTime.run: ITERATION 2 out of 2 iterations
relaxed_clock {'slack': 1.0, 'coupling': 0.0}

Inferred sequence evolution model (saved as ../data/COG-UK-NORW/results/20220210/timetree-4/

In [18]:
IFrame("COG-UK-NORW/results/20220210/timetree-4/timetree.pdf", width=600, height=300)