# Phase 2ii: TreeTime & Down Sampling

Generates temporal tree using TreeTime and if down_sample_to is provided performs downsampling.

```
Parameters
-------------
save_dir: str  
    Path to directory for saving outputs in.

fasta_file: str, optional
    Path to fasta file containing sequences to use in generating a BEAST 2 xml.
    If not given and root_strain_names is not given is used instaedf'{save_dir}/sequences.fasta'
    If not given and root_strain_names is given f'{save_dir}/sequences_with_root.fasta' is used instead.

metadata_path: str
    Path to csv or tsv containing metadata pertaining to fasta_file.
    If not given and root_strain_names is not given is used instaedf'{save_dir}/metadata.csv'
    If not given and root_strain_names is given f'{save_dir}/metadata_with_root.csv' is used instead.

sample_id_field: str, default 'strain'
    Name of field in metadata_path containing ids corresponding to those used in fasta_file.

collection_date_field: str, default 'date'
    Name of field in metadata_pathcontaining collection dates of sequences. Should be format YYYY-MM-DD.

root_strain_names: list of strings, optional
    IDs of sequences used to root 'Temporal' initial_tree removed from fasta file and initial tree file used to generate
    the BEAST 2 xml.

down_sample_to: int, optional
    If provided the fasta file and initial tree file used to generate the BEAST 2 xml is downsampled to this amount.
    If downsampling occurs the following are saved in save_dir and used in generating a BEAST 2 xml in phase 4:
        down_sampled_time.nwk: A downsampled temporal tree.
        down_sampled_sequences.fasta: Fasta file containing downsamplec sequences.
        down_sampled_metadata.csv: the down sampled metadata.
```


In [None]:
save_dir = 'runs_of_pipeline/2025-02-05'
fasta_file=None
metadata_path=None
sample_id_field = 'strain'
collection_date_field = 'date'
root_strain_names = None
down_sample_to= None

### Import packages and get data if not in save_dir

In [None]:
from beast_pype.tree_time_scale import timescale, temporal_pruning_sampler
from Bio import Phylo, SeqIO
import os
import pandas as pd
import ete3

In [None]:
if metadata_path is None:
    if root_strain_names is None: 
        metadata_path = f'{save_dir}/metadata.csv'
    else:
        metadata_path = f'{save_dir}/metadata_with_root.csv'
    if not os.path.isfile(metadata_path):
        raise FileNotFoundError(f'If metadata_path is not given the the save_dir directory `{save_dir}` must contain either the file "metadata.csv" or the file "metadata_with_root.csv" (if root_strain_names is given). ')
        
if fasta_file is None:
    if root_strain_names is None: 
        fasta_file = f'{save_dir}/sequences.fasta'
    else:
        fasta_file = f'{save_dir}/sequences_with_root.fasta'
    if not os.path.isfile(fasta_file):
        raise FileNotFoundError(f'If fasta_file is not given the the save_dir directory `{save_dir}` must contain either the file "sequences.fasta_file" or the file "sequences_with_root.fasta_file" (if root_strain_names is given).')

## Generating time trees

IQ tree is offended by the charachters: '/'. The code below corrects this.

In [None]:
correction_dict = {seq_record.id.replace('/', '_'): seq_record.id for seq_record in SeqIO.parse(fasta_file, "fasta")}

tree_file = f'{save_dir}/iqtree.treefile'
fh = open(tree_file)
tree = fh.read()
for changed, original in correction_dict.items():
    tree = tree.replace(changed, original)

oh = open(tree_file, 'w')
oh.write(tree)
oh.close()

In [None]:
if root_strain_names is None:
    time_tree, bad_tips = timescale(
        ftree=f'{save_dir}/iqtree.treefile',
        falignment=fasta_file, 
        fdates=metadata_path,
        remove_root=False,
        node_confidence_dir=save_dir,
        sample_id_field=sample_id_field,
        collection_date_field=collection_date_field        
    )
else:
    time_tree, bad_tips = timescale(
            ftree=f'{save_dir}/iqtree.treefile',
            falignment=fasta_file, 
            fdates=metadata_path,
            reroot=root_strain_names,
            node_confidence_dir=save_dir,
            sample_id_field=sample_id_field,
            collection_date_field=collection_date_field        
        )
    

Phylo.write(time_tree.tree,
                f'{save_dir}/full_time.nwk',
                format='newick',
                format_branch_length='%1.8f')

## Downsampling time trees

### Obtaining strain ids and sequences.

Below, if the sample size is over suggested_down_sample_tos then the normalised residuals from the root-to-tip regression above are used as weights in a probalitic draw to remove leaves from a list of all the tips. The tips that are left to keep are stored in a list. See beast_pype.tree_time_scale.temporal_pruning_sampler for details on weighted removal of tips method.

In [None]:
if down_sample_to is not None:
    strain_sequences = SeqIO.parse(fasta_file, 'fasta')
    metadata_df = pd.read_csv(metadata_path, parse_dates=[collection_date_field])
    tips = time_tree.tree.get_terminals()
    n_tips = len(tips)
    if down_sample_to < n_tips:
        stuff_to_add = True
        str_down_sample_to = str(down_sample_to)
        sampled_ids = temporal_pruning_sampler(time_tree=time_tree, sample_size=down_sample_to)
        tree = ete3.Tree(f'{save_dir}/full_time.nwk', format=1)
        tree.prune(sampled_ids,  preserve_branch_length=True)
        tree.write(outfile=f'{save_dir}/down_sampled_time.nwk',format=1)
        selected_metadata = metadata_df[metadata_df.strain.isin(sampled_ids)]
        selected_seqs = [seq_record for seq_record in strain_sequences if seq_record.id in sampled_ids]
        selected_metadata.to_csv(f'{save_dir}/down_sampled_metadata.csv', index=False)
        with open(f'{save_dir}/down_sampled_sequences.fasta', 'w') as handle:
            SeqIO.write(selected_seqs, handle, 'fasta')
