# Birth-Death Skyline (BDSKY) serial workflow


This Workflow Notebook is for running BDSKY serial models in BEAST 2. **The template BEAST 2 xml (template_xml) or ready_to_go_xml provided must be for BDSKY serial!**


```
Parameters
-------------
overall_save_dir: str
    Path to where you are saving all the runs of this pipeline.

specific_run_save_dir: str, default a timestamp of format 'YYYY-MM-DD_hour-min-sec'
    Subdirectory of overall_save_dir you wish to save the outputs from this pipeline iine. 
    If None, 'None' or an emply string a timestamp of format 'YYYY-MM-DD_hour-min-sec' is used instead.

cache_name str, default 'cache'
    Name to use for cache directory. Saved within overall_save_dir/specific_run_save_dir but deleted the end of this
    workflow notebook.

initial_tree_path: str, opional
    Path to initial tree to use in generating a BEAST 2 xml. Should be .nwk file (Newick format).
    If provided phases 2i and 2ii are skiped.
    If a distance tree is used set initial_tree_type to 'Distance'.
    If a temporal tree is used set initial_tree_type to 'Temporal'.

use_initial_tree:  bool, default True
    If False an initial tree will not be generated skipping Phases 2i and 2ii. As such, in phase 4 BEAST 2 generate its own
    initial tree.

initial_tree_type: str (either 'Distance' or 'Temporal') or None, default 'Temporal'
    Intial tree type to use.
    If 'Distance' and initial_tree_path is not provided the IQtree tree from Phase-2i-IQTree.ipynb is used for the
    initial tree and phase 2ii is skipped.
    if 'Temporal' and initial_tree_path is not provided the TreeTime tree from Phase-2ii-TreeTime-and-Down-Sampling.ipynb
    is used for the initial tree.

ready_to_go_xml: str, optional
    Path to a BEAST 2 xml that you wish to run unaltered. If provided phases 2i, 2ii and 3 are skipped.

fasta_file: str
    Path to fasta file containing sequences to use in generating a BEAST 2 xml.

partition: str
    The name of partition to use when launching slurm jobs via `sbatch` in phases 2i and 4.

metadata_path: str
    Path to csv or tsv containing metadata pertaining to fasta_file.

sample_id_field: str, default 'strain'
    Name of field in metadata_db containing ids corresponding to those used in fasta_file.

collection_date_field: str, default 'date'
    Name of field in metadata_db containing collection dates of sequences. Should be format YYYY-MM-DD.

root_strain_names: list of strings, optional
     IDs of sequences used to root 'Temporal' initial_tree removed from fasta file and initial tree file used to generate
    the BEAST 2 xml.

down_sample_to: int, optional
    If provided the fasta file and initial tree file used to generate the BEAST 2 xml is downsampled to this amount.
    If downsampling occurs the following are saved in  '{overall_save_dir}/{specific_run_save_dir}/' and used in generating
    a BEAST 2 xml in phase 4:
        down_sampled_time.nwk: A downsampled temporal tree.
        down_sampled_sequences.fasta: Fasta file containing downsamplec sequences.
        down_sampled_metadata.csv: the down sampled metadata.

template_xml_path:
    Path to template BEAST 2 xml used to generate the BEAST 2 xml.

rt_dimensions: int, default 6
    Number of Rt dimensions (time periods over which Rt is estimated).

origin_start_addition float
    This + initial temporal tree height is used as starting value of origin.
    We recommend using an estimate of infection period for the pathogen being studied. **Value should be in years.**
    Origin prio will be unform:
        Lower value: time in years from oldest to youngest sequence in fasta_file
        Start value: origin_start_addition + initial temporal tree height
        Upper value:  initial temporal tree height + origin_upper_addition.

origin_upper_addition: float/int
    This + initial temporal tree height is used as upper value of origin prior. **Value should be in years.**
    Origin prio will be unform:
        Lower value: time in years from oldest to youngest sequence in fasta_file
        Start value: origin_start_addition + initial temporal tree height
        Upper value:  initial temporal tree height + origin_upper_addition.

origin_prior: dict {'lower': float, 'upper': float, 'start': float}, optional
       Details of the origin prior. assumed to be uniformly distributed.

log_file_basename: str, optional
    If provided .tree, .log and .state files from running BEAST 2 will have this name prefixed by 'run-{number}-',
    number being that of the chain.

chain_length: int
    Number of chains to use for BEAST runs.

trace_log_every: int
    How often to save a log file during BEAST runs.

tree_log_every: int
    How often to save a tree file during BEAST runs.

screen_log_every: int
    How often to output to screen during BEAST runs.

store_state_every: int 
    How often to store MCMC state during BEAST runs.

number_of_chains: int
    Number of chains to use (repeated runs to do) when running BEAST.

seeds: list of ints, optional
    Seeds to use when running BEAST. If not provided seeds are drawn using numpy's randint function
    (interges 1 to 1,000,000).

beast_ram: str
    RAM to use for each run of beast see sbatch documentation.

beast_threads: int
    Threads/CPUs to use for each run of beast see https://www.beast2.org/2021/03/31/command-line-options.html.

burnin_percentage: int
    Perecentage burnin to use.
```

In [None]:
# Needed in this notebook
overall_save_dir = '../example_runs_of_BSDKY'
specific_run_save_dir=None
cache_name='cache'
initial_tree_path = None
use_initial_tree = True
initial_tree_type = 'Temporal'
ready_to_go_xml = None


# Used in Phase 2i
fasta_file= '../example_data/COVID-19_BA.2.86/sequences.fasta'
partition=None

# Used in Phase 2ii
metadata_path = '../example_data/COVID-19_BA.2.86/metadata.tsv'
sample_id_field='strain'
collection_date_field='date'
root_strain_names=None
down_sample_to=None

# Used in Phase 3:
#fasta_file = 'example_data/COVID-19_BSDKY/sequences.fasta_file'  Defined above
#metadata_path = None Defined above
template_xml_path = '../template_beast_xmls/BDSKY_serial_COVID-19_template.xml'
rt_dimensions=6

# collection_date_field='date' Defined above
origin_start_addition = 10 / 365.25
origin_upper_addition = 2
origin_prior = None
log_file_basename='BDSKY_serial'
chain_length = int(5e5)
trace_log_every = int(5e2)
tree_log_every = int(5e2)
screen_log_every = int(5e2)
store_state_every = int(5e2)

# Used in Phase 4:
number_of_chains = 4
#partition='NMLResearch' Defined above
seeds = None
beast_ram = "32G"
beast_threads=6

## Import libraries and define functions:  

In [None]:
from beast_pype.workflow import check_file_for_phrase
import json
import os
from datetime import datetime
import papermill as pm
from time import perf_counter
import pandas as pd
from numpy.random import randint 
import shutil
from papermill.iorw import load_notebook_node, write_ipynb
from papermill.parameterize import parameterize_notebook
import importlib.resources as importlib_resources

def cell_variables_to_dict(offset=0):
    """
    Convert the variables of a cell to a dictionary, using names as keys.

    Parameters
    ----------
    offset: int, default 0
        How many cells ago:
            0: for current cell.
            1: for the previous called cell.
            2: for the cell before that.
            ......

    Returns
    -------
    dictionary : dict {name_of_variable: value}

    References
    ----------------
    https://stackoverflow.com/questions/46824287/print-all-variables-defined-in-one-jupyter-cell
    
    Note
    -------
    Not sure why bet this needs to be defined within the notebook and not imported.

    """

    import io # for some searson this has to be called within the function.
    from contextlib import redirect_stdout # for some searson this has to be called within the function.
    ipy = get_ipython()
    out = io.StringIO()

    with redirect_stdout(out):
        ipy.run_line_magic("history", str(ipy.execution_count - offset))

    #process each line...
    x = out.getvalue().replace(" ", "").split("\n")
    x = [a.split("=")[0] for a in x if "=" in a] #all of the variables in the cell
    g = globals()
    dictionary = {k:g[k] for k in x if k in g}
    return dictionary

### Check parameters are correct.

In [None]:
if ready_to_go_xml is not None or not use_initial_tree: # None of the parameters below are needed if a ready_to_go_xml is provided.
    if initial_tree_type not in ['Temporal', 'Distance', None]:
        raise ValueError('initial_tree_type must be either "Temporal" or "Distance" or None.')

    if initial_tree_path is not None and initial_tree_type is None:
        raise ValueError('initial_tree_type must be specified if initial_tree_path is given.')

    if origin_start_addition is not None:
        if initial_tree_type != 'Temporal':
            raise ValueError('origin_start_addition is reliant on the initial_tree_type being "Temporal".')
        if origin_upper_addition is None:
            raise ValueError('origin_start_addition is reliant on origin_upper_addition being given as well.')

    if origin_upper_addition is not None:
        if initial_tree_type != 'Temporal':
            raise ValueError('origin_upper_addition is reliant on the initial_tree_type being "Temporal".')
        if origin_upper_addition is None:
            raise ValueError('origin_upper_addition is reliant on origin_start_addition being given as well.')

    if initial_tree_type in ['Distance', None]:
        if origin_prior is None:
            raise ValueError('If initial_tree_type is "Distance" or an an initial tree is not being used, an origin_prior must be given.')
        if origin_start_addition is not None:
            raise ValueError('If initial_tree_type is "Distance" or an an initial tree is not being used, an origin_start_addition should NOT be given.')
        if origin_upper_addition is not None:
            raise ValueError('If initial_tree_type is "Distance" or an an initial tree is not being used, an origin_upper_addition should NOT be given.')

    if down_sample_to is not None and (initial_tree_path is not None or initial_tree_type == 'Distance'):
        raise ValueError("Currently beast_pype's down_sampling method is tied to its Tree Time tree building." +
                         "Therefore, to use this down sampling method an initial_tree_path should not be given and an initial_tree_type should be set to 'Temporal'.")

if partition is None:
    raise ValueError('The name of partition to use when launching slurm jobs via `sbatch` in phases 2i and 4, must be given.')


Record parameters in dictionary

In [None]:
parameters = cell_variables_to_dict(offset=3)

### Creating Folders and Subfolders 

In [None]:
if not os.path.exists(overall_save_dir):
    os.makedirs(overall_save_dir)

Create a folder of today's date to save into within save_dir and reassign save_dir to that value.

In [None]:
if specific_run_save_dir is None or specific_run_save_dir in ['', 'None']:
    now = datetime.now()
    specific_run_save_dir = now.strftime('%Y-%m-%d_%H-%M-%S')

save_dir = overall_save_dir +'/'+ specific_run_save_dir
cache_dir = f'{save_dir}/{cache_name}'

If save_dir and and cache_dir do not exist create them. 

In [None]:
for folder in [save_dir, cache_dir]:
    if not os.path.exists(folder):
        os.makedirs(folder)

Start recording pipeline_run_info.

In [None]:
pipeline_run_info = {'parameters': parameters}

with open(save_dir +'/pipeline_run_info.json', 'w') as fp:
    json.dump(pipeline_run_info, fp, sort_keys=True, indent=4)

fp.close()

### Placing Common Parameters in a Dictionary 

In [None]:
common_params = {
    'save_dir' : save_dir,
    'cache_dir' : cache_dir
}

### Creating a record for runtimes

This record list of dirtionaries will be turned into a pandas dataframe and saved as a csv at the end of this notebook.

In [None]:
runtime_records = []

### Set path to workflow modules

In [None]:
workflow_modules = importlib_resources.path('beast_pype', 'workflow_modules')

## Phase 2: Data Pre-Processing
### Phase 2i: Building an IQ Tree tree.
#### Placing Phase 2i Parameters in a Dictionary

In [None]:
if ready_to_go_xml is None and \
    use_initial_tree and \
        (initial_tree_path is None and initial_tree_type is not None):
    phase_2i_start = perf_counter()
    phase_2i_params = {**common_params,
                       **{val_name: eval(val_name) for val_name in ['fasta_file' , 'partition']}
                     }

#### Running Phase 2i.

In [None]:
#papermill_description=Phase-2i-IQTree.ipynb
if ready_to_go_xml is None and \
    use_initial_tree and \
        (initial_tree_path is None and initial_tree_type is not None):
    phase_2i_log = pm.execute_notebook(input_path=f'{workflow_modules}/Phase-2i-IQTree.ipynb',
                                      output_path=save_dir + '/Phase-2i-IQTree.ipynb',
                                      parameters=phase_2i_params,
                                      progress_bar=True,
                                      nest_asyncio=True
                                     )


### Wait for IQtree to be Built

In [None]:
if ready_to_go_xml is None and \
    use_initial_tree and \
        (initial_tree_path is None and initial_tree_type is not None):
    out_file =  f'{save_dir}/iqtree.out'
    check_file_for_phrase(out_file)

    runtime_records.append({
        'Phase': 'Phase-2i-IQTree.ipynb',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_2i_start
    })

### Phase 2ii: Building an TreeTime tree and Downsampling.
#### Placing Phase 2ii Parameters in a Dictionary

In [None]:
if ready_to_go_xml is None and \
    use_initial_tree and \
        (initial_tree_path is None and initial_tree_type =='Temporal'):
    phase_2ii_start = perf_counter()
    phase_2ii_params = {**common_params,
                        **{val_name: eval(val_name) for val_name in ['fasta_file',
                                                                     'metadata_path',
                                                                     'sample_id_field',
                                                                     'collection_date_field',
                                                                     'down_sample_to',
                                                                     'root_strain_names']}
                     }

#### Running Phase 2ii.

In [None]:
#papermill_description=Phase-2ii-TreeTime-and-Down-Sampling.ipynb
if ready_to_go_xml is None and \
    use_initial_tree and \
        (initial_tree_path is None and initial_tree_type =='Temporal'):
    phase_2ii_log = pm.execute_notebook(input_path=f'{workflow_modules}/Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
                                      output_path=save_dir + '/Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
                                      parameters=phase_2ii_params,
                                      progress_bar=True,
                                      nest_asyncio=True
                                     )

    runtime_records.append({
        'Phase': 'Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_2i_start
    })

## Phase 3 Generating BEAST xmls

### Placing Phase 3 Parameters in a Dictionary

In [None]:
if ready_to_go_xml is None:
    phase_3_start = perf_counter()

    phase_3_params = {
        **common_params,
        **{val_name: eval(val_name) for val_name in [
            'template_xml_path',
            'use_initial_tree',
            'initial_tree_path',
            'rt_dimensions',
            'collection_date_field',
            'origin_start_addition',
            'origin_upper_addition',
            'log_file_basename',
            'origin_prior',
            'chain_length',
            'trace_log_every',
            'tree_log_every',
            'screen_log_every',
            'store_state_every']}}

    if down_sample_to is None:
        phase_3_params['fasta_file'] = fasta_file
        phase_3_params['metadata_path'] = metadata_path
    else:
        phase_3_params['fasta_file'] = f'{save_dir}/down_sampled_sequences.fasta_file'
        phase_3_params['metadata_path'] =  f'{save_dir}/down_sampled_metadata.csv'

### Running Phase 3.

In [None]:
#papermill_description=Phase-3-Gen-BDSKY-xml.ipynb
if ready_to_go_xml is None:
    phase_3_log = pm.execute_notebook(input_path=f'{workflow_modules}/Phase-3-Gen-BDSKY-xml.ipynb',
                                      output_path=save_dir + '/Phase-3-Gen-BDSKY-xml.ipynb',
                                      parameters=phase_3_params,
                                      progress_bar=True,
                                      nest_asyncio=True)
    runtime_records.append({
        'Phase': 'Phase-3-Gen-BDSKY-xml.ipynb',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_3_start
    })

## Phase 4 Running BEAST

BEASTs random number seed can select the same seed for multiple runs if they are launched close together in time (such as proromatically). Therefore lets use numpy to generate our seeds.

In [None]:
if seeds is None:
    number_of_seeds=number_of_chains
    seeds = randint(low=1, high=int(1e6), size=number_of_seeds).tolist()

Record seeds in pipeline_run_info json

In [None]:
with open(save_dir + "/pipeline_run_info.json", "r") as file:
    data = file.read()
file.close()
pipeline_run_info = json.loads(data)
pipeline_run_info['seeds'] = seeds
with open(save_dir +'/pipeline_run_info.json', 'w') as fp:
    json.dump(pipeline_run_info, fp, sort_keys=True, indent=4)

fp.close()

### If ready_to_go_xml was provided save a copy for use as f'{save_dir}/beast.xml'.

In [None]:
if ready_to_go_xml is not None:
    shutil.copy(ready_to_go_xml, f'{save_dir}/beast.xml')

### Placing Phase 4 Parameters in a Dictionary

In [None]:
phase_4_params = {**common_params,
                  'number_of_chains': number_of_chains,
                  'seeds':seeds,
                  'partition': partition,
                  'beast_threads':beast_threads,
                  'beast_ram':beast_ram}

### Running Phase 4.

In [None]:
#papermill_description=Phase5-Running-BEAST.ipynb
phase_4_log = pm.execute_notebook(input_path=f'{workflow_modules }/Phase-4-Running-BEAST.ipynb',
                                  output_path=save_dir + '/Phase-4-Running-BEAST.ipynb',
                                  parameters=phase_4_params,
                                  progress_bar=True,
                                  nest_asyncio=True)

## Add Slurm Job IDs and Names to pipeline_run_info.json.

In [None]:
with open(save_dir + "/pipeline_run_info.json", "r") as file:
    data = file.read()
file.close()
pipeline_run_info = json.loads(data)
with open(cache_dir +'/slurm_job_ids.txt', 'r') as fp:
    entries = fp.read().splitlines() 
fp.close()
pipeline_run_info['slurm job IDs'] = entries
with open(save_dir +'/pipeline_run_info.json', 'w') as fp:
    json.dump(pipeline_run_info, fp, sort_keys=True, indent=4)

fp.close()

## Phase 5: Diagnosing Outputs and Generate Report

Curently this has to be performed manually. That being said, the code cell below will parameterize a copy of the notebook ready to run. See below for location.

In [None]:
phase_5_params = {'save_dir': None,
                  'report_template': str(importlib_resources.path('beast_pype', 'report_templates') / 'BDSKY-Serial-Report.ipynb'),
                  'add_unreported_fields': True,
                  'metadata_path':os.path.abspath(metadata_path),
                  } # in this case the metadata_path needs to be absolute.

phase_5_notebook = load_notebook_node(f'{workflow_modules}/Phase-5-Diagnosing_Outputs_and_Generate_Report.ipynb')
phase_5_notebook = parameterize_notebook(phase_5_notebook, phase_5_params)
write_ipynb(phase_5_notebook, f'{save_dir}/Phase-5-Diagnosing_Outputs_and_Generate_Report.ipynb')
print(f'Phase 5 notebook is ready for manual use at: \n{save_dir}/Phase-5-Diagnosing_Outputs_and_Generate_Report.ipynb')

## Recording Runtimes

Converting to pandas DataFrame and saving as CSV.

In [None]:
runtime_df = pd.DataFrame.from_records(runtime_records)
runtime_df.sort_values(['Phase', 'Sample', 'Runtime'], inplace=True)
runtime_df.to_csv(save_dir + "/runtimes.csv", index=False)

### Delete Cache direcory

In [None]:
shutil.rmtree(cache_dir)