<a href="https://colab.research.google.com/github/lauraluebbert/delphy_workflows/blob/main/delphy_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running [Delphy](https://delphy.fathom.info/) is simple as 1. 2. 3.
___
___

# 1. Apply filters to download sequences from [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/):

In [2]:
#@title NCBI Virus filtering options:

def arg_str_to_bool(arg):
  if arg == "True":
      return True
  elif arg == "False":
      return False
  elif arg == "None" or arg == "":
      return None
  else:
      return arg

#@markdown ## Virus

virus = 'dengue virus type 4'  #@param {type:"string"}
#@markdown  - Example: 'Mammarenavirus lassaense' or 'coronaviridae' or 'NC_045512.2' or '142786' (Norovirus taxid).
virus = arg_str_to_bool(virus)

accession = False   #@param {type:"boolean"}
#@markdown  - Check this box if `virus` argument above is an NCBI accession (starts with 'NC'), e.g. 'NC_045512.2'.

#@markdown ## Host

host = 'homo sapiens'  #@param {type:"string"}
#@markdown  - Example: 'homo sapiens' (alternative: use the `host_taxid` filter). Input 'None' to disable this filter.
host = arg_str_to_bool(host)

host_taxid = None  #@param {type:"raw"}
#@markdown  - NCBI Taxonomy ID of host (e.g., 9443 for primates).
host_taxid = arg_str_to_bool(host_taxid)

#@markdown ## Sequence completeness

annotated = "None"   #@param ["True", "False", "None"]
#@markdown  - Indicates whether the sequences should be marked as 'annotated'.
annotated = arg_str_to_bool(annotated)

nuc_completeness = "None"  #@param ["None", "complete", "partial"]
#@markdown  - Choose between 'partial' or 'complete' nucleotide completeness.
nuc_completeness = arg_str_to_bool(nuc_completeness)

min_seq_length = None  #@param {type:"raw"}
#@markdown  - Minimum sequence length, e.g. 6252.
min_seq_length = arg_str_to_bool(min_seq_length)

max_seq_length = None  #@param {type:"raw"}
#@markdown  - Maximum sequence length, e.g. 7815.
max_seq_length = arg_str_to_bool(max_seq_length)

has_proteins = None  #@param {type:"raw"}
#@markdown  - Require sequences to contain specific proteins (e.g. input 'GPC' or 'L' - include the quotation marks for this filter) or a list of proteins (e.g. input ['GPC', 'L']). Also accepts names of genes or segments.
has_proteins = arg_str_to_bool(has_proteins)

proteins_complete = "None"   #@param ["True", "False", "None"]
#@markdown  - Set to 'True' if the proteins/genes/segments in `has_proteins` should be marked as complete.
proteins_complete = arg_str_to_bool(proteins_complete)

max_ambiguous_chars = None  #@param {type:"raw"}
#@markdown  - Maximum number of 'N' characters allowed in each sequence, e.g. 10.
max_ambiguous_chars = arg_str_to_bool(max_ambiguous_chars)

#@markdown ## Gene/peptide/protein counts

min_gene_count = None  #@param {type:"raw"}
#@markdown  - Minimum gene count, e.g. 1.
min_gene_count = arg_str_to_bool(min_gene_count)

max_gene_count = None  #@param {type:"raw"}
#@markdown  - Maximum gene count, e.g. 40.
max_gene_count = arg_str_to_bool(max_gene_count)

min_mature_peptide_count = None  #@param {type:"raw"}
#@markdown  - Minimum peptide count, e.g. 2.
min_mature_peptide_count = arg_str_to_bool(min_mature_peptide_count)

max_mature_peptide_count = None  #@param {type:"raw"}
#@markdown  - Maximum peptide count, e.g. 15.
max_mature_peptide_count = arg_str_to_bool(max_mature_peptide_count)

min_protein_count = None  #@param {type:"raw"}
#@markdown  - Minimum protein count, e.g. 2.
min_protein_count = arg_str_to_bool(min_protein_count)

max_protein_count = None  #@param {type:"raw"}
#@markdown  - Maximum protein count, e.g. 10.
max_protein_count = arg_str_to_bool(max_protein_count)

#@markdown ## Geographic location

geographic_location = None  #@param {type:"string"}
#@markdown  - Geographic location of sample collection, e.g. 'South Africa' or 'Germany'.
geographic_location = arg_str_to_bool(geographic_location)

geographic_region = None  #@param {type:"string"}
#@markdown  - Geographic region of sample collection, e.g. 'Africa' or 'Europe'.
geographic_region = arg_str_to_bool(geographic_region)

#@markdown ## Dates

min_collection_date = None  #@param {type:"string"}
#@markdown  - Minimum collection date, e.g. '2000-01-01'.
min_collection_date = arg_str_to_bool(min_collection_date)

max_collection_date = None  #@param {type:"string"}
#@markdown  - Maximum collection date, e.g. '2014-12-04'.
max_collection_date = arg_str_to_bool(max_collection_date)

min_release_date = None  #@param {type:"string"}
#@markdown  - Minimum release date of the sequences, e.g. '2000-01-01'.
min_release_date = arg_str_to_bool(min_release_date)

max_release_date = None  #@param {type:"string"}
#@markdown  - Maximum release date of the sequences, e.g. '2014-12-04'.
max_release_date = arg_str_to_bool(max_release_date)

#@markdown ## Source

submitter_country = None  #@param {type:"string"}
#@markdown  - Country that submitted the sequence, e.g. 'South Africa' or 'Germany'.
submitter_country = arg_str_to_bool(submitter_country)

lab_passaged = "None"   #@param ["True", "False", "None"]
#@markdown  - Set to True to return sequences that have been passaged in a laboratory setting.
lab_passaged = arg_str_to_bool(lab_passaged)

source_database = None  #@param {type:"string"}
#@markdown  - Source database of the sequence, e.g. 'GenBank' or 'RefSeq'.
source_database = arg_str_to_bool(source_database)

# 2. Optional: Upload a fasta file with your own sequences to add to the analysis
  **1) Click on the folder icon on the left.  
  2) Upload your file(s) to the Google Colab server by dragging in your file(s) (or use rightclick -> Upload).  
  3) Specify the name of your file(s) here:**

In [3]:
#@title FASTA file containing additional sequences

fasta_file = None  #@param {type:"string"}
#@markdown  - Example: 'my_fasta_file.fa' or 'my_fasta_file.fasta'.


In [4]:
#@title Metadata

#@markdown **Option 1: The metadata is the same for all sequences in your FASTA file**
metadata = {'Collection Date': 'YYYY-MM-DD', 'Geo Location': 'South Korea'}  #@param {type:"raw"}
#@markdown - The 'Collection Date' field is required. Optional: you can add as many additional columns as you wish, e.g. 'Geo Location': 'South Korea'.
#@markdown - NOTE: Use NCBI column names where applicable (see https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus for example column names)

#@markdown **Option 2: Input a CSV file with metadata for each sequence**
metadata_csv = None  #@param {type:"string"}
#@markdown  - Example: 'my_metadata.csv'. This file must include at least an 'Accession' and 'Collection Date' column.
#@markdown  - NOTE: Make sure the IDs in the "Accession" column match the IDs of the sequences in the provided FASTA file
#@markdown  - NOTE: Use NCBI column names where applicable (see https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus for example column names)

# Convert empty strings to None
fasta_file = arg_str_to_bool(fasta_file)
metadata_csv = arg_str_to_bool(metadata_csv)

# 3. Select on `Runtime` at the top of this notebook, then click `Run all` and lean back
A completion message will be displayed below once the notebook has been successfully executed.  
💡 Tip: Click on the folder icon on the left to view/download the files that are being generated.
  
<br>

____
____

In [None]:
#@title # Generating tree...

print("1/5 Installing software...")
# Install gget
# After the release, this will just be: pip install gget (dependence on biopython will be removed)
!pip install -q mysql-connector-python>=8.0.32 biopython
!pip install -q git+https://github.com/pachterlab/gget.git@delphy_dev

import gget
from Bio import SeqIO
import pandas as pd
import re
from datetime import datetime

# Delphy threads
threads = 2

print("1/5 Software installation complete.")

# Downloading virus genomes from NCBI Virus
print("2/5 Download data from NCBI Virus... This might take a minute depending on the internet connection and how busy the NCBI server is.")
gget.ncbi_virus(
    virus = virus,
    accession = accession,
    host = host,
    min_seq_length = min_seq_length,
    max_seq_length = max_seq_length,
    min_gene_count = min_gene_count,
    max_gene_count = max_gene_count,
    nuc_completeness = nuc_completeness,
    has_proteins = has_proteins,
    proteins_complete = proteins_complete,
    host_taxid = host_taxid,
    lab_passaged = lab_passaged,
    geographic_region = geographic_region,
    geographic_location = geographic_location,
    submitter_country = submitter_country,
    min_collection_date = min_collection_date,
    max_collection_date = max_collection_date,
    annotated = annotated,
    source_database = source_database,
    min_release_date = min_release_date,
    max_release_date = max_release_date,
    min_mature_peptide_count = min_mature_peptide_count,
    max_mature_peptide_count = max_mature_peptide_count,
    min_protein_count = min_protein_count,
    max_protein_count = max_protein_count,
    max_ambiguous_chars = max_ambiguous_chars
)
print("2/5 Data download from NCBI virus complete.")

# Merging sequencing and metadata files if additional file(s) were provided
ncbi_fasta_file = f"{'_'.join(str(virus).split(' '))}_sequences.fasta"
ncbi_metadata = f"{'_'.join(str(virus).split(' '))}_metadata.csv"

if fasta_file:
  print("Adding user-provided fasta file and metadata to the data from NCBI Virus...")

  # Combine sequence files
  combined_fasta_file = f"{'_'.join(str(virus).split(' '))}_sequences_combined.fasta"
  !cat $ncbi_fasta_file $fasta_file > $combined_fasta_file
  input_fasta_file = combined_fasta_file

  # Combine metadata
  combined_metadata_file = f"{'_'.join(virus.split(' '))}_metadata_combined.csv"
  ncbi_metadata_df = pd.read_csv(ncbi_metadata)
  if metadata_csv:
    # Combine provided metadata and NCBI metadata csv files
    user_metada_df = pd.read_csv(metadata_csv)
    comb_meta_df = pd.concat([ncbi_metadata_df, user_metada_df])
    comb_meta_df.to_csv(combined_metadata_file, index=False)
    metadata_file = combined_metadata_file

  else:
    # Extract sequence accessions from the provided FASTA file
    headers = [record.id.split(" ")[0] for record in SeqIO.parse(fasta_file, "fasta")]

    # Create a metadata dataframe with the accessions from the FASTA file and the provided metadata
    user_metada_df = pd.DataFrame(headers, columns=["Accession"])
    for key, value in metadata.items():
      user_metada_df[key] = value

    # Combine with NCBI metadata
    comb_meta_df = pd.concat([ncbi_metadata_df, user_metada_df])
    comb_meta_df.to_csv(combined_metadata_file, index=False)
    metadata_file = combined_metadata_file

  print("Merging user-provided and NCBI Virus data complete.")

else:
  input_fasta_file = ncbi_fasta_file
  metadata_file = ncbi_metadata

# Create MSA
print("3/5 Multiple Sequence Aligment (MSA): Aligning the sequences to each other so they are all in the same frame...")

aligned_fasta_file = f"{'_'.join(str(virus).split(' '))}_aligned.afa"

# # Option 1: Using the MUSCLE algorithm (this works well for a few hundred sequences, but is too slow when dealing with a few thousand sequences)
# gget.muscle(input_fasta_file, super5=True, out=aligned_fasta_file)

# Option 2: Using mafft
# TO-DO: Wrap the following code into gget module and replace with command `gget.mafft(input_fasta_file, out=aligned_fasta_file)`

#Install MAFFT
!apt-get install -qq -y mafft

# Aligning sequences to each other using mafft
aligned_fasta_file = f"{'_'.join(str(virus).split(' '))}_aligned.afa"
!mafft \
  --quiet \
  --auto \
  --thread 2 \
  $input_fasta_file > $aligned_fasta_file

print("3/5 MSA complete.")


# TO-DO: Wrap the following code into gget module and replace with command `gget.delphy(aligned_fasta_file, metadata_file)`

# Adjust the headers in the aligned fasta file to match header format required by Delphy (accession|YYYY-MM-DD):
print("4/5 Reformatting sequence files to match Delphy format...")

# Reformat collection date
default_day = '01'
default_month = '01'
def extract_and_format_date(date_string):
    # Define regular expressions for various date formats
    year_only = re.compile(r'(?P<year>\d{4})')
    year_month = re.compile(r'(?P<year>\d{4})[-/.](?P<month>\d{1,2})')
    full_date = re.compile(r'(?P<year>\d{4})[-/.](?P<month>\d{1,2})[-/.](?P<day>\d{1,2})')

    # Try to match the full date first
    match = full_date.search(date_string)
    if match:
        year = match.group('year')
        month = match.group('month').zfill(2)
        day = match.group('day').zfill(2)
    else:
        # Try to match year and month
        match = year_month.search(date_string)
        if match:
            year = match.group('year')
            month = match.group('month').zfill(2)
            day = default_day
        else:
            # Try to match only the year
            match = year_only.search(date_string)
            if match:
                year = match.group('year')
                month = default_month
                day = default_day
            else:
                # If no match, return None
                return None

    # Format the extracted date into YYYY-MM-DD
    formatted_date = f"{year}-{month}-{day}"

    try:
        # Validate date by trying to convert it to a datetime object
        datetime.strptime(formatted_date, '%Y-%m-%d')
    except ValueError:
        return None  # Return None if the date is invalid

    return formatted_date

def update_fasta_headers(fasta_file, csv_file, output_fasta):
    # Read the CSV file into a pandas DataFrame
    df = pd.read_csv(csv_file)

    # Create a dictionary from the DataFrame to map accession to date
    accession_to_date = pd.Series(df['Collection Date'].values, index=df['accession']).to_dict()

    # Open the input fasta file and output fasta file
    with open(fasta_file) as fasta_input, open(output_fasta, 'w') as fasta_output:
        # Iterate through each sequence record in the fasta file
        for record in SeqIO.parse(fasta_input, 'fasta'):
            accession = record.id

            # Check if the accession is in the pandas dictionary and has a non-NaN date
            if accession in accession_to_date and pd.notna(accession_to_date[accession]):
                date = accession_to_date[accession]

                # Format the date if necessary
                formatted_date = extract_and_format_date(date)

                if formatted_date is None:
                  # Skip the entry if date is NaN or accession not found
                  print(f"Skipping accession {accession} due to unrecognized date format: '{date}'")
                  continue

                # Update the seq header
                record.id = f"{accession}|{formatted_date}"
                record.description = ''  # Remove the original description to avoid duplication
            else:
                # Skip the entry if date is NaN or accession not found
                print(f"Skipping accession {accession} due to missing or NaN date.")
                continue

            # Write the updated record to the output fasta file
            SeqIO.write(record, fasta_output, 'fasta')

aligned_fasta_file_clean = f"{'_'.join(str(virus).split(' '))}_aligned_headers_adjusted.afa"
update_fasta_headers(aligned_fasta_file, metadata_file, aligned_fasta_file_clean)

print("4/5 Reformatting complete.")

# Run Delphy
print("5/5 Running Delphy...")

# Download delphy binary
!wget https://github.com/broadinstitute/delphy/releases/download/0.9995/delphy-ubuntu-x86_64

# Give permissions
!chmod u+x ./delphy-ubuntu-x86_64

beast_log_out = f"{'_'.join(str(virus).split(' '))}_delphy_beast_log.txt"
delphy_beast_tree_out = f"{'_'.join(str(virus).split(' '))}_delphy_beast_tree.nwk"
dphy_out = f"{'_'.join(str(virus).split(' '))}_delphy_out.dphy"

!./delphy-ubuntu-x86_64 \
  --v0-threads $threads \
  --v0-in-fasta $aligned_fasta_file_clean \
  --v0-out-log-file $beast_log_out \
  --v0-out-trees-file $delphy_beast_tree_out \
  --v0-out-delphy-file $dphy_out


# Display a message when done
from IPython.display import HTML

def done_message():
    display(HTML("""
    <h1>All done! 🎉</h1>
    <h3>To download the files we generated in this notebook to your local computer, click on the folder icon on the left and download files by right clicking a file of interest and selecting 'Download'.</h3>
    <h3>To further visualize your Delphy output, upload the <code>.dphy</code> file to <a href='https://delphy.fathom.info/' target='_blank'>https://delphy.fathom.info/</a></h3>
    """))

done_message()

Installing software...
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for gget (setup.py) ... [?25l[?25hdone
Software installation complete.
Download data from NCBI Virus... This might take a minute depending on the internet connection and how busy the NCBI server is.


New version of client (16.32.0) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
INFO:gget.utils:3450 sequences passed the provided filters.


Data download from NCBI virus is complete.
Multiple Sequence Aligment (MSA): Aligning the sequences to each other so they are all in the same frame...
Extracting templates from packages: 100%
Selecting previously unselected package fonts-lato.
(Reading database ... 123629 files and directories currently installed.)
Preparing to unpack .../00-fonts-lato_2.0-2.1_all.deb ...
Unpacking fonts-lato (2.0-2.1) ...
Selecting previously unselected package netbase.
Preparing to unpack .../01-netbase_6.3_all.deb ...
Unpacking netbase (6.3) ...
Selecting previously unselected package libclone-perl.
Preparing to unpack .../02-libclone-perl_0.45-1build3_amd64.deb ...
Unpacking libclone-perl (0.45-1build3) ...
Selecting previously unselected package libdata-dump-perl.
Preparing to unpack .../03-libdata-dump-perl_1.25-1_all.deb ...
Unpacking libdata-dump-perl (1.25-1) ...
Selecting previously unselected package libencode-locale-perl.
Preparing to unpack .../04-libencode-locale-perl_1.05-1.1_all.deb ...