<a href="https://colab.research.google.com/github/lauraluebbert/delphy_workflows/blob/main/delphy_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Delphy workflow
___

## 1. Select your virus of interest and apply filters to the genomes downloaded from NCBI virus

In [None]:
virus = 'Norovirus'                # Examples: 'Mammarenavirus lassaense' or 'coronaviridae' or 'NC_045512.2' or '142786' (Norovirus taxid)
accession = False                  # If 'virus' is an NCBI accession instead of a taxon (e.g. 'NC_045512.2'), set this to True

# Commonly used filtering options (set any filter to None to turn off the filter):
host = 'homo sapiens'             # Example: 'homo sapiens' (alternatively: use the host_taxid filter below)
min_seq_length = 6252             # Example: 6252
max_seq_length = 7815             # Example: 7815

has_proteins = None               # Example: 'GPC' or 'L' or ['GPC', 'L'] (also accepts genes or segments)
proteins_complete = False         # True or False (indicates whether the proteins/genes/segments in has_proteins should be marked 'complete')

geographic_location = None        # Example: 'South_Africa' or 'Germany'
min_collection_date = None        # Example: '2000-01-01'
max_collection_date = None        # Example: '2014-12-04'
max_ambiguous_chars = None        # Example: 10

# Additional filtering options:
min_gene_count = None             # Example: 1
max_gene_count = None             # Example: 40
nuc_completeness = None           # 'partial' or 'complete'
host_taxid = None                 # Example: 9443 (NCBI Taxonomy ID of all primates)
lab_passaged = None               # True or False (indicates whether the virus sequence has been passaged in a laboratory setting)
geographic_region = None          # Example: 'Africa' or 'Europe'
submitter_country = None          # Example: 'South_Africa' or 'Germany'
annotated = None                  # True or False (indicates whether the virus genome sequence should be annotated)
source_database = None            # Example: 'GenBank' or 'RefSeq'
min_release_date = None           # Example: '2000-01-01'
max_release_date = None           # Example: '2014-12-04'
min_mature_peptide_count = None   # Example: 2
max_mature_peptide_count = None   # Example: 15
min_protein_count = None          # Example: 2
max_protein_count = None          # Example: 15

## 2. Optional: Upload a fasta file with your own sequences to add to the analysis
  1) Click on the folder icon on the left  
  2) Upload your file(s) to the Google Colab server by dragging in your file(s) (or use rightclick -> Upload)  
  3) Specify the name of your file(s) here:

In [None]:
fasta_file = None        # Example: 'my_fasta_file.fa' or 'my_fasta_file.fasta'

# If the metadata is the same for all sequences in your fasta file, enter the metadata here
# You have to enter a Collection Date. In addition, you can add as many additional columns as you wish, e.g. "Geo Location": "South Korea".
# Use NCBI column names where applicable, e.g. see https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus
metadata = {
    "Collection Date": "YYYY-MM-DD",
    "Extra column 1": "Value",
    # ...
}

# Alternative: Upload a csv file containing the metadata
# This file has to include at least a "Accession" and a "Collection Date" column
# Make sure the IDs in the "Accession" column match the IDs of the sequences in the fasta
# Use NCBI column names where applicable, e.g. see https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus
metadata_csv = None       # Example: 'my_metadata.csv'

## 3. Click on 'Runtime' -> 'Run all' and lean back
___

### Installing gget:

In [None]:
# After the release, this will just be: pip install gget
!pip install -q mysql-connector-python==8.0.29 biopython
!pip install -q --log log git+https://github.com/pachterlab/gget.git@delphy_dev

import gget

Full descriptions for the filtering options:

In [None]:
help(gget.ncbi_virus)

### Downloading virus genomes from NCBI Virus:

This might take a minute depending on the internet connection and how busy the NCBI server is.

In [None]:
%%time
gget.ncbi_virus(
    virus = virus,
    accession = accession,
    host = host,
    min_seq_length = min_seq_length,
    max_seq_length = max_seq_length,
    min_gene_count = min_gene_count,
    max_gene_count = max_gene_count,
    nuc_completeness = nuc_completeness,
    has_proteins = has_proteins,
    proteins_complete = proteins_complete,
    host_taxid = host_taxid,
    lab_passaged = lab_passaged,
    geographic_region = geographic_region,
    geographic_location = geographic_location,
    submitter_country = submitter_country,
    min_collection_date = min_collection_date,
    max_collection_date = max_collection_date,
    annotated = annotated,
    source_database = source_database,
    min_release_date = min_release_date,
    max_release_date = max_release_date,
    min_mature_peptide_count = min_mature_peptide_count,
    max_mature_peptide_count = max_mature_peptide_count,
    min_protein_count = min_protein_count,
    max_protein_count = max_protein_count,
    max_ambiguous_chars = max_ambiguous_chars
)

### Merging sequencing and metadata files if additional file(s) were provided

In [None]:
ncbi_fasta_file = f"{'_'.join(virus.split(' '))}_sequences.fasta"
ncbi_metadata = f"{'_'.join(virus.split(' '))}_metadata.csv"

If an additional fasta file with sequences was provided, adding these to the sequences and metadata to analyze:

In [None]:
if fasta_file:
  !pip install biopython
  import pandas as pd
  from Bio import SeqIO

  # Combine sequence files
  combined_fasta_file = f"{'_'.join(virus.split(' '))}_sequences_combined.fasta"
  !cat $ncbi_fasta_file $fasta_file > $combined_fasta_file
  input_fasta_file = combined_fasta_file

  # Combine metadata
  combined_metadata_file = f"{'_'.join(virus.split(' '))}_metadata_combined.csv"
  ncbi_metadata_df = pd.read_csv(ncbi_metadata)
  if metadata_csv:
    # Combine provided metadata and NCBI metadata csv files
    user_metada_df = pd.read_csv(metadata_csv)
    comb_meta_df = pd.concat([ncbi_metadata_df, user_metada_df])
    comb_meta_df.to_csv(combined_metadata_file, index=False)
    metadata_file = combined_metadata_file

  else:
    # Extract sequence accessions from the provided FASTA file
    headers = [record.id.split(" ")[0] for record in SeqIO.parse(fasta_file, "fasta")]

    # Create a metadata dataframe with the accessions from the FASTA file and the provided metadata
    user_metada_df = pd.DataFrame(headers, columns=["Accession"])
    for key, value in metadata.items():
      user_metada_df[key] = value

    # Combine with NCBI metadata
    comb_meta_df = pd.concat([ncbi_metadata_df, user_metada_df])
    comb_meta_df.to_csv(combined_metadata_file, index=False)
    metadata_file = combined_metadata_file

else:
  input_fasta_file = ncbi_fasta_file
  metadata_file = ncbi_metadata

### Aligning the sequences to each other so they are all in the same frame:

Aligning all sequences in the faste file to each other so they are all in the same frame.

Option 1: Using the MUSCLE algorithm (this works well for a few hundred sequences, but is too slow when dealing with a few thousand sequences)

In [None]:
# %%time
# aligned_fasta_file = f"{'_'.join(virus.split(' '))}_aligned.afa"
# gget.muscle(input_fasta_file, super5=True, out=aligned_fasta_file)

Option 2: Using mafft

In [None]:
%%time
#Installing MAFFT
!apt-get install -qq -y mafft

# Aligning sequences to each other using mafft
aligned_fasta_file = f"{'_'.join(virus.split(' '))}_aligned.afa"
!mafft \
  --quiet \
  --auto \
  --thread 2 \
  $input_fasta_file > $aligned_fasta_file

### Running Delphy:

To-do:  
Prep needs to take care of putting fasta headers into the following format:
> accession|YYYY-MM-DD



In [None]:
# gget.delphy(aligned_fasta_file, metadata_file)

___
# All done! 🎉

### To download the files we generated in this notebook to your local computer, click on the folder icon on the left and download files by right clicking a file of interest and selecting 'Download'.

### To further visualize your Delphy output, upload the .dphy file to https://delphy.fathom.info/