<a href="https://colab.research.google.com/github/lauraluebbert/delphy_workflows/blob/main/delphy_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Delphy workflow
___

## 1. Select your virus of interest and apply filters to the genomes downloaded from NCBI virus

In [1]:
virus = "Norovirus"     # Examples: 'Norovirus' or 'coronaviridae' or 'NC_045512.2'
accession = False       # If 'virus' is an accession instead of a taxon (e.g. 'NC_045512.2'), set this to True

# Commonly used filtering options (set any filter to None to turn off the filter):
host = 'homo sapiens'             # Example: 'homo sapiens' (alternatively: use the host_taxid filter below)
min_seq_length = 6252             # Example: 6252
max_seq_length = 7815             # Example: 7815

# Additional filtering options:
min_gene_count = None             # Example: 1
max_gene_count = None             # Example: 40
nuc_completeness = None           # Example: 'partial' or 'complete'
virus_taxid = None                # Example: 11974 (NCBI Taxonomy ID of Caliciviridae) - Tip: use this in combination with the 'virus' argument to avoid long download times
host_taxid = None                 # Example: 9443 (NCBI Taxonomy ID of all primates)
lab_passaged = None               # Example: True or False (indicates whether the virus sequence has been passaged in a laboratory setting)
geographic_region = None          # Example: 'Africa' or 'Europe'
geographic_location = None        # Example: 'South_Africa' or 'Germany'
submitter_country = None          # Example: 'South_Africa' or 'Germany'
min_collection_date = None        # Example: '2000-01-01'
max_collection_date = None        # Example: '2014-12-04'
annotated = None                  # Example: True or False (indicates whether the virus genome sequence should be annotated)
source_database = None            # Example: 'GenBank' or 'RefSeq'
min_release_date = None           # Example: '2000-01-01'
max_release_date = None           # Example: '2014-12-04'
min_mature_peptide_count = None   # Example: 2
max_mature_peptide_count = None   # Example: 15
min_protein_count = None          # Example: 2
max_protein_count = None          # Example: 15
max_ambiguous_chars = None        # Example: 10

## 2. Optional: Upload a fasta file with your own sequences to add to the analysis
  1) Click on the folder icon on the left  
  2) Upload your file(s) to the Google Colab server by dragging in your file(s) (or use rightclick -> Upload)  
  3) Specify the name of your file(s) here:

In [2]:
fasta_file = None           # Example: 'my_fasta_file.fa' or 'my_fasta_file.fasta'

# If the metadata is the same for all sequences in your fasta file, enter the metadata here
# You have to enter a Collection Date. In addition, you can add as many additional columns as you wish, e.g. "Geo Location": "South Korea".
# Use NCBI column names where applicable, e.g. see https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus
metadata = {
    "Collection Date": "YYYY-MM-DD",
    "Extra column 1": "Value",
    # ...
}

# Alternative: Upload a csv file containing the metadata
# This file has to include at least a "Accession" and a "Collection Date" column
# Make sure the IDs in the "Accession" column match the IDs of the sequences in the fasta
# Use NCBI column names where applicable, e.g. see https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus
metadata_csv = None       # Example: 'my_metadata.csv'

## 3. Click on 'Runtime' -> 'Run all' and lean back
___

### Installing gget:

In [3]:
# After the release, this will just be: pip install gget
!pip install -q mysql-connector-python==8.0.29 biopython
!pip install -q --log log git+https://github.com/pachterlab/gget.git@delphy_dev

import gget

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.2/25.2 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for gget (setup.py) ... [?25l[?25hdone


Full descriptions for the filtering options:

In [4]:
help(gget.ncbi_virus)

Help on function ncbi_virus in module gget.gget_ncbi_virus:

ncbi_virus(virus, accession=False, outfolder=None, host=None, min_seq_length=None, max_seq_length=None, min_gene_count=None, max_gene_count=None, nuc_completeness=None, host_taxid=None, lab_passaged=None, geographic_region=None, geographic_location=None, submitter_country=None, min_collection_date=None, max_collection_date=None, annotated=None, virus_taxid=None, source_database=None, min_release_date=None, max_release_date=None, min_mature_peptide_count=None, max_mature_peptide_count=None, min_protein_count=None, max_protein_count=None, max_ambiguous_chars=None)
    Download a virus genome dataset from the NCBI Virus database (https://www.ncbi.nlm.nih.gov/labs/virus/).
    
    Args:
    - virus                Virus taxon or accession, e.g. 'Norovirus' or 'coronaviridae' or 'NC_045512.2'
                           If this input is a virus accession (e.g. 'NC_045512.2'), set accession = True.
    - accession            True/Fa

### Downloading virus genomes from NCBI Virus:

This might take a minute depending on the internet connection and how busy the NCBI server is.

In [5]:
%%time
gget.ncbi_virus(
    virus = virus,
    accession = accession,
    host = host,
    min_seq_length = min_seq_length,
    max_seq_length = max_seq_length,
    min_gene_count = min_gene_count,
    max_gene_count = max_gene_count,
    nuc_completeness = nuc_completeness,
    host_taxid = host_taxid,
    lab_passaged = lab_passaged,
    geographic_region = geographic_region,
    geographic_location = geographic_location,
    submitter_country = submitter_country,
    min_collection_date = min_collection_date,
    max_collection_date = max_collection_date,
    annotated = annotated,
    virus_taxid = virus_taxid,
    source_database = source_database,
    min_release_date = min_release_date,
    max_release_date = max_release_date,
    min_mature_peptide_count = min_mature_peptide_count,
    max_mature_peptide_count = max_mature_peptide_count,
    min_protein_count = min_protein_count,
    max_protein_count = max_protein_count,
    max_ambiguous_chars = max_ambiguous_chars
)

INFO:gget.utils:3291 sequences passed the provided filters.


CPU times: user 5.55 s, sys: 1.18 s, total: 6.72 s
Wall time: 23.7 s


### Merging sequencing and metadata files if additional file(s) were provided

In [6]:
# The fasta file downloaded from NCBI Virus is automatically named after today's date and the virus
from datetime import datetime
date = datetime.now().strftime("%Y-%m-%d") # Get today's date

ncbi_fasta_file = f"{virus}_{date}_sequences.fasta"
ncbi_metadata = f"{virus}_{date}_metadata.csv"

If an additional fasta file with sequences was provided, adding these to the sequences and metadata to analyze:

In [9]:
if fasta_file:
  !pip install biopython
  import pandas as pd
  from Bio import SeqIO

  # Combine sequence files
  combined_fasta_file = f"{virus}_{date}_sequences_combined.fasta"
  !cat ncbi_fasta_file fasta_file > combined_fasta_file
  input_fasta_file = combined_fasta_file

  # Combine metadata
  combined_metadata_file = f"{virus}_{date}_metadata_combined.csv"
  if metadata_csv:
    # Combine provided metadata and NCBI metadata csv files
    comb_meta_df = ncbi_metadata.append(metadata_csv, ignore_index=True)
    comb_meta_df.to_csv(combined_metadata_file, index=False)
    metadata_file = combined_metadata_file

  else:
    # Extract sequence accessions from the provided FASTA file
    headers = [record.id.split(" ")[0] for record in SeqIO.parse(fasta_file, "fasta")]

    # Create a metadata dataframe with the accessions and the provided metadata
    user_metada_df = pd.DataFrame(headers, columns=["Accession"])
    for key, value in metadata.items():
      user_metada_df[key] = value

      comb_meta_df = ncbi_metadata.append(user_metada_df, ignore_index=True)
      comb_meta_df.to_csv(combined_metadata_file, index=False)
      metadata_file = combined_metadata_file

else:
  input_fasta_file = ncbi_fasta_file
  metadata_file = ncbi_metadata

### Aligning the sequences to each other so they are all in the same frame:

Aligning all sequences in the faste file to each other so they are all in the same frame.

Option 1: Using the MUSCLE algorithm (this works well for a few hundred sequences, but is too slow when dealing with a few thousand sequences)

In [10]:
# %%time
# aligned_fasta_file = f"{virus}_{date}_aligned.afa"
# gget.muscle(input_fasta_file, super5=True, out=aligned_fasta_file)

Option 2: Using mafft

In [None]:
%%time
#Installing MAFFT
!apt-get install -qq -y mafft

# Aligning sequences to each other using mafft
aligned_fasta_file = f"{virus}_{date}_aligned.afa"
!mafft \
  --auto \
  --thread 2 \
  $input_fasta_file > $aligned_fasta_file

Extracting templates from packages: 100%
Selecting previously unselected package fonts-lato.
(Reading database ... 123597 files and directories currently installed.)
Preparing to unpack .../00-fonts-lato_2.0-2.1_all.deb ...
Unpacking fonts-lato (2.0-2.1) ...
Selecting previously unselected package netbase.
Preparing to unpack .../01-netbase_6.3_all.deb ...
Unpacking netbase (6.3) ...
Selecting previously unselected package libclone-perl.
Preparing to unpack .../02-libclone-perl_0.45-1build3_amd64.deb ...
Unpacking libclone-perl (0.45-1build3) ...
Selecting previously unselected package libdata-dump-perl.
Preparing to unpack .../03-libdata-dump-perl_1.25-1_all.deb ...
Unpacking libdata-dump-perl (1.25-1) ...
Selecting previously unselected package libencode-locale-perl.
Preparing to unpack .../04-libencode-locale-perl_1.05-1.1_all.deb ...
Unpacking libencode-locale-perl (1.05-1.1) ...
Selecting previously unselected package libhttp-date-perl.
Preparing to unpack .../05-libhttp-date-perl

### Running Delphy:

To-do:  
Prep needs to take care of putting fasta headers into the following format:
> accession|YYYY-MM-DD



In [None]:
# gget.delphy(aligned_fasta_file, metadata_file)

___
# All done! 🎉

### To download the files we generated in this notebook to your local computer, click on the folder icon on the left and download files by right clicking a file of interest and selecting 'Download'.

### To further visualize your Delphy output, upload the .dhpy file to https://delphy.fathom.info/