<a href="https://colab.research.google.com/github/jnoms/vpSAT/blob/main/bin/colab/ExploreStructures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Foldseek Viral Structure Visualization**
The aim of this notebook is to enable users to explore the viral structures established by Nomburg et al.,

 "*Birth of new protein folds and functions in the virome*".

# **Quick Start**

1. Execute the section 1 block to set up the notebook environment.
2. If you have a NCBI protein accession of interest, you can skip to section 3. Type in the protein accession, and execute the block. This will display the structure.
3. If you do not know your protein accession, you may explore available proteins in section 2. Here, enter either:
 * A taxonID of interest (this should be a taxonID for a specific viral species - these can be found by searching the [NCBI Taxonomy Website](https://www.ncbi.nlm.nih.gov/taxonomy))
 * The name of a viral family
4. Section 2, if used, will display a table showing all structures that are available and encoded by a virus with a given taxonID or within the specified viral family. This table has all protein accessions, which you can use to view the structures in section 3.
5. To download the structure file, execute the section 4 block.


##**1. Setup**

This step downloads required packages and supplementary file necessary to visualize viral protein structures.

In [None]:
# @title

#import libraries
import os
import io
import pandas as pd
from google.colab import data_table
data_table.enable_dataframe_formatter()
import sys
from google.colab import files
import numpy as np

#if py3Dmol is not already installed, install it
try:
    import py3Dmol
    print("py3Dmol is already installed")
except ImportError:
    print("py3Dmol is installing")
    old_stdout = sys.stdout
    sys.stdout = io.StringIO()
    !pip install py3Dmol
    sys.stdout = old_stdout
    import py3Dmol

#biopython is not already installed, install i
try:
  from Bio.PDB.MMCIFParser import MMCIFParser
  from Bio.PDB.PDBIO import PDBIO
  print("biopython is already installed")

except:
  print("biopython is installing")
  old_stdout = sys.stdout
  sys.stdout = io.StringIO()
  !pip install biopython
  sys.stdout = old_stdout
  from Bio.PDB.MMCIFParser import MMCIFParser
  from Bio.PDB.PDBIO import PDBIO


# Check if the structure file is in the directory. If not, download it
file_path = 'media-1.xlsx?download=true.1'

if os.path.isfile(file_path):
    print('Structures file is already downloaded')
else:
    old_stdout = sys.stdout
    sys.stdout = io.StringIO()
    !wget https://www.biorxiv.org/content/biorxiv/early/2024/01/23/2024.01.22.576744/DC1/embed/media-1.xlsx?download=true
    sys.stdout = old_stdout
    print("Structures file downloaded successfully")

#read in the structures file
structure_df = pd.read_excel('media-1.xlsx?download=true')

#add model archive index
structure_df['model_archive_index'] = range(1, len(structure_df) + 1)
structure_df['model_archive_index'] = structure_df['model_archive_index'].map(lambda x: f"{x:05d}")

#create a new column for the protein accession
structure_df['protein_accession'] = structure_df['cluster_member'].str.split('__').str[1]

#parse out the protein name
structure_df['protein_name'] = structure_df['cluster_member'].str.split('__').str[0]

#if family is nan, replace with Undefined
structure_df['family'] = structure_df['family'].fillna("undefined_family")

#convert the entire pandas df to string type
structure_df = structure_df.astype(str)

#re-order the structure df
structure_df = structure_df.copy()
structure_df = structure_df[['model_archive_index','protein_accession', 'protein_name', 'taxonID', 'species', 'superkingdom',	'phylum',	'class',	'order',	'family',	'genus','cluster_ID','cluster_count','cluster_rep',	'subcluster_rep',	'cluster_member']]

print("Setup completed")


## **2. Browse Available Structures**
Users can enter taxonID or viral family name to identify the protein accession number of interest. For viral family names that are undefined (such as Pandoraviruses, etc), input "undefined_family".

Protein accessions in the resultant table can be used in section 3 to view a specified structure.



In [None]:
# Prompt user to choose between taxonID or virus family using @param
Search_by = 'virus family name'  # @param ["taxonID", "virus family name"]

value = ''  # @param {type:"string"}

if Search_by == 'taxonID':
    # User enters taxonID
    taxonID_filtered_df = structure_df[structure_df['taxonID'] == value]

    if 'model_archive_index' in taxonID_filtered_df:
      taxonID_filtered_df = taxonID_filtered_df.drop('model_archive_index', axis=1)

    print(f"Display proteins filtered by taxon id({value})")
    display(data_table.DataTable(taxonID_filtered_df, include_index=False, num_rows_per_page=15))

elif Search_by == 'virus family name':
    # User enters virus family name
    virusFamilyName = value.lower().capitalize()
    print(f"Display proteins filtered by virusFamilyName({virusFamilyName})")
    family_filtered_df = structure_df[structure_df['family'] == virusFamilyName]


    if 'model_archive_index' in family_filtered_df:
      family_filtered_df = family_filtered_df.drop('model_archive_index', axis=1)

    display(data_table.DataTable(family_filtered_df, include_index=False, num_rows_per_page=15))


## **3. View Protein Structure**

Input desired protein accession number below and choose the coloration of the proteins based on structure confidence (pLDDT) or amino acid number (rainbow).

In [None]:
proteinAccession = 'YP_010087542' # @param {type:"string"}
proteinAccession = proteinAccession.split('.')[0]

accession_filtered_df = structure_df[structure_df['protein_accession'] == proteinAccession]
accession_index = structure_df[structure_df['protein_accession'] == proteinAccession]['model_archive_index'].values[0]


#informative message that protein message is not in table
if accession_filtered_df.empty:
    print("Protein accession not found. Please try another protein accession.")
else:

  #get cif file from Model Archive

  #make a directory to store results
  results_path = '/content/results'
  if not os.path.exists(results_path):
    os.makedirs(results_path)

  #use indexing to get the structure file from Model Archive
  #Model Archive data is 1 index and Python is 0 index
  file_directory = f"https://www.modelarchive.org/api/projects/ma-jd-viral-{accession_index}?type=basic__model_file_name"
  file_name_cif = f"ma-jd-viral-{accession_index}?type=basic__model_file_name"

  #download cif file into google collab
  old_stdout = sys.stdout
  sys.stdout = io.StringIO()
  !wget -P {results_path} {file_directory}
  sys.stdout = old_stdout

  #change the cif file into a pdb file
  def convert_cif_to_pdb(cif_file, pdb_file):
    parser = MMCIFParser()
    structure = parser.get_structure('structure', cif_file)
    io = PDBIO()
    io.set_structure(structure)
    io.save(pdb_file)

  cluster_member = accession_filtered_df['cluster_member'].values[0]
  file_name_pdb = f"{cluster_member}.pdb"

  convert_cif_to_pdb(os.path.join(results_path, file_name_cif), os.path.join(results_path, file_name_pdb))

  color = "pLDDT" #@param ["pLDDT", "rainbow"]

  # Load PDB file
  pdb_file = open(os.path.join(results_path, file_name_pdb)).read()

  # Create py3Dmol view
  view = py3Dmol.view(width=800, height=600)
  view.addModel(pdb_file, "pdb")

  # Set color based on normalized B-factor values
  if color == "pLDDT":
      view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':50,'max':90}}})
      from IPython.display import HTML
      color_legend_html = """
      <div style="position:relative; top:10px; background-color:white; padding:10px;">
          <h3>pLDDT Legend</h3>
          <div style="width: 400px;">
                <span style="margin-right: 50px;">Very Low</span>
                <span style="margin-right: 90px;">Low</span>
                <span style="margin-right: 75px;">High</span>
                <span>Very High</span>
          </div>
          <div style="display: flex; flex-direction: column; align-items: flex-start;">
              <div style="background: linear-gradient(to right, red, yellow, cyan, blue); height:20px; width: 400px;"></div>
              <div style="width: 400px;">
              </div>
          </div>
      </div>
      """
      display(HTML(color_legend_html))

      view.zoomTo()
      view.show()

  elif color == "rainbow":
      view.setStyle({'cartoon': {'color':'spectrum'}})
      #view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':50,'max':90}}})
      view.zoomTo()
      view.show()



## **4. Download Structure**

Run this cell if you want to download the structure file of the protein displayed above.

In [None]:
# @title
files.download(f"{results_path}/{file_name_pdb}")
print("Structure file downloaded")