<a href="https://colab.research.google.com/github/jnoms/vpSAT/blob/main/bin/colab/QueryStructures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Foldseek Viral Structure Alignment**
The aim of this notebook is to enable users to use Foldseek to conduct structural searches against the database of predicted viral structures established by Nomburg et al., "*Birth of new protein folds and functions in the virome*".


# **Directions**

This notebook is split into individual steps to enable searching of one or more structure (.pdb) files against the viral structure database. If you do not have a structure, we recommend you predict one based using the publically available Colabfold notebook (https://github.com/sokrypton/ColabFold).  

## Quick start
1. Specify search parameters in section 2. The defaults are sensible, so this is optional.
2. In the toolbar above, select Runtime --> Run all
3. In section 3, upload a PDB file or a zip file containing multiple PDB files.
4. Your results will appear in subsequent cells.


## More details:
* To run additional searches, simply select Runtime --> Run all again to prompt a new upload in section 3.
* In section 5, there are two output tables.
 * The first table shows information on the protein clusters to which your hits reside - this will tell you, for example, how many hits belong to each protein cluster. You can then investigate all cluster members in section 7 using the cluster_ID.
 * The second table simply displays all hits, along with their cluster_ID and all associated output fields.
* Executing section 6 prompts a download of the foldseek results file.

##**1. Setup**

This step downloads required files and packages, including Foldseek, that are necessary to search across the database for viral structures of interest against the query input.

In [None]:
# @title

from google.colab import files
import subprocess
import zipfile
import os
import io
import ipywidgets as widgets
import pandas as pd
from ipywidgets import Layout

# Check if Foldseek is already downloaded
if not os.path.exists('./foldseek-linux-avx2.tar.gz'):

    #Silencing main output
    import sys
    old_stdout = sys.stdout
    sys.stdout = io.StringIO()

    #Download Foldseek
    !wget https://github.com/steineggerlab/foldseek/releases/download/8-ef4e960/foldseek-linux-avx2.tar.gz ; tar xvzf foldseek-linux-avx2.tar.gz

    sys.stdout = old_stdout
    print("\nFoldseek downloaded successfully")
else:
    print("\nFoldseek is already downloaded.")

# Check if the target directory is already downloaded
target_directory = '/content/target_extracted_folder/'
if not os.path.exists(target_directory):

    #Silencing main output
    import sys
    old_stdout = sys.stdout
    sys.stdout = io.StringIO()

    !wget https://zenodo.org/records/10685505/files/structure_foldseek_database_2023-11-27.zip?download=1

    # Filename of the zip file
    target_zip_file = '/content/structure_foldseek_database_2023-11-27.zip?download=1'

    # Directory to extract the contents
    target_extract_dir = '/content/target_extracted_folder/'

    # Create the extraction directory if it doesn't exist
    os.makedirs(target_extract_dir, exist_ok=True)

    # Open the zip file
    with zipfile.ZipFile(target_zip_file, 'r') as zip_ref:
        # Extract all the contents to the extraction directory
        zip_ref.extractall(target_extract_dir)

    # List the extracted files
    target_extracted_files = os.listdir(target_extract_dir)

    # Get the target_file_path
    target_file_path = target_extract_dir + target_extracted_files[0]+'/db'
    sys.stdout = old_stdout

    print("\nTarget database downloaded successfully")
else:
    print("\nTarget database is already downloaded.")



# Check if the structure file is in the directory. If not, download it
file_path = 'media-1.xlsx?download=true.1'

if os.path.isfile(file_path):
    print('\nStructures file is already downloaded')
else:
    old_stdout = sys.stdout
    sys.stdout = io.StringIO()
    !wget https://www.biorxiv.org/content/biorxiv/early/2024/01/23/2024.01.22.576744/DC1/embed/media-1.xlsx?download=true
    sys.stdout = old_stdout
    print("\nStructures file downloaded successfully")

#read in the structures file
structure_df = pd.read_excel('media-1.xlsx?download=true')

print("\nSetup completed")



## **2. User Parameters**

*Input the desired parameters below:*

**Coverage** - Display the matches above this threshold of alignment/residue coverage—higher coverage value results in wider range of alignments (default: 0, meaning no coverage restriction)

**Coverage Mode** -
> 0 = Coverage of **both** *query and target*

>1 = Coverage of **only** *target*

>2 = Coverage of **only** *query*

**E-value** - Sensitivity value, higher e-value produces more distant structures (Range: 0-infinity)

**Output Format** - Indicate the desired output fields. Consult the Foldseek readme for more information: https://github.com/steineggerlab/foldseek?tab=readme-ov-file#output-search

(Default: query, target, fident, alnlen, mismatch, gapopen, qstart, qend, tstart, tend, evalue, bits, rmsd, prob, alntmscore)

**File Output** - This is the type of output file produced. The default is tabular, which lists alignments with the indiciated output format. The alternative is HTML, which will produce an interactive HTML file with all alignments. Note that if there are many targets, the HTML will be large and unweildy.  

In [None]:

# Create parameters for user input with default values
coverage = 0 #@param {type:"number"}
coverage_mode = 0 #@param {type:"integer"}
e = 10 #@param {type:"number"}
format_output = 'query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,rmsd,prob,alntmscore' #@param {type:"string"}
file_output = "tabular" #@param ["tabular", "HTML"]


### **3. Upload File Below**

Upload either a single .pdb file or a zip file of .pdb files below:

(Upload file widget will appear after pressing *Run All*)

In [None]:
# @title

#print("Please Upload a single pdb file or a zip file of pdb files")
uploaded_file = files.upload()

#check if the file is is a single pdb file or a zip file
for file_name, content in uploaded_file.items():
  if file_name.endswith('.pdb'):
    print("\n This is a pdb file")
    query_file_path = '/content'+'/' +file_name

  elif file_name.endswith('.zip'):
    print("\n This is a zip file")

    #make query_folder_path
    query_folder_path = '/content/query_extracted_folder/'
    os.makedirs(query_folder_path, exist_ok=True)

    # Extract zip file
    with zipfile.ZipFile(io.BytesIO(content), 'r') as zObject:
        zObject.extractall(path=query_folder_path)

    for item in os.listdir(query_folder_path):
    # Construct the full path of the folder containing pdb files
      query_file_path = os.path.join(query_folder_path, item)

    print("\n Contents in zip file successfully extracted")
  else:
    print("\n This is an unknown file. Please upload the correct file format")


In [None]:
# @title **4. Run Foldseek**

if file_output == 'tabular':
  output_file_path = "output.m8"
  format_mode = 0
else:
  output_file_path = "output.html"
  format_mode = 3

old_stdout = sys.stdout
sys.stdout = io.StringIO()

!/content/foldseek/bin/foldseek easy-search "{query_file_path}" "{target_file_path}" {output_file_path} tmpFolder -c {coverage} --cov-mode {coverage_mode} -e {e} --format-output "{format_output}" --format-mode {format_mode}
sys.stdout = old_stdout

print("\n Foldseek Run Completed")

In [None]:
# @title **5. Display Alignment Results**

if file_output == 'HTML':
  print("Alignment results will be in HTML format. Please download HTML file below to view the alignment results.")

else:


  #add column names to output data
  output_df = pd.read_csv(output_file_path, delimiter='\t', names=format_output.split(','))

  output_df['target'] = output_df['target'].str[:-4]

  #merge output_df with structures_df to obtain the cluster_id and cluster_count
  structure_df = structure_df.copy()
  structure_df['cluster_ID'] = structure_df['cluster_ID'].astype(str)
  structure_df['cluster_count'] = structure_df['cluster_count'].astype(str)

  structure_df = structure_df[['cluster_member','cluster_ID', 'cluster_count']]
  merged_output_df = pd.merge(output_df, structure_df, left_on='target', right_on='cluster_member', how='left')
  #if cluster_member, cluster_iD, cluster_count is nan, fill in the values with unassigned
  merged_output_df['cluster_member'] = merged_output_df['cluster_member'].fillna('unassigned')
  merged_output_df['cluster_ID'] = merged_output_df['cluster_ID'].fillna('unassigned')
  merged_output_df['cluster_count'] = merged_output_df['cluster_count'].fillna('unassigned')

  #select relevant values for the final output table
  merged_output_df.drop(columns=['cluster_member'], inplace=True)
  merged_output_df.insert(2, 'cluster_ID', merged_output_df.pop('cluster_ID'))
  merged_output_df.insert(3, 'cluster_count', merged_output_df.pop('cluster_count'))


  #Import interactive data table
  from google.colab import data_table
  data_table.enable_dataframe_formatter()

  # Displaying alignment results
  cluster_counts_df = merged_output_df.drop_duplicates('target')
  cluster_counts_df = cluster_counts_df['cluster_ID'].value_counts().rename_axis('cluster_ID').reset_index(name='Number of proteins found')
  cluster_counts_df['Fraction of total hits'] = round((cluster_counts_df['Number of proteins found']/cluster_counts_df['Number of proteins found'].sum()),2)
  cluster_counts_df = cluster_counts_df.sort_values(by='Fraction of total hits', ascending= False)
  cluster_counts_df = pd.merge(cluster_counts_df, structure_df.drop_duplicates('cluster_ID'), on = 'cluster_ID', how='left')

  cluster_counts_df['Fraction of cluster with an alignment'] = round((cluster_counts_df['Number of proteins found']/cluster_counts_df['cluster_count'].astype(float)),2)
  cluster_counts_df = cluster_counts_df.copy()
  cluster_counts_df = cluster_counts_df[['cluster_ID', 'Number of proteins found', 'Fraction of total hits', 'Fraction of cluster with an alignment']]
  cluster_counts_df['Fraction of cluster with an alignment'] = cluster_counts_df['Fraction of cluster with an alignment'].fillna("NA")

  print("Number of clusters found in alignment results: ", len(cluster_counts_df) )
  print("\nDisplaying clusters found in alignment results in table below: \n")
  display(data_table.DataTable(cluster_counts_df, include_index=False, num_rows_per_page=15))

  print("\nDisplaying alignment results in table below: ")
  display(data_table.DataTable(merged_output_df, include_index=False, num_rows_per_page=15))

In [None]:
# @title **6. Download Alignment Results**

if file_output == 'tabular':
  merged_output_df.to_csv('final_output.csv', index=False)
  files.download('final_output.csv')
  print("Tabular file Downloaded")

else:
  files.download('output.html')
  print("HTML file Downloaded")


In [None]:
# @title **7. Explore Cluster Members**

if file_output == 'tabular':
  clusterID = '' #@param {type:"string"}
  clusterID_results = merged_output_df[merged_output_df['cluster_ID']==clusterID]
  display(data_table.DataTable(clusterID_results, include_index=True, num_rows_per_page=15))
else:
  print("Find cluster members of interest by exploring the tabular output.")