<a href="https://colab.research.google.com/github/luquelab/pyCapsid/blob/Colab/notebooks/pyCapsid_colab_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# pyCapsid Colab




## Description
This Colab notebook contains a pipeline to predict the dynamics and quasi-rigid mechanical units of large protein complexes, with an emphasis on protein shells, like viral capsids. The protein complex and parameters are specified in the notebook's section [Input Structure and Parameters](#scrollTo=6MIuCLJbANGg). After specifying the desired options, it is recommended to execute the notebook by choosing the option `Run All` from the Colab menu `Runtime`.

Expect small protein shells (up to \~40,000 residues) to run in less than 10 minutes and medium shells (\~80,000 residues) in 2 or more hours using the free Colab cloud service. To investigate capsids exceeding 8 GB of RAM (120,000+ residues), consider upgrading your [Colab plan](https://colab.research.google.com/signup) or installing pyCapsid locally via [GitHub](https://github.com/luquelab/pyCapsid), [PIP](https://pypi.org/project/pyCapsid/), or [Conda](https://anaconda.org/luque_lab/pycapsid), as detailed in its [online installation guide](https://luquelab.github.io/pyCapsid/installation/).

This Colab notebook builds on the Python package [pyCapsid](https://luquelab.github.io/pyCapsid/), which combines elastic network models, normal mode analysis, and clustering methods to obtain the dynamics of protein complexes and analyze protein shells. For further technical details, please, check pyCapsid's [online documentation](https://luquelab.github.io/pyCapsid/).

## Issues, support, and citation
pyCapsid is licensed under the permissive free software license, MIT License. It is recommended to run the Colab notebook using [Google Chrome Browser](https://www.google.com/chrome/).

+ If you encounter any problem using pyCapsid or required any additional functionalities, please, [open an issue on GitHub](https://github.com/luquelab/pyCapsid/issues).
+ If you use pyCapsid and would like to help support its development further, please, [add a star to its GitHub repository](https://github.com/luquelab/pyCapsid).
+ If you publish any work that included the use of pyCapsid, please, follow its [online citation guide](https://luquelab.github.io/pyCapsid/acknowledgements/).

# Input structure and parameters

## Protein complex
pyCapsid requires the protein complex to be encoded in the [Protein Data Bank (PDB) format](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction).

### Source and structure
Fill the variables in the block code below to specify the source and identifier for the structure. Follow these guides:
+ `pdb_source` determines if the structure will be fetched from the Protein Data Bank (`'PDB'`) or uploaded (`'upload'`).
+ `pdb_id` stores the PDBid to fetch the structure online.
+ `pdbx` is true or false and determines whether the structure file is in the PDB or PDBx/mmcif format.

If the option is `'upload'`, when executing the code cell below a prompt will appear to choose the file from the local directory. The maximum file size allowed in the standard Colab cloud-server is 2 GB.

In [None]:
# Specify the PDB source
pdb_source = 'PDB' #Values expected: 'PDB' or 'upload'

# Specify the PDBid (if the structure has to be fetched online)
pdb_id = '4oq8'

pdbx = False

# Print option
if pdb_source == 'PDB':
  print('The structure with PDBid ' + pdb_id + ' will be fetched from the Protein Data Bank.')

elif pdb_source == 'upload':

  from google.colab import files
  uploaded = files.upload()
  pdb_file_name = list(uploaded.keys())[0] # Extract PDB file name
  print('The name of the PDB file is ' + pdb_file_name)

else:
  print('The value `'+ pdb_source +'` specifed in `pdb_source` is not valid. Choose from the expected options above.')

## pyCapsid parameters

Edit the variables in the block code below to specify the main options in the pyCapsid calculations. If you do not edit anything, the default values will be used. The list below describes the variables and their options:

+ `ENM_model` specifies the elastic network model used to coarse-grained the protein complex. There are four different models that can be specified:
  + `ANM`: Anisotropic network model with a default cutoff of 15Å and no distance weighting.
  + `GNM`: Gaussian network model (no three-dimensional directionality) with a default cutoff of 7.5Å and no distance weighting.
  + `U-ENM`: Unified elastic network Model with a default cutoff of 7.5Å and a default anisotropy parameter (f_anm) of 0.1. It is the **default** and **recommended** option.
  + `bbENM`: Backbone-enhanced Elastic network model with a default cutoff of 7.5Å and no distance weighting.

+ `n_modes` specifies the number of modes to be used in the calculation of the dynamics. The values accepted are typically from 'X' to 'Y'. The default values is 400 (?). However, using 200 models can yield good results. Increasing the number of modes often improved the results, but it requires longer computational times.

+ `cluster_min` specifies the minimum number of clusters used in the clustering analysis to identify the optimal quasi-rigid mechanical units. A value of 1 implies that the whole protein complex is analyzed as a single domain. The default value is 2.

+ `cluster_max` specifies the maximum number of clusters used in the clustering analysis to identify the optimal quasi-rigid mechanical units. The number of residues in the structure represent an upper value. The default value is 100. The recommended value should be at least the number of proteins in the structure. Ideally, the value should be the total number of proteins times the number of expected protein domains defining the protein fold.

+ `cluster_delta` specifies the steps taken when exploring the range of clusters to determine the optimal quasi-rigid mechanical units. The default value is 2. It is recommended to refine the search in a sub region once a potential optimal result has been identified.

In [None]:
# Specify values

## Elastic network model
ENM_model = 'U-ENM' # Values expected: 'ANM', 'GNM', 'U-ENM', and 'bbENM'.

## Number of modes used in the dynamics
n_modes = 50

## Cluster range and step in the optimal analysis of quasi-rigid units.
cluster_min = 4
cluster_max = 100
cluster_delta = 2

# Double-check options

## Elastic model
valid_ENM = ['ANM','GNM','U-ENM']
if ENM_model in valid_ENM:
  print('The ENM model used for coarse-graining is ' + ENM_model + ' .')

else:
  print('The value `'+ ENM_model +'` specifed in `ENM_model` is not valid. Choose from the expected options above.')

## Modes
### Cast to non-negative integers
n_modes = abs(int(n_modes))
if n_modes > 0:
  print('The number of modes in the dynamics will be ' + str(n_modes) + ' .')
else:
  print('WARNING: The values of `n_modes` should be both larger than zero.')

## Clusters
### Cast to non-negative integers
cluster_min = abs(int(cluster_min))
cluster_max = abs(int(cluster_max))
cluster_delta = abs(int(cluster_delta))
if ((cluster_min > 0) and (cluster_max > 0)):
  if cluster_min <= cluster_max:
    print('The lowest number of quasi-rigid units explored will be ' + str(cluster_min) + ' .')
    print('The largest number of quasi-rigid units explored will be ' + str(cluster_max) + ' .')
    print('The resolution of search for the optimal number of quasi-rigid units will be ' + str(cluster_delta) + ' .')

  elif cluster_min > cluster_max:
    print('WARNING: The value of `cluster_min` should be smaller or equal to `cluster_max`.')

else:
  print('WARNING: The values of `cluster_min` and `cluster_max` should be both larger than zero.')

# Installation

The following command installs pyCapsid and the necessary components for visualizing results in this notebook.

In [None]:
!pip install --upgrade pyCapsid ipywidgets==7.7.2 nglview
from google.colab import output
output.enable_custom_widget_manager()

# Execute the pyCapsid pipeline


## Extract features from input structure
This code loads the information from the input structure (PDB format) necessary for the calculations and validation in pyCapsid.



In [None]:
from pyCapsid.PDB import getCapsid

if pdb_source == 'PDB':
  # Extract the features fetching the PDB structure from the Protein Data Bank.
  pdb = pdb_id
  capsid, calphas, coords, bfactors, chain_starts, title = getCapsid(pdb)
  print('The strucure ' + pdb + ' was fetched in the pyCapsid pipeline.')

elif pdb_source == 'upload':
  pdb = pdb_file_name
  capsid, calphas, coords, bfactors, chain_starts, title = getCapsid(pdb, local = True)
  print('The strucure in the file ' + pdb + ' was inputed in the pyCapsid pipeline.')
else:
  print('WARNING: The PDB structure is not available.')

## Build the elastic network model (ENM)
This step uses the function `buildENMPreset` in pyCapsid to specify the elastic network model and build the associated hessian matrix.

In [None]:
from pyCapsid.CG import buildENMPreset
kirch, hessian = buildENMPreset(coords, preset = ENM_model)

## Perform the Normal Mode Analysis (NMA)

This section obtains the dynamics of the protein complex based on the dominant normal modes activated by thermal energy.

### Calculate the low frequency modes
This code obtains the number of low frequency modes specified by the variable `n_modes` in the [pyCapsid parameters section](#scrollTo=b_8gyJk1wLlV). The calculation relies on the eigenvalues and eigenvectors of the hessian matrix obtained in the [Build the elastic network model section](#scrollTo=cf28904f).

In [None]:
from pyCapsid.NMA import modeCalc
evals, evecs = modeCalc(hessian, n_modes = n_modes)

### Predict, scale, and validate the b-factors
This code uses the resulting normal modes and frequencies to predict the b-factors of each alpha carbon, fits these results to experimental values from the pdb entry, and plots the results for comparison.

In [None]:
from pyCapsid.NMA import fitCompareBfactors
evals_scaled, evecs_scaled = fitCompareBfactors(evals, evecs, bfactors, pdb, fit_modes=False)

## Perform the analysis of quasi-rigid clusters (QRC)

In [None]:
from pyCapsid.NMA import calcDistFlucts
from pyCapsid.QRC import findQuasiRigidClusters

dist_flucts = calcDistFlucts(evals_scaled, evecs_scaled, coords)

cluster_start = cluster_min
cluster_stop = cluster_max
cluster_step = cluster_delta
labels, score, residue_scores  = findQuasiRigidClusters(pdb, dist_flucts, cluster_start=cluster_start, cluster_stop=cluster_stop, cluster_step=cluster_step)

## Visualize in jupyter notebook with nglview
You can visualize the results in the notebook with nglview. The following function returns an nglview object with the results colored based on cluster. See the nglview documentation for further info (http://nglviewer.org/nglview/release/v2.7.7/index.html)

In [None]:
# This cell will create an standard view of the capsid, which the next cell will
# modify to create the final result.
from pyCapsid.VIS import createCapsidView
view_clusters = createCapsidView(pdb, capsid)
view_clusters

In [None]:
# If the above view doesn't change coloration, run this cell again.
# In general do not run this cell until the above cell has finished rendering
from pyCapsid.VIS import createClusterRepresentation
createClusterRepresentation(pdb, labels, view_clusters)

# Add rep_type='spacefill' to represent the atoms of the capsid as spheres. This provides less information regarding the proteins but makes it easier to identify the geometry of the clusters
#createClusterRepresentation(pdb, labels, view_clusters, rep_type='spacefill')

In [None]:
# Once you've done this use this code to download the results
view_clusters.center()
view_clusters.download_image(factor=2)

Running the same code but replacing labels with residue_scores and adding rwb_scale=True visualizes the quality score of each residue. This is a measure of how rigid each residue is with respect to its cluster. Blue residues make up the cores of rigid clusters, and red residues represent borders between clusters.

In [None]:
# This code adds a colorbar based on the residue scores
print('Each atom in this structure is colored according to the clustering quality score of its residue.')
import matplotlib.colorbar as colorbar
import matplotlib.pyplot as plt
from pyCapsid.VIS import clusters_colormap_hexcolor
import numpy as np
hexcolor, cmap = clusters_colormap_hexcolor(residue_scores, rwb_scale=True)
fig, ax = plt.subplots(figsize=(10, 0.5))
cb = colorbar.ColorbarBase(ax, orientation='horizontal',
                            cmap=cmap, norm=plt.Normalize(np.min(residue_scores), np.max(residue_scores)))
plt.show()

# This cell will create an empty view, which the next cell will
# modify to create the final result.
from pyCapsid.VIS import createCapsidView
view_scores = createCapsidView(pdb, capsid)
view_scores

In [None]:
from pyCapsid.VIS import createClusterRepresentation
createClusterRepresentation(pdb, residue_scores, view_scores, rwb_scale=True)

In [None]:
# Once you've done this use this code to download the results
view_scores.center()
view_scores.download_image(factor=2)

## Downloading Results
Some results are saved automatically by pyCapsid, and can be downloaded from colab in the following manner.

In [None]:
# Check what files are saved
!dir

In [None]:
# Use the colab api to download the files
# You may have to provide permission to download files via your browser
from google.colab import files
files.download('results_plot.svg')

If a result you want isn't saved, you can save any of the arrays used in the notebook and download the file.

In [None]:
import numpy as np
filename = pdb + '_coords.txt'
np.savetxt(filename, coords)
from google.colab import files
files.download(filename)

# Visualizing saved results
The numerical results are saved as compressed .npz files by default and can be opened and used to visualize the results afterwards. This includes the ability to visualize clusters that weren't the highest scoring cluster. In this example
we visualize the results of clustering the capsid into 20 clusters.

In [None]:
from pyCapsid.VIS import visualizeSavedResults
results_file = f'{pdb}_final_results_full.npz' # Path of the saved results
labels_20, view_clusters = visualizeSavedResults(pdb, results_file, n_cluster=20, method='nglview')
view_clusters

In [None]:
# If the above view doesn't change coloration, run this cell again.
# In general do not run this cell until the above cell has finished rendering
from pyCapsid.VIS import createClusterRepresentation
createClusterRepresentation(pdb, labels_20, view_clusters)

# Add rep_type='spacefill' to represent the atoms of the capsid as spheres. This provides less information regarding the proteins but makes it easier to identify the geometry of the clusters
#createClusterRepresentation(pdb, labels, view_clusters, rep_type='spacefill')