<a href="https://colab.research.google.com/github/phenix-project/Colabs/blob/main/alphafold2/AlphaFold2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### <center> <b> <font color='black'>  AlphaFold (Phenix version) </font></b> </center>

<font color='green'>This notebook runs AlphaFold with varying random seeds and chooses the highest-scoring model.  You upload a sequence and it carries out one cycle of AlphaFold modeling. If you want to upload templates, they can be in either mmcif or pdb format.

This notebook is derived from [ColabFold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) and the DeepMind [AlphaFold2 Colab](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb).
</font>

-----------------
<b> <font color='black'> <center>Instructions for a simple run:</center>
</font></b> 

1. Select the "Basic Inputs" cell, type in a sequence and jobname. You can also edit the Options in the next cell if you want.

2. Start your run by going up to the <b><i>Runtime</i></b> pulldown menu and selecting <b><i>Run all</i></b>

3. Scroll down the page and follow what is going on.  If necessary, upload your map file when the Upload button appears below the "Setting up input files" form. If you use Google drive
for your output files you will be asked for permission.

3. If the notebook crashes...try just selecting <b><i>Run all</i></b> again.

5. See the helpful hints at the bottom of the page for more details and advanced notes.

-----------------
<b> <font color='black'> <center>Please cite the ColabFold and AlphaFold2 papers if you use this notebook:</center>
</font></b> 

- <font color='green'>[Mirdita, M.,  Ovchinnikov, S., Steinegger, M.(2021). ColabFold - Making protein folding accessible to all *bioRxiv*, 2021.08.15.456425](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v2)</font> 

- <font color='green'> [Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)](https://www.nature.com/articles/s41586-021-03819-2)
</font>
-----------------


In [None]:
#@title 1. Basic inputs (Required)
#@markdown Select this cell, then enter sequence of chain to predict (at least 20 residues) and name of this job


sequence = 'ITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIA PGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFP LQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGP' #@param {type:"string"}
resolution =  None
jobname = '7mlz' #@param {type:"string"}
phenix_download_password='None'
query_sequence = sequence
password = phenix_download_password

content_dir = '/content'
# Save all parameters in a dictionary
params = {}
for p in ['content','resolution','jobname', 'password', 'query_sequence']:
  params[p] = locals().get(p,None)
! touch STEP_1


In [None]:
#@title 2. Options (Run without changes for a simple job)

import os
if not os.path.isfile("STEP_1"):
  raise AssertionError("Please run cell 1 first")
#@markdown Check if you want your ouputs saved to the directory <b>ColabOuputs</b> on Google drive
save_outputs_in_google_drive = False #@param {type:"boolean" }

#@markdown If any models you want to use are uploaded, fill in name of directory containing just these files here
#@markdown (usually put them in <b>ColabInputs</b>). Skip parts of the file name like /content/ or MyDrive/). Leave blank to upload directly</i></b>
input_directory = "" #@param {type:"string"}

#@markdown Choose what templates to include (those from the PDB are based on sequence 
#@markdown similarity):
include_templates_from_pdb = False #@param {type:"boolean" }
maximum_templates_from_pdb =  20#@param {type:"integer"}
upload_manual_templates = False #@param {type:"boolean" }

uploaded_templates_have_exact_sequence = True 
uploaded_templates_are_map_to_model = (not uploaded_templates_have_exact_sequence)
maximum_cycles =  1
phenix_version ='dev-4536'
version = phenix_version  # rename variable

#@markdown Specify if you want to run a series of jobs by uploading a file with one jobname, resolution and sequence per line</i></b>
upload_file_with_jobname_sequence_lines = False #@param {type:"boolean"}
upload_file_with_jobname_resolution_sequence_lines = upload_file_with_jobname_sequence_lines

msa_use = 'Use MSA throughout' 


#@markdown Specify how many randomizations to carry out:
random_seed_iterations =  50#@param {type:"integer"}
random_seed = 581867 #@param {type:"integer"}

#@markdown You can skip all multiple sequence alignment information
skip_all_msa = False #@param {type:"boolean"}

#@markdown Turn on debugging</i></b>
debug = False #@param {type:"boolean"}

# We are going to get these from uploaded file...
if upload_file_with_jobname_resolution_sequence_lines:
  params['jobname'] = None
  params['resolution'] = None
  params['sequence'] = None


skip_all_msa_after_first_cycle = False

upload_maps = False  # Always for this version
upload_msa_file = False  # Always for this version
starting_alphafold_model = False  # Always for this version
use_msa = (not skip_all_msa)

minimum_random_seed_iterations = int(max(1,random_seed_iterations//20))
data_dir = "/content"
content_dir = "/content"

if type(params) != type({}):  # convert back to dict
  params = params()

# Save parameters
for p in ['content_dir','data_dir','save_outputs_in_google_drive','input_directory',
    'include_templates_from_pdb','maximum_templates_from_pdb',
    'upload_manual_templates','uploaded_templates_are_map_to_model',
    'maximum_cycles','version',
    'upload_file_with_jobname_resolution_sequence_lines',
    'use_msa','skip_all_msa_after_first_cycle',
    'upload_maps','debug','carry_on','random_seed',
    'upload_msa_file','starting_alphafold_model',
    'random_seed_iterations','minimum_random_seed_iterations']:
  params[p] = locals().get(p,None)
! touch STEP_2


In [None]:
#@title 3. Setting up input files...
#@markdown You will be asked for permission to use your Google drive if needed.

#@markdown The upload button will appear below this cell if needed

import os
if not os.path.isfile("STEP_2"):
  raise AssertionError("Please run steps 1 and 2 first")
# Get the helper python files
import os

print("\nINSTALLING BIOPYTHON")
!pip install biopython dm-haiku==0.0.5 ml-collections py3Dmol
print("\nINSTALLING JAX 0.3.15")
!pip install jax==0.3.15
!pip install jaxlib==0.3.15
print("\nDONE INSTALLING JAX 0.3.15")
print("\nPLEASE IGNORE ALL jax ERROR MESSAGES")

os.chdir('/content')
file_name = 'phenix_colab_utils.py'
if os.path.isfile(file_name):
  os.remove(file_name)
os.environ['file_name'] = file_name
result = os.system("wget -qnc https://raw.githubusercontent.com/phenix-project/Colabs/main/alphafold2/$file_name")

import phenix_colab_utils as cu

cu.install_miniconda()

cu.get_helper_files()  # get all the other helper files
cu.clear_python_caches()


# Set up the inputs using the helper python files
from phenix_alphafold_utils import set_up_input_files
params = set_up_input_files(params, convert_to_params = False)
!rm -f STEP_1 STEP_2
!touch STEP_3


In [None]:
#@title 4. Installing Alphafold and utilities...
#@markdown This step takes 2 minutes

import os
if not os.path.isfile("STEP_3"):
  raise AssertionError("Please run step 3 first")

import phenix_colab_utils as cu

# Get tensorflow import before installation
if not locals().get('tf'):
  tf = cu.import_tensorflow()

# Install selected software
cu.install_software(
  bioconda = True,
  phenix = False,
    phenix_version = params.get('version'),
    phenix_password = params.get('password'),
  alphafold = True,
  pdb_to_cif = True
    )
! touch STEP_4

In [None]:
#@title 5. Creating AlphaFold models

import os
from phenix_colab_utils import exit

if not os.path.isfile("STEP_3"):
  exit("Please run steps 1-3 again before rerunning this step...")

if not os.path.isfile("STEP_4"):
  exit("Please run step 4 first...")

! rm -f STEP_1 STEP_2 STEP_3

# Convert params from dict to alphafold_with_density_map params
from phenix_alphafold_utils import get_alphafold_with_density_map_params
params = get_alphafold_with_density_map_params(params)

from run_alphafold_with_density_map import run_jobs

# Working directory
os.chdir(params.content_dir)
results = run_jobs(params)

  


In [None]:
#@title Utilities (skipped unless checked)

# Put whatever utilities you want here. They will be run if checked
clear_caches = False #@param {type:"boolean" }
if clear_caches:
  from phenix_colab_utils import clear_python_caches
  clear_python_caches(modules = ['run_alphafold_with_density_map3','run_job','rebuild_model','install_phenix','run_fix_paths','runsh','mk_mock_template','mk_template','hh_process_seq','run_job','get_template_hit_list','run_alphafold_with_density_map','get_template_hit_list','get_cif_file_list','alphafold_utils','get_msa','get_templates_from_drive','phenix_alphafold_utils','phenix_colab_utils','clear_python_caches'])
  from phenix_colab_utils import clear_python_caches
  clear_python_caches()


crash_deliberately_and_restart = False #@param {type:"boolean" }
if crash_deliberately_and_restart:
  print("Crashing by using all memory.  Results in restart, losing everything")
  [1]*10**10

upload_helper_files = False #@param {type:"boolean" }
def get_helper_files():
  import os
  for file_name in ['phenix_colab_utils.py',
      'alphafold_utils.py','run_alphafold_with_density_map.py','phenix_alphafold_utils.py']:
    if os.path.isfile(file_name):
      os.remove(file_name)
    os.environ['file_name'] = file_name
    result = os.system("wget -qnc https://raw.githubusercontent.com/phenix-project/Colabs/main/alphafold2/$file_name")
if upload_helper_files:
  get_helper_files()

remove_everything_and_restart = False #@param {type:"boolean" }
if remove_everything_and_restart:
  !kill -9 -1

auto_reload = False #@param {type:"boolean" }
if auto_reload:
  %load_ext autoreload
  %autoreload 2

**Helpful hints**

**Colab crashes**

* The notebook can crash if it runs out of memory, if some inputs are unexpected, or if there is something wrong with the notebook itself. Sometimes there is a useful error message below the cell that crashes (and sometimes not).

* The easiest thing to do on a crash is just run again.  Normally your installation is still there and so this just restarts right away. You don't have to put your inputs in again.

* If rerunning fails, you can reload the notebook (with the reload button on your browser) and run again (this requires putting in your inputs again).

* You can try checking the "debug" box on rerunning to (possibly) get a better error message.


**Saving your results**

* You might want to download your results as they appear.  Go to the Folder icon on the left, click on the 3 dots to the right of your file and select "Download".

* If you specify a Google drive input_directory (maybe "ColabInputs"), then your output files will be saved as they are created in a directory called ColabOutputs in your Google drive.

**Sequence format**

* Your sequence should contain only the 1-letter code of one protein chain. It can contain spaces if you want.

**File names and jobname must match**
* Your AlphaFold predictions will be named yyyy_unrelaxed_model_1_x.pdb
and your rebuilt models yyyy_unrelaxed_model_1_x_rebuilt.pdb, where yyyy is your jobname and x is the cycle number.

* All model file names must start with 4 characters, optionally followed by "_" and more characters, and must end in ".pdb" or ".cif",  Valid file names are abcd.pdb, abcd.cif, abcd_other.pdb.  Non-valid names are abc.pdb, abcde.cif.

* Your jobname must match the beginnings of your map file names and model file names.  If your jobname is joba then your map file name must look like: joba_xxx.mrc or joba_yyy.ccp4.  Your model file name must look like: joba_mymodel.pdb or joba.cif.  This correspondence is used to match map and model files with jobnames.

**Uploading a file with all your file information**

* To upload
a file with a jobname, and sequence on each line, 
check ***upload_file_with_jobname_sequence_lines*** and hit
the ***Run*** button to the left of the first cell.

* If you upload a file with multiple sequences, each line of the file should have exactly one job name, a space, and a sequence, like this:

7n8i_24237 VIWMTQSPSSLSASVGDRVTITCQASQDIRFYLNWYQQKPGKAPKLLISDASNMETGVPSRFSGS

7lvr_23541 MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETG

**Randomized tries **

* You can specify how many AlphaFold models to try
and build at the start (50 may be a good number unless you have a big structure). Models are scored by plDDT
and the highest-scoring one is kept.  If all the models
have similar plDDT as they are being created the randomization step is discontinued and the best one found is used.

**Try turning off MSA's if you supply templates**

* You can encourage AlphaFold to use your rebuilt templates by specifying skip_all_msa=True. This will just use your template information and intrinsic structural information.

* Reproducibility: The tensorflow and AlphaFold2 code will give different results depending on the GPU that is used.
 You can see what GPU you have by opening a cell with the '+Code' button and typing:
 ! nvidia-smi  and then running that cell.
The GPU type will be listed (like Tesla V100-SXM2).
You get a much higher-quality GPU with Colab Pro or Pro+ than with the free version.

**Running cells in this Colab notebook**
* You can step through this notebook one part at a time
by hitting the ***Run*** buttons to the left one at a time. 

* The cell that is active is indicated by a ***Run*** button that has turned into a black circle with a moving black arc

* When execution is done, the ***Run*** button will go back 
to its original white triangle inside a black circle

* You can stop execution of the active cell by hitting its ***Run*** button. It will turn red to indicate it has stopped.

* You can rerun any cell any time that nothing is running.  That means you can go all the way through, then go back to the first cell and enter another sequence and redo the procedure.

* If something goes wrong, the Colab Notebook will print out
an error message.  Usually this will be something telling you
how to change your inputs.  You enter your new inputs and
hit the ***Run*** button again to carry on.

**Possible problems**

* The automatic download may not always work. Normally the
file download starts when the .zip files are created,
but the actual download happens when all the AlphaFold
models are completed.
You can click on the 
folder icon to the left of the window and download your
jobname.zip file manually.  Open and close the file
broswer to show recently-added files.

* Your Colab connection may time out if you go away and
leave it, or if you run for a long time (more than an hour).
If your connection times out you lose everything that
is not yet downloaded. So you might want to download as you go or specify a Google drive input directory.

* The zip file will not be automatically downloaded until the very end of the job. 

* Google Colab assigns different types of GPUs with varying amount of memory. Some might not have enough memory to predict the structure for a long sequence.  


**Result zip file contents**

1. PDB formatted structure
2. Plot of the model quality (IDDT).
3. Plots of the MSA coverage.
4. A3M formatted input MSA.
5. BibTeX file with citations for all used tools and databases.
6. JSN file with predicted error matrix (PAE matrix)

At the end of the job the `jobname.result.zip` file or files will be downloaded automatically.


**AlphaFold limitations**
* The use of AlphaFold in Colab is allowed for non-commercial and for commercial use (no restrictions)


**Colab limitations**
* While Colab is free, it is designed for interactive work and not-unlimited memory and GPU usage. It will time-out after a few hours and it may check that you are not a robot at random times.  On a time-out you may lose your work. You can increase your allowed time with Colab+

* AlphaFold can crash if it requires too much memory. On a crash you may lose all your work that is not yet downloaded. You can have more memory accessible if you have Colab+. If you are familiar with Colab scripts you can try this [hack](https://towardsdatascience.com/double-your-google-colab-ram-in-10-seconds-using-these-10-characters-efa636e646ff ) to increase your memory allowance.


**Description of the plots**

*   **Number of sequences per position** - Look for at least 30 sequences per position, for best performance, ideally 100 sequences.
*   **Predicted lDDT per position** - model confidence (out of 100) at each position. The higher the better.
*   **Predicted Alignment Error** - For homooligomers, this could be a useful metric to assess how confident the model is about the interface. The lower the better.

**Updates**

- <b> <font color='green'>2021-11-12 Clarified use of PDB templates and allow skipping MSA information
- <b> <font color='green'>2022-02-17 Added randomization

**Acknowledgments**

- <b> <font color='green'>This notebook is based on the very nice notebook from ColabFold ([Mirdita et al., *bioRxiv*, 2021](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v1), https://github.com/sokrypton/ColabFold)</font></b> 

- <b><font color='green'>ColabFold is based on AlphaFold2 [(Jumper et al. 2021)](https://www.nature.com/articles/s41586-021-03819-2)
</font></b>