<a href="https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>- for testing in different (commercial) environment

<img src="https://raw.githubusercontent.com/sokrypton/ColabFold/main/.github/ColabFold_Marv_Logo_Small.png" height="200" align="right" style="height:240px">


# General instructions and introduction
Welcome to the AlphaFold tutorial where we will use a modified version of ColabFold to utilize AlphaFold2 using MMseqs2. Here, you can predict the protein folding structure solely based on the amino acid sequence.
- ColabFold is an emerging protein folding prediction tool based on Google DeepMind’s AlphaFold. 
- Throughout the workshop there will be questions to be answered, so you get a deeper understanding of what you are calculating
- You can run each cell (box) by pressing `shift+enter` while being inside of the cell or by pressing on the triangled play botton above.

## ColabFold: AlphaFold2 using MMseqs2
Easy to use protein structure and complex prediction using [AlphaFold2](https://www.nature.com/articles/s41586-021-03819-2) and [Alphafold2-multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1). Sequence alignments/templates are generated through [MMseqs2](mmseqs.com) and [HHsearch](https://github.com/soedinglab/hh-suite). For more details about the original Jupyter notebook and ColabFold, see <a href="#Instructions">bottom</a> of the notebook, check the [ColabFold GitHub](https://github.com/sokrypton/ColabFold) and read the manuscript. Old versions: [v1.0](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.0-alpha/AlphaFold2.ipynb), [v1.1](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.1-premultimer/AlphaFold2.ipynb), [v1.2](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.2.0/AlphaFold2.ipynb), [v1.3](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.3.0/AlphaFold2.ipynb)

[Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: Making protein folding accessible to all.
*Nature Methods*, 2022](https://www.nature.com/articles/s41592-022-01488-1) 


## Learning objectives

<img src="https://media.nature.com/w1248/magazine-assets/d41586-022-00997-5/d41586-022-00997-5_20300292.gif?as=webp" height="200" align="right" style="height:240px">

After completing this workshop you will be able to:
 - Describe how to use ColabFold at UCloud to predict protein structures.
 - Explain and interpret the results generated using ColabFold.
 - Use ColabFold to predict the protein structure of any specific protein of interest.


### Questions

(Double-click to edit the cell) <br>
❔ __Question I: <u> What is AlphaFold and why does it play a crucial role in structural biology?</u>__ <br>
_Answer:_  
<br>
❔ __Question II: <u> What is ColabFold and how does it differ from AlphaFold2?</u>__ <br>
_Answer:_  
<br>
❔ __Question III: <u> What is MMseqs2 and how does it work?</u>__ <br>
_Answer:_  
<br>


💡 Useful link(s): <br>
[Highly accurate protein structure prediction with AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) <br>
[How AlphaFold can realize AI’s full potential in structural biology](https://www.nature.com/articles/d41586-022-02088-x) <br>
[ColabFold: making protein folding accessible to all](https://www.nature.com/articles/s41592-022-01488-1) <br>
[MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets](https://www.nature.com/articles/nbt.3988)<br>
[ColabFold at GitHub](https://github.com/sokrypton/ColabFold)<br>
[YouTube: ColabFold - Making protein folding accessible to all via Google Colab!](https://youtu.be/Rfw7thgGTwI)<br>
<br>

## Initialization
Please run the following cell to initialize ColabFold by defining the paths of the working environment to be used for running ColabFold at UCloud. __This is needed only once__. 🚀

In [None]:
import sys,os

CONDA_DIR="/work/SW/CondaEnvironments/colabfold/"
CONDA_BIN=os.path.join(CONDA_DIR, "bin")
CONDA_LIB=os.path.join(CONDA_DIR, "lib64")

if CONDA_BIN not in os.environ["PATH"]:
    os.environ["PATH"] = ':'.join((CONDA_BIN, os.environ["PATH"]))

os.environ["LD_LIBRARY_PATH"] = ""
if CONDA_LIB not in os.environ["LD_LIBRARY_PATH"]:
    os.environ["LD_LIBRARY_PATH"] = ':'.join((CONDA_LIB, os.environ["LD_LIBRARY_PATH"]))
sys.path.insert(1, '/work/SW/ColabFold/ColabFold_software')

print(sys.version)
print("PATH=" + os.environ["PATH"])
print("LD_LIBRARY_PATH=" + os.environ["LD_LIBRARY_PATH"])
print("sys.path=" + str(sys.path))
print("PWD=" + os.getcwd())

<img src="https://upload.wikimedia.org/wikipedia/en/e/e5/UniProt_%28logo%29.png" height="200" align="right" style="height:100px">

## Selecting an example protein
Now you can select an amino acid sequence of interest to be used as the query sequence. </br>
In this workshop, we will use a sequence from the protein [PPARγ (peroxisome proliferator activated receptor γ)](https://www.uniprot.org/uniprotkb/P37231/entry) obtained from the protein database [UniProt](https://www.uniprot.org/). <br>
UniProt is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information. <br>
Here, you can find functional information on any annotated protein of interest:
1. Go to [uniprot.org](https://www.uniprot.org/).
2. Search for your favorite protein in the search bar.
3. Click on the entry ID in the column to the left.
4. You can find different functional informations about the protein in the menu to the left.
5. Click on __Sequence & Isoforms__ in the menu.
6. Click on __Copy sequence__ in the grey box containing the amino acid sequence to copy the entire sequence without any gaps.
7. You can now insert this sequence as query sequence for predict the protein structure using ColabFold.

💡 Useful link(s): <br>
[UniProt: the universal protein knowledgebase in 2021](https://academic.oup.com/nar/article/49/D1/D480/6006196?login=true) <br>


### Questions

(Double-click to edit the cell) <br>
❔ __Question IV: <u> What is your favorite protein and what is the amino acid sequence?</u>__ <br>
_Answer:_  
_Sequence:_  
<br>
❔ __Question V: <u> What is the sequence for PPARγ?</u>__ <br>
_Sequence:_  
<br>
❔ __Question VI: <u> What is the function of PPARγ and which biological processes is the protein involved in?</u>__ <br>
_Answer:_  
<br>
❔ __Question VII: <u> Name 3 different mutations in the amino acid sequence of PPARγ and the disease(s) they can lead to.</u>__ <br>
_Answer:_  


💡 Useful link(s): <br>
[UniProt](https://www.uniprot.org/) <br>
[PPARgamma in Metabolism, Immunity, and Cancer: Unified and Diverse Mechanisms of Action](https://pubmed.ncbi.nlm.nih.gov/33716977/) <br>
[Review of the Structural and Dynamic Mechanisms of PPARγ Partial Agonism](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4578752/) <br>
[Molecular Mechanisms and Genome-Wide Aspects of PPAR Subtype Specific Transactivation](https://pubmed.ncbi.nlm.nih.gov/20862367/) <br>
<br>
<br>

### Parameter settings
In the following cell, you find the parameters used to predict the structure. They can all be set and adjusted. <br>
For this workshop we will use the default parameters meaning that you can run the next cell without any further adjustments.

Or you change the parameters and rerun the cell. We recommend trying out another protein sequence.

_Note: Longer sequences lead to longer run times!_

In [None]:
############ Collected parameter values

### Biological parameters
## Use `:` to specify inter-protein chainbreaks for **modeling complexes** (supports homo- and hetero-oligomers). For example **PI...SK:PI...SK** for a homodimer
query_sequence = 'MGETLGDSPIDPESDSFTDTLSANISQEMTMVDTEMPFWPTNFGISSVDLSVMEDHSHSFDIKPFTTVDFSSISTPHYEDIPFTRTDPVVADYKYDLKLQEYQSAIKVEPASPPYYSEKTQLYNKPHEEPSNSLMAIECRVCGDKASGFHYGVHACEGCKGFFRRTIRLKLIYDRCDLNCRIHKKSRNKCQYCRFQKCLAVGMSHNAIRFGRMPQAEKEKLLAEISSDIDQLNPESADLRALAKHLYDSYIKSFPLTKAKARAILTGKTTDKSPFVIYDMNSLMMGEDKIKFKHITPLQEQSKEVAIRIFQGCQFRSVEAVQEITEYAKSIPGFVNLDLNDQVTLLKYGVHEIIYTMLASLMNKDGVLISEGQGFMTREFLKSLRKPFGDFMEPKFEFAVKFNALELDDSDLAIFIAVIILSGDRPGLLNVKPIEDIQDNLLQALELQLKLNHPESSQLFAKLLQKMTDLRQIVTEHVQLLQVIKKTETDMSLHPLLQEIYKDLY' #@param {type:"string"}

### Technical parameters

## MSA options (custom MSA upload, single sequence, pairing mode)
# Different values: "MMseqs2 (UniRef+Environmental)", "MMseqs2 (UniRef only)","single_sequence","custom"
msa_mode = "MMseqs2 (UniRef+Environmental)" 
# Different values: "unpaired+paired","paired","unpaired"
# "unpaired+paired" = pair sequences from same species + unpaired MSA, "unpaired" = seperate MSA for each chain, "paired" - only use paired sequences.
pair_mode = "unpaired+paired"

## number of models to use
# Either True or False
use_amber = False 
# Different values: "none", "pdb70","custom"
# "none" = no template information is used, "pdb70" = detect templates in pdb70, "custom" - upload and search own templates (PDB or mmCIF format, see [notes below](#custom_templates))
template_mode = "none"

# Different values: "auto", "AlphaFold2-ptm", "AlphaFold2-multimer-v1", "AlphaFold2-multimer-v2"
# "auto" = protein structure prediction using "AlphaFold2-ptm" and complex prediction "AlphaFold-multimer-v2". For complexes "AlphaFold-multimer-v[1,2]" and "AlphaFold-ptm" can be used.model_type = "auto" 

model_type = "auto"
num_recycles = 3


### Visualization
# Usually a number between 1 and 5
rank_num = 1 
# Different values: "chain", "lDDT", "rainbow"
color = "lDDT" 
# Either True of False
show_sidechains = False
show_mainchains = False


## set dpi for image resolution
dpi = 200 

########## End parameter settings

import os.path
import re
import hashlib
import random

def add_hash(x,y):
  return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]


# remove whitespaces
query_sequence = "".join(query_sequence.split())

jobname = 'test' #@param {type:"string"}
# remove whitespaces
basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)
while os.path.isfile(f"{jobname}.csv"):
  jobname = add_hash(basejobname, ''.join(random.sample(query_sequence,len(query_sequence))))

with open(f"{jobname}.csv", "w") as text_file:
    text_file.write(f"id,sequence\n{jobname},{query_sequence}")

queries_path=f"{jobname}.csv"

if template_mode == "pdb70":
  use_templates = True
  custom_template_path = None
elif template_mode == "custom":
  custom_template_path = f"{jobname}_template"
  os.mkdir(custom_template_path)
  uploaded = files.upload()
  use_templates = True
  for fn in uploaded.keys():
    os.rename(fn, f"{jobname}_template/{fn}")
else:
  custom_template_path = None
  use_templates = False

    
# decide which a3m to use
if msa_mode.startswith("MMseqs2"):
  a3m_file = f"{jobname}.a3m"
elif msa_mode == "custom":
  a3m_file = f"{jobname}.custom.a3m"
  if not os.path.isfile(a3m_file):
    custom_msa_dict = files.upload()
    custom_msa = list(custom_msa_dict.keys())[0]
    header = 0
    import fileinput
    for line in fileinput.FileInput(custom_msa,inplace=1):
      if line.startswith(">"):
         header = header + 1
      if not line.rstrip():
        continue
      if line.startswith(">") == False and header == 1:
         query_sequence = line.rstrip()
      print(line, end='')

    os.rename(custom_msa, a3m_file)
    queries_path=a3m_file
    print(f"moving {custom_msa} to {a3m_file}")
else:
  a3m_file = f"{jobname}.single_sequence.a3m"
  with open(a3m_file, "w") as text_file:
    text_file.write(">1\n%s" % query_sequence)

## Run protein structure prediction
Now you are ready to run the prediction of the protein in the cell below. This can take a while. ⌚

In [None]:
#@title Run Prediction
import sys
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from colabfold.download import download_alphafold_params, default_data_dir
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type
K80_chk = !nvidia-smi | grep "Tesla K80" | wc -l
if "1" in K80_chk:
  print("WARNING: found GPU Tesla K80: limited to total length < 1000")
  if "TF_FORCE_UNIFIED_MEMORY" in os.environ:
    del os.environ["TF_FORCE_UNIFIED_MEMORY"]
  if "XLA_PYTHON_CLIENT_MEM_FRACTION" in os.environ:
    del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]

from colabfold.colabfold import plot_protein
from pathlib import Path
import matplotlib.pyplot as plt


# For some reason we need that to get pdbfixer to import
if use_amber and '/usr/local/lib/python3.7/site-packages/' not in sys.path:
    sys.path.insert(0, '/usr/local/lib/python3.7/site-packages/')

def prediction_callback(unrelaxed_protein, length, prediction_result, input_features, type):
  fig = plot_protein(unrelaxed_protein, Ls=length, dpi=150)
  plt.show()
  plt.close()

result_dir="."
if 'logging_setup' not in globals():
    setup_logging(Path(".").joinpath("log.txt"))
    logging_setup = True

queries, is_complex = get_queries(queries_path)
model_type = set_model_type(is_complex, model_type)
download_alphafold_params(model_type, Path("."))
run(
    queries=queries,
    result_dir=result_dir,
    use_templates=use_templates,
    custom_template_path=custom_template_path,
    use_amber=use_amber,
    msa_mode=msa_mode,    
    model_type=model_type,
    num_models=5,
    num_recycles=num_recycles,
    model_order=[1, 2, 3, 4, 5],
    is_complex=is_complex,
    data_dir=Path("."),
    keep_existing_results=False,
    recompile_padding=1.0,
    rank_by="auto",
    pair_mode=pair_mode,
    stop_at_score=float(100),
    prediction_callback=prediction_callback,
    dpi=dpi
)

### Questions

(Double-click to edit the cell) <br>
❔ __Question VIII: <u> What is pLDDT and how is the coloring and prediction related?</u>__  
_Answer:_  

❔ __Question IX: <u> Why does AlphaFold2 needs to download the weights?</u>__  
_Answer:_  

❔ __Question X: <u> Why does it run 5 models?</u>__  
_Answer:_  

❔ __Question XI: <u> What is different and what is simillar between the different predicted protein models?</u>__  
_Answer:_  

## Visualization of results
The next cell creates an interactive 3D figure of the protein.

In [None]:
#Display 3D structure 
import py3Dmol
import glob
import matplotlib.pyplot as plt
from colabfold.colabfold import plot_plddt_legend

jobname_prefix = ".custom" if msa_mode == "custom" else ""
if use_amber:
  pdb_filename = f"{jobname}{jobname_prefix}_relaxed_rank_{rank_num:03}_alphafold2_ptm_model_*.pdb"
else:
  pdb_filename = f"{jobname}{jobname_prefix}_unrelaxed_rank_{rank_num:03}_alphafold2_ptm_model_*.pdb"

pdb_file = glob.glob(pdb_filename)

def show_pdb(rank_num=1, show_sidechains=False, show_mainchains=False, color="lDDT"):
  model_name = f"rank_{rank_num}"
  view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
  view.addModel(open(pdb_file[0],'r').read(),'pdb')

  if color == "lDDT":
    view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':50,'max':90}}})
  elif color == "rainbow":
    view.setStyle({'cartoon': {'color':'spectrum'}})
  elif color == "chain":
    chains = len(queries[0][1]) + 1 if is_complex else 1
    for n,chain,color in zip(range(chains),list("ABCDEFGH"),
                     ["lime","cyan","magenta","yellow","salmon","white","blue","orange"]):
      view.setStyle({'chain':chain},{'cartoon': {'color':color}})
  if show_sidechains:
    BB = ['C','O','N']
    view.addStyle({'and':[{'resn':["GLY","PRO"],'invert':True},{'atom':BB,'invert':True}]},
                        {'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
    view.addStyle({'and':[{'resn':"GLY"},{'atom':'CA'}]},
                        {'sphere':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
    view.addStyle({'and':[{'resn':"PRO"},{'atom':['C','O'],'invert':True}]},
                        {'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})  
  if show_mainchains:
    BB = ['C','O','N','CA']
    view.addStyle({'atom':BB},{'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})

  view.zoomTo()
  return view


show_pdb(rank_num,show_sidechains, show_mainchains, color).show()
if color == "lDDT":
  plot_plddt_legend().show() 

### Questions

(Double-click to edit the cell)  
❔ __Question XII: <u> What is the biological function of the intrinsically disordered region (IDR) of the protein, and is there a connection between the function and structure of this part of the protein? </u>__  
_Answer:_  

❔ __Question XIII: <u> What do you consider as the most intresting feature in the predicted model? </u>__  
_Answer:_  

## The Prediction Quality
To evaluate the prediction quality various prediction measurements can be visualised (e.g. PAE plots, sequence coverage, and predicted local distance difference test (plDDT)).  
Run the cell below to visualize the quality of the predicted model.

In [None]:
#@title Plots {run: "auto"}
from IPython.display import display, HTML
import base64
from html import escape

# see: https://stackoverflow.com/a/53688522
def image_to_data_url(filename):
  ext = filename.split('.')[-1]
  prefix = f'data:image/{ext};base64,'
  with open(filename, 'rb') as f:
    img = f.read()
  return prefix + base64.b64encode(img).decode('utf-8')

pae = image_to_data_url(f"{jobname}{jobname_prefix}_pae.png")
cov = image_to_data_url(f"{jobname}{jobname_prefix}_coverage.png")
plddt = image_to_data_url(f"{jobname}{jobname_prefix}_plddt.png")
display(HTML(f"""
<style>
  img {{
    float:left;
  }}
  .full {{
    max-width:100%;
  }}
  .half {{
    max-width:50%;
  }}
  @media (max-width:640px) {{
    .half {{
      max-width:100%;
    }}
  }}
</style>
<div style="max-width:90%; padding:2em;">
  <h1>Plots for {escape(jobname)}</h1>
  <h2> Predicted Aligned Error (PAE) Plots </h2>
  <img src="{pae}" class="full" />
  <img src="{cov}" class="half" />
  <img src="{plddt}" class="half" />
</div>
"""))


## Overview of output
AlphaFold2 employs various metrics in the process of predicting protein structure models. Some of these metrics are depicted in the plots above and are briefly described below.

### Predicted Aligned Error (PAE)  
One of the outputs from AlphaFold2 is the Predicted Aligned Error (PAE) measured in Ångströms (capped at 31.75 Å). The PAE score for a given residue in the protein indicates the expected positional error between the predicted and experimental structures if the residues are aligned. The PAE scores represent the model confidence for the relative positions and orientations of different protein segments (e.g. two domains), for which reason rigid cores and protein domains possibly can be determined.  
If the PAE scores are _low_ for a particular pair of residues, it means that the relative positions and orientations of the domains they belong to are predicted to be well-defined by AlphaFold. On the other hand, if the PAE values are _high_, it means that the relative position and orientation of the domains are predicted to be unreliable, and users should not consider any biological or structural relevance of these domains.

### Sequence coverage
Sequence coverage is a measure of how well a particular protein sequence has been studied and characterized. It is often depicted in the form of a plot, which shows the number of identified homologues (proteins that are similar in sequence to the protein of interest) along the length of the protein's representative sequence. The plot may also be colored to indicate the level of sequence identity between the protein of interest and each of the homologues. A high level of sequence coverage indicates that the protein has been well studied and characterized, while a low level of sequence coverage suggests that more research is needed to fully understand the protein's function and role in the cell. Sequence coverage can be useful for predicting the function of a protein, as it provides information about the protein's evolutionary history and the presence of conserved domains or motifs that may be important for its function.

### Predicted Local Distance Difference Test (pLDDT)

The predicted structure models produced by AlphaFold2 include atomic coordinates and a confidence estimate for each residue in the protein.The confidence measure is called the predicted Local Distance Difference Test (pLDDT) and is used to estimate the accuracy of the predicted Cα positions with experiemtanl structures. The pLDDT score is on a scale from 0 to 100, with higher scores indicating a higher confidence in the predicted structure. While pLDDT is a widely used metric in protein structure prediction, it is not clear how well it can be linked to physical properties such as protein dynamics. One of the main purposes of pLDDT is to assess the local accuracy of a prediction, which means that it can give high scores to well-predicted regions even if the overall prediction is not well-aligned with the true structure. This makes pLDDT a useful tool for identifying areas of the predicted structure that may be particularly reliable.


💡Useful link(s):  
[AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8728224/)  
[Effective Molecular Dynamics from Neural-Network Based Structure Prediction Models](https://www.biorxiv.org/content/10.1101/2022.10.17.512476v1.full)

### Questions

(Double-click to edit the cell)  
❔ __Question XIV: <u> In your own words, can you provide a explanation of what the Predicted Aligned Error (PAE) is and how it is visualized in a PAE plot?__</u>  
_Answer:_  

❔ __Question XV: <u> What are some interesting potential applications of AlphaFold2 that you can imagine?__</u>  
_Answer:_  

❔ __Question XVI: <u> Can you explain what information is depicted in the sequence coverage plot and how it relates to the N- and C-termini of the protein?__</u>   
_Answer:_  

❔ __Question XVII: <u> Can you describe what the predicted Local Distance Difference Test (pLDDT) plot represents and how it relates to the predicted protein model?__</u>  
_Answer:_  

❔ __Question XVIII: <u> Elucidate the biological mechanisms underlying the relationship between the domains of the predicted protein structure and their functions, providing relevant examples.__</u>  
_Answer:_  


💡 Useful link(s):  
[AlphaFold Error Estimates](https://www.rbvi.ucsf.edu/chimerax/data/pae-apr2022/pae.html)  
[lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3799472/)  
[Highly accurate protein structure prediction for the human proteome](https://www.nature.com/articles/s41586-021-03828-1)  
[AlphaFold - FAQ](https://alphafold.ebi.ac.uk/faq)

## Download the results
You can download all your results by running the last cell below (not necessary).

In [None]:
#@title Package and download results
#@markdown If you are having issues downloading the result archive, try disabling your adblocker and run this cell again. If that fails click on the little folder icon to the left, navigate to file: `jobname.result.zip`, right-click and select \"Download\" (see [screenshot](https://pbs.twimg.com/media/E6wRW2lWUAEOuoe?format=jpg&name=small)).


if msa_mode == "custom":
  print("Don't forget to cite your custom MSA generation method.")

!zip -FSr $jobname".result.zip" config.json $jobname*".json" $jobname*".a3m" $jobname*"relaxed_rank_"*".pdb" "cite.bibtex" $jobname*".png"
#files.download(f"{jobname}.result.zip")







# Original Instructions from ColabFold (<font color='red'>not valid</font> for this version of ColabFold on UCloud)<a name="Instructions"></a>
**Quick start**
1. Paste your protein sequence(s) in the input field.
2. Press "Runtime" -> "Run all".
3. The pipeline consists of 5 steps. The currently running step is indicated by a circle with a stop sign next to it.

**Result zip file contents**

1. PDB formatted structures sorted by avg. pLDDT and complexes are sorted by pTMscore. (unrelaxed and relaxed if `use_amber` is enabled).
2. Plots of the model quality.
3. Plots of the MSA coverage.
4. Parameter log file.
5. A3M formatted input MSA.
6. A `predicted_aligned_error_v1.json` using [AlphaFold-DB's format](https://alphafold.ebi.ac.uk/faq#faq-7) and a `scores.json` for each model which contains an array (list of lists) for PAE, a list with the average pLDDT and the pTMscore.
7. BibTeX file with citations for all used tools and databases.

At the end of the job a download modal box will pop up with a `jobname.result.zip` file. Additionally, if the `save_to_google_drive` option was selected, the `jobname.result.zip` will be uploaded to your Google Drive.

**MSA generation for complexes**

For the complex prediction we use unpaired and paired MSAs. Unpaired MSA is generated the same way as for the protein structures prediction by searching the UniRef100 and environmental sequences three iterations each.

The paired MSA is generated by searching the UniRef100 database and pairing the best hits sharing the same NCBI taxonomic identifier (=species or sub-species). We only pair sequences if all of the query sequences are present for the respective taxonomic identifier.

**Using a custom MSA as input**

To predict the structure with a custom MSA (A3M formatted): (1) Change the `msa_mode`: to "custom", (2) Wait for an upload box to appear at the end of the "MSA options ..." box. Upload your A3M. The first fasta entry of the A3M must be the query sequence without gaps. 

It is also possilbe to proide custom MSAs for complex predictions. Read more about the format [here](https://github.com/sokrypton/ColabFold/issues/76).

As an alternative for MSA generation the [HHblits Toolkit server](https://toolkit.tuebingen.mpg.de/tools/hhblits) can be used. After submitting your query, click "Query Template MSA" -> "Download Full A3M". Download the A3M file and upload it in this notebook.

**Using custom templates** <a name="custom_templates"></a>

To predict the structure with a custom template (PDB or mmCIF formatted): (1) change the `template_mode` to "custom" in the execute cell and (2) wait for an upload box to appear at the end of the "Input Protein" box. Select and upload your templates (multiple choices are possible).

* Templates must follow the four letter PDB naming with lower case letters.

* Templates in mmCIF format must contain `_entity_poly_seq`. An error is thrown if this field is not present. The field `_pdbx_audit_revision_history.revision_date` is automatically generated if it is not present.

* Templates in PDB format are automatically converted to the mmCIF format. `_entity_poly_seq` and `_pdbx_audit_revision_history.revision_date` are automatically generated.

If you encounter problems, please report them to this [issue](https://github.com/sokrypton/ColabFold/issues/177).

**Comparison to the full AlphaFold2 and Alphafold2 colab**

This notebook replaces the homology detection and MSA pairing of AlphaFold2 with MMseqs2. For a comparison against the [AlphaFold2 Colab](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb) and the full [AlphaFold2](https://github.com/deepmind/alphafold) system read our [preprint](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v1). 

**Troubleshooting**
* Check that the runtime type is set to GPU at "Runtime" -> "Change runtime type".
* Try to restart the session "Runtime" -> "Factory reset runtime".
* Check your input sequence.

**Known issues**
* Google Colab assigns different types of GPUs with varying amount of memory. Some might not have enough memory to predict the structure for a long sequence.
* Your browser can block the pop-up for downloading the result file. You can choose the `save_to_google_drive` option to upload to Google Drive instead or manually download the result file: Click on the little folder icon to the left, navigate to file: `jobname.result.zip`, right-click and select \"Download\" (see [screenshot](https://pbs.twimg.com/media/E6wRW2lWUAEOuoe?format=jpg&name=small)).

**Limitations**
* Computing resources: Our MMseqs2 API can handle ~20-50k requests per day.
* MSAs: MMseqs2 is very precise and sensitive but might find less hits compared to HHblits/HMMer searched against BFD or MGnify.
* We recommend to additionally use the full [AlphaFold2 pipeline](https://github.com/deepmind/alphafold).

**Description of the plots**
*   **Number of sequences per position** - We want to see at least 30 sequences per position, for best performance, ideally 100 sequences.
*   **Predicted lDDT per position** - model confidence (out of 100) at each position. The higher the better.
*   **Predicted Alignment Error** - For homooligomers, this could be a useful metric to assess how confident the model is about the interface. The lower the better.

**Bugs**
- If you encounter any bugs, please report the issue to https://github.com/sokrypton/ColabFold/issues

**License**

The source code of ColabFold is licensed under [MIT](https://raw.githubusercontent.com/sokrypton/ColabFold/main/LICENSE). Additionally, this notebook uses the AlphaFold2 source code and its parameters licensed under [Apache 2.0](https://raw.githubusercontent.com/deepmind/alphafold/main/LICENSE) and [CC BY 4.0](https://creativecommons.org/licenses/by-sa/4.0/) respectively. Read more about the AlphaFold license [here](https://github.com/deepmind/alphafold).

**Acknowledgments**
- We thank the AlphaFold team for developing an excellent model and open sourcing the software. 

- [KOBIC](https://kobic.re.kr) and [Söding Lab](https://www.mpinat.mpg.de/soeding) for providing the computational resources for the MMseqs2 MSA server.

- Richard Evans for helping to benchmark the ColabFold's Alphafold-multimer support.

- [David Koes](https://github.com/dkoes) for his awesome [py3Dmol](https://3dmol.csb.pitt.edu/) plugin, without whom these notebooks would be quite boring!

- Do-Yoon Kim for creating the ColabFold logo.

- A colab by Sergey Ovchinnikov ([@sokrypton](https://twitter.com/sokrypton)), Milot Mirdita ([@milot_mirdita](https://twitter.com/milot_mirdita)) and Martin Steinegger ([@thesteinegger](https://twitter.com/thesteinegger)).
