Original source: https://twitter.com/thesteinegger/status/1416826734322749445?s=28

https://t.co/WwSN5KE1ZE?amp=1

Notebook is not adapted to kaggle . 

# Protein structure prediction with AlphaFold2 and MMseqs2


Easy to use version of AlphaFold 2 (Jumper et al. 2021, Nature) using an API hosted at the Södinglab based on the MMseqs2 server (Mirdita et al. 2019, Bioinformatics) for the multiple sequence alignment creation. 

**Quickstart**
1. Change the runtime type to GPU at "Runtime" -> "Change runtime type" (improves speed)
2. Paste your protein sequence in the input field below
3. Press "Runtime" -> "Run all"
4. The pipeline has 8 steps. The currently running steps is indicated by a circle with a stop sign next to it. 

**Result**

We produce two result files (1) a PDB formated structure and (2) a plot of the model quality. At the end of the computation a download modal box will pop with a `result.tar.gz` file.

**Troubleshooting**
* Try to restart the session "Runntime" -> "Factory reset runtime"
* Check your input sequence 


**Limitations**
* MSAs: MMseqs2 might not find as many hits compared to HHblits/HMMer searched against BFD and Mgnify.
* Templates: Currently we do not use template information. But this is work in progress. 
* Computing resources: MMseqs2 is fast and we can probably handle >20k requests per day but it is not limitless. 
* It uses only one AF2 model followed by Amber Relaxation.

For best results, we recommend using the full pipeline: https://github.com/deepmind/alphafold

Most of the python code was written by Sergey Ovchinnikov (@sokrypton). The API is hosted at the Södinglab (@SoedingL) and maintained by Milot Mirdita (@milot_mirdita). Martin Steinegger (@thesteinegger) integrated everything.



In [1]:
#@title Input protein sequence here before you "Run all"

protein = 'MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPDFTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARGKDLYQFWGDIITNKLNEALAAQGDNVVINLASDEYFKSVKPKKLNAEIIKPVFLDEKNGKFKIISFYAKKARGLMSRFIIENRLTKPEQLTGFNSEGYFFDEDSSSNGELVFKRYEQR' #@param {type:"string"}
# remove whitespaces
protein=protein.join(protein.split())
with open("q.fasta", "w") as text_file:
    text_file.write(">1\n%s" % protein)
jobname = 'default' #@param {type:"string"}
# remove whitespaces
jobname="".join(jobname.split())

In [2]:
#@title Install dependencies
%%bash
if [ -e AF2_READY ]; then
  exit 0
fi
# install dependencies
apt-get -qq -y update 2>&1 1>/dev/null
apt-get -qq -y install jq curl zlib1g gawk 2>&1 1>/dev/null

pip -q install biopython 2>&1 1>/dev/null
pip -q install dm-haiku 2>&1 1>/dev/null
pip -q install ml-collections 2>&1 1>/dev/null
pip -q install mock 2>&1 1>/dev/null
pip -q install py3Dmol 2>&1 1>/dev/null

# download model
git clone https://github.com/deepmind/alphafold.git --quiet
mv alphafold alphafold_
mv alphafold_/alphafold .

# download model params (~1 min)
wget -qnc https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar
mkdir params
tar -xf alphafold_params_2021-07-14.tar -C params/
rm alphafold_params_2021-07-14.tar

# install openmm for refinement
wget -qnc https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
mv stereo_chemical_props.txt alphafold/common/
wget -qnc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
conda install -y -q -c conda-forge openmm=7.5.1 python=3.7 pdbfixer 2>&1 1>/dev/null
(cd /usr/local/lib/python3.7/site-packages; patch -s -p0 < /content/alphafold_/docker/openmm.patch)

touch AF2_READY

SyntaxError: invalid syntax (<ipython-input-2-28fd336a516d>, line 3)

In [None]:
#@title Build MSA

%%bash
# build msa using the MMseqs2 search server
ID=$(curl -s -F q=@q.fasta -F mode=all https://a3m.mmseqs.com/ticket/msa | jq -r '.id')
STATUS=$(curl -s https://a3m.mmseqs.com/ticket/${ID} | jq -r '.status')
while [ "${STATUS}" == "RUNNING" ]; do
    STATUS=$(curl -s https://a3m.mmseqs.com/ticket/${ID} | jq -r '.status')
    sleep 1
done
if [ "${STATUS}" == "COMPLETE" ]; then
    curl -s https://a3m.mmseqs.com/result/download/${ID}  > result.tar.gz
    tar xzf result.tar.gz
    tr -d '\000' < uniref.a3m > query.a3m
else
    echo "MMseqs2 server did not return a valid result."
    exit 1
fi
echo "Found $(grep -c ">" query.a3m) sequences (after redundacy filtering)"

In [None]:
#@title Setup model

# the following code is written by Sergey Ovchinnikov
# setup the model
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import numpy as np
import pickle
import py3Dmol
import matplotlib.pyplot as plt
import sys
sys.path.insert(0, '/usr/local/lib/python3.7/site-packages/')
from alphafold.common import protein
from alphafold.data import pipeline
from alphafold.data import templates
from alphafold.model import data
from alphafold.model import config
from alphafold.model import model
from alphafold.relax import relax
tf.get_logger().setLevel('ERROR')
model_runners = {}
models = ["model_1"] #,"model_2","model_3","model_4","model_5"]
for model_name in models:
  model_config = config.model_config(model_name)
  model_config.data.eval.num_ensemble = 1
  model_params = data.get_model_haiku_params(model_name=model_name, data_dir=".")
  model_runner = model.RunModel(model_config, model_params)
  model_runners[model_name] = model_runner

def mk_mock_template(query_sequence):
  # since alphafold's model requires a template input
  # we create a blank example w/ zero input, confidence -1
  ln = len(query_sequence)
  output_templates_sequence = "-"*ln
  output_confidence_scores = np.full(ln,-1)
  templates_all_atom_positions = np.zeros((ln, templates.residue_constants.atom_type_num, 3))
  templates_all_atom_masks = np.zeros((ln, templates.residue_constants.atom_type_num))
  templates_aatype = templates.residue_constants.sequence_to_onehot(output_templates_sequence,
                                                                    templates.residue_constants.HHBLITS_AA_TO_ID)
  template_features = {'template_all_atom_positions': templates_all_atom_positions[None],
                       'template_all_atom_masks': templates_all_atom_masks[None],
                       'template_sequence': [f'none'.encode()],
                       'template_aatype': np.array(templates_aatype)[None],
                       'template_confidence_scores': output_confidence_scores[None],
                       'template_domain_names': [f'none'.encode()],
                       'template_release_date': [f'none'.encode()]}
  return template_features

def predict_structure(prefix, feature_dict, model_runners, do_relax=True, random_seed=0):  
  """Predicts structure using AlphaFold for the given sequence."""

  # Run the models.
  plddts = {}
  for model_name, model_runner in model_runners.items():
    processed_feature_dict = model_runner.process_features(feature_dict, random_seed=random_seed)
    prediction_result = model_runner.predict(processed_feature_dict)
    unrelaxed_protein = protein.from_prediction(processed_feature_dict,prediction_result)
    unrelaxed_pdb_path = f'{prefix}_unrelaxed_{model_name}.pdb'
    plddts[model_name] = prediction_result['plddt']

    print(f"{model_name} {plddts[model_name].mean()}")

    with open(unrelaxed_pdb_path, 'w') as f:
      f.write(protein.to_pdb(unrelaxed_protein))

    if do_relax:
      # Relax the prediction.
      amber_relaxer = relax.AmberRelaxation(max_iterations=0,tolerance=2.39,
                                            stiffness=10.0,exclude_residues=[],
                                            max_outer_iterations=20)      
      relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
      relaxed_pdb_path = f'{prefix}_relaxed_{model_name}.pdb'
      with open(relaxed_pdb_path, 'w') as f: f.write(relaxed_pdb_str)

  return plddts

In [None]:
#@title Predict structure
a3m_lines = "".join(open("query.a3m","r").readlines())
msa, deletion_matrix = pipeline.parsers.parse_a3m(a3m_lines)
query_sequence = msa[0]

feature_dict = {
    **pipeline.make_sequence_features(sequence=query_sequence,
                                      description="none",
                                      num_res=len(query_sequence)),
    **pipeline.make_msa_features(msas=[msa],deletion_matrices=[deletion_matrix]),
    **mk_mock_template(query_sequence)
}
plddts = predict_structure(jobname,feature_dict,model_runners)

In [None]:
#@title Plot LDDT per residue
# confidence per position
plt.figure(dpi=100)
for model,value in plddts.items():
  plt.plot(value,label=model)
plt.legend()
plt.ylim(0,100)
plt.ylabel("predicted LDDT")
plt.xlabel("positions")
plt.show()
plt.savefig(jobname+"_relaxed_model_1.png")

In [None]:
#@title Show 3D structure
p = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js')
p.addModel(open(jobname+"_relaxed_model_1.pdb",'r').read(),'pdb')
p.setStyle({'cartoon': {'color':'spectrum'}})
p.zoomTo()
p.show()

In [3]:
#@title Download result
!tar cfz result.tar.gz $jobname"_relaxed_model_1.pdb" $jobname"_relaxed_model_1.png"
from google.colab import files
files.download('result.tar.gz')

tar: default_relaxed_model_1.pdb: Cannot stat: No such file or directory
tar: default_relaxed_model_1.png: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors


ModuleNotFoundError: No module named 'google.colab'