<a href="https://colab.research.google.com/github/patrickbryant1/Cfold/blob/master/Cfold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cfold
**Structure prediction of alternative protein conformations**


This Colab contains a pipeline for predicting alternative conformations of protein structures. \
\
The network, Cfold, is a structure prediction network similar to AlphaFold2 that is trained on a conformational split of the PDB.
This has the effect of learning to relate one conformation with one MSA representation. \
\
AlphaFold2 is available under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) and so is Cfold, which is a derivative thereof. The Cfold parameters are made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/legalcode).
\
\
**You may not use these files except in compliance with the licenses.**

Cfold is available for local installation here: https://github.com/patrickbryant1/Cfold

In [None]:
#@title Install dependencies

#@markdown Make sure your runtime is GPU.
#@markdown In the menu above do: Runtime --> Change runtime type --> Hardware accelerator (set to GPU)

#@markdown **Press play.**

#@markdown You will have to restart the runtime after this finishes to include the new packages.
#@markdown In the menu above do: Runtime --> Restart runtime

#@markdown **After restarting** - reconnect to Google drive.

#@markdown This installation is only required the first time this notebook is run.
!pip install -U jaxlib==0.3.24+cuda11.cudnn82 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
!pip install jax==0.3.24
!pip install ml-collections==0.1.1
!pip install dm-haiku==0.0.9
!pip install pandas==1.3.5
!pip install biopython==1.81
!pip install chex==0.1.5
!pip install dm-tree==0.1.8
!pip install immutabledict==2.0.0
!pip install numpy==1.21.6
!pip install scipy==1.7.3
!pip install tensorflow-cpu==2.12.0
!pip install tensorflow==2.11.0
!pip install optax==0.1.4
!pip install py3Dmol

In [None]:
#@title Clone the Cfold github repo
import shutil
try:
  shutil.rmtree('/content/Cfold', ignore_errors=True)
except:
  print('')

!git clone https://github.com/patrickbryant1/Cfold.git

In [20]:
#@title #Follow all steps outlined below to run Cfold
#@markdown To try the **test case** 4AVA, click the box "test_case". Then press the play button to the left.
\
#@markdown If you don't want to run the test case, **leave the box blank**.

#@markdown #Settings
#@markdown - *ID* - name \
#@markdown - **MSA** - currently no MSA search is available directly in the browser, therefore you have to provide your own MSAs in a3m format and upload them here. \

#@markdown Go to https://toolkit.tuebingen.mpg.de/tools/hhblits \
#@markdown Paste your protein sequence in the search field in fasta format --> Submit. \
#@markdown When the search is finished, go to the tab "Query Template MSA" and "Download Full A3M" \
#@markdown Upload the MSAs here: \
#@markdown Click the folder icon (Files) to the left and select the upload file icon. Upload your files.
#@markdown Make sure to name your MSA **"ID".a3m**

#@markdown - NUM_RECYCLES - how many recycles to use in the network
#@markdown - NUM_SAMPLES - how many samples to take per cluster size.

import sys, os
from google.colab import files
import pandas as pd
import numpy as np
import glob
sys.path.insert(0,'/content/Cfold/src')
test_case = True #@param {type:"boolean"}
ID = "4AVA" #@param {type:"string"}
NUM_RECYCLES = 3 # @param {type:"integer"}
NUM_SAMPLES = 5 # @param {type:"integer"}
SEQUENCE = "DGIAELTGARVEDLAGMDVFQGCPAEGLVSLAASVQPLRAAAGQVLLRQGEPAVSFLLISSGSAEVSHVGDDGVAIIARALPGMIVGEIALLRDSPRSATVTTIEPLTGWTGGRGAFATMVHIPGVGERLLRTARQRLAAFVSPIPVRLADGTQLMLRPVLPGDRERTVHGHIQFSGETLYRRFMSPALMHYLSEVDYVDHFVWVVTDGSDPVADARFVRDETDPTVAEIAFTVADAYQGRGIGSFLIGALSVAARVDGVERFAARMLSDNVPMRTIMDRYGAVWQREDVGVITTMIDVPGPGELSLGREMVDQINRVARQVIEAVG" #@param {type:"string"}
OUTDIR="/content/"+ID+'/'
if not os.path.exists(OUTDIR):
  os.mkdir(OUTDIR)

#Check that MSA exists
if test_case!=True:
  #Write fasta
  with open('/content/'+ID+'.fasta', 'w') as file:
    file.write('>'+ID+'\n')
    file.write(SEQUENCE)
  #Check that MSA exists
  try:
    with open('/content/'+ID+'.a3m', 'r') as file:
      for line in file:
        if line[0]=='>':
          continue
        else:
          line = line.rstrip()
          if line!=SEQUENCE:
            print('The sequence provided and the sequence in the MSA do not match!')

    MSA='/content/'+ID+'.a3m'
    FASTA_FILE='/content/'+ID+'.fasta'
  except:
    print('No MSA.')
    print('Please provide an MSA named '+ID+'.a3m')
else:
  MSA='/content/Cfold/data/test/'+ID+'.a3m'
  FASTA_FILE='/content/Cfold/data/test/'+ID+'.fasta'


print('Using MSA:',MSA)
print('Using fasta file:',FASTA_FILE)

#Get the Cfold parameters (if not downloaded)
if not os.path.exists('/content/params10000.npy'):
  print('Getting Cfold network parameters.')
  !wget https://gitlab.com/patrickbryant1/cfold_data/-/raw/main/params10000.npy

Using MSA: /content/Cfold/data/test/4AVA.a3m
Using fasta file: /content/Cfold/data/test/4AVA.fasta


In [None]:
#@title Generate the input features for the network and predict.
from make_msa_seq_feats_colab import process
from predict_with_clusters_colab import predict
import pickle

#Feats
features_output_path = os.path.join(OUTDIR, 'msa_features.pkl')

if not os.path.exists(features_output_path):
  #Get feats
  feature_dict = process(FASTA_FILE, [MSA])

  #Write out features as a pickled dictionary.

  with open(features_output_path, 'wb') as f:
      pickle.dump(feature_dict, f, protocol=4)
  print('Saved features to',features_output_path)
else:
  print('Features exist at',features_output_path)


#Predict
