SCAN_IDR is a pipeline developed to evaluate AlphaFold2 success rates in predicting the structures of protein assemblies interacting through an intrinsically disordered region. The pipeline evaluates how the delimitations of the inputs and the content of the multiple sequence alignments affect the reliability of the generated models. The study was evaluated on 42 complexes that have no sequence or structural similarity to the complexes used in the AlphaFold2 training.
The pipeline goes through 10 steps including:
- the retrieval of the pdb for the cases to be tested (Steps 1 & 2).
- the generation of AlphaFold2 runs for all these cases using 10 different protocols which sample different alignment and delimitation conditions (Steps 3 & 4).
- the evaluation of the generated models (Steps 5 to 9).
- the visualisation of the results using Jupyter Notebooks provided in
scripts/post_process/
directory (Step 10).
At each step, different folders and command files are generated which can be run by the user as described below. A user can also set the way commands are to be executed in their own environment if necessary (parallel work, GPU usage) by adapting these command files. The use of servers equipped with GPU cards is strongly recommended to obtain best performance from AlphaFold2.
scripts/
: contains the scripts used to generate the dataset, run the predictions and analyse the resultsdata/
: contains inputs processed by the python scripts to run the full datasetdata_demo/
: contains inputs and outputs obtained for two examples from the datasetscan_idr.yml
: defines the environment to run the scan_idr pipelinecolabfold_v1.3.0.def
: a singularity definition file to install the required version of ColabFold and AlphaFold2install_uniref30_2202_db.sh
: a script to install the ColabFold sequence database
Packages required to run the scan_idr
pipeline are listed in the conda environment file scan_idr.yml
(install time ~ 5 minutes). Conda environment can be activated with the instruction:
conda env create -n scan_idr --file scan_idr.yml
<WORKING_DIR>
can be defined in scripts/config.ini
. Default <WORKING_DIR> = ../
The list of pdb codes used as input is : <WORKING_DIR>/data/list_cif.txt
, comma separated list of pdb codes to retrieve the reference files from the rcsb
-
The pipeline can be run from Steps 1 to 10 on all the 42 entries using the input files provided in the
data/
folder. -
The full dataset of sequences, alignments, reference structures and structural models and generated for the 42 cases using the scripts of the
scan_idr
pipeline can be retrieved from thescanidr_data_repository.tar
archive available at (https://zenodo.org/record/7838024). -
A
data_demo/
folder containing inputs for 2 test cases with their expected outputs is presented below
Most of the dependencies are installed upon the creation of the scan_idr
conda environment explained above.
Independent installation of MMseqs2, ColabFold, and ProFit softwares are required to run the full pipeline (install time ~ 20 minutes).
- MMseqs2: This software is used in Step 3 to generate the initial multiple sequence alignments. A suitably advanced version of MMseqs2 must be obtained from the github repository.
- Retrieve the program executable with the commands below (Instructions taken from compile-from-source-under-linux).
- Modify the line
<your_path>/MMseqs2/bin/mmseqs
with the correct path in theSCAN_IDR/scripts/config.ini
file.
- Retrieve the program executable with the commands below (Instructions taken from compile-from-source-under-linux).
git clone https://github.com/soedinglab/MMseqs2.git
cd MMseqs2/
git checkout tags/14-7e284
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make
make install
- Colabfold: The
scan_idr
pipeline uses a colabfold singularity image to run AlphaFold2 in Step 4.
- The singularity software v3.8.6 is installed at the creation of the
scan_idr
conda environment. - A singularity definition file is provided in
SCAN_IDR/colabfold_v1_3_0.def
to build thecolabfold_container_v13.sif
image file (adapted from Tubiana). The generated singularity filecolabfold_container_v13.sif
corresponds to an image of the colabfold version 1.3 compatible with thescan_idr
pipeline. - To create the colabfold singularity image, root privilege is needed and the command below may be more easily performed on a local linux machine even without GPUs.
- Once built, the file
colabfold_container_v13.sif
can be copied to the machine running thescan_idr
pipeline such as a server or a cluster preferentially with GPUs. - Modify the line
<your_path>/colabfold_container_v13.sif
with the correct path in theSCAN_IDR/scripts/config.ini
file. - Build the
.sif
image with :
- The singularity software v3.8.6 is installed at the creation of the
sudo singularity build colabfold_container_v13.sif colabfold_v1_3_0.def
NB: Other implementations of colabfold may be used, provided the command line used in the script Step 4 is adapted.
-
ProFit: This software is used to calculate different RMSD metrics between the models and the reference structures in Step 6. The
scan_idr
was initially developed to run with ProFit software version 3.1 and was tested with the most recent version 3.3.
The ProFit software is freely available at http://www.bioinf.org.uk/software/profit/:- Download the ProFit V3.3 Linux Binary Distribution and unpack the tar file.
- The ProFit executable will be accessible in
<your_path>/ProFitV3.3/bin/profit
- Modify the line
<your_path>/ProFitV3.3/bin/profit
with the correct path in theSCAN_IDR/scripts/config.ini
file - Create the environment variables HELPDIR and DATADIR which should both point to the top ProFit directory where the
files ProFit.help and mdm78.mat are stored:
- Download the ProFit V3.3 Linux Binary Distribution and unpack the tar file.
export HELPDIR=<your_path>/ProFitV3.3
export DATADIR=<your_path>/ProFitV3.3
- uniref30_2202_db database: Installation of this ColabFoldDB is required to build the alignments in Step 3.
- The script
install_uniref30_2202_db.sh
is provided inSCAN_IDR/
. - This script is directly adapted from the ColabFold script
setup_databases.sh
- Modify the line
<your_path>/uniref30_2202/uniref30_2202_db
with the correct path in theSCAN_IDR/scripts/config.ini
file. - Searches against this ColabFoldDB requires a machine with ~ 128GB RAM.
- Run the following command to download and install the database:
- The script
sh install_uniref30_2202_db.sh
cd SCAN_IDR/scripts
./1_batch_download.sh -f <WORKING_DIR>/data/list_cif.txt -o <WORKING_DIR>/data/cif -c
- As a default option,
<WORKING_DIR>
can be left unchanged as../
in theSCAN_IDR/scripts/config.ini
file. - Output: PDB files will be stored in the directory
<WORKING_DIR>/data/cif/
with the mmcif format
python 2_GetUnicodeDelim_mmcif.py
- Output: File listing the uniprot ID and delimitations of the chains :
File_Listing_Uniprot_Inputs.txt
The uniprot and delimitations are automatically extracted from the .cif
files.
In case the information is not found, the value is replaced by ???
.
File can be edited and corrected manually following the format :
# PDBCODE: 5NCL
CHAIN:A UNIPROT:P53894 START:251 STOP:756
CHAIN:B UNIPROT:P43563 START:46 STOP:287
CHAIN:D UNIPROT:P24276 START:205 STOP:214
Edit and correct the file File_Listing_Uniprot_Inputs.txt
if required. The file provided in SCAN_IDR/data/File_Listing_Uniprot_Inputs.txt
was automatically generated and had to be corrected for a few entries of the 42 dataset cases.
Command to run:
python 3_CmdRetrieveFastaAliFile.py
Use the information in File_Listing_Uniprot_Inputs.txt
- Retrieve the fasta file for every entry from uniprot db and store it in
<WORKING_DIR>/data/fasta_msa
- Create a command file
<WORKING_DIR>/data/fasta_msa/cmd_create_alignments.sh
that will :
- Run mmseqs on every fasta input.
- Retrieve full-length sequence of every homolog from the resulting MSA.
- Realign full-length sequences using
mafft
software installed in the conda environment.
Move to <WORKING_DIR>/data/fasta_msa/
where the cmd_create_alignments.sh
script was created.
This script can be edited if required to modify the generic values of parameters such as QID, COV for specific entries.
Then, run:
sh cmd_create_alignments.sh
For every uniprot entry, a directory msa_<uniprot ID>
will be created in <WORKING_DIR>/data/fasta_msa/
These directories contain the multiple sequence alignments that will be used to generate the concatenated alignments in subsequent steps.
A copy of the MSA files generated in these directories or the 42 tested cases is provided as an .a3m
in the final repository distributed as Available Data
Command to run in <WORKING_DIR>/scripts
:
python 4_PrepareDirectoryforAF2.py -i all
With the option -i all
, the input files for ten protocols wil be generated.
The list of 10 protocols to be run is provided below:
-
mixed_ali-delim-delim: paired+unpaired alignment delimited as in the pdb for both receptor and ligand.
-
mixed_ali-fl-fl: paired+unpaired alignment using the full-length sequences of both receptor and ligand.
-
mixed_ali-delim-fl: paired+unpaired alignment delimited as in the pdb for the receptor but using the full-length sequence of the ligand.
-
mixed_ali-delim-100: paired+unpaired alignment delimited as in the pdb for the receptor and extending the size of the PDB ligand sequence by 100 residues.
-
mixed_ali-delim-200: paired+unpaired alignment delimited as in the pdb for the receptor and extending the size of the PDB ligand sequence by 200 residues.
-
unpaired_ali-delim-delim: unpaired alignment using the full-length sequences of both receptor and ligand.
-
unpaired_ali-delim-fl: unpaired alignment delimited as in the pdb for the receptor but using the full-length sequence of the ligand.
-
single_pep-delim-delim: alignment for the receptor and only single sequence for the ligand using the delimitations as in the pdb for both receptor and ligand.
-
single_pep-delim-100: alignment for the receptor and only single sequence for the ligand using the delimitations as in the pdb for the receptor and extending the size of the PDB ligand sequence by 100 residues.
-
single_pep-delim-200: alignment for the receptor and only single sequence for the ligand using the delimitations as in the pdb for the receptor and extending the size of the PDB ligand sequence by 200 residues.
-
Ouput:
- Creation of a directory
<WORKING_DIR>/data/af2_runs/
- Creation of a directory
<WORKING_DIR>/data/af2_runs/<CASE_INDEX>_<PDBCODE>/
- For every protocol generation of an af2 directory containing the concatenated MSA and the
run_colab.sh
script to run colabfold on this MSA - Generation of the output file
<WORKING_DIR>/data/af2_runs/cmd_global_af2runs.sh
which lists all the commands to run the protocols for every entry
- Creation of a directory
Move to <WORKING_DIR>/data/af2_runs/
to run :
sh cmd_global_af2runs.sh
If needed, each of these protocols can be generated individually using :
python 4_PrepareDirectoryforAF2.py -i <protocol_name>
- Explanations of the format of the protocol names:
<MSA_mode>-<Receptor_Delimitations>-<Ligand_Delimitations>
<MSA_mode>
: can be eithermixed
(paired+unpaired),unpaired
orsingle_pep
(no MSA, only single sequence)<Delimitations>
: can bedelim
(same delimitations as in the reference PDB),fl
(full-length sequence),100
or200
(delimitations of the reference PDB extended by this number)
Step 5: After AF2 has finished, cut of the models to the same delimitations as in the reference PDB to enable CAPRI-like evaluation
Command to run in <WORKING_DIR>/scripts
:
python 5_CutModelsForCAPRI.py
- Ouput:
- Creation of a directory
<WORKING_DIR>/data/cutmodels_for_caprieval/
- Creation of a directory
<WORKING_DIR>/data/cutmodels_for_caprieval/<CASE_INDEX>_<PDBCODE>/
- For every protocol generation of an af2 directory containing the 25 af2 models in which receptors and ligands are cut following the delimitation of the reference PDB
- Creation of a directory
Prior to this step, the user should make sure that the REFERENCE_DIR defined in the config.ini file contains reference file for respective test cases. For the demo (5V1U and 6G04), these files can be found in the data_demo/ref_capri_curated folder. The script expects syntax of the file names to be pdb_chain1-chain2.pdb where pdb is the same code as in list_cif.txt and chain1, chain2 refer to the chains that should be used for evaluation of the models.
Command to run in <WORKING_DIR>/scripts
:
python 6_RunCapri.py
- Ouput:
- Creation of a directory
<WORKING_DIR>/data/caprieval_on_cutmodels/
- Creation of a directory
<WORKING_DIR>/data/caprieval_on_cutmodels/<CASE_INDEX>_<PDBCODE>/
- Creation of the script
cmd_run_capri_all_models.sh
in<WORKING_DIR>/data
- Creation of a directory
Move to <WORKING_DIR>/data/
to run :
sh cmd_run_capri_all_models.sh
Command to run in <WORKING_DIR>/scripts
:
python 7_GetAF2Scores.py
- Ouput:
- Creation of a directory
<WORKING_DIR>/data/results_files/
- Generation of a file
AF2_AllProtocols_Scores.out
listing all the AF2 scores of all the models in<WORKING_DIR>/data/results_files/
- Creation of a directory
Command to run in <WORKING_DIR>/scripts
:
python 8_GenerateGlobalOutputTable.py
- Ouput:
- Generation of a file listing all the capri evaluations and the AF2 scores for all the models
Global_results_AF2_CAPRI.out
in<WORKING_DIR>/data/results_files/
- Generation of a file listing all the capri evaluations and the AF2 scores for all the models
Command to run in <WORKING_DIR>/scripts
:
python 9_RunFinalScoresAnalyses.py
- Ouput:
- For every protocols, generate several files comparing the number of successful predictions over all the
Global_results_AF2_CAPRI.out
in<WORKING_DIR>/data/results_files/
These files will be used as inputs for the graphical analyses performed in the Jupyter Notebooks scripts located in<WORKING_DIR>/scripts/post_process
directory.
- For every protocols, generate several files comparing the number of successful predictions over all the
Several Jupyter notebooks are provided to plot the results of the analysis and are provided in: <WORKING_DIR>/scripts/post_process
Move to <WORKING_DIR>/scripts/post_process
to run each of these notebooks:
- File
Report_SuccessRates_InvidualMethods.ipynb
- Plots as stacked bars the success rates of different protocols
- Plots as stacked bars the success rates of different protocols
- File
Report_DockQVsPDB.ipynb
- Plots as histogram for every pdb, the best DockQ scores of the model
- Plots as histogram for every pdb, the best DockQ scores of the model
- File
Report_ScoresVsCAPRI.ipynb
- Plots as box plots the distribution of the AF2 scores vs DockQ scores
- Plots as box plots the distribution of the AF2 scores vs DockQ scores
- File
Report_DatasetProperties_BoxPlots_lengths.ipynb
- Plots as box plots the length distributions of the systems studied
- Plots as box plots the length distributions of the systems studied
Depending if you changed the definition of your WORKING_DIR
in the config.ini
file, the path defining WORKING_DIR variable in the Jupyter notebooks may need to be adjusted
Running a demo example
To test the pipeline, it is possible to work on one or two pdb codes rather than running all 42 entries of the dataset.
To do so, edit the SCAN_IDR/data/list_cif.txt
and change the comma separated list of PDB codes.
We provide a SCAN_IDR/data_demo/
folder as an example of a simple input with two PDB entries with all the outputs that are expected to be generated from Step 1 to 10:
- First, uncompress the large directories in
SCAN_IDR/data_demo/
:fasta_msa.tar.gz
,af2_runs.tar.gz
andcutmodels_for_caprieval.tar.gz
- The file
SCAN_IDR/data_demo/list_cif.txt
can be copied to your<WORKING_DIR>/data/
and used as input of Steps 1 to 10 to run only 2 cases. - If you want to run the pipeline starting at a specific Step, the entire content of
SCAN_IDR/data_demo/
can be copied to<WORKING_DIR>/data/
. Thanks to that, all the files and folders required to run every Step independently will be accessible. For instance, it is then possible to start running the pipeline at Step 5 and all next steps subsequently.
NB: Only to run the last step Step 10, you need to have at least run Step 9 on your system.
The files and folders from the SCAN_IDR/data_demo/
can be compared to the outputs generated by the pipeline.
- Indicative times for the execution of the demo:
- Step 1: < 1 min
- Step 2: < 1 min
- Step 3: 5 min (including python script and MSA generation on 80 CPU)
- Step 4: 1-2 min for the python script + 5 hours to run AlphaFold on a single A100 GPU (can be bypassed by using
data_demo/af2_runs
files) - Step 5: 10 min
- Step 6: <1 min for the python script + 5 min for running CAPRI evaluation
- Step 7: <1 min
- Step 8: <1 min
- Step 9: <1 min
- Step 10: a few minutes to run each Jupyter notebook
- Step 1: < 1 min
- We would like to thank the ColabFold and AlphaFold teams for providing open access to their softwares.
- We are grateful to Martin, A.C.R. for the development of the ProFit software based on the McLachlan algorithm (McLachlan, A.D., 1982, Acta Cryst A38, 871-873).
- Thanks to Thibault Tubiana and Chloé Quignot for providing their singularity definition file for ColabFold.