Skip to content

i2bc/SCAN_IDR

Repository files navigation

Running the scripts in SCAN_IDR

SCAN_IDR is a pipeline developed to evaluate AlphaFold2 success rates in predicting the structures of protein assemblies interacting through an intrinsically disordered region. The pipeline evaluates how the delimitations of the inputs and the content of the multiple sequence alignments affect the reliability of the generated models. The study was evaluated on 42 complexes that have no sequence or structural similarity to the complexes used in the AlphaFold2 training.

The pipeline goes through 10 steps including:

  1. the retrieval of the pdb for the cases to be tested (Steps 1 & 2).
  2. the generation of AlphaFold2 runs for all these cases using 10 different protocols which sample different alignment and delimitation conditions (Steps 3 & 4).
  3. the evaluation of the generated models (Steps 5 to 9).
  4. the visualisation of the results using Jupyter Notebooks provided in scripts/post_process/ directory (Step 10).

At each step, different folders and command files are generated which can be run by the user as described below. A user can also set the way commands are to be executed in their own environment if necessary (parallel work, GPU usage) by adapting these command files. The use of servers equipped with GPU cards is strongly recommended to obtain best performance from AlphaFold2.

Content of the archive

  • scripts/: contains the scripts used to generate the dataset, run the predictions and analyse the results
  • data/: contains inputs processed by the python scripts to run the full dataset
  • data_demo/: contains inputs and outputs obtained for two examples from the dataset
  • scan_idr.yml: defines the environment to run the scan_idr pipeline
  • colabfold_v1.3.0.def: a singularity definition file to install the required version of ColabFold and AlphaFold2
  • install_uniref30_2202_db.sh: a script to install the ColabFold sequence database

Packages required to run the scan_idr pipeline are listed in the conda environment file scan_idr.yml (install time ~ 5 minutes). Conda environment can be activated with the instruction:

conda env create -n scan_idr --file scan_idr.yml

<WORKING_DIR> can be defined in scripts/config.ini. Default <WORKING_DIR> = ../

The list of pdb codes used as input is : <WORKING_DIR>/data/list_cif.txt, comma separated list of pdb codes to retrieve the reference files from the rcsb

  • The pipeline can be run from Steps 1 to 10 on all the 42 entries using the input files provided in the data/ folder.

  • The full dataset of sequences, alignments, reference structures and structural models and generated for the 42 cases using the scripts of the scan_idr pipeline can be retrieved from the scanidr_data_repository.tar archive available at (https://zenodo.org/record/7838024).

  • A data_demo/ folder containing inputs for 2 test cases with their expected outputs is presented below

Prerequisites

Most of the dependencies are installed upon the creation of the scan_idr conda environment explained above.

Independent installation of MMseqs2, ColabFold, and ProFit softwares are required to run the full pipeline (install time ~ 20 minutes).

  • MMseqs2: This software is used in Step 3 to generate the initial multiple sequence alignments. A suitably advanced version of MMseqs2 must be obtained from the github repository.
    • Retrieve the program executable with the commands below (Instructions taken from compile-from-source-under-linux).
    • Modify the line <your_path>/MMseqs2/bin/mmseqs with the correct path in the SCAN_IDR/scripts/config.ini file.
git clone https://github.com/soedinglab/MMseqs2.git
cd MMseqs2/
git checkout tags/14-7e284
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make
make install
  • Colabfold: The scan_idr pipeline uses a colabfold singularity image to run AlphaFold2 in Step 4.
    • The singularity software v3.8.6 is installed at the creation of the scan_idr conda environment.
    • A singularity definition file is provided in SCAN_IDR/colabfold_v1_3_0.def to build the colabfold_container_v13.sif image file (adapted from Tubiana). The generated singularity file colabfold_container_v13.sif corresponds to an image of the colabfold version 1.3 compatible with the scan_idr pipeline.
    • To create the colabfold singularity image, root privilege is needed and the command below may be more easily performed on a local linux machine even without GPUs.
    • Once built, the file colabfold_container_v13.sif can be copied to the machine running the scan_idr pipeline such as a server or a cluster preferentially with GPUs.
    • Modify the line <your_path>/colabfold_container_v13.sif with the correct path in the SCAN_IDR/scripts/config.ini file.
    • Build the .sif image with :
sudo singularity build colabfold_container_v13.sif colabfold_v1_3_0.def

NB: Other implementations of colabfold may be used, provided the command line used in the script Step 4 is adapted.

  • ProFit: This software is used to calculate different RMSD metrics between the models and the reference structures in Step 6. The scan_idr was initially developed to run with ProFit software version 3.1 and was tested with the most recent version 3.3.
    The ProFit software is freely available at http://www.bioinf.org.uk/software/profit/:

    • Download the ProFit V3.3 Linux Binary Distribution and unpack the tar file.
    • The ProFit executable will be accessible in <your_path>/ProFitV3.3/bin/profit
    • Modify the line <your_path>/ProFitV3.3/bin/profit with the correct path in the SCAN_IDR/scripts/config.ini file
    • Create the environment variables HELPDIR and DATADIR which should both point to the top ProFit directory where the files ProFit.help and mdm78.mat are stored:
export HELPDIR=<your_path>/ProFitV3.3
export DATADIR=<your_path>/ProFitV3.3
  • uniref30_2202_db database: Installation of this ColabFoldDB is required to build the alignments in Step 3.
    • The script install_uniref30_2202_db.sh is provided in SCAN_IDR/.
    • This script is directly adapted from the ColabFold script setup_databases.sh
    • Modify the line <your_path>/uniref30_2202/uniref30_2202_db with the correct path in the SCAN_IDR/scripts/config.ini file.
    • Searches against this ColabFoldDB requires a machine with ~ 128GB RAM.
    • Run the following command to download and install the database:
      sh install_uniref30_2202_db.sh

How to run the pipeline:

Step 1: Retrieve mmcif files from rscb

cd SCAN_IDR/scripts
./1_batch_download.sh -f <WORKING_DIR>/data/list_cif.txt -o <WORKING_DIR>/data/cif -c
  • As a default option, <WORKING_DIR> can be left unchanged as ../ in the SCAN_IDR/scripts/config.ini file.
  • Output: PDB files will be stored in the directory <WORKING_DIR>/data/cif/ with the mmcif format

Step 2: Parse mmcif files and get uniprot ID and delimitations

python 2_GetUnicodeDelim_mmcif.py
  • Output: File listing the uniprot ID and delimitations of the chains : File_Listing_Uniprot_Inputs.txt

The uniprot and delimitations are automatically extracted from the .cif files.
In case the information is not found, the value is replaced by ???.
File can be edited and corrected manually following the format :

# PDBCODE: 5NCL
CHAIN:A	UNIPROT:P53894	START:251	STOP:756
CHAIN:B	UNIPROT:P43563	START:46	STOP:287
CHAIN:D	UNIPROT:P24276	START:205	STOP:214

Edit and correct the file File_Listing_Uniprot_Inputs.txt if required. The file provided in SCAN_IDR/data/File_Listing_Uniprot_Inputs.txt was automatically generated and had to be corrected for a few entries of the 42 dataset cases.

Step 3: Generate the MSAs for every unicode entry

Command to run:

python 3_CmdRetrieveFastaAliFile.py

Use the information in File_Listing_Uniprot_Inputs.txt

  • Retrieve the fasta file for every entry from uniprot db and store it in <WORKING_DIR>/data/fasta_msa
  • Create a command file <WORKING_DIR>/data/fasta_msa/cmd_create_alignments.sh that will :
    • Run mmseqs on every fasta input.
    • Retrieve full-length sequence of every homolog from the resulting MSA.
    • Realign full-length sequences using mafft software installed in the conda environment.

Move to <WORKING_DIR>/data/fasta_msa/ where the cmd_create_alignments.sh script was created.
This script can be edited if required to modify the generic values of parameters such as QID, COV for specific entries.
Then, run:

sh cmd_create_alignments.sh

For every uniprot entry, a directory msa_<uniprot ID> will be created in <WORKING_DIR>/data/fasta_msa/
These directories contain the multiple sequence alignments that will be used to generate the concatenated alignments in subsequent steps.
A copy of the MSA files generated in these directories or the 42 tested cases is provided as an .a3m in the final repository distributed as Available Data

Step 4: Create the concatenated MSA for every protocol and generate the models

Command to run in <WORKING_DIR>/scripts:

python 4_PrepareDirectoryforAF2.py -i all 

With the option -i all, the input files for ten protocols wil be generated.

The list of 10 protocols to be run is provided below:

  • mixed_ali-delim-delim: paired+unpaired alignment delimited as in the pdb for both receptor and ligand.

  • mixed_ali-fl-fl: paired+unpaired alignment using the full-length sequences of both receptor and ligand.

  • mixed_ali-delim-fl: paired+unpaired alignment delimited as in the pdb for the receptor but using the full-length sequence of the ligand.

  • mixed_ali-delim-100: paired+unpaired alignment delimited as in the pdb for the receptor and extending the size of the PDB ligand sequence by 100 residues.

  • mixed_ali-delim-200: paired+unpaired alignment delimited as in the pdb for the receptor and extending the size of the PDB ligand sequence by 200 residues.

  • unpaired_ali-delim-delim: unpaired alignment using the full-length sequences of both receptor and ligand.

  • unpaired_ali-delim-fl: unpaired alignment delimited as in the pdb for the receptor but using the full-length sequence of the ligand.

  • single_pep-delim-delim: alignment for the receptor and only single sequence for the ligand using the delimitations as in the pdb for both receptor and ligand.

  • single_pep-delim-100: alignment for the receptor and only single sequence for the ligand using the delimitations as in the pdb for the receptor and extending the size of the PDB ligand sequence by 100 residues.

  • single_pep-delim-200: alignment for the receptor and only single sequence for the ligand using the delimitations as in the pdb for the receptor and extending the size of the PDB ligand sequence by 200 residues.

  • Ouput:

    • Creation of a directory <WORKING_DIR>/data/af2_runs/
    • Creation of a directory <WORKING_DIR>/data/af2_runs/<CASE_INDEX>_<PDBCODE>/
    • For every protocol generation of an af2 directory containing the concatenated MSA and the run_colab.sh script to run colabfold on this MSA
    • Generation of the output file <WORKING_DIR>/data/af2_runs/cmd_global_af2runs.sh which lists all the commands to run the protocols for every entry

Move to <WORKING_DIR>/data/af2_runs/ to run :

sh cmd_global_af2runs.sh

If needed, each of these protocols can be generated individually using :
python 4_PrepareDirectoryforAF2.py -i <protocol_name>

  • Explanations of the format of the protocol names:
    <MSA_mode>-<Receptor_Delimitations>-<Ligand_Delimitations>
    • <MSA_mode>: can be either mixed (paired+unpaired), unpaired or single_pep (no MSA, only single sequence)
    • <Delimitations>: can be delim (same delimitations as in the reference PDB), fl (full-length sequence), 100 or 200 (delimitations of the reference PDB extended by this number)

Step 5: After AF2 has finished, cut of the models to the same delimitations as in the reference PDB to enable CAPRI-like evaluation

Command to run in <WORKING_DIR>/scripts:

python 5_CutModelsForCAPRI.py 
  • Ouput:
    • Creation of a directory <WORKING_DIR>/data/cutmodels_for_caprieval/
    • Creation of a directory <WORKING_DIR>/data/cutmodels_for_caprieval/<CASE_INDEX>_<PDBCODE>/
    • For every protocol generation of an af2 directory containing the 25 af2 models in which receptors and ligands are cut following the delimitation of the reference PDB

Step 6: Run the evaluation of the models using CAPRI criteria

Prior to this step, the user should make sure that the REFERENCE_DIR defined in the config.ini file contains reference file for respective test cases. For the demo (5V1U and 6G04), these files can be found in the data_demo/ref_capri_curated folder. The script expects syntax of the file names to be pdb_chain1-chain2.pdb where pdb is the same code as in list_cif.txt and chain1, chain2 refer to the chains that should be used for evaluation of the models.

Command to run in <WORKING_DIR>/scripts:

python 6_RunCapri.py 
  • Ouput:
    • Creation of a directory <WORKING_DIR>/data/caprieval_on_cutmodels/
    • Creation of a directory <WORKING_DIR>/data/caprieval_on_cutmodels/<CASE_INDEX>_<PDBCODE>/
    • Creation of the script cmd_run_capri_all_models.sh in <WORKING_DIR>/data

Move to <WORKING_DIR>/data/ to run :

sh cmd_run_capri_all_models.sh

Step 7: Post-processing - Retrieval of the AF2 scores

Command to run in <WORKING_DIR>/scripts:

python 7_GetAF2Scores.py 
  • Ouput:
    • Creation of a directory <WORKING_DIR>/data/results_files/
    • Generation of a file AF2_AllProtocols_Scores.out listing all the AF2 scores of all the models in <WORKING_DIR>/data/results_files/

Step 8: Post-processing - Retrieval of the CAPRI evaluation scores

Command to run in <WORKING_DIR>/scripts:

python 8_GenerateGlobalOutputTable.py 
  • Ouput:
    • Generation of a file listing all the capri evaluations and the AF2 scores for all the models Global_results_AF2_CAPRI.out in <WORKING_DIR>/data/results_files/

Step 9: Post-processing - Retrieval of the CAPRI evaluation scores

Command to run in <WORKING_DIR>/scripts:

python 9_RunFinalScoresAnalyses.py
  • Ouput:
    • For every protocols, generate several files comparing the number of successful predictions over all the Global_results_AF2_CAPRI.out in <WORKING_DIR>/data/results_files/
      These files will be used as inputs for the graphical analyses performed in the Jupyter Notebooks scripts located in <WORKING_DIR>/scripts/post_process directory.

Step 10: Plotting the results

Several Jupyter notebooks are provided to plot the results of the analysis and are provided in: <WORKING_DIR>/scripts/post_process

Move to <WORKING_DIR>/scripts/post_process to run each of these notebooks:

  1. File Report_SuccessRates_InvidualMethods.ipynb
    • Plots as stacked bars the success rates of different protocols
  2. File Report_DockQVsPDB.ipynb
    • Plots as histogram for every pdb, the best DockQ scores of the model
  3. File Report_ScoresVsCAPRI.ipynb
    • Plots as box plots the distribution of the AF2 scores vs DockQ scores
  4. File Report_DatasetProperties_BoxPlots_lengths.ipynb
    • Plots as box plots the length distributions of the systems studied

Depending if you changed the definition of your WORKING_DIR in the config.ini file, the path defining WORKING_DIR variable in the Jupyter notebooks may need to be adjusted

Running a demo example

To test the pipeline, it is possible to work on one or two pdb codes rather than running all 42 entries of the dataset.

To do so, edit the SCAN_IDR/data/list_cif.txt and change the comma separated list of PDB codes.

We provide a SCAN_IDR/data_demo/ folder as an example of a simple input with two PDB entries with all the outputs that are expected to be generated from Step 1 to 10:

  • First, uncompress the large directories in SCAN_IDR/data_demo/:
    fasta_msa.tar.gz, af2_runs.tar.gz and cutmodels_for_caprieval.tar.gz
  • The file SCAN_IDR/data_demo/list_cif.txt can be copied to your <WORKING_DIR>/data/ and used as input of Steps 1 to 10 to run only 2 cases.
  • If you want to run the pipeline starting at a specific Step, the entire content of SCAN_IDR/data_demo/ can be copied to <WORKING_DIR>/data/. Thanks to that, all the files and folders required to run every Step independently will be accessible. For instance, it is then possible to start running the pipeline at Step 5 and all next steps subsequently.
    NB: Only to run the last step Step 10, you need to have at least run Step 9 on your system.

The files and folders from the SCAN_IDR/data_demo/ can be compared to the outputs generated by the pipeline.

  • Indicative times for the execution of the demo:
    • Step 1: < 1 min
    • Step 2: < 1 min
    • Step 3: 5 min (including python script and MSA generation on 80 CPU)
    • Step 4: 1-2 min for the python script + 5 hours to run AlphaFold on a single A100 GPU (can be bypassed by using data_demo/af2_runs files)
    • Step 5: 10 min
    • Step 6: <1 min for the python script + 5 min for running CAPRI evaluation
    • Step 7: <1 min
    • Step 8: <1 min
    • Step 9: <1 min
    • Step 10: a few minutes to run each Jupyter notebook

Acknowledgements

  • We would like to thank the ColabFold and AlphaFold teams for providing open access to their softwares.
  • We are grateful to Martin, A.C.R. for the development of the ProFit software based on the McLachlan algorithm (McLachlan, A.D., 1982, Acta Cryst A38, 871-873).
  • Thanks to Thibault Tubiana and Chloé Quignot for providing their singularity definition file for ColabFold.