This pipeline expands a binary list of protein-protein interactions into larger complexes and prepares AlphaFold input structures, leveraging local ColabFold and structural biology toolkits.
Install or load the following dependencies:
- Phenix
- CCP4
- DrSASA
localcolabfoldwith access toalphafold2_multimer_v3- PyMOL and ImageMagick for thumbnail figure generation (optional)
Prepare a CSV file containing binary protein interactions. Example:
Q9FM80,AT2G38750
Q9FM80,AT3G26060
...- Use UniProt IDs wherever possible.
- Ensure naming consistency: avoid mixing synonyms like
Q9FM80andAT5G55580. - The common bait protein must be in the first column.
- Do not use Excel/Google Sheets "pull down" autofill—this may corrupt UniProt IDs.
A script is provided to produce a CSV file of interactors downloaded from BioGRID. Example usage:
get_biogrid_interactions.py -g Q92542 -s -f accession -o 'Homo sapiens' -e "AFFINITY CAPTURE-WESTERN:2" --accesskey abcdefghij
Gets interactors of uniprot ID Q92542 in Homo Sapiens, looking for at least 2 AFFINITY CAPTURE-WESTERN interactions, with strict checking enforced for database ID mapping, specifying that the "Accession" fields should be checked. Other options here https://www.uniprot.org/help/query-field. Access keys can be obtained here : https://webservice.thebiogrid.org. Note that this uses the script uniprot_id_mapping.py which should be in the PATH.
python3 assemble_complexes_from_binary_interactions.py \
-c interactions.txt \
-o thaliana \
-b3 \
-f \
--strict \
-a xref_araport,xref_gramene \
-gIn the same directory as your AF_complex_queue.db SQLite database
alphafold_trim_sequences.plNote that PDBs for every sequence in your database will be downloaded to the current directory, and the sequences in your AF_complex_queue.db will be edited (to include underscores in the very common case of discontinuous high confidence regions). This script makes use of the phenix programs phenix.process_predicted_model and phenix.print_sequence and the CCP4 program PDBSET
python3 AF_complexes_release/alphafuser_worker.pyOr on SLURM, to e.g. request a GPU:
srun -pgpu --gres=gpu:1 --mem-per-gpu 48G -Ca40 \
python3 AF_complexes_release/alphafuser_worker.pyThis script calls bsa.pl and must be in the PATH. Theis is a perl wrapper for DrSasa
Multiple alphafuser_worker.py can be run in parallel, but large numbers can cause locking/concurrency issues on the SQLite database
how_many_complexes.shProvides statistics on the progress of the computation
A SQLite database (AF_complex_queue.db) will store:
- Trimmed sequences
- Complex assembly information, most importantly IPTM and buried surface area
- Folders containing AlphaFold multimer models
The database schema is as follows:
CREATE TABLE proteins(Uniprot_ID,Protein_name,PDB_path,Sequence,UNIQUE(Protein_name,Sequence));
CREATE TABLE complex_components(complex_name,protein_name);
CREATE TABLE queue (
complex_name TEXT,
percent_connections REAL,
priority INTEGER,
status TEXT,
GPU_type TEXT,
dockq_score REAL,
pLDDT REAL,
ptmscore REAL,
thread INTEGER,
iptm REAL,
average_buried_area REAL,
average_delta_G REAL,
best_buried_area REAL,
best_delta_G REAL,
start_time TIMESTAMP,
end_time TIMESTAMP,
failures INTEGER,
poisoned_based_on INTEGER
);
alphafuser_generate_thumbnails.pl can be used to generate pymol figures colored by protein and pLDDT. The simplist way to run this is to make sure the "newcsv.csv" file from complex assembly above is in the same directory as the directory containing the local-colabfold subdirectories, and run it with no arguments. Some options are provided for use with AWS, but are largely untested.
alphafuser_network_diagram.py can be used to make CSV files that can be used by the included R_network_diagram.R R script to produce a network diagram of the Alphafuser run
This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.