Skip to content
/ cg2all Public

Convert coarse-grained protein structure to all-atom model

License

Notifications You must be signed in to change notification settings

huhlim/cg2all

Repository files navigation

cg2all

Convert coarse-grained protein structure to all-atom model

Web server / Google Colab notebook

Hugging Face Spaces
A demo web page is available for conversions of CG model to all-atom structure via Huggingface space.

Google Colab
A Google Colab notebook is available for tasks:

  • Task 1: Conversion of an all-atom structure to a CG model using convert_all2cg
  • Task 2: Conversion of a CG model to an all-atom structure using convert_cg2all
  • Task 3: Conversion of a CG simulation trajectory to an atomistic simulation trajectory using convert_cg2all

Google Colab
A Google Colab notebook is available for local optimization of a protein model structure against a cryo-EM density map using cryo_em_minimizer.py

Installation

These steps will install Python libraries including cg2all (this repository), a modified MDTraj, a modified SE3Transformer, and other dependent libraries. The installation steps also place executables convert_cg2all and convert_all2cg in your python binary directory.

This package is tested on Linux (CentOS) and MacOS (Apple Silicon, M1).

for CPU only

pip install git+http://github.com/huhlim/cg2all

for CUDA (GPU) usage

  1. Install Miniconda
  2. Create an environment with DGL library with CUDA support
# This is an example with cudatoolkit=11.3.
# Set a proper cudatoolkit version that is compatible with your CUDA drivier and DGL library.
# dgl>=1.1 occassionally raises some errors, so please use dgl<=1.0.
conda create --name cg2all pip cudatoolkit=11.3 dgl=1.0 -c dglteam/label/cu113
  1. Activate the environment
conda activate cg2all
  1. Install this package
pip install git+http://github.com/huhlim/cg2all

for cryo_em_minimizer usage

You need additional python package, mrcfile to deal with cryo-EM density map.

pip install mrcfile

Usages

convert_cg2all

convert a coarse-grained protein structure to all-atom model

usage: convert_cg2all [-h] -p IN_PDB_FN [-d IN_DCD_FN] -o OUT_FN [-opdb OUTPDB_FN]
                      [--cg {supported_cg_models}] [--chain-break-cutoff CHAIN_BREAK_CUTOFF] [-a]
                      [--fix] [--ckpt CKPT_FN] [--time TIME_JSON] [--device DEVICE] [--batch BATCH_SIZE] [--proc N_PROC]

options:
  -h, --help            show this help message and exit
  -p IN_PDB_FN, --pdb IN_PDB_FN
  -d IN_DCD_FN, --dcd IN_DCD_FN
  -o OUT_FN, --out OUT_FN, --output OUT_FN
  -opdb OUTPDB_FN
  --cg {supported_cg_models}
  --chain-break-cutoff CHAIN_BREAK_CUTOFF
  -a, --all, --is_all
  --fix, --fix_atom
  --standard-name
  --ckpt CKPT_FN
  --time TIME_JSON
  --device DEVICE
  --batch BATCH_SIZE
  --proc N_PROC

arguments

  • -p/--pdb: Input PDB file (mandatory).
  • -d/--dcd: Input DCD file (optional). If a DCD file is given, the input PDB file will be used to define its topology.
  • -o/--out/--output: Output PDB or DCD file (mandatory). If a DCD file is given, it will be a DCD file. Otherwise, a PDB file will be created.
  • -opdb: If a DCD file is given, it will write the last snapshot as a PDB file. (optional)
  • --cg: Coarse-grained representation to use (optional, default=CalphaBasedModel).
    • CalphaBasedModel: CA-trace (atom names should be "CA")
    • ResidueBasedModel: Residue center-of-mass (atom names should be "CA")
    • SidechainModel: Sidechain center-of-mass (atom names should be "SC")
    • CalphaCMModel: CA-trace + Residue center-of-mass (atom names should be "CA" and "CM")
    • CalphaSCModel: CA-trace + Sidechain center-of-mass (atom names should be "CA" and "SC")
    • BackboneModel: Model only with backbone atoms (N, CA, C)
    • MainchainModel: Model only with mainchain atoms (N, CA, C, O)
    • Martini: Martini model
    • Martini3: Martini3 model
    • PRIMO: PRIMO model
  • --chain-break-cutoff: The CA-CA distance cutoff that determines chain breaks. (default=10 Angstroms)
  • --fix/--fix_atom: preserve coordinates in the input CG model. For example, CA coordinates in a CA-trace model will be kept in its cg2all output model.
  • --standard-name: output atom names follow the IUPAC nomenclature. (default=False; output atom names will use CHARMM atom names)
  • --ckpt: Input PyTorch ckpt file (optional). If a ckpt file is given, it will override "--cg" option.
  • --time: Output JSON file for recording timing. (optional)
  • --device: Specify a device to run the model. (optional) You can choose "cpu" or "cuda", or the script will detect one automatically.
    "cpu" is usually faster than "cuda" unless the input/output system is really big or you provided a DCD file with many frames because it takes a lot for loading a model ckpt file on a GPU.
  • --batch: the number of frames to be dealt at a time. (optional, default=1)
  • --proc: Specify the number of threads for loading input data. It is only used for dealing with a DCD file. (optional, default=OMP_NUM_THREADS or 1)

examples

Conversion of a PDB file

convert_cg2all -p tests/1ab1_A.calpha.pdb -o tests/1ab1_A.calpha.all.pdb --cg CalphaBasedModel

Conversion of a DCD trajectory file

convert_cg2all -p tests/1jni.calpha.pdb -d tests/1jni.calpha.dcd -o tests/1jni.calpha.all.dcd --cg CalphaBasedModel

Conversion of a PDB file using a ckpt file

convert_cg2all -p tests/1ab1_A.calpha.pdb -o tests/1ab1_A.calpha.all.pdb --ckpt CalphaBasedModel-104.ckpt

convert_all2cg

convert an all-atom protein structure to coarse-grained model

usage: convert_all2cg [-h] -p IN_PDB_FN [-d IN_DCD_FN] -o OUT_FN [--cg {supported_cg_models}]

options:
  -h, --help            show this help message and exit
  -p IN_PDB_FN, --pdb IN_PDB_FN
  -d IN_DCD_FN, --dcd IN_DCD_FN
  -o OUT_FN, --out OUT_FN, --output OUT_FN
  --cg

arguments

  • -p/--pdb: Input PDB file (mandatory).
  • -d/--dcd: Input DCD file (optional). If a DCD file is given, the input PDB file will be used to define its topology.
  • -o/--out/--output: Output PDB or DCD file (mandatory). If a DCD file is given, it will be a DCD file. Otherwise, a PDB file will be created.
  • --cg: Coarse-grained representation to use (optional, default=CalphaBasedModel).
    • CalphaBasedModel: CA-trace (atom names should be "CA")
    • ResidueBasedModel: Residue center-of-mass (atom names should be "CA")
    • SidechainModel: Sidechain center-of-mass (atom names should be "SC")
    • CalphaCMModel: CA-trace + Residue center-of-mass (atom names should be "CA" and "CM")
    • CalphaSCModel: CA-trace + Sidechain center-of-mass (atom names should be "CA" and "SC")
    • BackboneModel: Model only with backbone atoms (N, CA, C)
    • MainchainModel: Model only with mainchain atoms (N, CA, C, O)
    • Martini: Martini model
    • Martini3: Martini3 model
    • PRIMO: PRIMO model

an example

convert_all2cg -p tests/1ab1_A.pdb -o tests/1ab1_A.calpha.pdb --cg CalphaBasedModel

script/cryo_em_minimizer.py

Local optimization of protein model structure against given electron density map. This script is a proof-of-concept that utilizes cg2all network to optimize at CA-level resolution with objective functions in both atomistic and CA-level resolutions. It is highly recommended to use cuda environment.

usage: cryo_em_minimizer [-h] -p IN_PDB_FN -m IN_MAP_FN -o OUT_DIR [-a]
                         [-n N_STEP] [--freq OUTPUT_FREQ]
                         [--chain-break-cutoff CHAIN_BREAK_CUTOFF]
                         [--restraint RESTRAINT]
                         [--cg {CalphaBasedModel,CA,ca,ResidueBasedModel,RES,res}]
                         [--standard-name] [--uniform_restraint]
                         [--nonuniform_restraint] [--segment SEGMENT_S]

options:
  -h, --help            show this help message and exit
  -p IN_PDB_FN, --pdb IN_PDB_FN
  -m IN_MAP_FN, --map IN_MAP_FN
  -o OUT_DIR, --out OUT_DIR, --output OUT_DIR
  -a, --all, --is_all
  -n N_STEP, --step N_STEP
  --freq OUTPUT_FREQ, --output_freq OUTPUT_FREQ
  --chain-break-cutoff CHAIN_BREAK_CUTOFF
  --restraint RESTRAINT
  --cg {CalphaBasedModel,CA,ca,ResidueBasedModel,RES,res}
  --standard-name
  --uniform_restraint
  --nonuniform_restraint
  --segment SEGMENT_S

arguments

  • -p/--pdb: Input PDB file (mandatory).
  • -m/--map: Input electron density map file in the MRC or CCP4 format (mandatory).
  • -o/--out/--output: Output directory to save optimized structures (mandatory).
  • -a/--all/--is_all: Whether the input PDB file is atomistic structure or not. (optional, default=False)
  • -n/--step: The number of minimization steps. (optional, default=1000)
  • --freq/--output_freq: The interval between saving intermediate outputs. (optional, default=100)
  • --chain-break-cutoff: The CA-CA distance cutoff that determines chain breaks. (default=10 Angstroms)
  • --restraint: The weight of distance restraints. (optional, default=100.0)
  • --cg: Coarse-grained representation to use (default=ResidueBasedModel)
  • --standard-name: output atom names follow the IUPAC nomenclature. (default=False; output atom names will use CHARMM atom names)
  • --uniform_restraint/--nonuniform_restraint: Whether to use uniform restraints. (default=True) If it is set to False, the restraint weights will be dependent on the pLDDT values recorded in the PDB file's B-factor columns.
  • --segment: The segmentation method for applying rigid-body operations. (default=None)
    • None: Input structure is not segmented, so the same rigid-body operations are applied to the whole structure.
    • chain: Input structure is segmented based on chain IDs. Rigid-body operations are independently applied to each chain.
    • segment: Similar to "chain" option, but the structure is segmented based on peptide bond connectivities.
    • 0-99,100-199: Explicit segmentation based on the 0-index based residue numbers.

an example

./cg2all/script/cryo_em_minimizer.py -p tests/3isr.af2.pdb -m tests/3isr_5.mrc -o 3isr_5+3isr.af2 --all

Datasets

The training/validation/test sets are available at zenodo.

Reference

Lim Heo & Michael Feig, "One particle per residue is sufficient to describe all-atom protein structures", bioRxiv (2023). Link

DOI