Predict enzyme pH optima directly from 3D protein structures using an Equivariant Graph Neural Network (EGNN).
Includes dataset generation, training, prediction with attention export, and visualization & analysis tools.
- EGNN model with attention and edge features (
egnn.py) - Dataset builder from CIF/PQR → PyTorch Geometric tensors
- Training pipeline (LDS, weighted losses, k-fold)
- Prediction scripts for new proteins (auto-align with checkpoint)
- Attention visualization (mapped to PDB B-factors)
- RSA and distance analysis
# Create and activate environment
# 1️⃣ Create the environment
bash install.sh
# 2️⃣ Activate it
conda activate pHoptNN
# 3️⃣ Verify installation
python -m torch.utils.collect_env
NVIDIA Drivers: Make sure your GPU driver is installed. Docker: Install Docker Desktop (Windows) or Docker Engine (Linux). NVIDIA Toolkit: Windows: Included in Docker Desktop automatically. Linux: Run sudo apt-get install nvidia-container-toolkit . Open your terminal (or PowerShell) and navigate to the folder where you have the code. Run this command:
# This downloads the env and opens a terminal inside it
docker run --gpus all --privileged --rm -it -v $(pwd):/app rajarshisinharoy/phoptnn:v1Once inside: You will see a prompt like (pHoptNN) root@.... You can now run the scripts immediately. Happy Predictions if you get some errors in dockker installations then probably you need to restart your docker.
# 1. Configure the production repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 2. Update and Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# 3. Configure Docker (Crucial Step!)
sudo nvidia-ctk runtime configure --runtime=docker
# 4. Restart Docker to apply changes
sudo systemctl restart dockerpython phoptnn_interface.py /path/to/pdb_or_pdb folder --save_dir /path/to/output_folder# 1️⃣ Build dataset
python EGNN/create_pyg_dataset.py --cif_dir /data/cif --pqr_dir /data/pqr --root_dir /data/phoptnn_dataset
# 2️⃣ Train EGNN
python EGNN/train.py --root_dir /data/phoptnn_dataset/train --df_individuals hp_egnn_grid.csv --idx_individual 0 --save_models_path runs/checkpoints
# 3️⃣ Predict new structures
python phoptnn_interface.py /path/to/pdb_or_pdb folder --save_dir /path/to/output_folder
or more advanced version is:
python EGNN/predict.py --input_path /data/new_pdbs --pqr_dir /data/pqr --model_weights runs/checkpoints/model_0.pt --train_csv_path /data/phoptnn_dataset/train/raw/train.csv --save_dir pred_outThis table provides a high-level overview of the key directories and scripts in the project.
| Directory / File | Description |
|---|---|
| Root | |
README.md |
This file, describing the project and layout. |
LICENSE |
The open-source license for the repository. |
Dockerfile |
Defines the Docker container for environment replication. |
environment.yml |
Conda environment file for setting up dependencies. |
phoptnn_interface.py |
Main script or interface for running the PhOptNN model. |
repository.txt |
A full, detailed tree of all files (for reference). |
| Core Model (EGNN) | |
EGNN/ |
Main directory for the E(n) Equivariant GNN model. |
EGNN/models/egnn.py |
Core EGNN model definition. |
EGNN/train.py |
Main script for training the EGNN model (supports LDS, k-fold). |
EGNN/predict.py |
Script to run predictions and export attention weights. |
EGNN/create_pyg_dataset.py |
Script to build PyTorch Geometric datasets from CIF/PQR files. |
EGNN/attention2pdb.py |
Utility to map saved attention weights to PDB B-factors for visualization. |
EGNN/rsa_anlysis.py |
Script to analyze the relationship between attention, RSA, and active sites. |
EGNN/constants.py |
Project constants (e.g., atom and residue dictionaries). |
EGNN/weight/ |
Stores trained model weights (.pt files). |
EGNN/qm9/ |
Code related to the QM9 benchmark dataset. |
| Data | |
pyg_datasets_connected/ |
Default directory for processed PyTorch Geometric datasets. |
pyg_datasets_connected/train/ |
Processed training data. |
pyg_datasets_connected/test/ |
Processed test data. |
| Structure Generation (AF3) | |
AF3_jobs/ |
Directory for configuring, running, and analyzing AlphaFold 3 jobs. |
AF3_jobs/input/ |
Contains .slurm batch scripts and .json inputs for running AF3. |
AF3_jobs/input_template/ |
Data (MSAs, CIFs) and scripts to generate AF3 input templates. |
AF3_jobs/output/ |
Example output directory from an AF3 run. |
| Analysis & Alternative Models | |
Analysis/ |
Scripts and notebooks for analyzing model predictions and attention. |
Analysis/Glycosidasen_class/ |
In-depth analysis for the "Glycosidasen" enzyme class, containing data, PDBs, PQR files, and results. |
GCN/ |
Implementation of an alternative GCN model. |
A0A0A1C3U6.pdb |
An example PDB file, likely for quick testing. |
You’ll need:
- CIF files:
/data/cif/{uniprot_id}.cif - PQR files:
/data/pqr/{uniprot_id}.pqr - Training/Test CSVs:
train.csvwith columns:uniprot_id, ph_optimumtest.csvwith at leastuniprot_id, ph_optimum
Run:
python EGNN/create_pyg_dataset.py --cif_dir /data/cif --pqr_dir /data/pqr --root_dir /data/phoptnn_dataset --batch_size 1 --l_max -1This will generate:
/data/phoptnn_dataset/train/egnn_train_dataset_bs_1.pt
/data/phoptnn_dataset/test/egnn_test_dataset_bs_1.pt
Train your EGNN model with:
python EGNN/train.py --root_dir /data/phoptnn_dataset/train --df_individuals hp_egnn_grid.csv --idx_individual 0 --early_stopping_patience 50 --losses_csv_path runs/losses --save_models_path runs/checkpoints- Supports Label Distribution Smoothing (LDS) and weighted loss
- Automatically saves best model checkpoint
- Logs training metrics to CSV
python EGNN/predict.py --input_path /path/to/pdb_or_folder --model_weights weight/W_6_attn.pt --params_csv EGNN/hyperparameters/Best_hp.csv --idx_individual 6 --train_csv_path /data/phoptnn_dataset/train/raw/train.csv --y_mean 7.1956 --y_std 1.2302 --pqr_dir ./pqr_files --save_dir ./pred_out --att-export node --node-agg mean --charge-power 2Outputs:
pred_out/predictions.csvpred_out/{pdb}_attention.csvpred_out/{pdb}_edge_attention.csv
python attention2pdb.py --pdb protein.pdb --attention_csv pred_out/protein_attention.csv --out_pdb protein_attention.pdb --mode auto --aggregate residue_maxThen, open protein_attention.pdb in PyMOL or Chimera and color by B-factor to visualize important residues.
python rsa_anlysis.py --pdb_dir /path/to/pdbs --att_dir ./pred_out --out_dir ./rsa_out --active_csv active_site_map.csv --dist_bins "0:2.5:10" --rsa_bins "0:0.05:1.0"Generates CSVs and plots showing how attention relates to surface accessibility and active-site proximity.
- Atoms & residues defined in
constants.py - Hydrogens are excluded automatically
- Convert pdb to pqr file using "pdb2pqr" with AMBER forcefeild for better performance.
- Edge features:
- 5-dim RDKit bond vector + ring flag
- Fallback: distance-bin encoding
If you use pHoptNN in your research, please cite:
This project is licensed under the MIT License.
See the LICENSE file for details.
Developed as part of the pHoptNN project — integrating EGNN-based geometric learning for enzyme property prediction.