pHoptNN

Predict enzyme pH optima directly from 3D protein structures using an Equivariant Graph Neural Network (EGNN).
Includes dataset generation, training, prediction with attention export, and visualization & analysis tools.

Features

EGNN model with attention and edge features (egnn.py)
Dataset builder from CIF/PQR → PyTorch Geometric tensors
Training pipeline (LDS, weighted losses, k-fold)
Prediction scripts for new proteins (auto-align with checkpoint)
Attention visualization (mapped to PDB B-factors)
RSA and distance analysis

Environment Setup

Conda (recommended)

This enviouroment supports CUDA => 12.x

# Create and activate environment
# 1️⃣ Create the environment
 bash install.sh

# 2️⃣ Activate it
conda activate pHoptNN

# 3️⃣ Verify installation
python -m torch.utils.collect_env

Docker Support

NVIDIA Drivers: Make sure your GPU driver is installed. Docker: Install Docker Desktop (Windows) or Docker Engine (Linux). NVIDIA Toolkit: Windows: Included in Docker Desktop automatically. Linux: Run sudo apt-get install nvidia-container-toolkit . Open your terminal (or PowerShell) and navigate to the folder where you have the code. Run this command:

# This downloads the env and opens a terminal inside it
docker run --gpus all --privileged --rm -it -v $(pwd):/app rajarshisinharoy/phoptnn:v1

Once inside: You will see a prompt like (pHoptNN) root@.... You can now run the scripts immediately. Happy Predictions if you get some errors in dockker installations then probably you need to restart your docker.

# 1. Configure the production repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 2. Update and Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 3. Configure Docker (Crucial Step!)
sudo nvidia-ctk runtime configure --runtime=docker

# 4. Restart Docker to apply changes
sudo systemctl restart docker

Usecase:

Use for Prediction:

python phoptnn_interface.py /path/to/pdb_or_pdb folder --save_dir /path/to/output_folder

Example Workflow for Building DataSet, Training and Prediction

# 1️⃣ Build dataset
python EGNN/create_pyg_dataset.py   --cif_dir /data/cif   --pqr_dir /data/pqr   --root_dir /data/phoptnn_dataset

# 2️⃣ Train EGNN
python EGNN/train.py   --root_dir /data/phoptnn_dataset/train   --df_individuals hp_egnn_grid.csv   --idx_individual 0   --save_models_path runs/checkpoints

# 3️⃣ Predict new structures
python phoptnn_interface.py /path/to/pdb_or_pdb folder --save_dir /path/to/output_folder 
 or more advanced version is:
python EGNN/predict.py   --input_path /data/new_pdbs --pqr_dir /data/pqr  --model_weights runs/checkpoints/model_0.pt   --train_csv_path /data/phoptnn_dataset/train/raw/train.csv   --save_dir pred_out

Repository Layout

This table provides a high-level overview of the key directories and scripts in the project.

Directory / File	Description
Root
`README.md`	This file, describing the project and layout.
`LICENSE`	The open-source license for the repository.
`Dockerfile`	Defines the Docker container for environment replication.
`environment.yml`	Conda environment file for setting up dependencies.
`phoptnn_interface.py`	Main script or interface for running the PhOptNN model.
`repository.txt`	A full, detailed tree of all files (for reference).
Core Model (EGNN)
`EGNN/`	Main directory for the E(n) Equivariant GNN model.
`EGNN/models/egnn.py`	Core EGNN model definition.
`EGNN/train.py`	Main script for training the EGNN model (supports LDS, k-fold).
`EGNN/predict.py`	Script to run predictions and export attention weights.
`EGNN/create_pyg_dataset.py`	Script to build PyTorch Geometric datasets from CIF/PQR files.
`EGNN/attention2pdb.py`	Utility to map saved attention weights to PDB B-factors for visualization.
`EGNN/rsa_anlysis.py`	Script to analyze the relationship between attention, RSA, and active sites.
`EGNN/constants.py`	Project constants (e.g., atom and residue dictionaries).
`EGNN/weight/`	Stores trained model weights (`.pt` files).
`EGNN/qm9/`	Code related to the QM9 benchmark dataset.
Data
`pyg_datasets_connected/`	Default directory for processed PyTorch Geometric datasets.
`pyg_datasets_connected/train/`	Processed training data.
`pyg_datasets_connected/test/`	Processed test data.
Structure Generation (AF3)
`AF3_jobs/`	Directory for configuring, running, and analyzing AlphaFold 3 jobs.
`AF3_jobs/input/`	Contains `.slurm` batch scripts and `.json` inputs for running AF3.
`AF3_jobs/input_template/`	Data (MSAs, CIFs) and scripts to generate AF3 input templates.
`AF3_jobs/output/`	Example output directory from an AF3 run.
Analysis & Alternative Models
`Analysis/`	Scripts and notebooks for analyzing model predictions and attention.
`Analysis/Glycosidasen_class/`	In-depth analysis for the "Glycosidasen" enzyme class, containing data, PDBs, PQR files, and results.
`GCN/`	Implementation of an alternative GCN model.
`A0A0A1C3U6.pdb`	An example PDB file, likely for quick testing.

Dataset Preparation

Input Data

You’ll need:

CIF files: /data/cif/{uniprot_id}.cif
PQR files: /data/pqr/{uniprot_id}.pqr
Training/Test CSVs:
- train.csv with columns: uniprot_id, ph_optimum
- test.csv with at least uniprot_id, ph_optimum

Build PyG Dataset

Run:

python EGNN/create_pyg_dataset.py   --cif_dir /data/cif   --pqr_dir /data/pqr   --root_dir /data/phoptnn_dataset   --batch_size 1   --l_max -1

This will generate:

/data/phoptnn_dataset/train/egnn_train_dataset_bs_1.pt
/data/phoptnn_dataset/test/egnn_test_dataset_bs_1.pt

Training

Train your EGNN model with:

python EGNN/train.py   --root_dir /data/phoptnn_dataset/train   --df_individuals hp_egnn_grid.csv   --idx_individual 0   --early_stopping_patience 50  --losses_csv_path runs/losses   --save_models_path runs/checkpoints

Notes:

Supports Label Distribution Smoothing (LDS) and weighted loss
Automatically saves best model checkpoint
Logs training metrics to CSV

Prediction

Standard prediction with attention export

python EGNN/predict.py   --input_path /path/to/pdb_or_folder   --model_weights weight/W_6_attn.pt   --params_csv EGNN/hyperparameters/Best_hp.csv   --idx_individual 6   --train_csv_path /data/phoptnn_dataset/train/raw/train.csv   --y_mean 7.1956 --y_std 1.2302   --pqr_dir ./pqr_files   --save_dir ./pred_out   --att-export node   --node-agg mean   --charge-power 2

Outputs:

pred_out/predictions.csv
pred_out/{pdb}_attention.csv
pred_out/{pdb}_edge_attention.csv

Visualizing Attention Analysis

1️⃣ Attention → PDB (B-factors)

python attention2pdb.py   --pdb protein.pdb   --attention_csv pred_out/protein_attention.csv   --out_pdb protein_attention.pdb   --mode auto   --aggregate residue_max

Then, open protein_attention.pdb in PyMOL or Chimera and color by B-factor to visualize important residues.

2️⃣ RSA / Distance Analysis

python rsa_anlysis.py   --pdb_dir /path/to/pdbs   --att_dir ./pred_out   --out_dir ./rsa_out   --active_csv active_site_map.csv   --dist_bins "0:2.5:10"   --rsa_bins "0:0.05:1.0"

Generates CSVs and plots showing how attention relates to surface accessibility and active-site proximity.

Data Expectations

Atoms & residues defined in constants.py
Hydrogens are excluded automatically
Convert pdb to pqr file using "pdb2pqr" with AMBER forcefeild for better performance.
Edge features:
- 5-dim RDKit bond vector + ring flag
- Fallback: distance-bin encoding

Citation

If you use pHoptNN in your research, please cite:

License

This project is licensed under the MIT License.
See the LICENSE file for details.

Acknowledgements

Developed as part of the pHoptNN project — integrating EGNN-based geometric learning for enzyme property prediction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pHoptNN

Features

Environment Setup

Conda (recommended)

This enviouroment supports CUDA => 12.x

Docker Support

Usecase:

Use for Prediction:

Example Workflow for Building DataSet, Training and Prediction

Repository Layout

Dataset Preparation

Input Data

Build PyG Dataset

Training

Notes:

Prediction

Standard prediction with attention export

Visualizing Attention Analysis

1️⃣ Attention → PDB (B-factors)

2️⃣ RSA / Distance Analysis

Data Expectations

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Analysis		Analysis
EGNN		EGNN
GCN		GCN
examples		examples
pyg_datasets_connected		pyg_datasets_connected
qm9/temp/qm9		qm9/temp/qm9
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
phoptnn_interface.py		phoptnn_interface.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

pHoptNN

Features

Environment Setup

Conda (recommended)

This enviouroment supports CUDA => 12.x

Docker Support

Usecase:

Use for Prediction:

Example Workflow for Building DataSet, Training and Prediction

Repository Layout

Dataset Preparation

Input Data

Build PyG Dataset

Training

Notes:

Prediction

Standard prediction with attention export

Visualizing Attention Analysis

1️⃣ Attention → PDB (B-factors)

2️⃣ RSA / Distance Analysis

Data Expectations

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages