EzHit is a lightweight enzyme-reaction retrieval framework for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.
Given an enzyme amino-acid sequence and a reaction SMILES, EzHit estimates whether the enzyme is likely to catalyze the reaction. EzHit supports online prediction, custom fine-tuning, local training, local inference, and uncertainty-aware inference using ensemble prediction and Mahalanobis-distance-based distribution assessment.
EzHit can be used directly through the HuggingFace Space:
Try EzHit on HuggingFace Space
The web interface allows users to submit an enzyme sequence and a reaction SMILES and obtain prediction results through an interactive interface.
Users can fine-tune EZHit on their own enzyme-reaction datasets using the provided Google Colab notebook.
The Colab notebook provides a ready-to-run workflow for:
- uploading a custom enzyme-reaction dataset
- uploading a pretrained EZHit checkpoint
- generating required feature caches
- fine-tuning EZHit on user-provided data
- evaluating the fine-tuned model
- exporting the fine-tuned checkpoint
- exporting
train_distribution_stat.ptfor Mahalanobis-distance inference
After fine-tuning, the exported checkpoint and train_distribution_stat.pt can be used for customized prediction in the HuggingFace Space or local inference scripts.
Create and activate a conda environment with Python 3.8:
conda create -n ezhit python=3.8 -y
conda activate ezhitClone this repository:
git clone https://github.com/ld139/EzHit.git
cd EzHitInstall the required dependencies inside the ezhit environment:
pip install -r requirements.txt| Resource | Link |
|---|---|
| HuggingFace Space | Enzyme-Catalysis-Predictor |
| Colab notebook | EZHit_FineTune_Colab.ipynb |
| Pretrained EZHit checkpoints | deanluo/EzHit |
| Large-scale screening results | deanluo/EzHit-large-scale-screening |
| Full training data | Zenodo |
To download released checkpoints after installing requirements.txt:
hf download deanluo/EzHit \
--include "checkpoints/*.pt" \
--local-dir .The checkpoint files will be placed under:
checkpoints/
EZHit can be trained locally using train_bn_kan.py.
Example command:
python train_bn_kan.py \
--train_csv data/train.csv \
--val_csv data/valid.csv \
--test_csv data/Enzyme-405.csv \
--train_esm_cache caches/train_prott5_bf16.pt \
--val_esm_cache caches/valid_prott5_bf16.pt \
--test_esm_cache caches/Enzyme-405_bf16.pt \
--train_drfp_cache caches/train_drfp.pt \
--val_drfp_cache caches/valid_drfp.pt \
--test_drfp_cache caches/Enzyme-405_drfp.pt \
--train_reactant_cache caches/train_reactant_2048.pt \
--val_reactant_cache caches/valid_reactant_2048.pt \
--test_reactant_cache caches/Enzyme-405_reactant_2048.pt \
--group_key CANO_RXN_SMILES \
--rxn_col CANO_RXN_SMILES \
--label_col Label \
--standardize_smiles \
--uncharge \
--filter_charged_single_atom \
--fp_dim 2048 \
--reactant_fp_dim 2048 \
--hidden 512 \
--dropout 0.4 \
--epochs 30 \
--batch_size 2048 \
--lr 1e-4 \
--patience 5 \
--select_metric top10 \
--pos_weight auto \
--seed 44EZHit can also be fine-tuned locally using finetune_bn.py.
Example command:
python finetune_bn.py \
--pretrained_ckpt checkpoints/binarycls_best_val_seed40.pt \
--train_csv data/train.csv \
--val_csv data/val.csv \
--test_csv data/test.csv \
--train_esm_cache caches/train_protein.pt \
--val_esm_cache caches/val_protein.pt \
--test_esm_cache caches/test_protein.pt \
--train_drfp_cache caches/train_drfp.pt \
--val_drfp_cache caches/val_drfp.pt \
--test_drfp_cache caches/test_drfp.pt \
--train_reactant_cache caches/train_reactant_2048.pt \
--val_reactant_cache caches/val_reactant_2048.pt \
--test_reactant_cache caches/test_reactant_2048.pt \
--rxn_col CANO_RXN_SMILES \
--group_key CANO_RXN_SMILES \
--label_col Label \
--epochs 15 \
--batch_size 512 \
--lr 1e-4 \
--pos_weight auto \
--save_best_path checkpoints/ezhit_finetuned.ptLocal inference is provided through predict.py.
The script can run either:
- single-pair prediction from command-line inputs, or
- batch prediction from a CSV file.
It automatically computes:
- ProtT5 protein embeddings
- DRFP reaction fingerprints
- reactant Morgan fingerprints
- ensemble prediction probability
- ensemble uncertainty
- optional Mahalanobis distance if
train_distribution_stat.ptis provided
python predict.py \
--protein_sequence "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG" \
--rxn_smiles "CCO>>CC=O" \
--model_group general \
--seeds 40 41 42 43 44 \
--standardize_smiles \
--uncharge \
--filter_charged_single_atom \
--output_csv results/single_prediction.csvPrepare a CSV file:
protein_sequence,CANO_RXN_SMILES
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=ORun prediction:
python predict.py \
--input_csv examples/predict_demo.csv \
--seq_col protein_sequence \
--rxn_col CANO_RXN_SMILES \
--model_group general \
--seeds 40 41 42 43 44 \
--standardize_smiles \
--uncharge \
--filter_charged_single_atom \
--output_csv results/predict_demo_output.csvBy default, predict.py downloads checkpoints from deanluo/EzHit. To use local checkpoints:
python predict.py \
--input_csv examples/predict_demo.csv \
--ckpt_paths checkpoints/binarycls_best_val_seed40.pt checkpoints/binarycls_best_val_seed41.pt checkpoints/binarycls_best_val_seed42.pt checkpoints/binarycls_best_val_seed43.pt checkpoints/binarycls_best_val_seed44.pt \
--standardize_smiles \
--uncharge \
--filter_charged_single_atom \
--output_csv results/predict_demo_output.csvIf a matching train_distribution_stat.pt file is available, pass it through --maha_stat_path:
python predict.py \
--input_csv examples/predict_demo.csv \
--ckpt_paths checkpoints/binarycls_best_val_seed40.pt checkpoints/binarycls_best_val_seed41.pt checkpoints/binarycls_best_val_seed42.pt checkpoints/binarycls_best_val_seed43.pt checkpoints/binarycls_best_val_seed44.pt \
--maha_stat_path train_distribution_stat.pt \
--standardize_smiles \
--uncharge \
--filter_charged_single_atom \
--output_csv results/predict_demo_with_maha.csvThe output CSV contains:
| Column | Description |
|---|---|
EZHit_probability |
Mean ensemble match probability |
EZHit_probability_std |
Standard deviation across ensemble checkpoints |
EZHit_ensemble_MI |
Mutual-information-based ensemble uncertainty |
Mahalanobis_Dist |
Optional latent-space Mahalanobis distance |
The Colab fine-tuning workflow exports the following key files:
| File | Description |
|---|---|
ezhit_finetuned_seed42.pt |
Fine-tuned EZHit model checkpoint |
train_distribution_stat.pt |
Training-distribution statistics for Mahalanobis-distance inference |
val_predictions.csv |
Prediction results on the validation set |
test_predictions.csv |
Prediction results on the test set |
The two most important files for customized inference are:
ezhit_finetuned_seed42.pt
train_distribution_stat.pt
The checkpoint stores the fine-tuned model weights. The train_distribution_stat.pt file stores latent-space statistics used for Mahalanobis-distance calculation.
EZHit reports prediction results that may include:
| Output | Description |
|---|---|
| Match probability | Predicted enzyme-reaction compatibility score |
| Ensemble uncertainty | Model-disagreement-based uncertainty estimate |
| Mahalanobis distance | Latent-space distance from the training distribution |
A typical interpretation is:
| Probability | Mahalanobis distance | Interpretation |
|---|---|---|
| High | Low | High-priority candidate |
| High | High | Potentially useful but less reliable or out-of-distribution |
| Low | Low | In-distribution but predicted as incompatible |
| Low | High | Low-priority candidate |
The thresholds should be adjusted based on the dataset, model version, and validation results.
If you use EZHit in your research, please cite:
@article{ezhit,
title = {Accurate and large-scale enzyme-reaction retrieval with sequence information},
author = {Ding Luo, Binju Wang},
journal = {Under Review},
year = {2026},
}This project is released under the MIT License.
For questions or issues, please contact the developers through GitHub Issues.
