Predicting Protein-Carbohydrate Binding Sites: A Deep Learning Approach Integrating Protein Language Model Embeddings and Structural Features
Here, we have built a novel ensemble model, DeepCPBSite, combined with three separate models using three separate approaches (Random undersampling, Weighted oversampling and Class-Weighted Loss) based on ResNet+FNN deep learning architecture. The framework for this architecture is given below:


Training set, independent set, TS53 set, and TS37 set are given in Dataset folder.
OS: Pop!_OS 22.04 LTS
Python version: Python 3.9.19
Used libraries:
numpy==1.26.4
pandas==2.2.1
pytorch==2.4.1
xgboost==2.0.3
pickle5==0.0.11
scikit-learn==1.2.2
matplotlib==3.8.2
PyQt5==5.15.10
imblearn==0.0
skops==0.9.0
shap==0.45.1
IPython==8.18.1
tqdm==4.66.5
biopython==1.84
transformers==4.44.2
-
Firstly, download all features. Read the readme.txt of all_features folder
-
Reproducible codes are given. Training and prediction scripts are also provided.
-
For reproducing the results of the tables, you can navigate to the generation folder of that corresponding table number. Before running, update the feature_path variables inside the Python files.
-
Transformers and PyTorch are needed for extracting the embeddings.
-
For more queries, you can visit the following GitHubs:
-
You need to install DSSP for generating the structural features from PDB
sudo apt-get install dssp
- Firstly, you need to fill up dataset.txt. Follow the pattern shown below:
>Protein_id
Fasta
-
For predicting carbohydrate protein binding sites from a protein sequence, you need to run the extractFeatures.py to generate features and then run predict_with_struct.py for prediction with struct or predict_without_struct.py for prediction without struct.
-
For running predict_with_struct.py, you need to input the PDB file for the query protein sequence. For generating ESMFold or AlphaFold PDB, you can visit: ColabFold.
In Prev_Papers and Prev_Papers_ESMFold, scripts are provided for reproducing the results of previous papers. We have given the probabilities that were produced from their scripts for the TS53 set.