Predicting Protein-Carbohydrate Binding Sites: A Deep Learning Approach Integrating Protein Language Model Embeddings and Structural Features

Here, we have built a novel ensemble model, DeepCPBSite, combined with three separate models using three separate approaches (Random undersampling, Weighted oversampling and Class-Weighted Loss) based on ResNet+FNN deep learning architecture. The framework for this architecture is given below:

Data availability

Training set, independent set, TS53 set, and TS37 set are given in Dataset folder.

Environments

OS: Pop!_OS 22.04 LTS

Python version: Python 3.9.19

Used libraries:

numpy==1.26.4
pandas==2.2.1
pytorch==2.4.1
xgboost==2.0.3
pickle5==0.0.11
scikit-learn==1.2.2
matplotlib==3.8.2
PyQt5==5.15.10
imblearn==0.0
skops==0.9.0
shap==0.45.1
IPython==8.18.1
tqdm==4.66.5
biopython==1.84
transformers==4.44.2

Reproduction of results

Firstly, download all features. Read the readme.txt of all_features folder
Reproducible codes are given. Training and prediction scripts are also provided.
For reproducing the results of the tables, you can navigate to the generation folder of that corresponding table number. Before running, update the feature_path variables inside the Python files.

Prediction

Prerequisites

Transformers and PyTorch are needed for extracting the embeddings.
For more queries, you can visit the following GitHubs:

ProtT5-XL-U50

ESM2
You need to install DSSP for generating the structural features from PDB

sudo apt-get install dssp

Steps

Firstly, you need to fill up dataset.txt. Follow the pattern shown below:

>Protein_id
Fasta

For predicting carbohydrate protein binding sites from a protein sequence, you need to run the extractFeatures.py to generate features and then run predict_with_struct.py for prediction with struct or predict_without_struct.py for prediction without struct.
For running predict_with_struct.py, you need to input the PDB file for the query protein sequence. For generating ESMFold or AlphaFold PDB, you can visit: ColabFold.

Reproduce previous paper metrics

In Prev_Papers and Prev_Papers_ESMFold, scripts are provided for reproducing the results of previous papers. We have given the probabilities that were produced from their scripts for the TS53 set.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Dataset		Dataset
Training		Training
Training_without_struct		Training_without_struct
all_features		all_features
prediction		prediction
table_14_generation		table_14_generation
table_15_generation		table_15_generation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predicting Protein-Carbohydrate Binding Sites: A Deep Learning Approach Integrating Protein Language Model Embeddings and Structural Features

Data availability

Environments

Reproduction of results

Prediction

Prerequisites

Steps

Reproduce previous paper metrics

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

nafcoder/DeepCPBSite

Folders and files

Latest commit

History

Repository files navigation

Predicting Protein-Carbohydrate Binding Sites: A Deep Learning Approach Integrating Protein Language Model Embeddings and Structural Features

Data availability

Environments

Reproduction of results

Prediction

Prerequisites

Steps

Reproduce previous paper metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages