Skip to content

nafcoder/DeepCPBSite

Repository files navigation

Predicting Protein-Carbohydrate Binding Sites: A Deep Learning Approach Integrating Protein Language Model Embeddings and Structural Features

Here, we have built a novel ensemble model, DeepCPBSite, combined with three separate models using three separate approaches (Random undersampling, Weighted oversampling and Class-Weighted Loss) based on ResNet+FNN deep learning architecture. The framework for this architecture is given below:

DeepCPBSite-1 DeepCPBSite_2-1

Data availability

Training set, independent set, TS53 set, and TS37 set are given in Dataset folder.

Environments

OS: Pop!_OS 22.04 LTS

Python version: Python 3.9.19

Used libraries:

numpy==1.26.4
pandas==2.2.1
pytorch==2.4.1
xgboost==2.0.3
pickle5==0.0.11
scikit-learn==1.2.2
matplotlib==3.8.2
PyQt5==5.15.10
imblearn==0.0
skops==0.9.0
shap==0.45.1
IPython==8.18.1
tqdm==4.66.5
biopython==1.84
transformers==4.44.2

Reproduction of results

  1. Firstly, download all features. Read the readme.txt of all_features folder

  2. Reproducible codes are given. Training and prediction scripts are also provided.

  3. For reproducing the results of the tables, you can navigate to the generation folder of that corresponding table number. Before running, update the feature_path variables inside the Python files.

Prediction

Prerequisites

  1. Transformers and PyTorch are needed for extracting the embeddings.

  2. For more queries, you can visit the following GitHubs:

    ProtT5-XL-U50

    ESM2

  3. You need to install DSSP for generating the structural features from PDB

sudo apt-get install dssp

Steps

  1. Firstly, you need to fill up dataset.txt. Follow the pattern shown below:
>Protein_id
Fasta
  1. For predicting carbohydrate protein binding sites from a protein sequence, you need to run the extractFeatures.py to generate features and then run predict_with_struct.py for prediction with struct or predict_without_struct.py for prediction without struct.

  2. For running predict_with_struct.py, you need to input the PDB file for the query protein sequence. For generating ESMFold or AlphaFold PDB, you can visit: ColabFold.

Reproduce previous paper metrics

In Prev_Papers and Prev_Papers_ESMFold, scripts are provided for reproducing the results of previous papers. We have given the probabilities that were produced from their scripts for the TS53 set.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages