DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor
Here, we report DLpTCR a computational framework that integrated three deep-learning models for predicting the likelihood of the interaction between TCR and peptide presented by MHC molecules. DLpTCR obtained excellent performance on independent testing dataset, thereby allowing robust identification of immunogenic T cell epitopes.
Download DLpTCR by
git clone https://github.com/JiangBioLab/DLpTCR
This package can be installed in this ways (the easy way):
# If needed:
pip install -r requirements.txt
# Or
conda install --yes --file requirements.txt
# Or you can create a new environment
conda create --name dlptcr --file requirements.txt
Note the code depends on the numpy
, tensorflow
and other packages. So have those
installed first. The build will likely fail if it can't find them. For more information, see:
- NumPy: Library for efficient matrix math in Python
- tensorflow: An end-to-end open source machine learning platform in Python
We collected experimentally verified TCR-pMHC pairs from the VDJdb, IEDB and
TetTCR-seq dataset for constructing a high-quality benchmark dataset. These peptide-TCR pairs were
split into training, testing and independent testing datasets with regard to their TCR α- and β-chains
so that each peptide-TCR pair only existed in one split, in detail as following:
1)TCRA_train.csv and TRB_Train.csv are the training datasets for constructing and training the models.
2)TCRA_test.csv and TCRB_test.csv are the testing datasets for testing the constructed models.
3)TCRA_COVID-19.csv and TCRB_COVID-19.csv are independent testing data for evaluating the performance of
ensemble classifiers.
4)TRA-VDJdb_TCR cross-reactivity.rar and TRB_VDJdb_TCR cross-reactivity.rar are used to assess the
prediction ability of ensemble classifiers for TCR cross-reactivity.
5)TCRAB_IEDB.csv is used to evaluate the integrated model for predicting the peptide-TCRαβ interaction.
The final base classifiers of DLpTCR are deposited in this folder.
- FULL_A_ALL_onehot.h5, CNN_A_ALL_onehot.h5 and RESNET_A_ALL_pca15.h5 are the base classifiers of ensemble model for predicting the peptide-TCRα interaction.
- FULL_B_ALL_pca18.h5, CNN_B_ALL_pca20.h5 and RESNET_B_ALL_pca10.h5 are the base classifiers of ensemble model for predicting the peptide-TCRβ interaction.
The folder contains the features generated by full training datasets using PCA encoding method. we padded each sequence of a pair to the maximum length of 20 and encoded them using Principal Component Analysis (PCA) encoding. For each amino acid, we selected the top 20 PCs explained over 95% of total data variation and generated different vectors using 8-20 PCs to represent its biochemical signatures, respectively.
The source code of feature extraction, five-fold cross-validation, model construction and training, and prediction are deposited in this floder 'code'.
- The source code in folder 'fold' are used to select the appropriate features by five-fold cross validation.
- The source code in folder 'train' are used to construct and train the base classifiers.
- The source code (XXX_Feature_Extraction.py) is used to implement feature extraction.
- The source code (DLpTCR.py) is used to predict the peptide-TCR interaction.
After you install DLpTCR, TensorFlow will be installed along with DLpTCR. Refer to Keras documentation to configure TensorFlow to run on GPU/CPU. Note that, if you want to use GPU, you also need to install CUDA and cuDNN; refer to their websites for instructions. If you use "conda install tensorflow-gpu" to install TensorFlow. CPU is only suitable for predicting not training.
cd to the DLpTCR/code folder which contains DLpTCR_server.py, Model_Predict_Feature_Extraction.py. python >>> from Feature_Extraction import * >>> from DLpTCR_server import * >>> input_file_path = '../data/Example_file.xlsx'
Please refer to document 'Example_file.xlsx' for the format of the input file. Column names are not allowed to change.
>>> model_select = "AB"
model:pTCRα user_select = "A" model:pTCRβ user_select = "B" model:pTCRαβ user_select = "AB"
>>> job_dir_name = 'test'
>>> user_dir = './user/' + str(job_dir_name) + '/'
The predicted files will be stored in the path "user_dir".
>>> user_dir_Exists = os.path.exists(user_dir)
>>> if not user_dir_Exists:
os.makedirs(user_dir)
>>> error_info,TCRA_cdr3,TCRB_cdr3,Epitope = deal_file(input_file_path, user_dir, model_select)
>>> output_file_path = save_outputfile(user_dir, user_select, input_file_path,TCRA_cdr3,TCRB_cdr3,Epitope)
also,you can use the API.py to predict the peptide-TCR interaction.
python API.py
CPU is only suitable for prediction not training. For custom general training using user’s training data:
python Train_Test_Onehot_Chem_Feature_Extraction.py
python Train_Test_PCA_Feature_Extraction.py
The code in Folder DLpTCR/code/fold is then used for 5-fold cross-validation to filter out the best features:
#example
python CNN_A_fold_onehot.py
The code in folder DLpTCR/code/train is then used to filter out the best features for model training
#example
python CNN_A_ALL_onehot.py
Please cite the following paper for using DLpTCR:
DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor