This distribution provides an implementation, along with the data and trained models used in our paper:
Xia, Zhihao, et al. "DeeReCT-PolyA: a robust and generic deep learning method for PAS identification". Bioinformatics, 2018.
If you find the code useful for your research, please cite our paper.
@article{deepolyA,
author = {Xia, Zhihao and Li, Yu and Zhang, Bin and Li, Zhongxiao and Hu, Yuhui and Chen, Wei and Gao, Xin},
title = "{DeeReCT-PolyA: a robust and generic deep learning method for PAS identification}",
year = {2018},
month = {11},
doi = {10.1093/bioinformatics/bty991},
url = {https://dx.doi.org/10.1093/bioinformatics/bty991},
}
The code has been tested with Python3 + Tensorflow1.7. Tensorflow GPU edition is recommended. However, running the code with CPU is still pretty fast if the dataset is not too large.
The repository contains pre-trained models in the models
directory for PAS identification for Dragon-human, Omni-human, C57BL/6J (BL) and SPRET/EiJ (SP) mouse. You may use the pre-trained models to test or fine-tune the model with your own data. Note that each pre-trained models is trained with 4 out 5 folds of the data while the results in the paper are evaluated using 5-fold cross validation.
Please look at each script for a list of parameters that you can specify or run
python script.py -h
If you have any question, please contact zhihao.xia@wustl.edu.
To prepare your data for training or fine-tuning, sequences should be put in .txt
files in which each line is a ATGC sequence of length 206 with the centered 6-mers as the true or pseudo poly(A) motif. Positive data and negative data should be put in two different sub-directories. Then, run
python data_prep.py pos_root neg_root outfile [--nfolds n]
to encode the raw sequences with one-hot encoding and split the data into training, validation and test set. The processed dataset will be saved as a .npz
file. Note that if you just want to use our pre-trained model for inference on your own data or you don't have the ground truth labels, we provide testing code that can directly take the sequences without the preparation as inputs and make predictions.
After the data preparation, you can train a DeeReCT-PolyA model from scratch by running
python train.py data [--out outfile] [--hparam hyperparam_file]
The input data should be the .npz
file generated from last step. There are some hyper-parameters, e.g. learning rate, that you can specify for the model (set randomly as default). We suggest using random search to find the best set of hyper-parameters based on the performance on the validation dataset. For reference, we provide some sets of hyper-parameters in the models
directory. The trained model can be saved to the output file.
As discussed in our paper, when you need a DeeReCT-PolyA model for your own data, instead of training from scratch, it is usually beneficial to fine-tune a pre-trained model, especially when the new training data is insufficient. To fine-tune a pre-trained with your own data, run
python train.py data [--out outfile] [--hparam hyperparam_file] --pretrained model_file
To test the model with your data, run
python test.py data model [--out outfile]
The data can be a .txt
file in which each line is a ATGC sequence of length 206 with the centered 6-mers as the true or pseudo poly(A) motif. It can also be a .npz
file containing the one-hot encoded sequences generated by data_prep.py
and the test split in the .npz
file will be used. The binary predictions for input sequences can be saved by specifying the output file.
Dragon-human Poly(A) dataset: Kalkatawi, Manal, et al. "Dragon PolyA Spotter: predictor of poly (A) motifs within human genomic DNA sequences." Bioinformatics 28.1 (2011): 127-129.
Omni-human Poly(A) dataset: Arturo, Magana-Mora et al. "Omni-PolyA: a method and tool for accurate recognition of Poly (A) signals in human genomic DNA." BMC genomics 18.1 (2017): 620.