N4-methylcytosine is a kind of DNA modification which could regulate multiple biological processes such as transcription regulation, DNA replication and gene expressions. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop a deep learning-based model to predict 4mC sites in the E. coli. In the proposed model, DNA sequences were encoded by word embedding technique ‘word2vec’. The obtained features were inputted into 1D convolutional neural network (CNN) to classify 4mC from non-4mC sites in Escherichia coli. On the independent dataset, our model could yield the overall accuracy of 0.861%, which was approximately 4.3% higher than the existing model, 4mCCNN respectively.
Python3 (tested 3.5.4)
jupyter (tested 1.0.0)
scikit-learn (tested 0.22.1)
pandas (tested 1.0.1)
numpy (tested 1.18.1)
gensim (tested 3.8.1)
sklearn (tested 0.19.1)
keras (tested 2.3.1)
tensorflow (tested 2.1.0)
W2V.py
Train_CNN_Model.py
Test.py
For files with different input sequences, you need to pay attention to the modification of parameters in code.
Zulfiqar, Hasan, Zi-Jie Sun, Qin-Lai Huang, Shi-Shi Yuan, Hao Lv, Fu-Ying Dao, Hao Lin, and Yan-Wen Li. "Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli." Methods (2021), doi: 10.1016/j.ymeth.2021.07.011.