Skip to content

Incorporating Pre-training of Deep Bidirectional Transformers and Convolutional Neural Networks to Interpret DNA Sequences

Notifications You must be signed in to change notification settings

khanhlee/bert-dna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bert-dna

Incorporating Pre-training of Deep Bidirectional Transformers and Convolutional Neural Networks to Interpret DNA Sequences

Recently, language representation models have drawn a lot of attention in natural language processing (NLP) field due to their remarkable results. Among them, Bidirectional Encoder Representations from Transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique namely BERT-DNA by incorporating BERT-base multilingual model in bioinformatics to interpret the information of DNA sequences. We treated DNA sequences as sentences and transformed them into fixed-length meaningful vectors where 768- vector represents each nucleotide. We observed that our BERT-base features improved more than 5-10% in terms of sensitivity, specificity, accuracy, and MCC compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by convolutional neural networks) hold potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and deep convolutional neural networks could open a new avenue in bioinformatic modeling using sequence information.

Dependencies

Prediction step-by-step:

Step 1

Use "extract_seq.py" file to generate JSON files

  • python extract_seq.py

Step 2

Use command line in "bert2json.txt" to train BERT model and extract features

Step 3

Use "jsonl2csv.py" to transfrom JSON to CSV files:

  • python jsonl2csv.py json_file csv_file

Step 4

Use 6mAtraining.py to train CNN model on CSV files

About

Incorporating Pre-training of Deep Bidirectional Transformers and Convolutional Neural Networks to Interpret DNA Sequences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages