This is an implementation of ULMFiT for genomics classification using Pytorch and Fastai. The model architecture used is based on the AWD-LSTM model, consisting of an embedding, three LSTM layers, and a final set of linear layers.
The ULMFiT approach uses three training phases to produce a classification model:
- Train a language model on a large, unlabeled corpus
- Fine tune the language model on the classification corpus
- Use the fine tuned language model to initialize a classification model
This method is particularly advantageous for genomic data, where large amounts of unlabeled data is abundant and labeled data is scarce. The ULMFiT approach allows us to train a model on a large, unlabeled genomic corpus in an unsupervised fashion. The pre-trained language model serves as a feature extractor for parsing genomic data.
Typical deep learning approaches to genomics classification are highly restricted to whatever labeled data is available. Models are usually trained from scratch on small datasets, leading to problems with overfitting. When unsupervised pre-training is used, it is typically done only on the classification dataset or on synthetically generated data. The Genomic-ULMFiT approach uses genome scale corpuses for pre-training to produce better feature extractors than we would get by training only on the classification corpus.
For a deep dive into the ULMFiT approach, model architectures, regularization and training strategies, see the Methods Long Form document in the Methods section.
Performance of Genomic-ULMFiT relative to other methods
E. coli promoters
The Genomic-ULMFiT method performs well at the task of classifying promoter sequences from random sections of the genome. The process of unsupervised pre-training and fine-tuning has a clear impact on the performance of the classification model
|E. coli Genome Pre-Training||0.919||0.941||0.893||0.839|
|Genomic Ensemble Pre-Training||0.973||0.980||0.966||0.947|
Data generation described in notebook
Classification performance on human promoters is competitive with published results
Human Promoters (short)
For the short promoter sequences, using data from Recognition of Prokaryotic and Eukaryotic Promoters using Convolutional Deep Learning Neural Networks:
|Model||DNA Size||kmer/stride||Accuracy||Precision||Recall||Correlation Coefficient||Specificity|
|Kh et al.||-200/50||-||-||-||0.9||0.89||0.98|
|With Pre-Training and Fine Tuning||-200/50||5/2||.977||.959||.989||.955||.969|
|With Pre-Training and Fine Tuning||-200/50||5/1||.990||.983||.995||.981||.987|
|With Pre-Training and Fine Tuning||-200/50||3/1||.995||.992||.996||.991||.994|
Human Promoters (long)
For the long promoter sequences, using data from PromID: Human Promoter Prediction by Deep Learning:
|Model||DNA Size||Models||Accuracy||Precision||Recall||Correlation Coefficient|
|Umarov et al.||-1000/500||2 Model Ensemble||-||0.636||0.802||0.714|
|Umarov et al.||-200/400||2 Model Ensemble||-||0.769||0.755||0.762|
|Naive Model||-500/500||Single Model||0.858||0.877||0.772||0.708|
|With Pre-Training||-500/500||Single Model||0.888||0.90||0.824||0.770|
|With Pre-Training and Fine Tuning||-500/500||Single Model||0.892||0.877||0.865||0.778|
Data generation described in notebook
Other Bacterial Promoters
This table shows results on data from Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. These results show how CNN based methods can sometimes perform better when training on small datasets.
|Method||Organism||Training Examples||Accuracy||Precision||Recall||Correlation Coefficient||Specificity|
|Kh et al.||E. coli||2936||-||-||0.90||0.84||0.96|
|Kh et al.||B. subtilis||1050||-||-||0.91||0.86||0.95|
Genomic-ULMFiT shows improved performance on the metagenomics taxonomic dataset from Deep learning models for bacteria taxonomic classification of metagenomic data.
|Fiannaca et al.||Amplicon||.9137||.9162||.9137||.9126|
|Fiannaca et al.||Shotgun||.8550||.8570||.8520||.8511|
When trained on a dataset of mammalian enhancer sequences from Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences, Genomic_ULMFiT improves on results from Cohn et al.
|Cohn et al.||0.80||0.78||0.77||0.72|
|Genomic-ULMFiT 5-mer Stride 2||0.812||0.871||0.773||0.787|
|Genomic-ULMFiT 4-mer Stride 2||0.804||0.876||0.771||0.786|
|Genomic-ULMFiT 3-mer Stride 1||0.819||0.875||0.788||0.798|
This table shows results for training a classification model on a dataset of coding mRNA sequences and long noncoding RNA (lncRNA) sequences. The dataset comes from A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential by Hill et al. The dataset contains two test sets - a standard test set and a challenge test set.
|GRU Ensemble (Hill et al.)*||Standard Test Set||0.96||0.97||0.95||0.97||0.92|
|Genomic ULMFiT (3mer stride 1)||Standard Test Set||0.963||0.952||0.974||0.953||0.926|
|GRU Ensemble (Hill et al.)*||Challenge Test Set||0.875||0.95||0.80||0.95||0.75|
|Genomic ULMFiT (3mer stride 1)||Challenge Test Set||0.90||0.944||0.871||0.939||0.817|
(*) Hill et al. presented their results as a plot rather than as a data table. Values in the above table are estimated by reading off the plot
One way to gain insight into how the classification model makes decisions is to perturb regions of a given input sequence to see how changing different regions of the sequence impact the classification result. This allows us to create plots like the one below, highlighting important sequence regions for classification. In the plot below, the red line corresponds to a true transcripotion start site. The plot shows how prediction results are sensitive to changes around that location. More detail on interpretations can be found in the Model Interpretations directory.
Long Sequence Inference
Inference on long, unlabeled sequences can be done by breaking the input sequence into chunks and plotting prediction results as a function of length. The image below shows a sample prediction of promoter locations on a 40,000 bp region of the E. coli genome. True promoter locations are shown in red. More detail can be found in this notebook
There are a number of other genomic classification domains I intend to explore when time permits.
- Classification from raw NGS data
I'm planning on doing a more structured literature review of deep learning for genomic classification and how they compare to Genomic_ULMFiT. For now, here are links to relevant papers.