# Genomic-ULMFiT Methods 0: Data Representation, Model Architecture, Regularization and Training

Karl Heyer

## Introduction

Genomic-ULMFiT (G-ULMFiT) is a method for training deep genomic sequence classification models that shows competitive or improved performance over previous published results. This technique allows us to solve problems like:
 * Does this genomic sequence contain a promoter?
 * Is this RNA sequence coding RNA or non-coding RNA?
 * What genus does this sequencing read belong to?

This method is based on ULMFiT [1] - a transfer learning method for NLP tasks. Transfer learning is the process of taking a model trained to solve one task as the basis for solving another task. This means that rather than training a model from scratch for the second task, we initialize the model with weights learned from the initial task, then *fine tune* the weights towards the second task. Transfer learning has been extensively applied in the field of computer vision. For example, one might train a classification model on ImageNet data, then fine tune that model for classification of satellite imagery. 

Transfer learning in NLP domains has historically been restricted to embeddings. NLP models would have embedding layers initialized with pretrained weights from word2vec, GloVe, or some other source. ULMFiT extends this to transferring learned weights for multiple layers, to great effect. Importantly, the initial model is trained in an unsupervised fashion before being fine tuned for supervised classification on a labeled dataset. This means that our model performance is not restricted by the availability of labeled data. From a genomics perspective, this allows us to train the initial model on large amounts of unlabeled sequence data, then transfer the learned weights to a classification model that is fine tuned on a smaller dataset. This allows G-ULMFiT to leverage the huge amount of unlabeled sequence data available to produce accurate results on small labeled datasets using __only genomic sequences as input__. The initial model is also general and reusable. It can be fine tuned towards any number of classification tasks and datasets.

This document covers the theory behind G-ULMFiT and practical considerations for structuring data and training models. This document is written with the following goals in mind:
   * Cover the theory of the ULMFiT process and considerations taken applying it to genomic data
   * Explain the model architectures used, the theory behind them, and important hyperparameters
   * Describe practical methods for training models quickly and achieving high performance, including regularization and learning rate scheduling
    
This document is not intended to be a code walkthrough. Relevant code is shown in the notebooks in the [E. coli](https://github.com/kheyer/Genomic-ULMFiT/tree/master/Bacteria/E.%20Coli) directory.

This document is structured in the following sections:
   * Genomic Sequence Data Representation details preprocessing steps of preparing genomic data before sending it as input to a model
   * ULMFiT Overview describes the overall method and the template for what we want to achieve
   * Model Architecture covers Genomic Language Models, Genomic Classification Models, the layers that comprise them, and important hyperparameter choices in model design
   * Regularization details the many types of regularization at play in the model and how to tune them effectively
   * Training covers the ULMFiT process in detail, as well as learning rate schedules and training phases


## Table of Contents
1. Genomic Sequence Data Representation
    * 1.1 Genomic Tokenization
    * 1.2 Genomic Numericalization
2. ULMFiT Overview

## 1. Genomic Sequence Data Representation

If we want to train a sequence model on genomic data, the first thing we need to figure out is how to process the data into a form that can be used by a neural network. We need to turn sequence data into a numerical form that can be manipulated mathematically. We do this in two steps - __tokenization__ and __numericalization__.

Tokenization is the process of breaking the sequence down into sub-units or tokens. Numericalization is the process of mapping tokens to integer values.

### 1.1 Genomic Tokenization

How do we break genomic data into tokens? A common way is to tokenize by nucleotide [2,3,4,5,6,7]. Single nucleotide tokenization would process the sequence `ATCGCGTACG` into `A T C G C G T A C G`. This works, but it gives a very restricted representation of genomic sub-units. This looks at every nucleotide in a vacuum. It essentially says every `A` should be treated the same, regardless of where it appears or in what context it appears.

A representation that allows for more nuance is to tokenize by k-mers instead. This approach has been used by [8,9,10]. We could tokenize the sequence `ATCGCGTACGATCCG` into:
 * 3-mers: `ATC GCG TAC GAT CCG`
 * 4-mers: `ATCG CGTA CGAT`
 * 5-mers: `ATCGC GTACG ATCCG`
 
Or some other k-mer size. Notice in the above that the sequence is truncated to the last whole k-mer. 

Another parameter in tokenization is the stride between k-mers. Stride is defined as the frame shift between k-mers relative to the sequence being tokenized. In the above example, there was no overlap between k-mers, but this does not have to be the case. Consider tokenizing the sequence `ATCGCGTACGATCCG` with a k-mer size of 4 and the following stride values:
 * Stride 1: `ATCG TCGC CGCG GCGT CGTA GTAC TACG ACGA CGAT GATC ATCC TCCG`
 * Stride 2: `ATCG CGCG CGTA TACG CGAT ATCC`
 * Stride 3: `ATCG GCGT TACG GATC`
 * Stride 4: `ATCG CGTA CGAT`
 
Notice how the stride parameter affects the number of tokens created per length of input sequence. The impact of the choice of k-mer and stride values is discussed more in Section 5.7: Language Model Training. For now, understand that k-mer and stride values are hyperparameters that must be decided before training begins, and that the choice of k-mer and stride has an effect on compute time and performance.

### 1.2 Genomic Numericalization

Once we have decided on the k-mer and stride values to use in our tokenization, numericalizing is easy. We simply create a dictionary mapping each unique k-mer to an integer value. This creates the __vocabulary__ of the model - the total set of possible tokens. For a given k-mer length, the vocabulary will be $4^k + 1$ tokens. The $+ 1$ comes from adding a padding token, which will be important for batching sequences of different length. So for tokenization and numericalization of a sequence with k-mer length 3 and stride 2, we might see the following:

Sequence: `ATCGCGTACGATCCG`

Tokenization: `ATC CGC CGT TAC CGA ATC CCG`

Numericalization: `[5, 12, 8, 32, 27, 5, 14]`

Where the final numericalized list is the input to the model. Once a numericalized input is sent to the model, it is turned into a vector representation via an embedding. The integer values of the numericalized input correspond to rows in the embedding matrix. This is discussed further in Section 3.2: Embeddings.

### 1.3 Practical Tokenization

In practice I find k-mer lengths from 3-5 and stride values of 1-2 work best.

## 2. ULMFiT Overview

Now that we can process genomic data into a form we can feed to a model, we need to determine our strategy for training the model. Lets start by defining our end goal: We want to train a sequence model to classify genomic sequences using sequence input alone. This poses a potential problem. Sequence models tend to require a large amount of data to train effectively, and labeled genomic classification datasets can be small. The ULMFiT approach provides a solution to this. ULMFiT breaks training into three stages:

1. First we train a general domain language model using unsupervised on a large unlabeled corpus
2. We fine tune the general language model on the classification corpus to create a task specific language model
3. We fine tune the task specific language model for classification

Before going further, lets define the two types of models we will deal with. A __Language Model__ is a model that takes in a sequence of k-mer tokens and predicts the next token in the sequence. A __Classification Model__ is a model that takes in a sequence of tokens and predicts what category or class that sequence belongs to.

A language model is trained in an unsupervised fashion, meaning that no labeled data is required. Since the goal of the language model is to predict the next k-mer in a sequence, each k-mer becomes a correct output prediction for the sequence that preceeds it. This means we can generate huge amounts of paired data (input sequence + next k-mer) from any unlabeled genomic sequence.

A classification model is trained in a supervised fashion, requiring paired labeled data. For example if the task is promoter classification, all sequences in the classification dataset must be labeled as 0 or 1 for not-promoter or promoter.

The arthitectures for the Classification Model and the Language Model follow similar structures - the consist of an __Embedding__, an __Encoder__, and a __Linear Head__. On a high level, these layers function in the following ways:

 * Embedding: Converts the numericalized tokens of the input sequence into vector representations
 * Encoder: Processes the vectorized sequence into a hidden state
 * Linear Head: Uses the hidden state to make a classification decision.
 
When we move between stages, we transfer the learned weights from one model to the next. When we train the language models, we transfer all three sections of the model. When we transfer to the classification model, we only transfer the Embedding and the Encoder, as the classifcation model required a different linear head. Visually:

![](media/ulmfit1.png)

(Black arrows show transfer learning)
 
Model architecture is discussed in detail in Section 3.

### 2.1 General-Domain Language Model Training

In the first stage of training, we want to train a general genomic language model on a large unlabeled corpus. They key detail here is the training corpus used in this step can be any corpus in a similar domain to the classification corpus. So if for example we wanted to classify human genome sequences as either promoter or not-promoter, we could use the entire human genome to train the general domain language model. We could even go further and use an ensemble of genomes from animals phylogenically similar to humans. This allows us to generate large amounts of training data very easily.

The general domain language model will form the basis for all subsequent models. For this reason, we want the general domain model to be well trained. But what exactly does this mean? We want a model that understands the structure of genomic data and is able to pull meaning from nucleotide sequences. We use the language modeling task (predicting the next k-mer) as a proxy for this. In practice, I have seen that improving the general domain language model has a direct impact on improving performance of the classification model downstream. For this reason, it is worth investing in training a high performing general domain language model. Consequently, this step is actually the most time consuming step of the process. I typically invest 12+ hours into training the general domain language model, compared to 1-4 hours for fine tuning the task specific language model and 0.25-1 hours for training the classification model.

While training the general domain language model is computationally intensive, it only needs to be done once. If you train a human genome language model, that language model can be fine tuned for any number of downsteam tasks. This means the general domain language model has a high return on investment of compute time.

### 2.2 Task Specific Language Model

Once we have the general model trained, we want to fine tune it to the classification corpus to create a task specific language model. This is because no matter how general the general domain language model is, the classification dataset likely comes from a different distribution [1]. If we have a classification dataset specifically curated for a set of recognizable genomic classes, there are likely motifs and other structures in the sequence data that are more significant in the context of the classification dataset than in the general domain corpus. 

The task specific language model is initialized with the weights of the general domain language model. The full model (Embedding + Encoder + Linear Head) is transferred.

### 2.3 Task Specific Classification Model

Once we have trained the task specific language model, we can attempt our original goal: training a classification model. This model is initialized using the Embedding and the Encoder of the task specific language model. The Linear Head is not transferred, as the final classification task has changed. The Language Model produced a prediction vector corresponding to the length of the k-mer vocabulary (predicting the next k-mer), while the Classification Model produces a prediction vector with length equal to the number of classes in the classification dataset.

Initializing the classification model with the embedding and encoder of the task specific language model allows the classification model to train much faster while also being more robust to overfitting compared to training a model from scratch. Empirically I have found that models trained using transfer learning require much less regularization than models trained from scratch. This performance boost from pre-training and transfer learning is what allows ULMFiT to work so effectively on small datasets.

Transfer learning is an extremely important step for getting high quality results from training on small datasets. Transfer learning for deep learning genomics models has been done [3,7,11,12], but it is far from common. Many published methods train from scratch. I would expect pre-training to provide a general improvement to many methods.