In [1]:
import time
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

In this notebook we go through the process of training a recurrent neural network (RNN) to learn the same distribution that our toy HMM follows.

First, some helper functions to encode and decode sequences. Remember, `assert` statements are little tests to make sure that things are working the way we expect. They are *very* helpful for catching silly bugs.

In [2]:
states = ['H', 'L']
nucleotides = ['A', 'C', 'G', 'T']

def encode_seq(symbols, seqtype='dna'):
    encdr = nucleotides
    if seqtype != 'dna':
        encdr = states
    outseq = np.array([encdr.index(s) for s in symbols])
    return outseq

test_hl = 'HHHLLL'
test_nuc = 'GGGAAA'
assert encode_seq(test_hl, seqtype='states')[0] == states.index(test_hl[0]) and \
       encode_seq(test_hl, seqtype='states')[-1] == states.index(test_hl[-1])
assert encode_seq(test_nuc, seqtype='dna')[0] == nucleotides.index(test_nuc[0]) and \
       encode_seq(test_nuc, seqtype='dna')[-1] == nucleotides.index(test_nuc[-1])

In [3]:
def decode_seq(num_array, seqtype='dna'):
    encdr = nucleotides
    if seqtype != 'dna':
        encdr = states
    outseq = [encdr[s] for s in num_array]
    return ''.join(outseq)

assert decode_seq(encode_seq(test_nuc)) == test_nuc
assert decode_seq(encode_seq(test_hl, seqtype='hid'), seqtype='hid') == test_hl

Now we read in the training data emitted by the HMM in notebook 0. Let's split it in half, so that we have some sequences for testing.

In [4]:
training_data_file = 'rnn_toy_training.tsv'
training_df = pd.read_csv(training_data_file, sep='\t')
training_df = training_df.head(50)
testing_df = training_df.tail(50)
training_df.head(2)

Unnamed: 0,dna,hidden_state
0,TGGTCGTATTTTGTCGGGGGCAGACCAAAAAACAACGAAACGAATG...,LLLLLLLHHLLLHHLHHHLLHLLHHLLHLLHLHLLLLLHHHHLLLH...
1,GCACGGTGGATGTATCGCTGTGCAAGCAAGCCGGGATACTGCTTGT...,HHHHHHHHHHLHLHHHLLLHLHHLHHHHLLHLHLLLHHLLLLHHLH...


Our data set will consist of PyTorch tensors. These are simply numerical matrices (like you would find in Numpy or MATLAB or R), but they include the ability to track gradients.

We have 50 training example sequences, each of length 500. We first encode the strings into integers.

In [5]:
SEQ_LEN = training_df.dna.str.len().max()
NUM_SEQS = training_df.shape[0]
BATCH_SIZE = 1

X_train = torch.zeros(NUM_SEQS, SEQ_LEN, dtype=torch.long)
Y_train = torch.zeros(NUM_SEQS, SEQ_LEN, dtype=torch.long)
for i, row in training_df.iterrows():
    dna = row['dna']
    hid = row['hidden_state']
    dna_encode = torch.LongTensor(encode_seq(dna, seqtype='dna'))
    hid_encode = torch.LongTensor(encode_seq(hid, seqtype='hid'))

    X_train[i, :] = dna_encode
    Y_train[i, :] = hid_encode

In [6]:
X_train.shape, Y_train.shape

(torch.Size([50, 500]), torch.Size([50, 500]))

Next, we load the data into a `Dataset` and `Dataloader` module. This is not strictly necessary for training, but it makes it easier to shuffle, sample from and batch the data. When projects get more complicated, these modules are very helpful.

In [7]:
train_data = TensorDataset(X_train, Y_train)
train_loader = DataLoader(train_data, shuffle=True, batch_size=BATCH_SIZE)

Here we make sure the data comes back out of the dataloader in the way we expect. The `batch_size` refers to the number of examples sampled simultaneously. In this case, we only retrieve one example sequence at a time. 

In [8]:
# Make sure the data comes back out in the way we expect
train_features, train_labels = next(iter(train_loader))
train_features.shape, train_labels.shape

(torch.Size([1, 500]), torch.Size([1, 500]))

Here we set up a [python class](https://www.geeksforgeeks.org/python-classes-and-objects/) to manage the various pieces of our model. Again, this is not strictly necessary, but it makes life simpler. For example, the class will track useful information internally (e.g. `input_size`, `n_layers`). We can also create convenience functions such as `init_hidden()` to create a fresh hidden layer without having to remember what the precise dimensions ought to be. 

There are two pieces to our class. The `__init__()` function that is run when we first instantiate an instance of `MyGruClass`. This is where we initialize variables with the correct values, and instantiate the machine learning layers. A Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN). Notice the Tensorflow GRU module is called here (which is the heart of our model). We also include an ["embedding" layer](https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526), which learns to represent our integer inputs (remember we converted our sequence of characters into a sequence of integers using the `encode_seq()` function) as a vector of foating point numbers. That vector of floating point numbers serves as the input to our GRU. Finally, we instantiate a `Linear` and `LogSoftMax` layer, both of which convert the GRU output into a sequence of log probabilities. 

The `forward()` function is where we actually *use* the layers we created with `__init()`. In `forward()` we take an input and cascade it through the layers to produce an output. The key to any kind of deep learning project is to carefully track the input and output dimensions of your layers. Notice my comments to help myself mentally track what each layer is spitting out, and what the next layer expects. It helps to be aware of the [`transpose()` function](https://pytorch.org/docs/stable/generated/torch.transpose.html), which allows you to rotate two dimensions. This helps to match a tensor with the input expections of a layer. 

In [9]:
class MyGruClass(nn.Module):
    def __init__(self, input_size, hidden_size, predict_size, n_layers=1, bdir=False):
        super(MyGruClass, self).__init__()
        self.input_size = input_size
        self.embed_size = input_size
        self.hidden_size = hidden_size
        self.predict_size = predict_size
        self.n_layers = n_layers
        self.n_directions = 2 if bdir else 1
        
        self.embedding = nn.Embedding(input_size, self.embed_size)
        self.gru = nn.GRU(self.embed_size, 
                          hidden_size, 
                          num_layers=n_layers, 
                          bidirectional=bdir)

        self.lin_out = nn.Linear(hidden_size*self.n_directions, predict_size)
        self.sigmoid = nn.LogSoftmax(dim=2)
        
    def forward(self, input, hidden):
        embedded = self.embedding(input)
        # embedding shape: (batch_size, seq_len, hidden_size)
        # transpose so that batch dim is in the 2nd index position
        output = torch.transpose(embedded, 0, 1)
        
        output, hidden = self.gru(output, hidden)
        # output shape: (seq_len, batch_size, n_directions*hidden_size)
        # hidden shape: (n_directions*n_layers, batch_size, hidden_size)
        
        output = self.sigmoid(self.lin_out(output))
        return output, hidden

    def init_hidden(self, batch_size=1):
        return torch.zeros(self.n_layers*self.n_directions, 
                           batch_size, 
                           self.hidden_size)

    def input_dims(self):
        print(f'Input dimensions are: (batch_size, seq_len, {self.input_size})')
    
    def output_dims(self):
        print(f'Output dimensions are: (seq_len, batch_size, {self.predict_size})')
    
    def hidden_dims(self):
        dnl = self.n_layers*self.n_directions
        print(f'Hidden dimensions are: ({dnl}, batch_size, {self.hidden_size})')

In [10]:
# Using convenience functions in MyGruClass to print expected dimensions
test_model = MyGruClass(len(nucleotides), 10, len(states))
test_model.input_dims(), test_model.hidden_dims(), test_model.output_dims();

Input dimensions are: (batch_size, seq_len, 4)
Hidden dimensions are: (1, batch_size, 10)
Output dimensions are: (seq_len, batch_size, 2)


The `train()` function is where the action happens. Here we instantiate our model, and start feeding data to it. We select the Negative Log-Likelihood Loss `NLLLoss()` [function](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) as our optimization function because it determines whether a log probality (the output of our GRU) correctly classifies the input. In other words, the more probability the model assigns to the correct category, the more the the model is "rewarded". 

In our case, the model will output a vector of two probabilities (log transformed), one for `H` and one for `L`. For every input nucleotide, the model will assign the probability that it corresponds to a "high GC" region or a "low GC" region. The training data also has the answer generated by the HMM to which the model can compare its prediction. The more probability assigned to the correct category, the more the current model weights are reinforced.

There are many loss functions, all intended for different scenarios or that have different emphases. When starting a new project, it's worth reviewing the [available loss functions](https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7) to pick the one or two that seem most appropriate.

PyTorch also offers [multiple optimization algorithms](https://pytorch.org/docs/stable/optim.html). For this project we went with a common default: Adam. 

Having initialized training data, a model, a loss function, and an optimizer, we are ready to learn. The training process proceeds to loop through the data set in a random order (because we set the `shuffle` parameter on our Dataloader to `True`). A full loop through the data is called an "epoch". Within each epoch, we iterate through every training batch (in this case, batches are just one sequence long). Before feeding a sequence to our RNN, we reset the hidden state and the gradient. We then feed the training sequence to our model, and collect the prediction output. The output is then reorganized (using `transpose()`) to match the input expections of our NLLLoss function. Once we get a loss value, we call `backward()` to calculate derivatives, and which are fed to the optimization function, which updates the model weights. It's remarkable how much PyTorch keeps track of for us.

Finally, there are some print statements to keep track of where we are in the loop, and whether the model is continuing to improve or not.

In [11]:
def train(train_loader, 
          learn_rate=0.02, 
          input_dim=len(nucleotides), 
          hidden_dim=10,
          output_dim=len(states),
          batch_size=1,
          EPOCHS=5):
    
    # Instantiating the model
    model = MyGruClass(input_dim, hidden_dim, output_dim)
    
    # Defining loss function and optimizer
    criterion = nn.NLLLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learn_rate)
    
    model.train()
    print("Starting training")
    epoch_times = []
    # Start training loop
    for epoch in range(1, EPOCHS+1):
        start_time = time.time()
        avg_loss = 0
        for sample_x, sample_y in train_loader:
            h = model.init_hidden(batch_size)
            model.zero_grad()
            
            # the heart of the training!
            out, h = model(sample_x, h)
            
            # NLLLoss expects batch first, then class probabilities, then seq_len
            out_T = torch.transpose(out, 0, 1)
            out_T = out_T.transpose(1, 2)

            loss = criterion(out_T, sample_y)
            loss.backward()
            optimizer.step()
            avg_loss += loss.item()
            
        current_time = time.time()
        print(f"Epoch {epoch}/{EPOCHS} Done, Total Loss: {avg_loss/len(train_loader):.3f}")
        print(f"Total Time Elapsed: {current_time-start_time:.1f} seconds")
        epoch_times.append(current_time-start_time)
    print(f"Total Training Time: {sum(epoch_times):.1f} seconds")
    return model

And now, for the big moment. We train the model! 

This will take a about 30 seconds to run.

In [12]:
gru_model = train(train_loader, learn_rate = 0.02, EPOCHS=5)

Starting training
Epoch 1/5 Done, Total Loss: 0.671
Total Time Elapsed: 5.1 seconds
Epoch 2/5 Done, Total Loss: 0.667
Total Time Elapsed: 5.1 seconds
Epoch 3/5 Done, Total Loss: 0.668
Total Time Elapsed: 5.1 seconds
Epoch 4/5 Done, Total Loss: 0.667
Total Time Elapsed: 5.1 seconds
Epoch 5/5 Done, Total Loss: 0.667
Total Time Elapsed: 5.1 seconds
Total Training Time: 25.6 seconds


The model does improve the loss function in the first three epochs, then plateaus.

Let's see how the GRU's predictions compare to a real example!

First, we need a function to handle all the steps of encoding and decoding the output and resetting the model. This function outputs both the predicted state (a string of H's and L's), and the associated probabilities.

In [13]:
def predict(model, dna):
    assert all([x in nucleotides for x in dna])
    assert isinstance(model, MyGruClass)
    dna_encode = torch.LongTensor(encode_seq(dna, seqtype='dna'))
    dna_encode = dna_encode[None, :]
    h = model.init_hidden(1)
    model.zero_grad()
    out, _ = model(dna_encode, h)
    out_state_indices = [int(torch.argmax(x)) for x in out[:,0]]
    out_probs = np.array([torch.exp(x).detach().numpy() for x in out[:,0]])
    state = decode_seq(out_state_indices, 'hid')
    return state, out_probs

test_seq = 'GGGTTT'
test_state, test_probs = predict(gru_model, test_seq)
assert len(test_state) == len(test_seq)
assert all([x in states for x in test_state])

Here we pick a sequence from our testing data that the model has never seen. We can align the prediction and the HMM generated sequence to see how closely they agree.

In [14]:
test_nucleotides = testing_df.iloc[-1, 0]
test_hlseq = testing_df.iloc[-1, 1]
pred_hl, pred_prob_hl = predict(gru_model, test_nucleotides)

def align(seq1, seq2, WIDTH=60):
    '''Align two input sequences of equal length,
    with *  between indicating mismatches.'''
    lines = int(np.ceil(len(seq1) / WIDTH))
    match = ''
    for i, c1 in enumerate(seq1):
        indicator = ' '
        if c1 != seq2[i]:
            indicator = '*'
        match += indicator
    
    for i in range(lines):
        print('Seq1', seq1[i*WIDTH:i*WIDTH+WIDTH])
        print('    ', match[i*WIDTH:i*WIDTH+WIDTH])
        print('Seq2', seq2[i*WIDTH:i*WIDTH+WIDTH])
        print()

align(test_hlseq, pred_hl)

Seq1 HLLLLLLLLHLLHLHHLLLLLLHLLLHHLHHLHLLLLHLLLLHHHLLLLHLHLLLLLLLH
       *  *     *  ***  ****   *   *   ** *  *      *     **** * 
Seq2 HLHLLHLLLHLHHLLLHLLHHHLLLLLHLHLLHLHHLLLLHLHHHLLHLHLHLHHHHLHH

Seq1 HHLLLHHLHHLHLHLLHHHLHHHLHHHHHLLLLLHLLHHHLLHLLHHLLLHHLHHHLLLH
     ***   ** ** *  ** * * *    *   *   *   *  **     ** *       
Seq2 LLHLLHLHHLHHHHLHLHLLLHLLHHHLHLLHLLHHLHHLLLLHLHHLLHLHHHHHLLLH

Seq1 LLLLLLHHHHLLLHLLLHHHHLLHLLLLLLLHHHLLLLLLHHLLHHLLLLLHHHHLHHLH
      ** ***** * *    **  **   *  *    ***** * **  * *  ** * *   
Seq2 LHHLHHLLLHHLHHLLLLLHHHHHLLHLLHLHHHHHHHHLLHHHHHHLHLLLLHLLLHLH

Seq1 LHHHLHHHHHLLLLLLHHHHLHHLHLLLHLLLLLLLHHHLHHHLLHLLHLHHHLLLHLLL
       ** ** *    *      ** ** *   * *   * *  *      **  * * **  
Seq2 LHLLLLLHLHLLLHLLHHHHHLHHLLHLHLHLHLLLLHLLHLHLLHLLLHHHLLHLLHLL

Seq1 HHHLHLHLLLLHHHHHHHHHHHHLHHLLLLLLHLLHHHLHLHHHLLHLHHHHHLLHHLHH
      ** *       *   * *   *    *    *  *   ****  **   **  ** *  
Seq2 HLLLLLHLLLLHLHHHLHLHHHLLHHLHLLLLLLLLHHLLHLLHLHLLHHLLHLHLHHHH

Seq1 

In [15]:
def fraction_matches(seq1, seq2):
    '''Compare sequences in terms of matching positions.'''
    matches = 0
    for i, c1 in enumerate(seq1):
        if c1 == seq2[i]:
            matches += 1
    return {'length':len(seq1), 
            'n_matches':matches, 
            'n_mismatches':len(seq1) - matches, 
            'fraction_matches':matches/len(seq1)}

def sim_prob_match(fraction, symbols, length, n_draws=1000):
    '''Simulate n_draws pairs of random sequences composed of symbols.
    Estimate likelihood of matching at or above the input fraction 
    of positions.'''
    fraction_distribution = []
    for i in range(n_draws):
        seqs = np.random.choice(symbols, size=[2, length], replace=True)
        match = fraction_matches(''.join(seqs[0,:]), ''.join(seqs[1,:]))
        fraction_distribution.append(match['fraction_matches'])
    tail = np.sum(np.array(fraction_distribution) > fraction)
    max_fraction = np.max(np.array(fraction_distribution))
    prob_of_random_match_fraction = tail / n_draws
    display_p = f'{prob_of_random_match_fraction:.2f}'
    if prob_of_random_match_fraction == 0:
        display_p = f'<{1.0/n_draws}'
    return {'pval':prob_of_random_match_fraction,
            'max_fraction':max_fraction,
            'display_p':display_p}

fm = fraction_matches(test_hlseq, pred_hl)
spm = sim_prob_match(0.61, states, length=500, n_draws=1000)

fm['prob_of_rand_match_fraction'] = spm['display_p']
fm

{'length': 500,
 'n_matches': 305,
 'n_mismatches': 195,
 'fraction_matches': 0.61,
 'prob_of_rand_match_fraction': '<0.001'}

305 out of 500 positions match (61%). But 195 do not match. Is that good? Did the model learn anything? Is it approximating our HMM better than a random H and L generator would?

Keep in mind that the HMM training examples were draw probabalistically, so there is noise in the training data to begin with. That is, we don't expect perfect alignment. However, it is reasonable to simulate how often random draws of H and L sequences of length 500 would be expected to match at 61% or better. It turns out that even simulating 1000 random H and L sequence pairs, the best matches never exceed ~58%. 

So, 61% match is exceedingly rare by random chance. We can safely conclude that our deep network learned to approximate our HMM.

It's also interesting to see whether the probabilities are more or less "confident" at matches vs mismatches.

In [16]:
def print_probs(true_seq, pred_seq, prob, symbols, n=15):
    print('Pos True Pred Prob:(' + ' '.join(symbols) + ')')
    for i in range(n):
        ind = ' '
        if true_seq[i] != pred_seq[i]:
            ind = '*'
        prob_str = ''
        for j in range(len(symbols)):
            prob_str += f'{pred_prob_hl[i, j]:.2f} '
        print(f'{i}   {true_seq[i]}  {ind} {pred_seq[i]}    ' + prob_str)

print_probs(test_hlseq, pred_hl, pred_prob_hl, states)

Pos True Pred Prob:(H L)
0   H    H    0.57 0.43 
1   L    L    0.34 0.66 
2   L  * H    0.56 0.44 
3   L    L    0.36 0.64 
4   L    L    0.33 0.67 
5   L  * H    0.53 0.47 
6   L    L    0.33 0.67 
7   L    L    0.31 0.69 
8   L    L    0.31 0.69 
9   H    H    0.51 0.49 
10   L    L    0.32 0.68 
11   L  * H    0.55 0.45 
12   H    H    0.59 0.41 
13   L    L    0.36 0.64 
14   H  * L    0.33 0.67 


Based on the first 15 positions, the mismatches don't seem any less "confident" than the matches. For example, at position 0 the model correctly predicted "H" with 57% confidence. At position 2, the model incorrectly predicted "H", but still had 58% confidence.