# Week Five Exercise: RNNs

## Data things

Today, we'll be doing language classification. We're using a subset of the data from the [Discriminating between Similar Languages (DSL) 2015 task](http://ttg.uni-saarland.de/lt4vardial2015/dsl.html). We're only going to be doing classification between `es-ES, es-AR, pt-PT, and pt-BR`. 

In [2]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

import unicodedata
import re
import random

lang_traindev = "../data/DSL-Task"

easy_label_map = {"es-ES":0, "es-AR":1, "pt-PT":2, "pt-BR":3}

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

def load_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f):
            example = {}
            line = line.split("\t") 
            example["text"] = strip_accents(line[0])
            example['label'] = easy_label_map[(line[1].strip("\n"))]
            if example["label"] is None:
                continue
            
            data.append(example)
        
        random.seed(1)
        random.shuffle(data)
        return data

training_set = load_data(lang_traindev + '/train_espt.txt')
dev_set = load_data(lang_traindev + '/devel_espt.txt')

### Pad and Index Sequences
And extract bag-of-words feature vectors. For speed, we'll only use words that appear at least 10 times in the training set, resulting in a vocab size |V| = 22210

In [3]:
import collections
import numpy as np

PADDING = "<PAD>"
UNKNOWN = "<UNK>"
max_seq_length = 20

def tokenize(string):
    return string.split()

def build_dictionary(training_datasets):
    """
    Extract vocabulary and build dictionary.
    """  
    word_counter = collections.Counter()
    for i, dataset in enumerate(training_datasets):
        for example in dataset:
            word_counter.update(tokenize(example['text']))
    
    vocabulary = set([word for word in word_counter if word_counter[word] > 10])
    vocabulary = list(vocabulary)
    vocabulary = [PADDING, UNKNOWN] + vocabulary
        
    word_indices = dict(zip(vocabulary, range(len(vocabulary))))

    return word_indices, len(vocabulary)

def sentences_to_padded_index_sequences(word_indices, datasets):
    """
    Annotate datasets with feature vectors. Adding right-sided padding. 
    """
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['text_index_sequence'] = torch.zeros(max_seq_length)

            token_sequence = tokenize(example['text'])
            padding = max_seq_length - len(token_sequence)

            for i in range(max_seq_length):
                if i >= len(token_sequence):
                    index = word_indices[PADDING]
                    pass
                else:
                    if token_sequence[i] in word_indices:
                        index = word_indices[token_sequence[i]]
                    else:
                        index = word_indices[UNKNOWN]
                example['text_index_sequence'][i] = index

            example['text_index_sequence'] = example['text_index_sequence'].long().view(1,-1)
            example['label'] = torch.LongTensor([example['label']])


word_to_ix, vocab_size = build_dictionary([training_set])
sentences_to_padded_index_sequences(word_to_ix, [training_set, dev_set])

In [4]:
print( vocab_size )

22210


### Batchify data
We want to feed data to our model in mini-batches so we need a data iterator that will "batchify" the data. We 

In [5]:
# This is the iterator we'll use during training. 
# It's a generator that gives you one batch at a time.
def data_iter(source, batch_size):
    dataset_size = len(source)
    start = -1 * batch_size
    order = list(range(dataset_size))
    random.shuffle(order)

    while True:
        start += batch_size
        if start > dataset_size - batch_size:
            # Start another epoch.
            start = 0
            random.shuffle(order)   
        batch_indices = order[start:start + batch_size]
        batch = [source[index] for index in batch_indices]
        yield [source[index] for index in batch_indices]

# This is the iterator we use when we're evaluating our model. 
# It gives a list of batches that you can then iterate through.
def eval_iter(source, batch_size):
    batches = []
    dataset_size = len(source)
    start = -1 * batch_size
    order = list(range(dataset_size))
    random.shuffle(order)

    while start < dataset_size - batch_size:
        start += batch_size
        batch_indices = order[start:start + batch_size]
        batch = [source[index] for index in batch_indices]
        if len(batch) == batch_size:
            batches.append(batch)
        else:
            continue
        
    return batches

# The following function gives batches of vectors and labels, 
# these are the inputs to your model and loss function
def get_batch(batch):
    vectors = []
    labels = []
    for dict in batch:
        vectors.append(dict["text_index_sequence"])
        labels.append(dict["label"])
    return vectors, labels


<br>
<br>


## Model time!

We'll build a simple Elman-style RNN in **Part 1**, and a RNN with LSTM units in **Part 2**.

### Part 1: Elman Network

Simple RNNs are finicky and sensitive to hyperparameter settings. Within 200 epochs, your model should surpass 80% accuracy on the training set and the **full** dev accuracy will be > 72%.

In a vanilla, Elman-style, RNN you will 
* Embed your words into a 8-dimensional vector space using an embedding matrix that has been randomly initialized. 
* Then pass each word, in sequential order, into an RNN unit. In this unit, 
    * The word embedding is concatenated with the hidden vector.
    * The concatenated vector is passed through an affine layer and `tanh` non-linearity.
    * Output a hidden vector the size of your hidden dimension
* Take the resulting hidden vector, `h_{t-1}` and use in the RNN unit for your next word, x_t
* The final hidden vector, `h_n`, is passed through an affine layer to get an ouput with the desierd dimensions.

In [6]:
class ElmanRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size, batch_size):
        super(ElmanRNN, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embedding_size = embedding_dim
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.batch_size = batch_size
        
        self.i2h = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.decoder = nn.Linear(hidden_size, output_size)
        self.init_weights()
    
    def forward(self, x, hidden):
        x_emb = self.embed(x)                
        embs = torch.chunk(x_emb, x_emb.size()[1], 1)
        
        def step(emb, hid):
            combined = torch.cat((hid, emb), 1)
            hid = F.tanh(self.i2h(combined))
            return hid

        for i in range(len(embs)):
            hidden = step(embs[i].squeeze(), hidden)
        
        output = self.decoder(hidden)
        return output, hidden

    def init_hidden(self):
        h0 = Variable(torch.zeros(self.batch_size, self.hidden_size))
        return h0
    
    def init_weights(self):
        initrange = 0.1
        lin_layers = [self.i2h, self.decoder]
        em_layer = [self.embed]
     
        for layer in lin_layers+em_layer:
            layer.weight.data.uniform_(-initrange, initrange)
            if layer in lin_layers:
                layer.bias.data.fill_(0)

Let's define an **evaluation function**.  We're using the boolean variable `lstm` to determine if we're using our `ElmanRNN` or `LSTM` model that we'll be building shortly.

In [7]:
# This function outputs the accuracy on the dataset, we will use it during training.
def evaluate(model, data_iter, lstm):
    model.eval()
    correct = 0
    total = 0
    for i in range(len(data_iter)):
        vectors, labels = get_batch(data_iter[i])
        vectors = Variable(torch.stack(vectors).squeeze())
        labels = torch.stack(labels).squeeze()
        
        if lstm:
            hidden, c_t = model.init_hidden()
            output, hidden = model(vectors, hidden, c_t)
        else:
            hidden = model.init_hidden()
            output, hidden = model(vectors, hidden)
        
        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum()
      
    return correct / float(total)

# This function gives us the confusion matrix for all labels and the overall accuracy.
def evaluate_confusion(model, data_iter, lstm):
    model.eval()
    correct_all = 0
    correct = {}
    for lab in easy_label_map:
        correct[lab] = [0,0,0,0,0] #eses, esar, ptpt, ptbr, total
    total = 0
    for i in range(len(data_iter)):
        vectors, labels = get_batch(data_iter[i])
        vectors = Variable(torch.stack(vectors).squeeze())
        labels = torch.stack(labels).squeeze()
        
        if lstm:
            hidden, c_t = model.init_hidden()
            output, hidden = model(vectors, hidden, c_t)
        else:
            hidden = model.init_hidden()
            output, hidden = model(vectors, hidden)
        
        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct_all += (predicted == labels).sum()
        
        for lab in easy_label_map:
            inds = (labels[:] == easy_label_map[lab]).nonzero().squeeze()
            for i in range(len(easy_label_map)):
                tmp =  torch.ones(len(inds.size())).long()*i
                correct[lab][i] += (predicted[inds] == tmp).sum()
            correct[lab][-1] += inds.size(0)
        
        confusion = {}
        for val in correct:
            confusion[val] = {v:correct[val][i] for i, v in enumerate(easy_label_map)}
        
    return confusion, correct_all / float(total)

We now define our **training loop**. We're using the same boolean variable `lstm` here as well

In [8]:
def training_loop(batch_size, num_epochs, model, loss_, optim, training_iter, dev_iter, train_eval_iter, lstm=False):
    step = 0
    epoch = 0
    total_batches = int(len(training_set) / batch_size)
    while epoch <= num_epochs:
        model.train()
        vectors, labels = get_batch(next(training_iter)) 
        vectors = Variable(torch.stack(vectors).squeeze()) # batch_size, seq_len
        labels = Variable(torch.stack(labels).squeeze())
    
        model.zero_grad()
        
        if lstm:
            hidden, c_t = model.init_hidden()
            output, hidden = model(vectors, hidden, c_t)
        else:
            hidden = model.init_hidden()
            output, hidden = model(vectors, hidden)

        lossy = loss_(output, labels)
        lossy.backward()
        torch.nn.utils.clip_grad_norm(model.parameters(), 5.0)
        optim.step()
        
        if step % total_batches == 0:
            if epoch % 5 == 0:
                print("Epoch %i; Step %i; Loss %f; Train acc: %f; Dev acc %f" 
                      %(epoch, step, lossy.data[0],\
                        evaluate(model, train_eval_iter, lstm),\
                        evaluate(model, dev_iter, lstm)))
            epoch += 1
        step += 1

Finally, we can build and train our model!

In [9]:
# Hyper Parameters 
input_size = vocab_size
num_labels = 4 
hidden_dim = 24
embedding_dim = 8
batch_size = 256
learning_rate = 0.2
num_epochs = 200


# Build, initialize, and train model
rnn = ElmanRNN(vocab_size, embedding_dim, hidden_dim, num_labels, batch_size)
rnn.init_weights()

# Loss and Optimizer
loss = nn.CrossEntropyLoss()  
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)

# Train the model
training_iter = data_iter(training_set, batch_size)
train_eval_iter = eval_iter(training_set[0:500], batch_size)
dev_iter = eval_iter(dev_set[0:500], batch_size)

training_loop(batch_size, num_epochs, rnn, loss, optimizer, training_iter, dev_iter, train_eval_iter, lstm=False)

Epoch 0; Step 0; Loss 1.386484; Train acc: 0.253906; Dev acc 0.257812
Epoch 5; Step 1405; Loss 0.700493; Train acc: 0.500000; Dev acc 0.539062
Epoch 10; Step 2810; Loss 0.696540; Train acc: 0.437500; Dev acc 0.484375
Epoch 15; Step 4215; Loss 0.699141; Train acc: 0.511719; Dev acc 0.539062
Epoch 20; Step 5620; Loss 0.695492; Train acc: 0.488281; Dev acc 0.464844
Epoch 25; Step 7025; Loss 0.692526; Train acc: 0.523438; Dev acc 0.546875
Epoch 30; Step 8430; Loss 0.697006; Train acc: 0.574219; Dev acc 0.511719
Epoch 35; Step 9835; Loss 0.692805; Train acc: 0.511719; Dev acc 0.484375
Epoch 40; Step 11240; Loss 0.691888; Train acc: 0.585938; Dev acc 0.527344
Epoch 45; Step 12645; Loss 0.692137; Train acc: 0.460938; Dev acc 0.515625
Epoch 50; Step 14050; Loss 0.690499; Train acc: 0.476562; Dev acc 0.527344
Epoch 55; Step 15455; Loss 0.697140; Train acc: 0.496094; Dev acc 0.550781
Epoch 60; Step 16860; Loss 0.752339; Train acc: 0.593750; Dev acc 0.574219
Epoch 65; Step 18265; Loss 0.680105; T

Accuracy and confusion matrix on the full dev set,Accuracy on the full dev set,

In [10]:
dev_full_iter = eval_iter(dev_set, batch_size)
evaluate_confusion(rnn, dev_full_iter, False)

({'es-AR': {'es-AR': 1406, 'es-ES': 571, 'pt-BR': 2, 'pt-PT': 3},
  'es-ES': {'es-AR': 672, 'es-ES': 1312, 'pt-BR': 1, 'pt-PT': 0},
  'pt-BR': {'es-AR': 8, 'es-ES': 1, 'pt-BR': 1478, 'pt-PT': 497},
  'pt-PT': {'es-AR': 2, 'es-ES': 5, 'pt-BR': 520, 'pt-PT': 1458}},
 0.7124495967741935)

### Part 2: LSTM

An LSTM RNN will quickly outperform the vanilla RNN on this task. Your training accuracy will reach 100% within a 100 epochs, and your **full** dev accuracy should be >78%.

Your task is to modify the the `ElmanRNN` to make it into an LSTM RNN.  [Olah's blogpost](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) is a very useful refernce.

We'll be using the same training-loop and evaluation functions as the Elman network.

In [19]:
class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size, batch_size):
        super(LSTM, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embedding_size = embedding_dim
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.batch_size = batch_size
        
        self.linear_f = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.linear_i = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.linear_ctilde = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.linear_o = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.decoder = nn.Linear(hidden_size, output_size)
        self.init_weights()
    
    def forward(self, x, hidden, c):
        x_emb = self.embed(x)
        embs = torch.chunk(x_emb, x_emb.size()[1], 1)       
        
        def step(emb, hid, c_t):
            combined = torch.cat((hid, emb), 1)
            f = F.sigmoid(self.linear_f(combined))
            i = F.sigmoid(self.linear_i(combined))
            c_tilde = F.tanh(self.linear_ctilde(combined))
            c_t = f * c_t + i * c_tilde
            o = F.sigmoid(self.linear_o(combined))
            hid = o * F.tanh(c_t)
            return hid, c_t
              
        for i in range(len(embs)):
            hidden, c = step(embs[i].squeeze(), hidden, c)     
            
        output = self.decoder(hidden)
        return output, hidden

    def init_hidden(self):
        h0 = Variable(torch.zeros(self.batch_size, self.hidden_size))
        c0 = Variable(torch.zeros(self.batch_size, self.hidden_size))
        return h0, c0
    
    def init_weights(self):
        initrange = 0.1
        lin_layers = [self.linear_f, self.linear_i, self.linear_ctilde, self.linear_o, self.decoder]
        em_layer = [self.embed]
     
        for layer in lin_layers+em_layer:
            layer.weight.data.uniform_(-initrange, initrange)
            if layer in lin_layers:
                layer.bias.data.fill_(0)

Let's test out our LSTM RNN,

In [32]:
# Hyper Parameters 
input_size = vocab_size
num_labels = 4
hidden_dim = 12
embedding_dim = 8
batch_size = 256
learning_rate = 0.5
num_epochs = 50

# Build, initialize, and train model
rnn = LSTM(vocab_size, embedding_dim, hidden_dim, num_labels, batch_size)
rnn.init_weights()

# Loss and Optimizer
loss = nn.CrossEntropyLoss()  
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)#

# Train the model
training_iter = data_iter(training_set, batch_size)
train_eval_iter = eval_iter(training_set[0:500], batch_size)
dev_iter = eval_iter(dev_set[0:500], batch_size)

training_loop(batch_size, num_epochs, rnn, loss, optimizer, training_iter, dev_iter, train_eval_iter, lstm=True)

Epoch 0; Step 0; Loss 1.386247; Train acc: 0.312500; Dev acc 0.257182
Epoch 5; Step 1405; Loss 0.697559; Train acc: 0.468750; Dev acc 0.497858
Epoch 10; Step 2810; Loss 0.692911; Train acc: 0.507812; Dev acc 0.502646
Epoch 15; Step 4215; Loss 0.684276; Train acc: 0.503906; Dev acc 0.521295
Epoch 20; Step 5620; Loss 0.601682; Train acc: 0.695312; Dev acc 0.623236
Epoch 25; Step 7025; Loss 0.500320; Train acc: 0.707031; Dev acc 0.664441
Epoch 30; Step 8430; Loss 0.363241; Train acc: 0.832031; Dev acc 0.771799
Epoch 35; Step 9835; Loss 0.271194; Train acc: 0.890625; Dev acc 0.793599
Epoch 40; Step 11240; Loss 0.160747; Train acc: 0.925781; Dev acc 0.794103
Epoch 45; Step 12645; Loss 0.153955; Train acc: 0.964844; Dev acc 0.787676
Epoch 50; Step 14050; Loss 0.088313; Train acc: 0.968750; Dev acc 0.788306


Accuracy and confusion matrix on the full dev set,

In [33]:
dev_full_iter = eval_iter(dev_set, batch_size)
evaluate_confusion(rnn, dev_full_iter, True)

({'es-AR': {'es-AR': 1482, 'es-ES': 498, 'pt-BR': 3, 'pt-PT': 1},
  'es-ES': {'es-AR': 436, 'es-ES': 1540, 'pt-BR': 0, 'pt-PT': 4},
  'pt-BR': {'es-AR': 1, 'es-ES': 0, 'pt-BR': 1598, 'pt-PT': 388},
  'pt-PT': {'es-AR': 0, 'es-ES': 1, 'pt-BR': 341, 'pt-PT': 1643}},
 0.7891885080645161)

## Optional bits

* Instead of taking the last hidden state as the input to the linear layer that gives us our output, take a mean or max-pool over time. Compare results of meanpooling vs maxpooling.
* Make it deep! Instead of a single layer RNN, add another layer
* With the LSTM RNN, we're clearly overfitting to the training set, add regulariation. L2, or dropout.