---
# Exercise: Augmenting the LSTM PoS tagger with Character-level features

My proposal for [this](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#exercise-augmenting-the-lstm-part-of-speech-tagger-with-character-level-features)

In the [previous example](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#example-an-lstm-for-part-of-speech-tagging), each word had an embedding, which served as the inputs to our sequence model. Let’s augment the word embeddings with a representation derived from the characters of the word. We expect that this should help significantly, since character-level information like affixes have a large bearing on part-of-speech. For example, words with the affix -ly are almost always tagged as adverbs in English.


## Sequence Models and LSTM networks

Few notes about LSTMs... you can move on if you know about them
There is dependence through time between your inputs. 
It maintains some kind of state that could be used as part of the next input, so that the information can propagate along as the network passes over the sequence.

Colah's [blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) is a nice piece to read about them. They chain-like nature makes them appropriate for sequences and lists.
Standard RNNs are not effective when the context becomes wide.
LSTMs were introduced by Hochreiter & Schimdhuber, explicitly to avoid long-term dependency problem.
Core components in LSTMs: cell state, gates
* $C_{t-1}$ previous cell state
* $h_{t-1}$ hidden state
* $x_t$ current input
1. First, we need to know what information to keep from the previous cell state.
 * $f_t = \sigma(W_f \dot [ h_{t-1}, x_t] + b_f)$
2. Second, what information to store in the cell state.
 * $i_t = \sigma(W_i \dot [h_{t-1}, x_t] + b_i)$
 * $\tilde{C}_t = \text{tanh}(W_C \dot [h_{t-1}, x_t] + b_C)$
 * $C_t = f_t * C_{t-1} + i_t * \tilde{C}_t$
3. Finally, what to output 
 * $o_t = \sigma(W_o \dot [h_{t-1}, x_t] + b_o)$
 * $h_t = o_t * \text{tanh}(C_t)$

## Preparing the dataset

We first prepare our toy dataset and a few function to preprocess them.
As before, we will first build a dictionary that will help as change each sentence into a sequence of indexes.
As a reminder, such dictionary hold as key the words and value the corresponding index.
We will do the same for words (so they can be representes as character indexes thanks to a dictionary).
Finally we'll generate a dictionary of tensor to represent each word in the dictionary as a sequence of characters indexes.

In [1]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim 

torch.manual_seed(1)

<torch._C.Generator at 0x115d96850>

In [4]:
def prepare_sequence(seq, to_ix):
    """Convert a sequence of things to a matrix
    
    Args:
        seq(list): Sequence of things
        to_ix(dict): key value pairs with things as keys and their index as value
    
    Returns:
        matrix with each line correspond to a word in the sequence with its index in 
        the dictionary to_ix
    """
    idxs = [to_ix[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)

def prepare_words_tensor(word_to_ix, char_to_ix):
    """Convert words(keys) in the dictionary word_to_ix into
    tensors that contains character indexes
    
    Args:
        word_to_ix(dict): key value pairs with words as keys and their indexes as values
        char_to_ix(dict): key value pairs with characters as keys and their indexes as values
    
    Returns:
        dict: Contains keys as index of words and values the tensors of words
    """
    list_words_tensor = {}
    for word, idx in word_to_ix.items():
        list_words_tensor[idx] = prepare_sequence(word, char_to_ix)
    return list_words_tensor

# Toy training data
training_data = [
    ("The dog ate the apple".casefold().split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".casefold().split(), ["NN", "V", "DET", "NN"])
]

# Preparing our vocabulary of words
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V":2}

# Preparing our vocabulary of characters
char_to_ix = {}
for word in word_to_ix.keys():
    for c in word:
        if c not in char_to_ix:
            char_to_ix[c] = len(char_to_ix)

print(char_to_ix)

# Preparing representation of words at the character level
list_words_tensor = prepare_words_tensor(word_to_ix, char_to_ix)
print(list_words_tensor[2])

{'the': 0, 'dog': 1, 'ate': 2, 'apple': 3, 'everybody': 4, 'read': 5, 'that': 6, 'book': 7}
{'t': 0, 'h': 1, 'e': 2, 'd': 3, 'o': 4, 'g': 5, 'a': 6, 'p': 7, 'l': 8, 'v': 9, 'r': 10, 'y': 11, 'b': 12, 'k': 13}
tensor([6, 0, 2])


In [58]:
class LSTMTaggerAug(nn.Module):
    def __init__(self, embedding_dim_words, embedding_dim_chars, hidden_dim_words, hidden_dim_chars, vocab_size, tagset_size, charset_size):
        """LSTM Part-of-Speech Tagger Augmented with Character level features
        
        Atttributes:
            embedding_dim_words: Embedding dimension of word features to input to LSTM word level
            embedding_dim_chars: Embedding dimension of word features to input to character level
            hidden_dim_words: Output size of the LSTM word level
            hidden_dim_chars: Output size of the LSTM character level
            vocab_size: Size of the vocabulary of characters
            tagset_size: Size of the set of labels
            charset_size: Size of the vocabulary of characters
        """
        super(LSTMTaggerAug, self).__init__()
        self.hidden_dim_words = hidden_dim_words
        self.hidden_dim_chars = hidden_dim_chars
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim_words)
        self.char_embeddings = nn.Embedding(charset_size, embedding_dim_chars)
        self.lstm_char = nn.LSTM(embedding_dim_chars, hidden_dim_chars)
        self.lstm_words = nn.LSTM(embedding_dim_words + hidden_dim_chars, hidden_dim_words)
        self.hidden2tag = nn.Linear(hidden_dim_words, tagset_size)
        self.hidden_char = self.init_hidden(c=False)
        self.hidden_words = self.init_hidden(c=True)
    
    def init_hidden(self, c=True):
        """Initialize hidden state of LSTMs
        
        Args:
            c(boolean): return initialized hidden state for LSTM word level if true
        
        """
        if c:
            return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim_words)),
                    autograd.Variable(torch.zeros(1, 1, self.hidden_dim_words)))
        else:
            return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim_chars)),
                    autograd.Variable(torch.zeros(1, 1, self.hidden_dim_chars)))
    
    
    def forward(self, sentence_seq, words_tensor_dict):
        """Forward propagation
        
        Args:
            sentence_seq(list): Sequence of indexis related to the corresponding sentence words
            words_tensor_dict(dict): Dictionary of tensors of words at the character level
        
        Returns:
            tensor: Labels predicted (POS) for the sequence
        """
        # embeds = self.word_embeddings(sentence)
        for ix, word_idx in enumerate(sentence_seq):
            
            # Char level
            word_chars_tensors = words_tensor_dict[int(word_idx)]
            char_embeds = self.char_embeddings(word_chars_tensors)
            
            # Remember that the input of LSTM is a 3D Tensor:
            # The first axis is the sequence itself, 
            # the second indexes instances in the mini-batch, and 
            # the third indexes elements of the input.
            lstm_char_out, self.hidden_char = self.lstm_char(
                char_embeds.view(len(char_embeds), 1, -1), self.hidden_char)
            
            # Word level
            embeds = self.word_embeddings(word_idx)
            # Now here we will only keep the final hidden state of the character level LSTM
            # i.e lstm_char_out[-1]
#             print(embeds.shape)
#             print(lstm_char_out[-1].view(6).shape)
            embeds_cat = torch.cat((embeds, lstm_char_out[-1].view(6)))
            print(embeds_cat)
            
            lstm_out, self.hidden_words = self.lstm_words(embeds_cat, self.hidden_words)
            tag_space = self.hidden2tag(lstm_out.view(1, -1))
            
            tag_score = F.log_softmax(tag_space, dim=1)
            if ix==0:
                tag_scores = tag_score
            else:
                tag_scores = torch.cat((tag_scores, tag_score), 0)
        
        return tag_scores


In [59]:
# Used dimensions for our LSTMTaggerAug
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

In [60]:
# Let's test it
model = LSTMTaggerAug(EMBEDDING_DIM, EMBEDDING_DIM, HIDDEN_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix), len(char_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

inputs = prepare_sequence(training_data[0][0], word_to_ix)
words_tensors = prepare_words_tensor(word_to_ix, char_to_ix)
# print(words_in)
# print(inputs)
tag_scores = model(inputs, words_tensors)



tensor([-0.4169, -1.1838,  0.1670, -0.1375,  0.8632, -0.0244, -0.2928, -0.1944,
        -0.3524, -0.2916, -0.2313,  0.0261], grad_fn=<CatBackward>)


IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

In [48]:
# Let's train it !
model = LSTMTaggerAug(EMBEDDING_DIM, EMBEDDING_DIM, HIDDEN_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix), len(char_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

inputs = prepare_sequence(training_data[0][0], word_to_ix)
words_tensors = prepare_words_tensor(word_to_ix, char_to_ix)
tag_scores = model(inputs, words_tensors)
# print(tag_scores)

for epoch in range(300):
    for sentence, tags in training_data:
        model.zero_grad()
        model.hidden = model.init_hidden()
        sentence_in = prepare_sequence(sentence, word_to_ix)
        words_tensors = prepare_words_tensor(word_to_ix, char_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)
        tag_scores = model(sentence_in, words_tensors)
        loss = loss_function(tag_scores, targets)
        loss.backward(retain_graph=True)
        optimizer.step()
    
inputs = prepare_sequence(training_data[0][0], word_to_ix)
words_in = prepare_words_tensor(word_to_ix, char_to_ix)
tag_scores = model(inputs, words_in)
print(tag_scores)
print(training_data[0][0])

torch.Size([6])
torch.Size([6])


RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 1 and 2 at /Users/distiller/project/conda/conda-bld/pytorch_1573049287641/work/aten/src/TH/generic/THTensor.cpp:680