<a href="https://colab.research.google.com/github/hookskl/nlp_w_pytorch/blob/main/nlp_w_pytorch_ch7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intermediate Sequence Modeling for NLP

Sequence prediction tasks seek to label each item of a sequence. Examples include:
 
 * *language modeling*: predict the next word given a sequence of words at each step
 * *parts-of-speech tagging*: predict the grammatical part of speech for each word
 * *named entity recognition*: predict whether each word is part of a named enity (`Person`, `Location`, `Product`, `Organization`, etc.)

Sequence prediction is also sometimes referred to as *sequence labeling*.

Elman RNNs in practice fail to capture long-range dependencies necessary to perform well on tasks such as sequence prediction. Gated RNNs use a different architecture that helps circumvent this shortcoming.

## The Problem with Vanilla RNNs (Elman RNNs)

The Elman RNN has two issues that make it unsuitable for many tasks:

1. it is unable to retain information for long sequences
2. numerical stability of its gradients

The first issue stems from the fact that the Elman RNN merely updates its hidden state vector at each time step, regardless of whether the update made sense or not. The RNN has no control of what values are retained versus which ones are discarded. A more desirable behaviour would be to let the RNN decide when to update and by how much or what parts to update.

Second, for really long sequences, Vanilla RNNs tend to suffer from numerical stable gradients. In these cases, the gradients can shrink towards `0` (*vanishing gradients*) or grow to infinity (*exploding gradients*). Either case has severe consequences on the model's ability to train. Techiniques exist for dealing with these gradient issues (ReLU, gradient clipping, etc.) but fail to improve the model better than using *gating*.

### Gating as a Solution to a Vanilla RNN's Challenges

To gain some intuition about gating, consider adding two values, $a$ and $b$, with the optional constraint to limit how much of $b$ is added. This can be written as : $$a + \lambda b \text{, where } 0 \leq \lambda \leq 1.$$

Here $\lambda$ acts as a "switch" or "gate", controlling how much $b$ contributes to the total sum. This is the basic idea behind gating in RNNs. To see how this is incorporated, recall how the Elman RNN updates the hidden state vector: $$h_t=h_{t-1}+F(h_{t-1}, x_t) \text{, where}$$ $$h_t \text{ is the hidden state vector at some time step }t,$$ $$x_t \text{is the input at some time step }t, \text{ and}$$ $$F \text{ is the recurrent computation of the RNN}$$

Modifying this equation with the ideas above, yields: $$h_t=h_{t-1}+\lambda(h_{t-1}, x_t)F(h_{t-1}, x_t)$$

but now instead of being a constant, $\lambda$ is function of the previous hidden state vector and current data input and still maps values to $[0,1]$. The function $\lambda$ controls how much of the current input gets to update the hidden state $h_{t-1}$, as well as being context dependent. This is the basic idea behind all gated RNNs. 

*Long short-term memory* networks (LSTMs) are a flavor of gated RNNs that extends this idea further, where not only are the updates controlled, but also intentional forgetting of values from the previous hidden state. 

Another variant is the *gated recurrent unit* (GRU). Both are easily implemented in PyTorch, replacing the `nn.RNN` or `nn.RNNCell` with `nn.LSTM` or `nn.LSTMCell`, respectively. No other code changes are required (also applies to GRUs). 

## Example: A Character RNN for Generating Surnames

This example uses the surnames dataset to introduce a new sequence prediction task: using an RNN to generate a new surname. What this means is at each time step the RNN is computing a probability distribution over the set of possible characters given a prior sequence of characters. These distributions can be used either for optimizing the network and improving predictions, or generating a new surname. 

Two models will be used for this task: an unconditional model and a conditional model. The only difference in these two models is the conditional model will start with a bias coming from an embedding of a given nationality.

### The SurnameDataset Class

The `SurnameDataset` class remains largely unchanged from implementation for classifying a surname's nationality. However, to accomodate the difference in tasks, the `.__getitem__()` method is modified to output the sequences of integers for the prediction targets. The method references the `Vectorizer` for computing the sequence of integers that serve as the input (the `from_vector`) and the sequence of integers that serve as the output (the `to_vector`). 

*Example 7-1. The `SurnameDataset.__getitem__()` method for a sequence prediction task*

```
class SurnameDataset(Dataset):
    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a vectorizer from scratch

        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """

        surname_df = pd.read_csv(surname_csv)
        return cls(surname_df, SurnameVectorizer.from_dataframe(surname_df))

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets

        Args:
            index (int): the index to the data point
        Returns:
            a dictionary holding the data point: (x_data, y_target, class_index)
        """
        row = self._target_df.iloc[index]

        from_vector, to_vector = \
            self._vectorizer.vectorize(row.surname, self._max_seq_length)

        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)

        return {'x_data': from_vector,
                'y_target': to_vector,
                'class_index': nationality_index}
```

### The Vectorization Data Structures

Similar to previous implementations, there are three main data structures used to transform each sequence of characters into its vectorized form:

* `SequenceVocabulary` to map tokens to integers
* `SurnameVectorizer` to coordinate the integer mappings
* `DataLoader` to group the vectorizer's results into minibatches

The `DataLoader` is unchanged from previous examples.

#### SurnameVectorizer and END-OF-SEQUENCE

For sequence prediction, the training routine is written to expect two sequences of integers which represent the token observations and the token targets at each time step. Usually, the sequence trained on is also the sequence to predict. Consequently, a single sequence of tokens (surname characters) are used to construct both observations and targets by staggering the training sequence. 

To start, each token is mapped to its respective integer using the `SequenceVocabularly`. Next, the begin of sequence and end of sequence tokens' indexes are added (prepended or appended) to the sequence. At this point, each data point is a sequence of indices and has the same first and last index. From here the input and output sequences are created using two different slices of the given sequence. The first slice is all tokens from the sequence except the last and the second slice is all tokens except the first. 

Some additional implementation details: once the sequence is converted to indices and wrapped with the beginning and ending indices, the `vector_length` is tested to ensure consistent lengths prior to stacking into minibatches. After testing vector length the two slices of sequences are created, `from_vector` and `to_vector`, and are then padded with the `mask_index` to a consistent length.

*Example 7-2. The code for `SurnameVectorizer.vectorize()` in a sequence prediction task*

```
class SurnameVectorizer(object):
    """The Vectorizer which coordinates the Vocabularies and puts them to use"""

    def vectorizer(self, surname, vector_length=-1):
        """Vectorizer a surname into a vector of observations and targets

        Args:
            surname (str): the surname to be vectorized
            vector_length (int): an argument for forcing the length of index vector
        Returns:
            a tuple: (from_vector, to_vector)
                from_vector (numpy.ndarray): the observation vector
                to_vector (numpy.ndarray): the target prediction vector
        """
        indices = [self.char_vocab.begin_seq_index]
        indices.extend(self.char_vocab.lookup_token(token) for token in surname)
        indices.append(self.char_vocab.end_seq_index)

        if vector_length < 0:
            vector_length = len(indices) -1

        from_vector = np.zeros(vector_length, dtype=np.int64)
        from_indices = indices[:-1]
        from_vector[:len(from_indices)] = from_indices
        from_vector[len(from_indices):] = self.char_vocab.mask_index

        to_vector = np.emtpy(vector_length, dtype=np.int64)
        to_indices = indices[1:]
        to_vector[:len(to_indices)] = to_indices
        to_vector[len(to_indices):] = self.char_vocab.mask_index  

        return from_vector, to_vector

    @classmethod
    def from_dataframe(cls, surname_df):
        """Instantiate the vectorizer from the dataset dataframe

        Args:
            surname_df (pandas.DataFrame): the surname dataset
        Returns:
            an instance of the SurnameVectorizer
        """
        char_vocab = SequenceVocabulary()
        nationality_vocab = Vocabulary()

        for index, row in surname_df.itterrows():
            for char in row.surname:
                char_vocab.add_token(char)
            nationality_vocab.add_token(row.nationality)

        return cls(char_vocab, nationality_vocab)     
```

### From the ElmanRNN to the GRU

Switching from vanilla RNNs to a GRU model in PyTorch is extremely simple. The GRU is instantiated using `torch.nn.GRU` and the parameters are the same as those used in the vanilla RNN.

### Model 1: The Uncondioned SurnameGenerationModel

The first model is unconditioned, meaning it does not observe the nationality before generating a surname. This means the GRU does not bias its computations towards any nationality. This bias comes from how the initial hidden vector is constructed. Here, because the model is unconditioned, this initial hidden vector consists of all `0`s.

In general, the model embeds character indices, computes their sequential state using a GRU, and computes the probability of token predictions using a `Linear` layer. 

*Example 7-3. The unconditioned surname generation model*

```
class SurnameGenerationModel(nn.Module):
    def __init__(self, char_embedding_size, char_vocab_size, rnn_hidden_size,
                 batch_first=True, padding_idx=0, dropout_p=0.5):
        """
        Args:
            char_embedding_size (int): the size of the character embeddings
            char_vocab_size (int): the number of characters to embed
            rnn_hidden_size (int): the size of the RNN's hidden state
            batch_first (bool): informs whether the input tnesors will
                have batch or the sequence on the 0th dimension
            padding_idx (int): the index for the tensor padding;
                see torch.nn.Embedding
            dropout_p (float): the probability of zeroing activations using the dropout method
        """
        super(SurnameGenerationModel, self).__init__()

        self.char_emb = nn.Embedding(num_embeddings=char_vocab_size,
                                     embedding_dim=char_embedding_size,
                                     padding_idx=padding_idx)
        self.rnn = nn.GRU(input_size=char_embedding_size,
                          hidden_size=rnn_hidden_size,
                          batch_first=batch_first)
        self.fc = nn.Linear(in_features=rnn_hidden_size,
                            out_features=char_vocab_size)
        self._dropout_p = dropout_p

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the model

        Args:
            x_in (torch.Tensor): an input data tensor
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be False during training
        Returns:
            the resulting tensor
                tensor.shape should be (batch, output_dim)
        """
        x_embedded = self.char_emb(x_in)

        y_out, _ = self.rnn(x_embedded)

        batch_size, seq_size, feat_size = y_out.shape
        y_out = y_out.continguous().view(batch_size * seq_size, feat_size)

        y_out = self.fc(F.dropout(y_out, p=self._dropout_p))

        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)

        new_feat_size = y_out.shape[-1]
        y_out = y_out.view(batch_size, seq_size, new_feat_size)

        return y_out
```

The primary difference in this sequence classification task is how the state vectors computed by the RNN are handled. Previously, a single vector per batch index was retrieved and predictions were performed using those single vectors. However, now the 3D tensors are reshaped into 2D tensors (a matrix) such that the row dimension represents every sample (batch and sequence index). The matrix is fed into the `Linear` layer and a prediction vector is computed for every sample. The output matrix is then reshaped back into a 3D tensor. The reshaping is done purely to accomodate the linear layer requiring a matrix as input.

### Model 2: The Conditioned SurnameGenerationModel

*Example 7-4. The conditioned surname generation model*

```
class SurnameGenerationModel(nn.Module):
    def __init__(self, char_embedding_size, char_vocab_size, num_nationalities,
                 rnn_hidden_size, batch_first=True, padding_idx=0, dropout_p=0.5):
        # ...
        self.nation_embedding = nn.Embedding(embedding_dim=rnn_hidden_size,
                                             num_embeddings=num_nationalities)

    def forward(self, x_In, nationality_index, apply_softmax=False):
        # ...
        x_embedded = self.char_embedding(x_in)
        # hidden_size: (num_layers * num_directions, batch_size, rnn_hidden_size)
        nationality_embedded = self.nation_emb(nationality_index).unsqueeze(0)
        y_out, _ = self.rnn(x_embedded, nationality_embedded)
        # ...                                             
```

### The Training Routine and Results

*Example 7-5. Handling three-dimensional tensors and sequence-wide loss computations*

```
def normalize_sizes(y_pred, y_true):
    """Normalize tensor sizes

    Args:
        y_pred (torch.Tensor): the output of the model
            if a 3-d tensor, reshapes to a matrix
        y_true (torch.Tensor): the target predictions
            if a matrix, reshapes to be a vector
    """
    if len(y_pred.size()) == 3:
        y_pred = y_pred.contiguous().view(-1, y_pred.size(2))
    if len(y_true.size()) == 2:
        y_true = y_true.contiguous().view(-1)
    return y_pred, y_true

def sequence_loss(y_pred, y_true, mask_index):
    y_pred, y_true = normalize_sizes(y_pred, y_true)
    return F.cross_entropy(y_pred, y_true, ignore_index=mask_index)
```

*Example 7-6. Hyperparameters for surname generation*

```
args = Namespace(
    # Data and path information
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch7/model1_unconditioned_surname_generation",
    # or: save_dir="model_storage/ch7/model2_conditioned_surname_generation",
    # Model hyperparameters
    char_embedding_size=32,
    rnn_hidden_size=32,
    # Training hyperparameters
    seed=1337,
    learning_rate=0.001,
    batch_size=128,
    num_epochs=100,
    early_stopping_criteria=5,
    # Runtime options omitted
)    
```

*Example 7-7. Sampling from the unconditioned generation model*

```
def sample_from_model(model, vectorizer, num_samples=1, sample_size=20, 
                      temperature=1.0):
    """Sampel a sequence of indices from the model

    Args:
        model (SurnameGenerationModel): the trained model
        vectorizer (SurnameVectorizer): the corresponding vectorizer
        num_samples (int): the number of samples
        sample_size (int): the max length of the samples
        temperature (float): accentuates or flattens the distribution
            0.0 < temperature < 1.0 will make it peakier
            temperature > 1.0 will make it more uniform
    Returns:
        indices (torch.Tensor): the matrix of indices
        shape = (num_samples, sample_size)
    """
    begin_seq_index = [vectorizer.char_vocab.begin_seq_index 
                       for _ in range(num_samples)]
    being_seq_index = torch.tensor(begin_seq_index,
                                   dtype=torch.int64).unsqueeze(dim=1)
    indices = [begin_seq_index]
    h_t = None

    for time_step in range(sample_size):
        x_t indices[time_step]
        x_emb_t = model.char_emb(x_t)
        rnn_out_t, h_t = model.rnn(x_emb_t, h_t)
        prediction_vector = model.fc(rnn_out_t.squeeze(dim=1))
        probability_vector = F.softmax(prediction_vector / temperature, dim=1))
        indices.append(torch.multinomial(probability_vector, num_samples=1))
    indices = torch.stack(indices).squeeze().permute(1, 0)
    return indices                                                                       
```

*Example 7-8. Mapping sampled indices to surname strings*

```
def decode_samples(sampled_indices, vectorizer):
    """Transform indices into the string form of a surname

    Args:
        sampled_indices (torch.Tensor): the indices from `sample_from_model`
        vectorizer (SurnameVectorizer): the corresponding vectorizer
    """
    decoded_surnames = []
    vocab = vectorizer.char_vocab

    for sample_index in range(sampled_indices.shape[0]):
        surname = ""
        for time_step in range(sampled_indices.shape[1]):
            sample_item = sampled_indices[sample_index, time_step].item()
            if sample_item == vocab.begin_seq_index:
                continue
            elif sample_item == vocab.end_seq_index:
                break
            else:
                surname += vocab.lookup_index(sample_item)
        decoded_surnames.append(surname)
    return decoded_surnames
```

*Example 7-9. Sampling from the unconditioned model*

```
samples = sample_from_model(unconditioned_model, vectorizer, 
                              num_samples=10)

decode_samples(samples, vectorizer)                              
```

*Example 7-10. Sampling from a sequence model*

```
def sample_from_model(model, vectorizer, nationalities, sample_size=20,
                      temperature=1.0):
    """Sample a sequence of indices from the model

    Args:
        model (SurnameGenerationModel): the trained model
        vectorizer (SurnameVectorizer): the corresponding vectorizer
        nationalities (list): a list of integers representing nationalities
        sample_size (int): the max length of the samples
        temperature (float): accentuates or flattens the dsitribution
            0.0 < temperature < 1.0 will make it peakier
            temperature > 1.0 will make it more uniform
    Returns:
        indices (torch.Tensor): the matrix of indices
        shape = (num_samples, sample_size)
    """
    num_samples = len(nationalities)
    begin_seq_index = [vectorizer.char_vocab.begin_seq_index
                       for _ in range(num_samples)]
    begin_seq_index = torch.tensor(being_seq_index,
                                   dtype=torch.int64).unsqeeuze(dim=1)
    indices = [begin_seq_index]
    nationality_indices = torch.tensor(nationalities,
                                       dtype=torch.int64).unsqueeze(dim=0)
    h_t = model.nation_emb(nationlity_indices)

    for time_step in range(sample_size):
        x_t = indices[time_step]
        x_emb_t = model.char_emb(x_t)
        rnn_out_t, h_t = model.rnn(x_emb_t, h_t)
        prediction_vector = model.fc(rnn_out_t.squeeze(dim=1))
        probability_vector = F.softmax(prediction_vector / temperature, dim=1)
        indices.append(torch.multinomial(probability_vector, num_samples=1))
    indices = torch.stack(indices).squeeze().permute(1, 0)
    return indices                                                                      
```

*Example 7-11. Sampling from the conditioned SurnameGenerationModel*

```
for index in range(len(vectorizer.nationality_vocab)):
    nationality = vectorizer.nationality_vocab.lookup_index(index)
    
    print("Sampled for {}: ".format(nationality))

    sampled_indices = sample_from_model(model=conditioned_model,
                                        vectorizer=vectorizer,
                                        nationalities=[index] * 3,
                                        temperature=0.7)
    for sampled_surname in decode_samples(sampled_indices,
                                          vectorizer):
        print("- " + sampled_surname)                                                                                  
```

## Tips and Tricks for Training Sequence Models

*Example 7-12. Applying gradient clipping in PyTorch*

```
# define you sequence model
model = ...
# define loss function
loss_function = ...

# training loop
for _ in ..."
    ...
    model.zero_grad()
    output, hidden = model(data, hidden)
    loss = loss_function(output, targets)
    loss.backward()
    torch.nn.utils.clip_grad_norm(model.parameters(), 0.25)
    ...
```
