# The LSTM classifier

In this notebook you will learn how to implement a classifier based on the Long Short-Term Memory (LSTM) network. The architecture for this classifier is that of the *encoder* introduced in Lecture&nbsp;1.3.3.

In [None]:
import torch

## Loading the data

We use the same data as in *Document classification with simple neural networks*.

The following helper function loads the sentiment- or product-labelled sentences from a tab-separated file. It returns a list of pairs where the first component of each pair is a tokenised review (represented as a list of string tokens) and the second component is the corresponding label (an integer). By default, we cap the documents after the first 100 tokens.

In [None]:
from torch.utils.data import Dataset

class DocumentDataset(Dataset):

    def __init__(self, filename, max_len=100):
        self.items = []
        with open(filename, 'rt', encoding='utf-8') as fp:
            for line in fp:
                doc, label = line.rstrip().split('\t')
                self.items.append((doc.split()[:max_len], int(label)))

    def __getitem__(self, idx):
        return self.items[idx]

    def __len__(self):
        return len(self.items)

We use this function to load the training data and the development data:

In [None]:
train_data = DocumentDataset('reviews-product-train.txt')
dev_data = DocumentDataset('reviews-product-dev.txt')

## Vectorizing the data

We represent each document by a vector containing the word ids of the tokens in the document.

We first construct our vocabulary. Note that we reserve the word id&nbsp;0 for padding.

In [None]:
def make_vocab(data):
    vocab = {'<pad>': 0, '<unk>': 1}
    for doc, label in data:
        for token in doc:
            if token not in vocab:
                vocab[token] = len(vocab)
    return vocab

We create the vocabulary from the training data:

In [None]:
vocab = make_vocab(train_data)

Next we create our document batcher. In each batch we right-pad shorter documents with zeroes.

In [None]:
class DocumentBatcher(object):

    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, batch):
        docs, labels = zip(*batch)

        # Replace each token by its integer id
        xs = [[self.vocab.get(t, self.vocab['<unk>']) for t in d] for d in docs]

        # Right-pad each document with zeroes
        max_len = max(len(x) for x in xs)
        xs = [x + [self.vocab['<pad>']] * (max_len - len(x)) for x in xs]

        return torch.as_tensor(xs), torch.as_tensor(labels)

We save the batcher for later use.

In [None]:
batcher = DocumentBatcher(vocab)

## Evaluation

We evaluate our classify using accuracy:

In [None]:
def accuracy(y_pred, y):
    return torch.mean(torch.eq(y_pred, y).float()).item()

## Training the model

We are now ready to set up the LSTM model and train it using cross-entropy loss.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

### Model 1: Unidirectional LSTM

We first consider an encoder based on a standard unidirectional LSTM. This encoder consists of three parts: an embedding layer, an LSTM, and a linear layer that projects the final hidden state of the LSTM down to the number of classes.

In [None]:
class LSTMModel(nn.Module):

    def __init__(self, num_embeddings, embedding_dim, hidden_dim, num_classes):
        super().__init__()

        # Embedding layer
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)

        # Unidirectional LSTM
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Dropout layer
        self.dropout = nn.Dropout(0.5)

        # Final linear layer that projects down to the class labels
        self.linear = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # Shape of x: [batch_size, sequence_length]

        # Replace each word id with its embedding vector
        x = self.embedding(x)
        # Shape of x: [batch_size, sequence_length, embedding_dim]

        # Unroll the LSTM
        output, (h_n, c_n) = self.lstm(x)
        # Shape of h_n: [1, batch_size, hidden_dim]

        # Extract the last hidden state
        x = h_n[-1]
        # Shape of x: [batch_size, hidden_dim]

        # Apply dropout
        x = self.dropout(x)
        # Shape of x: [batch_size, hidden_dim]

        # Send x through the final linear layer
        x = self.linear(x)
        # Shape of x: [batch_size, num_classes]

        return x

### Model 2: Bidirectional LSTM

We now consider an encoder based on a bidirectional LSTM. The basic architecture is the same as before. The final hidden state is the concatenation of the final hidden states of the forward and the backward LSTM.

In [None]:
class LSTMModel(nn.Module):

    def __init__(self, num_embeddings, embedding_dim, hidden_dim, num_classes):
        super().__init__()

        # Embedding layer
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)

        # Bidirectional LSTM
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)

        # Dropout layer
        self.dropout = nn.Dropout(0.5)

        # Final linear layer that projects down to the class labels
        self.linear = nn.Linear(2 * hidden_dim, num_classes)

    def forward(self, x):
        # Shape of x: [batch_size, sequence_length]

        # Replace each word id with its embedding vector
        x = self.embedding(x)
        # Shape of x: [batch_size, sequence_length, embedding_dim]

        # Unroll the Bi-LSTM
        output, (h_n, c_n) = self.lstm(x)
        # Shape of h_n: [2, batch_size, hidden_dim]

        # Extract the last hidden states
        x = torch.cat((h_n[-1], h_n[-2]), axis=-1)
        # Shape of x: [batch_size, 2 * hidden_dim]

        # Apply dropout
        x = self.dropout(x)
        # Shape of x: [batch_size, 2 * hidden_dim]

        # Send x through the final linear layer
        x = self.linear(x)
        # Shape of x: [batch_size, num_classes]

        return x

### Training loop

We directly present the embellished version of the training loop, which includes code for plotting and loading pre-trained embeddings (see below).

In [None]:
import matplotlib.pyplot as plt
import tqdm

from torch.utils.data import DataLoader

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

def train(n_epochs=5, batch_size=32, lr=1e-2, pretrained=False):
    # Initialize the model
    model = LSTMModel(len(vocab), 100, 50, 6)
    nn.init.xavier_uniform_(model.embedding.weight)
    if pretrained:
        embeddings = torch.load('glove.pt')
        model.embedding = nn.Embedding.from_pretrained(embeddings, freeze=False)

    # Initialize the optimizer
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)

    # We will keep track of the losses on the two datasets
    train_losses = []
    dev_losses = []
    dev_accuracies = []

    # We use DataLoaders for automatic batching
    train_loader = DataLoader(train_data, batch_size=batch_size, collate_fn=batcher, shuffle=True)
    dev_loader = DataLoader(dev_data, batch_size=batch_size, collate_fn=batcher)

    for t in range(n_epochs):
        with tqdm.tqdm(total=len(train_data)) as pbar:
            pbar.set_description(f'Epoch {t+1}')

            # Start training
            model.train()
            running_loss = 0
            for bx, by in train_loader:
                optimizer.zero_grad()
                output = model.forward(bx)
                loss = F.cross_entropy(output, by)
                loss.backward()
                optimizer.step()
                running_loss += loss.item() * len(bx)
                pbar.update(len(bx))
            train_losses.append(running_loss / len(train_data))

            # Start evaluation
            model.eval()
            dev_loss = 0
            dev_y = []
            dev_y_pred = []
            with torch.no_grad():
                for bx, by in dev_loader:
                    output = model.forward(bx)
                    loss = F.cross_entropy(output, by)
                    dev_loss += loss.item() * len(bx)
                    dev_y.append(by)
                    dev_y_pred.append(torch.argmax(output, axis=1))
            dev_losses.append(dev_loss / len(dev_data))
            dev_y = torch.hstack(dev_y)
            dev_y_pred = torch.hstack(dev_y_pred)
            dev_acc = accuracy(dev_y_pred, dev_y)
            dev_accuracies.append(dev_acc)

        print(f'dev loss={dev_loss / len(dev_data):.4f} dev acc={dev_acc:.4f}')

    # Plotting
    plt.figure(figsize=(15, 6))
    plt.subplot(121)
    plt.plot(train_losses)
    plt.plot(dev_losses)
    plt.xlabel('Epoch')
    plt.ylabel('Average loss')
    plt.subplot(122)
    plt.plot(dev_accuracies)
    plt.xlabel('Epoch')
    plt.ylabel('Development set accuracy')

    return model

In [None]:
train()

What accuracy do you get?

## Explorations

**ðŸ¤” Exploration 1: Architecture tuning**

> Try to tune the architecture of your classifier to see which one gives the highest accuracy on the development data. For example, you could try to stack several LSTM layers on top of each other.

**ðŸ¤” Exploration 2: Initializing embeddings**

> One of the recent advances of deep learning research is the discovery of new techniques for initializing the parameters of neural networks. For example, [Glorot and Bengio (2010)](https://proceedings.mlr.press/v9/glorot10a.html) propose a method for choosing the random distribution from which the parameters are drawn based on the architecture of the network. This method is implemented in the PyTorch functions [`nn.init.xavier_uniform_`](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_uniform_) and [`nn.init.xavier_normal_`](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_). Test this method by calling it on `model.embedding.weight`. What effect does this have on accuracy?

**ðŸ¤” Exploration 3: Pre-trained embeddings**

> Does it help your classifier to use pre-trained word embeddings instead of task-specific embeddings? For example, try to load pre-trained embeddings obtained from the [GloVe project](https://nlp.stanford.edu/projects/glove/). Is it better to â€˜freezeâ€™ the embeddings or train them along with the rest of the network?

Thatâ€™s all folks!