## Anything Goes Implementation
### Bidirectional LSTM for Sequence Tagging

For the "Anything Goes" implementation I used a model similar to the one I used in the last challenge. It's been adapted for sequence tagging and achieved just under 99% validation accuracy.

Import statements.

In [3]:
import torch
from torch import nn
import torch.optim

from torchtext import data
from torchtext import datasets

import numpy as np

import time
import random

Environment variables. **Set `train_file` and `test_file` to the relative filepaths of the data.** If `test_file` is an empty string no test data will be used.
The validation split determines the percentage of training samples set aside for validation.

In [4]:
train_file = "data/train.tsv"
test_file = ""
val_split = 0.3

Set random seed for reproducability.

In [5]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Declare the `TEXT` and `TAG` fields. Set all `TEXT` tokens to lowercase for normalization.

In [6]:
TEXT = data.Field(lower = True)
TAGS = data.Field(unk_token = None)



In [7]:
fields = (("text", TEXT), ("tags", TAGS))

This is an adapted version of the `SequenceTaggingDataset` from torchtext. Their implementation expected a specific data format that did not match the provided file.

In [8]:
class SequenceTaggingDataset(data.Dataset):
    @staticmethod
    def sort_key(example):
        for attr in dir(example):
            if not callable(getattr(example, attr)) and \
                    not attr.startswith("__"):
                return len(getattr(example, attr))
        return 0

    def __init__(self, path, fields, encoding="utf-8", separator="\t", **kwargs):
        print("Loading data...")
        examples = []
        columns = []

        with open(path, encoding=encoding) as input_file:
            for line in input_file:
                line = line.strip()
                if line.split(separator)[0] == "<S>":
                    if columns:
                        examples.append(data.Example.fromlist(columns, fields))
                    columns = []
                else:
                    for i, column in enumerate(line.split(separator)):
                        if len(columns) < i + 1:
                            columns.append([])
                        columns[i].append(column)
            if columns:
                examples.append(data.Example.fromlist(columns, fields))
        print("Data loaded from {}".format(path))
        super(SequenceTaggingDataset, self).__init__(examples, fields,
                                                     **kwargs)

Load the data into a Pytorch dataset and split based on the provided `val_split`. Load the test dataset if one is provided.

In [9]:
train_data, val_data = SequenceTaggingDataset(train_file, fields).split(split_ratio=1-val_split)
if len(test_file) > 0:
    test_data = SequenceTaggingDataset(test_file, fields)

Loading data...




Data loaded from data/train.tsv


In [10]:
print("Training samples: {}".format(len(train_data)))
print("Validation samples: {}".format(len(val_data)))
if "test_data" in globals():
    print("Testing samples: {}".format(len(test_data)))

Training samples: 277145
Training samples: 118777
Testing samples: 277145


Quick sanity check.

In [11]:
print(vars(train_data.examples[0]))

{'text': ['is', 'dócha', 'gur', 'cuala', 'gach', 'duine', 'againn', 'moltaí', 'ag', 'teacht', 'ó', 'comhairlí', 'contae', 'gur', 'féidir', 'gearradh', 'siar', 'de', '<num>', '%', 'ar', 'an', 'méid', 'dramhaíola', 'taobh', 'istigh', 'de', 'roinnt', 'blianta', '.'], 'tags': ['N', 'N', 'N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'N']}


Build the vocab. I'm only including words that appear twice or more in the embeddings. Any unseen words or words with only one occurrence will be judged solely on the surrounding tags.

In [12]:
MIN_FREQ = 2

TEXT.build_vocab(train_data,
                 min_freq = MIN_FREQ)
TAGS.build_vocab(train_data)

In [13]:
print("Unique tokens in TEXT: {}".format(len(TEXT.vocab)))
print("Unique tokens in TAG: {}".format(len(TAGS.vocab)))

Unique tokens in TEXT: 61602
Unique tokens in TAG: 6


Set the batch size and the GPU if one is available. **I was only able to run this in a reasonable amount of time using a GPU**.
Then create the iterators to produce batches.

In [14]:
BATCH_SIZE = 128

device = torch.device('cuda:2' if torch.cuda.is_available() else 'cpu')
print(device)

train_iterator, val_iterator = data.BucketIterator.splits(
    (train_data, val_data),
    batch_size = BATCH_SIZE,
    device = device
)
if "test_data" in globals():
    test_iterator = data.BucketIterator(test_data, batch_size = BATCH_SIZE, device = device
)

cuda:2




Declare the model class.Similar setup to last time: embedding layer followed by LSTM with bidirectional support and the option to add multiple layers. Then a linear layer to produce outputs with dropout.

In [15]:
class POSTagger(nn.Module):
    def __init__(self,
                 input_dim,
                 embedding_dim,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout,
                 pad_idx):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        self.lstm = nn.LSTM(embedding_dim,
                            hidden_dim,
                            num_layers = n_layers,
                            bidirectional = bidirectional,
                            dropout = dropout if n_layers > 1 else 0)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        outputs, (hidden, cell) = self.lstm(embedded)
        predictions = self.fc(self.dropout(outputs))

        return predictions


Similar to last time, I used 100-dimension embeddings along with 2 bidirectional LSTM layers. The output dimension has to be the number of tags since we're deciding between mutliple tags, unlike last time where it was a probability.

In [16]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(TAGS.vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.3
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = POSTagger(INPUT_DIM,
                        EMBEDDING_DIM,
                        HIDDEN_DIM,
                        OUTPUT_DIM,
                        N_LAYERS,
                        BIDIRECTIONAL,
                        DROPOUT,
                        PAD_IDX)

Since I'm not using pretrained weights this time, initialize the embedding weights to have a Gaussian distribution.

In [17]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)

model.apply(init_weights)

POSTagger(
  (embedding): Embedding(61602, 100, padding_idx=1)
  (lstm): LSTM(100, 128, num_layers=2, dropout=0.3, bidirectional=True)
  (fc): Linear(in_features=256, out_features=6, bias=True)
  (dropout): Dropout(p=0.3, inplace=False)
)

Print trainable parameters to judge size of the model. It's fairly large, which explains the GPU requirement.

In [18]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print("{} trainable parameters".format(count_parameters(model)))

6792526 trainable parameters


Set weights for padding to zero to ignore their affect.

In [19]:
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)


tensor([[-0.0214, -0.1501, -0.0780,  ..., -0.1113,  0.1239, -0.0879],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0490, -0.1034, -0.1875,  ..., -0.0086, -0.1052,  0.1024],
        ...,
        [ 0.1334,  0.1853,  0.0196,  ..., -0.0059, -0.0565, -0.0670],
        [ 0.0093, -0.0242,  0.1470,  ..., -0.0858, -0.0875,  0.0155],
        [ 0.0475,  0.1175, -0.0679,  ..., -0.1348,  0.1007, -0.1227]])


Standard Adam optimizer with self-generated learning rate.

In [20]:
optimizer = torch.optim.Adam(model.parameters())

Just like last time, `CrossEntropyLoss`, but this time I had to ignore any outputs from padding tags since every word has an output, not just the whole sentence.

In [21]:
TAG_PAD_IDX = TAGS.vocab.stoi[TAGS.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

Send the model and loss to the GPU is available.

In [22]:
model = model.to(device)
criterion.to(device)

CrossEntropyLoss()

Determine accuracy. This was pretty much a copy and paste from [this repo](https://github.com/bentrevett/pytorch-pos-tagging).

In [23]:
def categorical_accuracy(preds, y, tag_pad_idx):
    max_preds = preds.argmax(dim = 1, keepdim = True)
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]])

Standard train and eval functions.

In [24]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in iterator:
        text = batch.text
        tags = batch.tags

        optimizer.zero_grad()

        predictions = model(text.to(device))

        # reshape predictions since pytorch can't handle 3-dimensional predictions
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)

        loss = criterion(predictions, tags.to(device))

        acc = categorical_accuracy(predictions.cpu(), tags.cpu(), tag_pad_idx)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [25]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            tags = batch.tags

            predictions = model(text.to(device))

            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)

            loss = criterion(predictions, tags.to(device))
            acc = categorical_accuracy(predictions.cpu(), tags.cpu(), tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Train for 10 epochs.

In [26]:
N_EPOCHS = 10

best_val_loss = float('inf')

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    val_loss, val_acc = evaluate(model, val_iterator, criterion, TAG_PAD_IDX)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'model.pt')

    print("Epoch: {}".format(epoch+1))
    print(f"Train Loss: {train_loss:.3f} | Train Acc: {train_acc:.3f}")
    print(f"Val Loss: {val_loss:.3f} | Val Acc: {val_acc:.3f}")

	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  This is separate from the ipykernel package so we can avoid doing imports until


Epoch: 1
Train Loss: 0.142 | Train Acc: 0.954
Val Loss: 0.086 | Val Acc: 0.971
Epoch: 2
Train Loss: 0.078 | Train Acc: 0.974
Val Loss: 0.075 | Val Acc: 0.975
Epoch: 3
Train Loss: 0.066 | Train Acc: 0.978
Val Loss: 0.072 | Val Acc: 0.977
Epoch: 4
Train Loss: 0.059 | Train Acc: 0.980
Val Loss: 0.068 | Val Acc: 0.978
Epoch: 5
Train Loss: 0.054 | Train Acc: 0.982
Val Loss: 0.068 | Val Acc: 0.978
Epoch: 6
Train Loss: 0.051 | Train Acc: 0.983
Val Loss: 0.069 | Val Acc: 0.978
Epoch: 7
Train Loss: 0.048 | Train Acc: 0.984
Val Loss: 0.070 | Val Acc: 0.978
Epoch: 8
Train Loss: 0.045 | Train Acc: 0.985
Val Loss: 0.071 | Val Acc: 0.978
Epoch: 9
Train Loss: 0.043 | Train Acc: 0.986
Val Loss: 0.073 | Val Acc: 0.979
Epoch: 10
Train Loss: 0.041 | Train Acc: 0.986
Val Loss: 0.075 | Val Acc: 0.978


I was surprised at how well this was able to do with largely out of the box torchtext. The library certainly has a steep learning curve but I'm seeing its capabilities. I was able to get just below a 99% validation accuracy after a few epochs.

In [None]:
if "test_data" in globals():
    model.load_state_dict(torch.load('model.pt'))

    test_loss, test_data = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)

    print(f"Test Loss: {test_loss:.3f} | Test Acc: {val_acc:.3f}")