## Supervised Implementation
### Mandarin Word Segmentation Using BiLSTMs
My supervised implementation is extremely similar to my "Anything Goes" implementation in `celtic_mutations`. Much of the code directly translated after getting the data preprocessing down and adjusting hyperparameters.
I was able to achieve up to a 99% validation accuracy with 30% of the provided dataset.

Import statements.

In [1]:
import torch
from torch import nn
import torch.optim

from torchtext import data
from torchtext import datasets

import numpy as np

import time
import random

Environment variables. **Set `train_file` and `test_file` to the relative filepaths of the data.** If `test_file` is an empty string no test data will be used.
The validation split determines the percentage of training samples set aside for validation.

In [2]:
train_file = "data/train.tsv"
test_file = "data/test.tsv"
val_split = 0.3

Set random seed for reproducability.

In [3]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Declare the `TEXT` and `TAG` fields. In this implementation, the TAG field represent whether or not a character is the end of a word.

In [4]:
TEXT = data.Field(lower = True)
TAGS = data.Field(unk_token = None)



In [5]:
fields = (("text", TEXT), ("tags", TAGS))

I again had to modify the `SequenceTaggingDataset` from torchtext. This time rather than specifying a character for a new example, I divided the examples into 500-character chunks.

In [6]:
class SequenceTaggingDataset(data.Dataset):
    @staticmethod
    def sort_key(example):
        for attr in dir(example):
            if not callable(getattr(example, attr)) and \
                    not attr.startswith("__"):
                return len(getattr(example, attr))
        return 0

    def __init__(self, path, fields, val_split=0, encoding="utf-8", separator="\t", **kwargs):
        print("Loading data...")
        examples = []
        columns = []

        with open(path, encoding=encoding) as input_file:
            for idx, line in enumerate(input_file):
                line = line.strip()
                if columns and idx % 500 == 0:
                    examples.append(data.Example.fromlist(columns, fields))
                    columns = []
                for i, column in enumerate(line.split(separator)):
                    if len(columns) < i + 1:
                        columns.append([])
                    columns[i].append(column)
            if columns:
                examples.append(data.Example.fromlist(columns, fields))
        print("Data loaded from {}".format(path))
        super(SequenceTaggingDataset, self).__init__(examples, fields,
                                                     **kwargs)

Load the data into a Pytorch dataset and split based on the provided `val_split`. Load the test dataset if one is provided.

In [7]:
train_data, val_data = SequenceTaggingDataset(train_file, fields).split(split_ratio=1-val_split)
if len(test_file) > 0:
    test_data = SequenceTaggingDataset(test_file, fields)



Loading data...
Data loaded from data/train.tsv
Loading data...
Data loaded from data/test.tsv


In [8]:
print("Training samples: {}".format(len(train_data)))
print("Validation samples: {}".format(len(val_data)))
if "test_data" in globals():
    print("Testing samples: {}".format(len(test_data)))

Training samples: 11716
Validation samples: 5021
Testing samples: 396


Quick sanity check.

In [9]:
print(vars(train_data.examples[0]))

{'text': ['婆', '婆', '在', '長', '期', '的', '耳', '濡', '目', '染', '之', '下', '，', '也', '都', '是', '玩', '模', '型', '的', '高', '手', '。', '每', '當', '假', '日', '無', '處', '去', '時', '，', '全', '家', '陶', '醉', '在', '模', '型', '世', '界', '中', '，', '其', '樂', '融', '融', '。', '一', '種', '視', '覺', '上', '錯', '誤', '的', '反', '應', '現', '象', '，', '錯', '視', '早', '就', '被', '發', '現', '了', '，', '我', '們', '的', '眼', '睛', '受', '到', '環', '境', '的', '影', '響', '做', '出', '錯', '誤', '的', '判', '斷', '時', '，', '直', '線', '可', '能', '看', '成', '曲', '線', '，', '平', '行', '線', '可', '能', '看', '成', '歪', '斜', '線', '，', '失', '之', '毫', '釐', '，', '差', '以', '千', '里', '，', '有', '時', '錯', '的', '瘋', '狂', '，', '錯', '的', '離', '譜', '。', '大', '家', '常', '說', '眼', '見', '為', '憑', '，', '由', '於', '我', '們', '對', '眼', '睛', '的', '信', '賴', '程', '度', '，', '遠', '超', '過', '其', '他', '的', '知', '覺', '感', '觀', '，', '一', '旦', '看', '見', '與', '事', '實', '不', '相', '符', '的', '圖', '形', '時', '，', '第', '一', '個', '反', '應', '是', '不', '相', '信', '，', '非', '得', '以', '規', '矩', '實', '量', 

Build the vocab. I'm only including words that appear twice or more in the embeddings. Any unseen words or words with only one occurrence will be judged solely on the surrounding tags.

In [10]:
MIN_FREQ = 2

TEXT.build_vocab(train_data,
                 min_freq = MIN_FREQ)
TAGS.build_vocab(train_data)

In [11]:
print("Number unique tokens in TEXT: {}".format(len(TEXT.vocab)))
print("Unique tokens in TAG: {}".format(TAGS.vocab.itos))

Number unique tokens in TEXT: 5454
Unique tokens in TAG: ['<pad>', '1', '0']


Set the batch size and the GPU if one is available. **I was only able to run this in a reasonable amount of time using a GPU**.
Then create the iterators to produce batches.

In [12]:
BATCH_SIZE = 128

device = torch.device('cuda:2' if torch.cuda.is_available() else 'cpu')
print(device)

train_iterator, val_iterator = data.BucketIterator.splits(
    (train_data, val_data),
    batch_size = BATCH_SIZE,
    device = device
)
if "test_data" in globals():
    test_iterator = data.BucketIterator(test_data, batch_size = BATCH_SIZE, device = device
)

cuda:2




Declare the model class. I used the same model as the Celtic Mutations project. The only changes required were hyperparameter modifications.

In [13]:
class WordSegmenter(nn.Module):
    def __init__(self,
                 input_dim,
                 embedding_dim,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout,
                 pad_idx):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        self.lstm = nn.LSTM(embedding_dim,
                            hidden_dim,
                            num_layers = n_layers,
                            bidirectional = bidirectional,
                            dropout = dropout if n_layers > 1 else 0)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        outputs, (hidden, cell) = self.lstm(embedded)
        predictions = self.fc(self.dropout(outputs))

        return predictions


100-dimensional embeddings, 4 bi-directional LSTMs, and 0.2 dropout

In [33]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(TAGS.vocab)
N_LAYERS = 4
BIDIRECTIONAL = True
DROPOUT = 0.2
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = WordSegmenter(INPUT_DIM,
                        EMBEDDING_DIM,
                        HIDDEN_DIM,
                        OUTPUT_DIM,
                        N_LAYERS,
                        BIDIRECTIONAL,
                        DROPOUT,
                        PAD_IDX)

Since I'm not using pretrained weights this time, initialize the embedding weights to have a Gaussian distribution.

In [34]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)

model.apply(init_weights)

WordSegmenter(
  (embedding): Embedding(5454, 100, padding_idx=1)
  (lstm): LSTM(100, 128, num_layers=4, dropout=0.2, bidirectional=True)
  (fc): Linear(in_features=256, out_features=3, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

Print trainable parameters to judge size of the model. It's fairly large, which explains the GPU requirement.

In [35]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print("{} trainable parameters".format(count_parameters(model)))

1967483 trainable parameters


Set weights for padding to zero to ignore their affect.

In [36]:
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)


tensor([[-0.0130,  0.0206, -0.0029,  ..., -0.0323,  0.0617, -0.1056],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0989,  0.0110, -0.1453,  ...,  0.1078, -0.1063,  0.0542],
        ...,
        [-0.0893, -0.0824,  0.0827,  ...,  0.0522, -0.1137,  0.0026],
        [-0.0141,  0.0566, -0.0076,  ..., -0.0563,  0.0193, -0.1662],
        [-0.1331, -0.0218, -0.2010,  ..., -0.0623, -0.0537, -0.1236]])


Standard Adam optimizer with self-generated learning rate.

In [37]:
optimizer = torch.optim.Adam(model.parameters())

`CrossEntropyLoss`, ignoring any outputs from padding tags since every word has an output, not just the whole sentence.

In [38]:
TAG_PAD_IDX = TAGS.vocab.stoi[TAGS.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

Get index of positive labels. This will be used for calculating F1-score.

In [39]:
TAG_POS_IDX = TAGS.vocab.stoi['1']

Send the model and loss to the GPU is available.

In [40]:
model = model.to(device)
criterion.to(device)

CrossEntropyLoss()

Determine accuracy. This was pretty much a copy and paste from [this repo](https://github.com/bentrevett/pytorch-pos-tagging).

In [41]:
def categorical_accuracy(preds, y, tag_pad_idx):
    max_preds = preds.argmax(dim = 1, keepdim = True)
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]])

In [42]:
def get_precision(preds, y, tag_pos_idx):
    max_preds = preds.argmax(dim = 1, keepdim = True).squeeze(1)
    pos_preds = (max_preds == tag_pos_idx).nonzero()
    correct = max_preds[pos_preds].eq(y[pos_preds])
    return correct.sum() / torch.FloatTensor([y[pos_preds].shape[0]])

In [43]:
def get_recall(preds, y, tag_pos_idx):
    max_preds = preds.argmax(dim = 1, keepdim = True)
    positives = (y == tag_pos_idx).nonzero()
    correct = max_preds[positives].squeeze(1).eq(y[positives])
    return correct.sum() / torch.FloatTensor([y[positives].shape[0]])

In [44]:
def f1score(precision, recall):
    return 2*((precision*recall)/(precision+recall))

Standard train and eval functions.

In [45]:
def train(model, iterator, optimizer, criterion, tag_pad_idx, tag_pos_idx):
    epoch_loss = 0
    epoch_acc = 0
    epoch_precision = 0
    epoch_recall = 0

    model.train()

    for batch in iterator:
        text = batch.text
        tags = batch.tags

        optimizer.zero_grad()

        predictions = model(text.to(device))

        # reshape predictions since pytorch can't handle 3-dimensional predictions
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)

        loss = criterion(predictions, tags.to(device))
        
        acc = categorical_accuracy(predictions.cpu(), tags.cpu(), tag_pad_idx)
        precision = get_precision(predictions.cpu(), tags.cpu(), tag_pos_idx)
        recall = get_recall(predictions.cpu(), tags.cpu(), tag_pos_idx)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
        epoch_precision += precision.item()
        epoch_recall += recall.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator), f1score(epoch_precision / len(iterator), epoch_recall / len(iterator))

In [46]:
def evaluate(model, iterator, criterion, tag_pad_idx, tag_pos_idx):
    epoch_loss = 0
    epoch_acc = 0
    epoch_precision = 0
    epoch_recall = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            tags = batch.tags

            predictions = model(text.to(device))

            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)

            loss = criterion(predictions, tags.to(device))
            acc = categorical_accuracy(predictions.cpu(), tags.cpu(), tag_pad_idx)
            precision = get_precision(predictions.cpu(), tags.cpu(), tag_pos_idx)
            recall = get_recall(predictions.cpu(), tags.cpu(), tag_pos_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            epoch_precision += precision.item()
            epoch_recall += recall.item()
    
    return epoch_loss / len(iterator), epoch_acc / len(iterator), f1score(epoch_precision / len(iterator), epoch_recall / len(iterator))

Train for 100 epochs. This task took much longer than the Celtic Mutations task, but was able to reach similar accuracy eventually.

In [47]:
N_EPOCHS = 40

best_val_loss = float('inf')

for epoch in range(N_EPOCHS):
    train_loss, train_acc, train_f1 = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX, TAG_POS_IDX)
    val_loss, val_acc, val_f1 = evaluate(model, val_iterator, criterion, TAG_PAD_IDX, TAG_POS_IDX)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'model.pt')

    print("Epoch: {}".format(epoch+1))
    print(f"Train Loss: {train_loss:.3f} | Train Acc: {train_acc:.3f} | Train F1-Score: {train_f1: .3f}")
    print(f"Val Loss: {val_loss:.3f} | Val Acc: {val_acc:.3f} | Val F1-Score: {val_f1: .3f}")

Epoch: 1
Train Loss: 0.550 | Train Acc: 0.713 | Train F1-Score:  0.817
Val Loss: 0.270 | Val Acc: 0.888 | Val F1-Score:  0.914
Epoch: 2
Train Loss: 0.238 | Train Acc: 0.903 | Train F1-Score:  0.926
Val Loss: 0.205 | Val Acc: 0.919 | Val F1-Score:  0.937
Epoch: 3
Train Loss: 0.197 | Train Acc: 0.922 | Train F1-Score:  0.940
Val Loss: 0.171 | Val Acc: 0.933 | Val F1-Score:  0.949
Epoch: 4
Train Loss: 0.171 | Train Acc: 0.933 | Train F1-Score:  0.949
Val Loss: 0.151 | Val Acc: 0.942 | Val F1-Score:  0.955
Epoch: 5
Train Loss: 0.155 | Train Acc: 0.940 | Train F1-Score:  0.954
Val Loss: 0.139 | Val Acc: 0.948 | Val F1-Score:  0.960
Epoch: 6
Train Loss: 0.143 | Train Acc: 0.946 | Train F1-Score:  0.958
Val Loss: 0.128 | Val Acc: 0.952 | Val F1-Score:  0.963
Epoch: 7
Train Loss: 0.132 | Train Acc: 0.950 | Train F1-Score:  0.962
Val Loss: 0.120 | Val Acc: 0.956 | Val F1-Score:  0.966
Epoch: 8
Train Loss: 0.124 | Train Acc: 0.954 | Train F1-Score:  0.964
Val Loss: 0.113 | Val Acc: 0.959 | Val F

The validation F1-Score exceeded 0.98, which seems pretty good, although as mentioned above it does take a number of iterations to hit its maximum.

In [52]:
if "test_data" in globals():
    model.load_state_dict(torch.load('model.pt'))

    test_loss, test_acc, test_f1 = evaluate(model, test_iterator, criterion, TAG_PAD_IDX, TAG_POS_IDX)

    print(f"Test Loss: {test_loss:.3f} | Test Acc: {test_acc:.3f} | Test F1-Score: {test_f1: .3f}")

Test Loss: 0.109 | Test Acc: 0.962 | Test F1-Score:  0.969
