# Irish Language Translation: Anything Goes
My Anything Goes implementation uses a GRU Encoder-Decoder model with Attention. In addition, I override the model's prediction for a word when that word is likely to not have changed from the source (doesn't change 90% of the time or more). This captures the obvious unchanging words that are characteristic of the dataset while allowing the model to learn how to translate the words that do change.

The model itself was influenced largely by tutorials found [here](https://github.com/bentrevett/pytorch-seq2seq).

Filenames and train-validation split. During development I used a standard 70/30 spliot but for the sake of maximizing test performance I've increased the amount of training data.

In [1]:
PARAMS = {
    'train-source': "train-source.txt",
    'train-target': "train-target.txt",
    'test-source': "test-source.txt",
    'test-target': "test-target.txt",
    'split': 0.9,
}

Imports.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from nltk.translate.bleu_score import corpus_bleu
from torchtext.data import Field, BucketIterator, TabularDataset

import re
import numpy as np
import pandas as pd

import random
import math
import time
import warnings

warnings.filterwarnings("ignore")

Open the provided filenames.

In [3]:
source = open(PARAMS['train-source'], 'r').read()
target = open(PARAMS['train-target'], 'r').read()
test_source = open(PARAMS['test-source'], 'r').read()
test_target = open(PARAMS['test-target'], 'r').read()

Clean the data. I removed all punctuation under the assumption that it doesn't change during translation, considering most of the changes are simply spelling changes.

In [4]:
source = re.sub('\n', ' ', source)
source = re.sub(r'[^\w\s<>/]', '', source)
target = re.sub('\n', ' ', target)
target = re.sub(r'[^\w\s<>/]', '', target)
test_source = re.sub('\n', ' ', test_source)
test_source = re.sub(r'[^\w\s<>/]', '', test_source)
test_target = re.sub('\n', ' ', test_target)
test_target = re.sub(r'[^\w\s<>/]', '', test_target)

The `split_sentences` function splits the data on the `<s>` and `</s>` tokens.

In [5]:
def split_sentences(raw_data: str):
    sentences = []
    curr = []

    for word in raw_data.split(' '):
        if word != '<s>' and word != '</s>':
            curr.append(word)
        if word == '</s>':
            sentences.append(' '.join(curr))
            curr = []
    return sentences

In [6]:
source_data, target_data = split_sentences(source), split_sentences(target)
test_source, test_target = split_sentences(test_source), split_sentences(test_target)

Save data to a csv file to be read in using torchtext's `TabularDataset`.

In [7]:
train_df = pd.DataFrame({'source': source_data, 'target': target_data})
test_df = pd.DataFrame({'source': test_source, 'target': test_target})

In [8]:
train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', index=False)

Seed for reproducibility.

In [9]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Declare torchtext fields.

In [10]:
SRC = Field(init_token = '<s>', 
            eos_token = '</s>', 
            lower = True)

TRG = Field(init_token = '<s>', 
            eos_token = '</s>', 
            lower = True)

fields = [('src', SRC), ('trg', TRG)]

Construct the datasets using the saved csv's.

In [11]:
train_data, valid_data = TabularDataset('train.csv', 'csv', fields = fields).split(PARAMS['split'])
test_dat = TabularDataset('test.csv', 'csv', fields = fields)

Generate a list of words that change less than 10% of the time. When constructing translations, I'll override the model's output for these words since they're most likely going to be unchanging. This way, I'll at least get baseline accuracy and can use the ability of the model to understand context for predicting the words that change.

In [12]:
source_changes = {}
for ex in train_data:
    for j, source_word in enumerate(ex.src):
        if source_word not in source_changes:
            source_changes[source_word] = {'exact': 0, 'change': 0}
        if source_word in ex.trg:
          source_changes[source_word]['exact'] += 1
        else:
          source_changes[source_word]['change'] += 1

In [13]:
def calc_change(change_dict: dict) -> float:
  return change_dict['change'] / max(sum(change_dict.values()), 1)
unchanging = [word for word in source_changes if calc_change(source_changes[word]) < 0.1]

In [14]:
print("{} unchanging words out of {}".format(len(unchanging), len(source_changes)))

12817 unchanging words out of 27913


Minimum frequency of 2.

In [15]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [17]:
BATCH_SIZE = 64

train_iterator = BucketIterator(train_data, batch_size = BATCH_SIZE, device = device)
valid_iterator = BucketIterator(valid_data, batch_size = BATCH_SIZE, device = device)

The encoder uses a single bidirectional GRU with dropout.

In [18]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        embedded = self.dropout(self.embedding(src))
        
        outputs, hidden = self.rnn(embedded)
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        return outputs, hidden

Attention module for determining which words to emphasize during decoding.

In [19]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        attention = self.v(energy).squeeze(2)
        
        return F.softmax(attention, dim=1)

Decoder uses context and attention through a single GRU layer.

In [20]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
        
        input = input.unsqueeze(0)
        
        embedded = self.dropout(self.embedding(input))
    
        a = self.attention(hidden, encoder_outputs)
        
        a = a.unsqueeze(1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        return prediction, hidden.squeeze(0)

Build the model using teacher forcing.

In [21]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        encoder_outputs, hidden = self.encoder(src)
                
        input = trg[0,:]
        
        for t in range(1, trg_len):
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            outputs[t] = output
            
            teacher_force = random.random() < teacher_forcing_ratio
            
            top1 = output.argmax(1) 
            
            input = trg[t] if teacher_force else top1

        return outputs

I tried a few variations of these hyperparameters. A higher dropout with lower dimensions seems to work the best.

In [22]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 64
DEC_EMB_DIM = 64
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = EncoderDecoder(enc, dec, device).to(device)

Initialize weights with a normal distribution centered at 0 and a standard deviation of 0.01

In [23]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

EncoderDecoder(
  (encoder): Encoder(
    (embedding): Embedding(15042, 64)
    (rnn): GRU(64, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=128, bias=True)
      (v): Linear(in_features=128, out_features=1, bias=False)
    )
    (embedding): Embedding(13799, 64)
    (rnn): GRU(320, 128)
    (fc_out): Linear(in_features=448, out_features=13799, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [24]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'{count_parameters(model):,} trainable parameters')

8,445,671 trainable parameters


I tried a few different learning rates. 0.01 won out by far.

In [25]:
optimizer = optim.Adam(model.parameters(), lr=0.01)

Declare loss and ignore padding.

In [26]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Function for replacing the models outputs where a word is likely to not have changed.

In [27]:
def check_unchanging(src: list, pred: list) -> list:
    output = []
    for i, word in enumerate(src):
        if word in unchanging:
            output.append(word)
        else:
            output.append(pred[i])
    return output

In [28]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    targets = []
    outputs = []
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        output_dim = output.shape[-1]
        
        flattened_output = output[1:].view(-1, output_dim)
        flattened_trg = trg[1:].view(-1)
        
        loss = criterion(flattened_output, flattened_trg)
        loss.backward()
        
        # this is new to me, I saw it in a tutorial and it seems to work well
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()

        # construct translations for BLEU
        pred = torch.argmax(torch.nn.functional.softmax(output, dim=-1), dim=-1)
        pred = torch.transpose(pred[1:], 1, 0)
        trg = torch.transpose(trg[1:], 1, 0)
        src = torch.transpose(src[1:], 1, 0)
        for i in range(trg.shape[0]):
            vocab_trg = [TRG.vocab.itos[p] for p in trg[i]]
            vocab_src = [SRC.vocab.itos[p] for p in src[i]]
            end_idx = vocab_trg.index('</s>')
            targets.append([vocab_trg[:end_idx]])
            outputs.append(check_unchanging(vocab_src[:end_idx], [TRG.vocab.itos[p] for p in pred[i]][:end_idx]))

        torch.cuda.empty_cache()
        
    return epoch_loss / len(iterator), corpus_bleu(targets, outputs)

In [29]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    targets = []
    outputs = []
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing
            output_dim = output.shape[-1]
            
            flattened_output = output[1:].view(-1, output_dim)
            flattened_trg = trg[1:].view(-1)

            loss = criterion(flattened_output, flattened_trg)

            epoch_loss += loss.item()

            # construct translations for BLEU
            pred = torch.argmax(torch.nn.functional.softmax(output, dim=-1), dim=-1)
            pred = torch.transpose(pred[1:], 1, 0)
            trg = torch.transpose(trg[1:], 1, 0)
            src = torch.transpose(src[1:], 1, 0)
            for i in range(trg.shape[0]):
                vocab_trg = [TRG.vocab.itos[p] for p in trg[i]]
                vocab_src = [SRC.vocab.itos[p] for p in src[i]]
                end_idx = vocab_trg.index('</s>')
                targets.append([vocab_trg[:end_idx]])
                outputs.append(check_unchanging(vocab_src[:end_idx], [TRG.vocab.itos[p] for p in pred[i]][:end_idx]))
                
            torch.cuda.empty_cache()
        
    return epoch_loss / len(iterator), corpus_bleu(targets, outputs)

In [30]:
N_EPOCHS = 8
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    train_loss, train_bleu = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss, valid_bleu = evaluate(model, valid_iterator, criterion)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    
    print(f'Epoch {epoch+1}')
    print(f'\tTrain Loss: {train_loss:.4f} | Train BLEU: {train_bleu:.4f}')
    print(f'\t Val. Loss: {valid_loss:.4f} |  Val. BLEU: {valid_bleu:.4f}')

Epoch 1
	Train Loss: 4.8210 | Train BLEU: 0.4605
	 Val. Loss: 3.2811 |  Val. BLEU: 0.5616
Epoch 2
	Train Loss: 3.1644 | Train BLEU: 0.5540
	 Val. Loss: 2.8760 |  Val. BLEU: 0.6261
Epoch 3
	Train Loss: 3.2934 | Train BLEU: 0.5677
	 Val. Loss: 2.9919 |  Val. BLEU: 0.6607
Epoch 4
	Train Loss: 3.4369 | Train BLEU: 0.5648
	 Val. Loss: 3.1967 |  Val. BLEU: 0.6692
Epoch 5
	Train Loss: 3.4357 | Train BLEU: 0.5723
	 Val. Loss: 3.1633 |  Val. BLEU: 0.6604
Epoch 6
	Train Loss: 3.3797 | Train BLEU: 0.5702
	 Val. Loss: 3.1573 |  Val. BLEU: 0.6620
Epoch 7
	Train Loss: 3.4320 | Train BLEU: 0.5549
	 Val. Loss: 3.2786 |  Val. BLEU: 0.6424
Epoch 8
	Train Loss: 4.0764 | Train BLEU: 0.5069
	 Val. Loss: 3.7495 |  Val. BLEU: 0.5725


In [31]:
model.load_state_dict(torch.load('model.pt'))

_, test_bleu = evaluate(model, valid_iterator, criterion)

print("Test BLEU: {:.4f}".format(test_bleu))

Test BLEU: 0.6262
