# Irish Language Translation: From Scratch
Rather than go for a standard EM approach for my From Scratch implementation, I stumbled upon something that's incredibly simple yet works pretty well. In this implementation I take advantage of that facts that:
    - the word orderings rarely change from source to target
    - a significant percentage of source words remain unchanged in the target
    - many of the words that *do* change do so the same way each time
These factors would true 100% of the time but they were true often enough I could "hack" what I think is a reasonable solution.

Filenames and train-validation split. During development I used a standard 70/30 spliot but for the sake of maximizing test performance I've increased the amount of training data.

In [1]:
import warnings

PARAMS = {
    'train-source': "train-source.txt",
    'train-target': "train-target.txt",
    'test-source': "test-source.txt",
    'test-target': "test-target.txt",
    'split': 0.9,
}

warnings.filterwarnings("ignore")

Open the provided filenames.

In [2]:
source = open(PARAMS['train-source'], 'r').read()
target = open(PARAMS['train-target'], 'r').read()
test_source = open(PARAMS['test-source'], 'r').read()
test_target = open(PARAMS['test-target'], 'r').read()

Clean the data. I removed all punctuation under the assumption that it doesn't change during translation, considering most of the changes are simply spelling changes.

In [3]:
import re

source = re.sub('\n', ' ', source)
source = re.sub(r'[^\w\s<>/]', '', source)
target = re.sub('\n', ' ', target)
target = re.sub(r'[^\w\s<>/]', '', target)
test_source = re.sub('\n', ' ', test_source)
test_source = re.sub(r'[^\w\s<>/]', '', test_source)
test_target = re.sub('\n', ' ', test_target)
test_target = re.sub(r'[^\w\s<>/]', '', test_target)

The `split_sentences` function splits the data on the `<s>` and `</s>` tokens.

In [4]:
def split_sentences(raw_data: str):
    sentences = []
    curr = []

    for word in raw_data.split(' '):
        if word != '<s>' and word != '</s>' and word != '':
            curr.append(word.lower())
        if word == '</s>':
            sentences.append(curr)
            curr = []
    return sentences

In [5]:
source_data = split_sentences(source)
target_data = split_sentences(target)
test_source, test_target = split_sentences(test_source), split_sentences(test_target)

Split the train data into train and validation based on the provided ratio.

In [6]:
num_train = int(len(source_data)*PARAMS['split'])
train_source, valid_source = source_data[:num_train], source_data[num_train:]
train_target, valid_target = target_data[:num_train], target_data[num_train:]

Construct the vocab for the training data. I could have done this above before splitting the sentences with a simple `set` call, but figured I didn't want to include the start and end tags so this'll do.

In [7]:
def get_vocab(list_of_sents: list) -> set:
  vocab = set()
  for sentence in list_of_sents:
    for word in sentence:
      vocab.add(word)
  return vocab

source_vocab = get_vocab(train_source)
target_vocab = get_vocab(train_target)

Construct dictionaries to transition between integer and string representations. This allows me to use a list of lists rather than a dictionary of dictionaries (which eats a ton of RAM) in the below `co_occurrence` matrix.

In [8]:
source_stoi = {src_wrd: i for i, src_wrd in enumerate(source_vocab)}
source_itos = {i: src_wrd for i, src_wrd in enumerate(source_vocab)}
target_stoi = {trg_wrd: i for i, trg_wrd in enumerate(target_vocab)}
target_itos = {i: trg_wrd for i, trg_wrd in enumerate(target_vocab)}

Co-occurrence compares each word in the source sentences to the word in the same position in the target sentences. Although occasionally positions won't perfectly match between source and target, they do the majority of the time. So, over the entire training set, two words that are in the same position the most often are likely direct translations of each other. Because many words do not change, these words are often themselves, but they could also be spelling variations.

In [9]:
co_occurrence = [[0 for trg_word in target_vocab] for src_word in source_vocab]

for i, source_sent in enumerate(train_source):
  target_sent = train_target[i]
  for i, source_word in enumerate(source_sent):
    if i < len(target_sent):
      target_word = target_sent[i]
      co_occurrence[source_stoi[source_word]][target_stoi[target_word]] += 1


The `translate` function uses the co-occurrence matrix to predict the most commonly seen target word for each source word. If the word has never been seen, i.e. it's not in the training set, the word is predicted to be unchanging.

In [10]:
def translate(sentence: list) -> list:
  translation = []
  for word in sentence:
    if word not in source_vocab:
      translation.append(word)
    else:
      word_co = co_occurrence[source_stoi[word]]
      translation.append(target_itos[word_co.index(max(word_co))])
  return translation

Test on some training sentences, seems to be working.

In [11]:
for i in range(5):
  print(train_source[i][:12])
  print(train_target[i][:12])
  print(translate(train_source[i][:12]))
  print('-'*100)

['cinnte', 'go', 'leór', 'thiocfadh', 'dóbhtha', 'bás', 'a', 'fhagháil', 'ar', 'imeall', 'an', 'phuill']
['cinnte', 'go', 'leor', 'thiocfadh', 'dóibh', 'bás', 'a', 'fháil', 'ar', 'imeall', 'an', 'phoill']
['cinnte', 'go', 'leor', 'thiocfadh', 'dóibh', 'bás', 'a', 'fháil', 'ar', 'imeall', 'an', 'phoill']
----------------------------------------------------------------------------------------------------
['bhí', 'sé', 'follasach', 'go', 'rabh', 'an', 'poll', 'sin', 'ag', 'foscladh', 'ar', 'an']
['bhí', 'sé', 'follasach', 'go', 'raibh', 'an', 'poll', 'sin', 'ag', 'foscladh', 'ar', 'an']
['bhí', 'sé', 'follasach', 'go', 'raibh', 'an', 'poll', 'sin', 'ag', 'foscladh', 'ar', 'an']
----------------------------------------------------------------------------------------------------
['dfhéadfadh', 'siad', 'bás', 'fhagháil', 'ar', 'a', 'bhruach', 'agus', 'na', 'cuirp', 'imtheacht', 'ar']
['dfhéadfadh', 'siad', 'bás', 'a', 'fháil', 'ar', 'a', 'bhruach', 'agus', 'na', 'coirp', 'a']
['dfhéadfadh', 

This is my attempt at manually calculating BLEU. While it was fairly close to the nltk calculations, it was off by a little so below I'm using the nltk library for the sake of making sure my results are as accurate as possible.

In [12]:
import math

def ngrams(sent: list, n: int):
  ngrams = {}
  for i in range(len(sent)-n+1):
    curr_ngram = " ".join(sent[i:i+n])
    if curr_ngram not in ngrams:
      ngrams[curr_ngram] = 1
    else:
      ngrams[curr_ngram] += 1
  return ngrams

def calc_bleu(target: list, pred: list) -> float:
  n = min(4, len(pred))
  all_ngrams = 1
  for i in range(n):
    count = ngrams(pred, i+1)
    clip = ngrams(target, i+1)
    clip = {k: clip[k] for k in clip if k in count}
    all_ngrams *= sum(clip.values()) / sum(count.values())
  brevity = math.exp(1-len(pred)/len(target)) if len(pred) < len(target) else 1
  return (all_ngrams**(1/n)) * brevity

The `test` function simply calculates BLEU over the dataset.

In [13]:
from nltk.translate.bleu_score import sentence_bleu

def test(source: list, targets: list) -> float:
  running_bleu = 0
  nltk_bleu = 0

  pred_targets = []
  true_targets = []
  for i, sentence in enumerate(source):
    running_bleu += sentence_bleu([targets[i]], translate(sentence))

  return running_bleu / i
  

Get results.

In [14]:
print("Train BLEU: {:.4f}".format(test(train_source, train_target)))
print("Val. BLEU: {:.4f}".format(test(valid_source, valid_target)))

Train BLEU: 0.7782
Val. BLEU: 0.6920


Visualize some sentences.

In [15]:
for i in range(5):
  print(valid_source[i][:12])
  print(valid_target[i][:12])
  print(translate(valid_source[i][:12]))
  print('-'*100)

['bhéarfadh', 'sin', 'faoiseamh', 'intinne', 'dó']
['bhéarfadh', 'sin', 'faoiseamh', 'intinne', 'dó']
['bhéarfadh', 'sin', 'faoiseamh', 'intinne', 'dó']
----------------------------------------------------------------------------------------------------
['nó', 'bhí', 'cineál', 'de', 'scrupall', 'coinnsíosa', 'i', 'rith', 'an', 'ama', 'air', 'nach']
['nó', 'bhí', 'cineál', 'de', 'scrupall', 'coinsiasa', 'i', 'rith', 'an', 'ama', 'air', 'nach']
['nó', 'bhí', 'cineál', 'de', 'scrupall', 'coinnsíosa', 'i', 'rith', 'an', 'ama', 'air', 'nach']
----------------------------------------------------------------------------------------------------
['cuireadh', 'an', 'bád', 'ar', 'an', 'tuinn', 'eadar', 'sin', 'is', 'tráthas', 'agus', 'dimthigh']
['cuireadh', 'an', 'bád', 'ar', 'an', 'toinn', 'eadar', 'sin', 'is', 'tráthas', 'agus', 'dimigh']
['cuireadh', 'an', 'bád', 'ar', 'an', 'toinn', 'idir', 'sin', 'is', 'tráthas', 'agus', 'dimigh']
------------------------------------------------------------

Final test results.

In [16]:
print("Testing BLEU: {:.4f}".format(test(test_source, test_target)))

Testing BLEU: 0.7734
