## Basic Model Training
Let's get our data into a format that we can use to train transformer-style models.

We will start with the Korean dataset, as it is well-annotated.

In [1]:
!pip install xigt



In [1]:
from xigt.codecs import xigtxml
corpus = xigtxml.load(open('../data/kor.xml'))

In [19]:
class MissingValueError(Exception):
    pass

# From a single line of IGT, extracts the features which are allowed in this shared task:
# 1. Transcribed words (not segmented)
# 2. Translation (not aligned)
# 3. Glosses
def extract_igt(igt):
    if not igt.get('w'):
        raise MissingValueError("words")
    if not igt.get('tw'):
        raise MissingValueError("translation")
    if not igt.get('gw'):
        raise MissingValueError("glosses")
        
    words = [word.value() for word in igt['w'].items]
    glosses = [gloss.value() for gloss in igt['gw'].items]
    alignments = [gloss.alignment for gloss in igt['gw'].items]
    
    translation = [item.value() for item in igt['tw']]
    return {'words': words, 'translation': translation, 'glosses': glosses, 'alignments': alignments}
    
extract_igt(corpus[15])

{'words': ['a',
  '.',
  'John-i',
  'koki-lul',
  'kuw-e',
  'mek-ki-nun',
  'kuw-e',
  'mek-ess-ciman'],
 'translation': ['John',
  'broiled',
  'and',
  'ate',
  'the',
  'meat',
  ',',
  'but',
  '...'],
 'glosses': ['Nom',
  'meat-Acc',
  'broil-Inf',
  'eat-Noml-Top',
  'broil-Inf',
  'eat-Past-but'],
 'alignments': [None, None, None, None, None, None]}

In [3]:
corpus_data = []

missing_words_count = 0
missing_translation_count = 0
missing_gloss_count = 0
all_good_count = 0

for i, igt in enumerate(corpus):
    try:
        igt_data = extract_igt(igt)
        corpus_data.append(igt_data)
        all_good_count += 1
    except MissingValueError as v:
        match str(v):
            case 'words': missing_words_count += 1
            case 'translation': missing_translation_count += 1
            case 'glosses': missing_gloss_count += 1

print(f"Parsed corpus, with \n\t{all_good_count} good rows\n\t{missing_words_count} rows missing words\
        \n\t{missing_translation_count} missing translations\n\t{missing_gloss_count} missing glosses")

Parsed corpus, with 
	4838 good rows
	73 rows missing words        
	471 missing translations
	0 missing glosses


In [4]:
corpus_data[4]

{'words': ['Chelsu-nun', 'pam-ul', 'kuw-e', 'mek-ess-ta', '.'],
 'translation': ['Chelsu', 'broiled', 'and', 'ate', 'the', 'chestnut'],
 'glosses': ['Top', 'chestnut-Acc', 'broil-Inf', 'eat-Past-Dec']}

In [5]:
# Let's remove the dashes from the input, to simulate the case where we don't have segmentation
for item in corpus_data:
    for i, word in enumerate(item['words']):
        item['words'][i] = word.replace('-', '')
        
corpus_data[4]

{'words': ['Chelsunun', 'pamul', 'kuwe', 'mekessta', '.'],
 'translation': ['Chelsu', 'broiled', 'and', 'ate', 'the', 'chestnut'],
 'glosses': ['Top', 'chestnut-Acc', 'broil-Inf', 'eat-Past-Dec']}

In [6]:
# Let's also split the output by dashes
for item in corpus_data:
    glosses = []
    for i, word in enumerate(item['glosses']):
        word_glosses = word.split("-")
        glosses.append(word_glosses[0])
        glosses += ["-" + gloss for gloss in word_glosses[1:]]
    item['glosses'] = glosses

corpus_data[4]

{'words': ['Chelsunun', 'pamul', 'kuwe', 'mekessta', '.'],
 'translation': ['Chelsu', 'broiled', 'and', 'ate', 'the', 'chestnut'],
 'glosses': ['Top',
  'chestnut',
  '-Acc',
  'broil',
  '-Inf',
  'eat',
  '-Past',
  '-Dec']}

Notes:
- We originally tried to align words and glosses, but it turns out a huge number of rows are either missing alignments, or have completely wrong alignments. Rather than mess up our model with incorrect data, we will simply provide unaligned glosses.
- There's a lot of messy unnecessary data. We will have to count on the transformer to deal with those.

# Encoding
Input: transcription + translation

Output: glosses (stems and grams)

We need to encode all of our items, input and output, as integers.

In [9]:
all_text = open('all_text.txt', 'w')
all_text.write('')
all_text.close()
all_text = open('all_text.txt', 'a')
for item in corpus_data:
    all_text.write(" ".join(item['words']) + "\n")
    all_text.write(" ".join(item['translation']) + "\n")
    all_text.write(" ".join(item['glosses']) + "\n")

In [28]:
from tokenizers import ByteLevelBPETokenizer

special_chars = ["[BOS]", "[EOS]", "[UNK]", "[SEP]", "[PAD]", "[MASK]"]

tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=['./all_text.txt'], min_frequency=2, special_tokens=special_chars)
tokenizer.save_model(".", "kor")
# tokenizer = BartTokenizer(model_max_length=512, add_prefix_space=True)
# tokenizer.train(files=['./all_text.txt'], min_frequency=2)






['./kor-vocab.json', './kor-merges.txt']

In [46]:
from transformers import BartTokenizer
from tokenizers.processors import BertProcessing

tokenizer = BartTokenizer('./kor-vocab.json', './kor-merges.txt', bos_token="[BOS]", eos_token="[EOS]", sep_token="[SEP]", cls_token="[BOS]", unk_token="[UNK]", pad_token="[PAD]", mask_token="[MASK]", model_max_length=512)
tokenizer.vocab_size

11399

In [61]:
enc = tokenizer.encode(['tarooga', 'daigakuni', 'dekaketa'], is_split_into_words=True, add_special_tokens=False)
print(enc)
for tok in enc:
    print(tokenizer.decode([tok]))
dec = tokenizer.batch_decode([enc], clean_up_tokenization_spaces=False)
print(dec)

[263, 296, 2565, 2386, 1594, 563, 345, 1237, 983, 279, 7261]
 t
ar
oo
ga
 da
ig
ak
uni
 de
ka
keta
[' tarooga daigakuni dekaketa']


In [26]:
" ".join(corpus_data[4]['words'])

'Chelsunun pamul kuwe mekessta .'

In [8]:
from typing import List

special_chars = ["[UNK]", "[SEP]", "[PAD]", "[MASK]", "[BOS]", "[EOS]"]

def create_vocab(sentences: List[List[str]], threshold=2):
    all_words = dict()
    for sentence in sentences:
        for word in sentence:
            all_words[word.lower()] = all_words.get(word.lower(), 0) + 1

    all_words_list = []
    for word, count in all_words.items():
        if count >= threshold:
            all_words_list.append(word)

    return sorted(all_words_list)

source_vocab = create_vocab([item['words'] for item in corpus_data])
len(source_vocab)

3169

In [9]:
# Also create a list for the target and gloss words
target_and_gloss_vocab = create_vocab([item['translation'] for item in corpus_data] + [item['glosses'] for item in corpus_data])
print(len(target_and_gloss_vocab))

3357


In [10]:
def encode_word(word, vocab='source'):
    word = word.lower()
    
    if word in special_chars:
        return special_chars.index(word)
    if vocab=='source':
        if word in source_vocab:
            return source_vocab.index(word) + len(special_chars)
        else:
            return 0
    else:
        if word in target_and_gloss_vocab:
            return target_and_gloss_vocab.index(word) + len(special_chars) + len(source_vocab)
        else:
            return 0

encode_word('', vocab='transl')

3175

In [11]:
import torch

MODEL_INPUT_LENGTH = 512

PAD_ID = special_chars.index("[PAD]")
SEP_ID = special_chars.index("[SEP]")

# Encodes a sentence as integers, and pads it
def encode(sentence: List[str], vocab='source') -> List[int]:
    return [encode_word(word, vocab=vocab) for word in sentence]
            
encode(corpus_data[4]['words']) 

[481, 2143, 1532, 1721, 57]

Now let's divide our data and turn it into the Dataset format.

In [12]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(corpus_data, test_size=0.3)
test, dev = train_test_split(test, test_size=0.5)

print(f"Train: {len(train)}")
print(f"Dev: {len(dev)}")
print(f"Test: {len(test)}")

Train: 3387
Dev: 726
Test: 726


In [13]:
print(train[2])

{'words': ['halapenimkkeyse', 'hakkyoey', 'kasinta'], 'translation': ['Grandfather', '(', 'HON', ')', 'goes', '(', 'HON', ')', 'to', 'school', '.'], 'glosses': ['grandfather(HON)', '-NOM(HON)', 'school', '-to', 'go', '-SH', '-PRES', '-DEC']}


In [14]:
from datasets import Dataset, DatasetDict

raw_dataset = DatasetDict()
raw_dataset['train'] = Dataset.from_list(train)
raw_dataset['validation'] = Dataset.from_list(dev)
raw_dataset['test'] = Dataset.from_list(test)

raw_dataset

DatasetDict({
    train: Dataset({
        features: ['words', 'translation', 'glosses'],
        num_rows: 3387
    })
    validation: Dataset({
        features: ['words', 'translation', 'glosses'],
        num_rows: 726
    })
    test: Dataset({
        features: ['words', 'translation', 'glosses'],
        num_rows: 726
    })
})

In [15]:
BOS_ID = special_chars.index("[BOS]")
EOS_ID = special_chars.index("[EOS]")

def preprocess(row):
    """Preprocesses each row in the dataset
    1. Combines the source and translation into a single list, and encodes
    2. Pads the combined input and output sequences
    3. Creates attention mask
    """
    source_enc = encode(row['words'])
    transl_enc = encode(row['translation'], vocab='transl')
    combined_enc = source_enc + [SEP_ID] + transl_enc
    
    # Pad
    initial_length = len(combined_enc)
    combined_enc += [PAD_ID] * (MODEL_INPUT_LENGTH - initial_length)
    
    # Create attention mask
    attention_mask = [1] * initial_length + [0] * (MODEL_INPUT_LENGTH - initial_length)
    
    # Encode the output
    output_enc = encode(row['glosses'], vocab='transl')
    output_enc = output_enc + [EOS_ID]
    
    # Shift one position right
    decoder_input_ids = [BOS_ID] + output_enc
    
    # Pad both
    output_enc += [PAD_ID] * (MODEL_INPUT_LENGTH - len(output_enc))
    decoder_input_ids += [PAD_ID] * (MODEL_INPUT_LENGTH - len(decoder_input_ids))
    
    return {'input_ids': torch.tensor(combined_enc), 'attention_mask': torch.tensor(attention_mask), 'labels': torch.tensor(output_enc), 'decoder_input_ids': torch.tensor(decoder_input_ids)}
    
preprocess(raw_dataset['train'][1])

{'input_ids': tensor([ 296,   57, 2774,  137,  983, 1442, 2317, 1887,    1, 4688, 4161, 5520,
         5013, 6189, 4641, 5482, 5830, 3722,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2, 

In [16]:
# Map to all datasets
dataset = DatasetDict()
dataset['train'] = raw_dataset['train'].map(preprocess)
dataset['validation'] = raw_dataset['validation'].map(preprocess)
dataset['test'] = raw_dataset['test'].map(preprocess)

dataset

  0%|          | 0/3387 [00:00<?, ?ex/s]

  0%|          | 0/726 [00:00<?, ?ex/s]

  0%|          | 0/726 [00:00<?, ?ex/s]

DatasetDict({
    train: Dataset({
        features: ['words', 'translation', 'glosses', 'input_ids', 'attention_mask', 'labels', 'decoder_input_ids'],
        num_rows: 3387
    })
    validation: Dataset({
        features: ['words', 'translation', 'glosses', 'input_ids', 'attention_mask', 'labels', 'decoder_input_ids'],
        num_rows: 726
    })
    test: Dataset({
        features: ['words', 'translation', 'glosses', 'input_ids', 'attention_mask', 'labels', 'decoder_input_ids'],
        num_rows: 726
    })
})

In [17]:
print(dataset['train'][1])

{'words': ["b'", '.', 'thoyoiley/', '??', 'i', 'kongcangi', 'pwuli', 'naessta'], 'translation': ['Fire', 'broke', 'out', 'in', 'the', 'factory', 'on', 'Saturday', '.'], 'glosses': ['Saturday', '-on/?NOM', 'factory', '-NOM', 'fire', '-NOM', 'break', 'out', '-PST', '-DEC'], 'input_ids': [296, 57, 2774, 137, 983, 1442, 2317, 1887, 1, 4688, 4161, 5520, 5013, 6189, 4641, 5482, 5830, 3722, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2

## Model Creation

In [50]:
from transformers import BartConfig, BartForConditionalGeneration

config = BartConfig(
    vocab_size=len(special_chars) + len(source_vocab) + len(target_and_gloss_vocab),
    max_position_embeddings=512,
    pad_token_id=PAD_ID,
    bos_token_id=BOS_ID,
    eos_token_id=EOS_ID,
    decoder_start_token_id=BOS_ID,
    forced_eos_token_id=EOS_ID,
    num_beams = 5
)

model = BartForConditionalGeneration(config)
model.config

BartConfig {
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "attention_dropout": 0.0,
  "bos_token_id": 4,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 4,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 5,
  "forced_eos_token_id": 5,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "bart",
  "num_beams": 5,
  "num_hidden_layers": 12,
  "pad_token_id": 2,
  "scale_embedding": false,
  "transformers_version": "4.21.3",
  "use_cache": true,
  "vocab_size": 6532
}

In [19]:
preds = model.generate(torch.LongTensor([dataset["train"][0]['input_ids']]), num_beams=5, min_length=0, max_length=20)
preds

tensor([[   4, 1515, 4434, 4434, 4434, 4434, 6376, 6376, 6376, 6376, 6376, 4434,
         4434, 4434, 4434, 4434, 3485, 3485, 3485,    5]])

In [40]:
all_vocab = special_chars + source_vocab + target_and_gloss_vocab

def batch_decode(batch):
    """Decodes a batch of indices to the actual words"""
    def decode(seq):
        if isinstance(seq, torch.Tensor):
            indices = seq.detach().cpu().tolist()
        else:
            indices = seq.tolist()
        return [all_vocab[index] for index in indices if index >= len(special_chars)]
        
    return [decode(seq) for seq in batch]
        
batch_decode(preds)

[['kumyoiley',
  'delicious',
  'delicious',
  'delicious',
  'delicious',
  'was',
  'was',
  'was',
  'was',
  'was',
  'delicious',
  'delicious',
  'delicious',
  'delicious',
  'delicious',
  '-kes',
  '-kes',
  '-kes']]

## Evaluation

In [27]:
from torchtext.data.metrics import bleu_score

bleu_score(batch_decode(preds), [[dataset["train"][0]['glosses']]])

0.0

In [44]:
import numpy as np

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
        
    # Decode predicted output
    decoded_preds = batch_decode(preds)
    
    # Decode (gold) labels
    labels = np.where(labels != -100, labels, PAD_ID)
    decoded_labels = batch_decode(labels)
    
    bleu = bleu_score(decoded_preds, [[seq] for seq in decoded_labels])
    
    # Also get accuracy, based on (correct morphemes in output) / (len of correct output)
    correct_glosses = 0
    total_glosses = 0
    
    for (pred, labels) in zip(decoded_preds, decoded_labels):
        correct_glosses += len([gloss for gloss in labels if gloss in pred ])
        total_glosses += len(labels)
    
    acc = round(correct_glosses / total_glosses, 4)
    
    return {'bleu': bleu, 'accuracy': acc}

## Training

In [47]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

batch_size = 16

args = Seq2SeqTrainingArguments(
    f"igt-word-level",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    # fp16=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [48]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics
)

In [28]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: words, glosses, translation. If words, glosses, translation are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3387
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 212


Epoch,Training Loss,Validation Loss
1,No log,4.61867


The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: words, glosses, translation. If words, glosses, translation are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 726
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=212, training_loss=6.1274080096550705, metrics={'train_runtime': 8559.476, 'train_samples_per_second': 0.396, 'train_steps_per_second': 0.025, 'total_flos': 3669991643676672.0, 'train_loss': 6.1274080096550705, 'epoch': 1.0})

In [49]:
trainer.evaluate(eval_dataset=dataset["validation"].select([1,2]))

The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: glosses, translation, words. If glosses, translation, words are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2
  Batch size = 16


{'eval_loss': 8.999423027038574,
 'eval_bleu': 0.0,
 'eval_accuracy': 0.0,
 'eval_runtime': 2.9284,
 'eval_samples_per_second': 0.683,
 'eval_steps_per_second': 0.341}