# Train a lemmatizer with Lemmy
In this notebook, you will see how to train a lemmatizer using lemmy. It assumes you already have a CSV file of the 
format *pos*, *full_form*, *lemma*. The previous notebook, *01 prepare*, explains how to create such a file using data from Dansk Sprognævn (DSN) and the Universal Dependency (UD) data.

We initially create a train/test split and train on the training data only and then evaluate on the train and test set respectively. We then train again on the entire dataset and save the trained rules.

In [1]:
import logging
import random
from pprint import pformat
import pandas as pd
from lemmy import Lemmatizer
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s : %(message)s")

In [2]:
PREPARED_FILE = "./data/prepared.csv"
TRAINED_RULES_FILE = "./data/rules.py"

In [3]:
def print_examples(lemmatizer):
    examples = [["VERB", "drak"], ["NOUN", "kattene"], ["NOUN", "ukrudtet"], ["NOUN", "slaraffenlandet"],
                ["NOUN", "alen"], ["NOUN", "skaber"], ["NOUN", "venskaber"], ["NOUN", "tilbageførelser"],
                ["NOUN", "aftenbønnerne"], ["NOUN", "altankassepassere"]]
    for word_class, full_form in examples:
        lemma = lemmatizer.lemmatize(word_class, full_form)
        print("(%s, %s) -> %s" % (word_class, full_form, lemma))

def calculate_accuracy(lemmatizer, X, y):
    total = 0
    correct = 0
    ambiguous = 0

    for index in range(len(y)):
        word_class, full_form = X[index]
        target = y[index]
        predicted = lemmatizer.lemmatize(word_class, full_form)
        total += 1
        if len(predicted) > 1:
            ambiguous += 1
        elif predicted[0] == target:
            correct += 1


    print("correct:", correct)
    print("ambiguous:", ambiguous)
    print("total:", total)
    print("accuracy:", correct/total)
    print("ambiguous%:", ambiguous/total)
    print("ambiguous + accuracy:", (ambiguous+correct)/total)

## Load Data

In [4]:
def load_data(filename):
    df = pd.read_csv(filename, usecols=[0, 1, 2], keep_default_na=False)
    df = df.sample(frac=1, random_state=42) # shuffle rows
    X = [(word_class, full_form) for _, (word_class, full_form, _) in df.iterrows()]
    y = [lemma for _, (_word_class, _full_form, lemma,) in df.iterrows()]
    return X, y

X, y = load_data(PREPARED_FILE)

## Split Data

In [5]:
def split_data(X, y):
    mask = [False] * len(y)
    test_indices = random.sample(range(len(y)), len(y) // 500)
    for index in test_indices:
        mask[index] = True

    X_train = []
    y_train = []
    X_test = []
    y_test = []
    for index, test in enumerate(mask):
        if test:
            X_test += [X[index]]
            y_test += [y[index]]
        else:
            X_train += [X[index]]
            y_train += [y[index]]
    
    return X_train, y_train, X_test, y_test

random.seed(42)
X_train, y_train, X_test, y_test = split_data(X, y)

In [6]:
print(f"Complete set: {len(X):10}")
print(f"Train set:    {len(X_train):10}")
print(f"Test set:     {len(X_test):10}")

Complete set:     401123
Train set:        400321
Test set:            802


## Train temmatizer - training set only

In [7]:
lemmatizer = Lemmatizer()
lemmatizer.fit(X_train, y_train)

DEBUG : epoch #1: 45859 rules (45859 new) in 1.83s
DEBUG : epoch #2: 60341 rules (14482 new) in 1.63s
DEBUG : epoch #3: 62571 rules (2230 new) in 1.60s
DEBUG : epoch #4: 63046 rules (475 new) in 1.70s
DEBUG : epoch #5: 63168 rules (122 new) in 1.58s
DEBUG : epoch #6: 63201 rules (33 new) in 1.52s
DEBUG : epoch #7: 63209 rules (8 new) in 1.57s
DEBUG : epoch #8: 63209 rules (0 new) in 1.54s
DEBUG : training complete: 63209 rules in 13.05s
DEBUG : rules before pruning: 63209
DEBUG : used rules: 58398
DEBUG : rules after pruning: 58398 (4811 removed)


In [8]:
calculate_accuracy(lemmatizer, X_train, y_train)

correct: 395843
ambiguous: 4478
total: 400321
accuracy: 0.9888139767836311
ambiguous%: 0.011186023216368864
ambiguous + accuracy: 1.0


In [9]:
calculate_accuracy(lemmatizer, X_test, y_test)

correct: 735
ambiguous: 12
total: 802
accuracy: 0.9164588528678305
ambiguous%: 0.014962593516209476
ambiguous + accuracy: 0.9314214463840399


In [10]:
print_examples(lemmatizer)

(VERB, drak) -> ['drikke']
(NOUN, kattene) -> ['kat']
(NOUN, ukrudtet) -> ['ukrudt']
(NOUN, slaraffenlandet) -> ['slaraffenland']
(NOUN, alen) -> ['ale', 'alen', 'al']
(NOUN, skaber) -> ['skaber']
(NOUN, venskaber) -> ['venskab']
(NOUN, tilbageførelser) -> ['tilbageførelse']
(NOUN, aftenbønnerne) -> ['aftenbøn']
(NOUN, altankassepassere) -> ['altankassepasser']


## Train temmatizer - full dataset

In [11]:
lemmatizer = Lemmatizer()
lemmatizer.fit(X, y)

DEBUG : epoch #1: 45946 rules (45946 new) in 1.70s
DEBUG : epoch #2: 60461 rules (14515 new) in 1.66s
DEBUG : epoch #3: 62695 rules (2234 new) in 1.54s
DEBUG : epoch #4: 63172 rules (477 new) in 1.70s
DEBUG : epoch #5: 63294 rules (122 new) in 1.71s
DEBUG : epoch #6: 63327 rules (33 new) in 1.67s
DEBUG : epoch #7: 63335 rules (8 new) in 1.68s
DEBUG : epoch #8: 63335 rules (0 new) in 1.61s
DEBUG : training complete: 63335 rules in 13.37s
DEBUG : rules before pruning: 63335
DEBUG : used rules: 58513
DEBUG : rules after pruning: 58513 (4822 removed)


In [12]:
calculate_accuracy(lemmatizer, X, y)

correct: 396627
ambiguous: 4496
total: 401123
accuracy: 0.98879146795372
ambiguous%: 0.011208532046280069
ambiguous + accuracy: 1.0


## Save Learned Rules
We now save the learend rules to a Python file which can be copied to the lemmatizer source code.

In [13]:
def _to_dict(lemmatizer):
    """Convert the internal defaultdict to a standard dict."""
    temp = {}
    for pos, rules_ in lemmatizer.rules.items():
        if pos not in temp:
            temp[pos] = {}

        for full_form_suffix, lemma_suffixes_ in rules_.items():
            temp[pos][full_form_suffix] = lemma_suffixes_
    return temp

In [14]:
with open(TRAINED_RULES_FILE, 'w') as file:
    file.write("# coding: utf8\n")
    file.write("from __future__ import unicode_literals\n")
    file.write("\n\n")
    file.write("rules = " + pformat(_to_dict(lemmatizer), width=120))