# Prepare Data for Training
In this notebook, we prepare a dataset which can be used for training a lemmatizer with Lemmy.

**NOTE**: You do *not* need to run this notebook to use lemma. The lemmatizer comes trained and ready to use! This notebook is only if you want train the lemmatizer yourself, for example because you want it trained on a specific dataset.

We use two datasets which are both publicly available. The first dataset is the word list from Dansk Sprognævn (DSN). This dataset is freely available but you have to sign a contract with DSN to obtain the file. Please see [www.dsn.dk](https://www.dsn.dk) for more info. The other dataset is the Danish part of the Universal Dependencies (UD). This dataset is open source and available from the [UD repo](https://github.com/UniversalDependencies/UD_Danish) on GitHub.

The notebook assumes you have the datasets stored in a subfolder called *data*.

In [1]:
from IPython.core.display import display, HTML
import logging
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import unicodecsv as csv
from tqdm import tqdm
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s : %(message)s")

In [2]:
UD_TRAIN_FILE = "./data/UD_Danish/da-ud-train.conllu"
DSN_XML_FILE = "./data/DSN/RO.iLexdump.m.fuldformer.til.aftagere.xml"
PREPARED_FILE = "./data/prepared.csv"
NORMS_FILE = "./data/norms.csv"

## Parse DSN XML data
Our first step reading the DSN data. We train the lemmatizer to use POS tags to help predict the lemma. We use the UD set of POS tags. Because the word classes used in DSN data differ from UD POS tags, we need to do some manual mapping. The `CLASS_LOOKUP` dictionary specifies the mapping.

In [3]:
CLASS_LOOKUP = {"sb": ["NOUN"],
                "adj": ["ADJ"],
                "adv": ["ADV"],
                "vb": ["VERB"],
                "proprium": ["PROPN"],
                "præp": ["ADP"],
                "udråbsord": ["INTJ"],
                "pron": ["PRON"],
                "talord": ["NUM"],
                "konj": ["CONJ"],
                "romertal": ["NUM"],
                "kolon": ["NOUN"],
                "lydord": ["NOUN"],
                "art": ["PRON_DONT_USE"]}

def _build_dsn_tuples(soup):
    unknown_classes = defaultdict(int)
    forms = set()
    homograph_groups = soup.find_all('hom', recursive=True)
    for hom_group in tqdm(homograph_groups):
        for article in hom_group.find_all(recursive=False):
            word_class_temp = article.name.split('-')[0]
            word_classes = CLASS_LOOKUP.get(word_class_temp, None)

            if not word_classes:
                unknown_classes[word_class_temp] += 1
                continue

            head_node = article.find('hoved')
            lemma = head_node.find('opslagsord').get_text()
            full_forms = article.find('fuldformer')
            if full_forms is None:
                continue

            if head_node.find('form.af'):
                # The lookup word ('artikel') itself is not the baseform, so it will be skipped.
                 continue

            for full_form_tag in full_forms.find_all('ff', recursive=False):
                full_form = full_form_tag.get_text()
                for word_class in word_classes:
                    forms.add((word_class, full_form, lemma))
    return sorted(forms, key = lambda x: x[1:]), unknown_classes

In [4]:
# Load the XML and parse it using Beautiful Soup
soup = BeautifulSoup(open(DSN_XML_FILE), 'xml')

In [5]:
# Build tuples of POS, + full form* and *lemma*
dsn_tuples, unknown = _build_dsn_tuples(soup)

100%|██████████| 63187/63187 [00:18<00:00, 3407.23it/s]


## Parse UD data
The next step is to read the UD data. We want to learn from both the DSN and UD data. While DSN is the authoritative source, UD does contain words and forms not found in DSN. In case of inconsistencies between DSN and UD, we choose DSN over UD.

Some of the UD POS tags, such as *DET* and *AUX*, can not be mapped 1-to-1 to the DSN word classes. Consequently, we learn the words with those POS tags from UD.

For adjectives (*ADJ*), the DSN word lists are incomplete. They do not contain various *degrees* for the adjectives, for example the forms *hurtigere* (faster) and *hurtigst* (fastest).

UD contains a large amount of proper nouns (*PROPN*) not found in DSN, specifically personal names. We might as well learn from these too, so we read the entire UD training file.

Since the UD data is not just a word list but actual sentences annotated with lemmas and POS tags (and more), we have the benefit of having not only the POS tag of the word we want to lemmatize, but also the POS tag of the previous word. We can use this to improve the accuracy of our lemmatizer, so when building the list of tuples from UD, we include the POS tag of the previous word of the sentence. This is set to the empty string when the current word is the first word of the sentence.

In [6]:
def _parse_ud_line(line):
    return line.split("\t")[1:4]

def _build_ud_tuples(ud_file, min_freq=1):
    counts = {}
    pos_prev = ""
    for line in open(ud_file).readlines():
        if line.startswith("#"):
            continue
        if line.strip() == "":
            pos_prev = ""
            continue

        orth, lemma, pos = _parse_ud_line(line)
        orth = orth.lower()
        lemma = lemma.lower()
        key = (pos_prev, pos, orth, lemma)
        counts[key] = counts.get(key, 0) + 1
        pos_prev = pos
    
    return [key for key in counts if counts[key] >= min_freq]

ud_tuples = _build_ud_tuples(UD_TRAIN_FILE)

In [7]:
ud_tuples[:5]

[('', 'ADP', 'på', 'på'),
 ('ADP', 'NOUN', 'fredag', 'fredag'),
 ('NOUN', 'AUX', 'har', 'have'),
 ('AUX', 'PROPN', 'sid', 'sid'),
 ('PROPN', 'VERB', 'inviteret', 'invitere')]

## Filter UD data
We will now filter the word forms read from UD. We do this to avoid introducing ambiguity due to spelling errors and typos in UD.

We want to include the following only:
1. Any POS + full form combination *not* found in DSN.
2. Any POS_PREV + POS + full form combination for which the POS + full form is *ambiguous* in DSN + Step 1.

By *ambiguous* we mean full forms (or combinations of POS tags and full forms) which have more than one lemma associated with them, which cause the lemmatizer to not know which of the lemmas to choose.

In [8]:
# Create a set for looking up POS + full form combinations found in DSN.
dsn_full_forms = set((pos, full_form) for pos, full_form, _lemma in dsn_tuples)

# Create a list of POS + full form + lemma tuples from UD for wich the POS + full form combination
# is *not* found in DSN.
ud_tuples_unique = [(pos, full_form, lemma) for (_pos_prev, pos, full_form, lemma) in ud_tuples if (pos, full_form) not in dsn_full_forms]

# Create a new list of tuples consisting of the ones from DSN and the new ones just found in UD.
dsn_ud_no_history = dsn_tuples + list(set(ud_tuples_unique))

## Ambiguity
Several words can be spelled in more than one way. For example, the Danish word for aquarium, "akvarium", can also be spelled *akvarie*. This causes ambiguous rules. If we naively added both "akvarie" and "akvarium" to our lemmatizer and then tried lemmatizing "akvarier" (plural form of "akvarium"/"akvarie"), the lemmatizer would not know whether to return "akvarium" or "akvarie". It would then return both words and let the user pick the desired one. In some scenarios, this might be the desired behavior. But here, we will try and avoid the ambiguity.

We avoid this kind of ambiguity by choosing one spelling over the other. It doesn't matter too much which spelling we choose as long as we are consistent. In general, when we can easily identify the more 'modern' spelling, we favor that one. In the caae of the aquarium, "akvarie" has only recently been accepted and so we will choose that.

The "-ium"/"-ie" endings are quite common and so we will  scan the DSN word list to make sure we handle all of them. We do this by grouping the word list by full form. We then look for groups that contain exactly two lemmas with one ending in "ium" and the other in "ie". From these we keep only the lemma ending in "ie". This ensures that we learn to lemmatize, for example, "akvarier" unambiguously to "akvarie".

We are not done yet, though. What should happen if we lemmatize "akvarium"? One option is to just return "akvarium" since the word is already in its base form. Another option would be to return "akvarie" to reflect that we have chosen that form over "akvarium". One might argue in favor of the first option by saying that turning "akvarium" into *akvarie* is the job of a normalizer and consequently should not be done by the lemmatizer. On the other hand, leaving different lemmas for different spellings of the same words was what we wanted to avoid. We have chosen to go with option two, lemmatizing "akvarium" to "akvarie".

Note that the disambiguation code is run only during preprocessing of the training (and test) data. If you prefer not to disambiguate in this way, you can skip that part of preprocessing and train your own rules.

In [9]:
def build_lemma_groups(forms):
    groups = {}
    for pos, full_form, lemma in forms:
        if pos not in groups:
            groups[pos] = {}
        if full_form not in groups[pos]:
            groups[pos][full_form] = []
        groups[pos][full_form].append(lemma)
    return groups

lemma_groups = build_lemma_groups(dsn_ud_no_history)

In [10]:
# As an example, show ambuigty for various forms of "akvarie".
[x for x in lemma_groups["NOUN"].items() if x[0].startswith('akvarie') or x[0].startswith('akvariu')]

[('akvarie', ['akvarie']),
 ('akvarier', ['akvarie', 'akvarium']),
 ('akvarierne', ['akvarie', 'akvarium']),
 ('akvariernes', ['akvarie', 'akvarium']),
 ('akvariers', ['akvarie', 'akvarium']),
 ('akvaries', ['akvarie']),
 ('akvariet', ['akvarie', 'akvarium']),
 ('akvariets', ['akvarie', 'akvarium']),
 ('akvarium', ['akvarium']),
 ('akvariums', ['akvarium'])]

In [11]:
def remove_ambiguous(forms_lookup, func_unwanted, func_wanted):
    replace_lookup = {}
    for full_form in forms_lookup:
        lemmas = forms_lookup[full_form]
        if len(lemmas) != 2:
            continue

        unwanted = next((lemma for lemma in lemmas if func_unwanted(lemma)), None)
        wanted = next((lemma for lemma in lemmas if func_wanted(lemma)), None)

        if not unwanted or not wanted:
            continue

        forms_lookup[full_form] = [wanted]
        replace_lookup[unwanted] = wanted

    for full_form in forms_lookup:
        lemmas = forms_lookup[full_form]
        if len(lemmas) != 1 or lemmas[0] not in replace_lookup:
            continue
        
        unwanted = lemmas[0]
        wanted = replace_lookup[unwanted]        
        forms_lookup[full_form] = [wanted]
    return list(replace_lookup.items())

In [12]:
# Run trough various endings and specific words which we know have two accepted spellings and remove one of them.
norm_pairs = []
t = [(key, value) for (key, value) in lemma_groups["NOUN"].items() if len(value) == 2]
print(f"Found {len(t)} nouns with exactly two spellings.")
for unwanted, wanted in [('ium', 'ie'), ('fader', 'far'), ('moder', 'mor'),
                         ('broder', 'bror'), ('skifer', 'skiffer'), ('brille', 'briller'),
                         ('kon', 'kum'), ('kjoleskød', 'kjoleskøde'), ('has', 'hase'),
                         ('kreol', 'kreoler'), ('dolmer', 'dolme'), ('skifte', 'skift'),
                         ('lomvi', 'lomvie'), ('blænder', 'blænde'), ('morlil', 'morlille'),
                         ('mukkebikke', 'mukkebik'), ('norman', 'normanner'),
                         ('padderok', 'padderokke'), ('padle', 'paddel'),
                         ('plusfours', 'plusfour'), ('samaritan', 'samaritaner'),
                         ('rio', 'rie'), ('sjægte', 'sjægt'), ('spektrum', 'spekter'),
                         ('pse', 'psis'), ('tandstikker', 'tandstik'),
                         ('tidsspild', 'tidsspilde'), ('tjekker', 'tjekke'),
                         ('trauma', 'traume'), ('tusinde', 'tusind'),
                         ('hundrede', 'hundred'), ('tyrk', 'tyrker'),
                         ('tøndestave', 'tøndestav'), ('vable', 'vabel'),
                         ('unikum', 'unika'), ('valeriane', 'valerian'),
                         ('varyler', 'varyl'), ('baryler', 'baryl')]:
    norm_pairs += remove_ambiguous(lemma_groups["NOUN"], func_unwanted=lambda x: x.endswith(unwanted), func_wanted=lambda x: x.endswith(wanted))
    t = [(key, value) for (key, value) in lemma_groups["NOUN"].items() if len(value) == 2]
    print(f"Found {len(t)} nouns with exactly two spellings (after disambiguating {unwanted}/{wanted}).")

# We want to distinguish between the words "center" and "centrum" but we want to replace "centrum" 
# with "center" whenever it's used as the suffix in a compound word.
for unwanted, wanted in [('centrum', 'center'), ('bo', 'boer')]:
    t = [(key, value) for (key, value) in lemma_groups["NOUN"].items() if len(value) == 2]
    norm_pairs += remove_ambiguous(lemma_groups["NOUN"], func_unwanted=lambda x: x.endswith(unwanted) and len(x) > len(unwanted), func_wanted=lambda x: x.endswith(wanted) and len(x) > len(wanted))
    t = [(key, value) for (key, value) in lemma_groups["NOUN"].items() if len(value) == 2]
    print(f"Found {len(t)} nouns with exactly two spellings (after disambiguating {unwanted}/{wanted}).")

for unwanted, wanted in [('barne', 'børne'), ('broder', 'brødre')]:
    t = [(key, value) for (key, value) in lemma_groups["NOUN"].items() if len(value) == 2]
    norm_pairs += remove_ambiguous(lemma_groups["NOUN"], func_unwanted=lambda x: x.startswith(unwanted), func_wanted=lambda x: x.startswith(wanted))
    t = [(key, value) for (key, value) in lemma_groups["NOUN"].items() if len(value) == 2]
    print(f"Found {len(t)} nouns with exactly two spellings (after disambiguating {unwanted}/{wanted}).")
    
# Lemma of Danish word "det" when POS=='DET' should be "den". But UD contains the fixed phrase "i det hele taget" in which
# the lemma for "det" is specified as "det". To work around this, we manually disambiguate.
lemma_groups["DET"]["det"] = ["den"]

# Finally, now that we have removed ambiguous full forms from the groups, create a new
# list of tuples base on the cleaned up groups.
clean_dsn_ud_no_history = []
for pos in lemma_groups:
    for full_form in lemma_groups[pos]:
        for lemma in lemma_groups[pos][full_form]:
            clean_dsn_ud_no_history.append((pos, full_form, lemma))
len(clean_dsn_ud_no_history)

Found 2675 nouns with exactly two spellings.
Found 2063 nouns with exactly two spellings (after disambiguating ium/ie).
Found 2039 nouns with exactly two spellings (after disambiguating fader/far).
Found 2007 nouns with exactly two spellings (after disambiguating moder/mor).
Found 2003 nouns with exactly two spellings (after disambiguating broder/bror).
Found 1995 nouns with exactly two spellings (after disambiguating skifer/skiffer).
Found 1967 nouns with exactly two spellings (after disambiguating brille/briller).
Found 1959 nouns with exactly two spellings (after disambiguating kon/kum).
Found 1953 nouns with exactly two spellings (after disambiguating kjoleskød/kjoleskøde).
Found 1941 nouns with exactly two spellings (after disambiguating has/hase).
Found 1937 nouns with exactly two spellings (after disambiguating kreol/kreoler).
Found 1929 nouns with exactly two spellings (after disambiguating dolmer/dolme).
Found 1915 nouns with exactly two spellings (after disambiguating skifte/

400699

## Other Ambiguity
We have now removed words with two accepted spellings. Unfortunately, we have at least one more kind of ambiguity left in the data, namely distinct words which share one or more forms. For example, the Danish word "se" means *see*. Past tense of "se" is "så" (somewhat similar to *saw*). But the word "så" also has another meaning in Danish, namely *sow*. Consequently, if we are to lemmatize the word "så" and do not have any other information, we cannot tell whether the lemma is "se" or "så". For these situation, it helps if we know the POS tag of the previous word of the sentence. Therefor, we now identify ambiguous words which are still present after the above cleaning of ambiguous words. For these ambiguous words, we then build a list of tuples which include the POS tag of the previous word.

In [13]:
def find_ambiguous_lemmas(forms):
    counter = Counter(t[:2] for t in forms)
    ambiguous = list(set([key for key in counter if counter[key] > 1]))
    return ambiguous

ambiguous = find_ambiguous_lemmas(clean_dsn_ud_no_history)
dsn_ud_with_history = [(f'{f[0]}_{f[1]}',) + f[2:] for f in ud_tuples if f[1:3] in ambiguous]
len(dsn_ud_with_history)

424

## Write Tuples To Disk

In [14]:
def _write_form(word_class, full_form, lemma):
    writer.writerow([word_class, full_form, lemma])

with open(PREPARED_FILE, 'wb') as csvfile:
    writer = csv.writer(csvfile,
                        delimiter=",",
                        quotechar='"',
                        quoting=csv.QUOTE_MINIMAL,
                        encoding='utf-8',
                        lineterminator='\n')
    
    writer.writerow(['word_class', 'full_form', 'lemma'])
    
    for pos, full_form, lemma in sorted(clean_dsn_ud_no_history, key = lambda x: (x[1:], x[0])):
        _write_form(pos, full_form, lemma)
    for pos, full_form, lemma in sorted(dsn_ud_with_history, key = lambda x: (x[1:], x[0])):
        _write_form(pos, full_form, lemma)

## Write Normalizations To Disk
At this point, the `norm_pairs` variable contains words we have learned have multiple spellings and for which we have decided to use only one of those spellings. We wrote this list to a CSV file. It will be used one measuring accuracy in another notebook.

In [15]:
len(norm_pairs)

191

In [16]:
with open(NORMS_FILE, 'wb') as csvfile:
    writer = csv.writer(csvfile,
                        delimiter=",",
                        quotechar='"',
                        quoting=csv.QUOTE_MINIMAL,
                        encoding='utf-8',
                        lineterminator='\n')
    
    writer.writerow(['full_form', 'norm'])
    for full_form, norm in norm_pairs:
        writer.writerow([full_form, norm])

## Unused classes from DSN
Finally, we will list word classes found in DSN for which we do not have a mapping to UD POS tags. Further investigation for these is an area for future work.

In [17]:
for key, value in unknown.items():
    print(key, value)

fork 361
flerord.forb. 196
præfiks 58
formelt 2
