# Charming the word snake: Terminology work and language checking with Python

_Esther Strauch & Maximilian Rosin, ([parson AG](https://www.parson-europe.com))<br/>
tcworld conference 2020_

## Why Python?

- Readable and explicit, thus very accessible.
- Interpreted language, thus it is easy to run scripts.
- A gazillion libraries for almost any recurring tasks.

## Terminology extraction

### The text

> The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, China. The outbreak was declared a Public Health Emergency of International Concern in January 2020, and a pandemic in March 2020. As of 14 October 2020, more than 38.1 million cases have been confirmed, with more than 1.08 million deaths attributed to COVID-19.

Source: [https://en.wikipedia.org/wiki/COVID-19_pandemic]()

In [None]:
raw_text = "The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, China. The outbreak was declared a Public Health Emergency of International Concern in January 2020, and a pandemic in March 2020. As of 14 October 2020, more than 38.1 million cases have been confirmed, with more than 1.08 million deaths attributed to COVID-19."
raw_text = raw_text.lower()

### Importing the spacy library

In [None]:
import spacy

### Loading a model of the English language

In [None]:
nlp = spacy.load("en_core_web_sm")

### Feeding the text into the model

In [None]:
doc = nlp(raw_text)


### A little helper function for pretty-printing results

In [None]:
from tabulate import tabulate

def print_table(data,header):
    print(tabulate(data, headers=header, tablefmt="simple"))


### Theory: token, lemma, part-of-speech

> The dog has a wet nose.

In [None]:
sample_doc = nlp("The dog has a wet nose.")

sample_sentence = []

for token in sample_doc:
    sample_sentence.append([token.text,token.lemma_,token.pos_])

print_table(sample_sentence,["Token","Lemma","POS"])


### Filtering for relevant terms

In [None]:
def is_relevant(token):
    pos_tag_concepts = ["NOUN","PROPN","VERB"]
    if token.pos_ in pos_tag_concepts:
        return True
    else:
        return False

### Building the initial list

In [None]:
one_word_terms = []

for token in doc:
    if is_relevant(token):
        one_word_terms.append([token.lemma_ , token.pos_])

### Removing duplicates

In [None]:
one_word_terms = sorted(one_word_terms)
one_word_terms_no_dups = [one_word_terms[i] for i in range(len(one_word_terms)) 
                           if i == 0 or one_word_terms[i] != one_word_terms[i-1]]
one_word_terms = one_word_terms_no_dups

### Checking results

In [None]:
print_table(one_word_terms,["Term","POS"])

### spacy's sentenizer

In [None]:
for sent in doc.sents:
    print(sent.text)

### Creating a list of sentences and their lemmas

In [None]:
sentences = []

for sent in doc.sents:
    terms = []
    for token in sent.subtree:
        if is_relevant(token):
            terms.append([token.lemma_ , token.pos_])
    terms = sorted(terms)
    terms = [terms[i] for i in range(len(terms)) if i == 0 or terms[i] != terms[i-1]]
    sentences.append({"sentence": sent.text,
                    "terms": terms})

In [None]:
for sentence in sentences:
    print("Sentence: " + sentence["sentence"] + "\n")
    print_table(sentence["terms"],["Term","POS"])
    print("\n")

### Adding the sample sentences to the list

In [None]:
for one_word_term in one_word_terms:
    for sentence in sentences:
        if one_word_term in sentence["terms"]:
            one_word_term.append(sentence["sentence"])

### Checking the results again

In [None]:
for term in one_word_terms:
    print("Term: " + term[0])
    print("POS: " + term[1])
    print("Sentence: " + term[2] + "\n")

### Adding definitions (where possible)

Data source: [Free Wordset Dictionary](https://github.com/wordset/wordset-dictionary/tree/master/data)

In [None]:
import json

postag_map = {"NOUN" : "noun",
              "PROPN": "noun",
              "VERB" : "verb"}

for term in one_word_terms:
    dictionary_name = "wordset-dictionary/" + term[0][0] + ".json"
    try:
        with open(dictionary_name,"r", encoding="utf-8") as dictionary_file:
            dictionary_data = json.load(dictionary_file)
            try:
                definitions = ""
                for meaning in dictionary_data[term[0]]["meanings"]:
                    if meaning["speech_part"] == postag_map[term[1]]:definitions += (meaning["def"] + ",")
                term.append(definitions)
            except KeyError:
                term.append("~definition missing~")
                print("Term " + term[0] + " not found in dictionary.")
    except FileNotFoundError:
        term.append("~definition missing~")
        print("No matching dictionary found.")

In [None]:
for term in one_word_terms:
    print("Term: " + term[0])
    print("POS: " + term[1])
    print("Sentence: " + term[2])
    print("Definitions: " + term[3] + "\n")

### Writing the list to a CSV file

In [None]:
import csv

with open("one-word-terms.csv", "w", newline="") as csvfile:
    csvwriter = csv.writer(csvfile)
    for term in one_word_terms:
        csvwriter.writerow(term)

### Further ideas

- Count the frequency of terms.
- Add **two-word terms** by searching for bigrams. This is a little bit tougher than it sounds.
- Evaluate relations between terms by looking for **colocations**.

## Checking writing rules

### The gruesome sample text

> Der folgende Text soll helfen, Schreibregeltests zu veranschaulichen. Dieses phantasmagorische Ungetüm von einem Satz ist zum Beispiel fast schon lächerlich lang und trotzdem erlaubt es mir die deutsche Sprache, dass ich ihn mit voller Absicht auf diese überaus erstaunliche Überlänge bringen kann. Hier ein kurzer Satz. Und noch einer. Als nächstes ein Satz mit einem sehr langen Wort. Denn wer kennt es nicht, das berühmte mecklenburg-vorpommerische Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz. Die Erfindung solcher Worte erfolgt meist durch Beamte, damit die Kommunikation mit den Bürgern einer Eindeutigkeit genügt. Oder besser, Beamte erfinden oft lange Worte. Aber auch vor eingeschobenen Nebensätzen, die in der Mitte eines Satzes eingeschoben werden, sei gewarnt. Aufzählungen von Sachverhalten, Worten, Begriffen, Fakten oder Nebensächlichkeiten verlagert man besser in separate Listen.

### Loading text into memory

In [None]:
gruesome_text = "Der folgende Text soll helfen, Schreibregeltests zu veranschaulichen. Dieses phantasmagorische Ungetüm von einem Satz ist zum Beispiel fast schon lächerlich lang und trotzdem erlaubt es mir die deutsche Sprache, dass ich ihn mit voller Absicht auf diese überaus erstaunliche Überlänge bringen kann. Hier ein kurzer Satz. Und noch einer. Als nächstes ein Satz mit einem sehr langen Wort. Denn wer kennt es nicht, das berühmte mecklenburg-vorpommerische Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz. Die Erfindung solcher Worte erfolgt meist durch Beamte, damit die Kommunikation mit den Bürgern einer Eindeutigkeit genügt. Oder besser, Beamte erfinden oft lange Worte. Aber auch vor eingeschobenen Nebensätzen, die in der Mitte eines Satzes eingeschoben werden, sei gewarnt. Aufzählungen von Sachverhalten, Worten, Begriffen, Fakten oder Nebensächlichkeiten verlagert man besser in separate Listen."

### Loading the language model (this time a German one)

In [None]:
nlp_de = spacy.load("de_core_news_sm")

### Feeding the text into the model

In [None]:
doc = nlp_de(gruesome_text)

### Finding forbidden words

#### List of forbidden words

In [None]:
forbidden_words = ["erfolgen","fast","Sachverhalt"]

#### Checking every token

In [None]:
for sentence in doc.sents:
    for token in sentence.subtree:
        if token.lemma_ in forbidden_words:
            print("Forbidden word: "
                  + "'" + token.lemma_ + "'"
                  + " in '" + sentence.text + "'\n")

### Finding long words

#### Counting syllables

In [None]:
vowels = ["a","e","i","o","u","ä","ü","ö","y"]
diphtongs = ["aa","ai","au","ay","ee","ei","eu","ey","ie","io","oa","oi","oo","oy","ui","ya","ye","yi","yo","yu"]

def count_occurences(text,substrings):
    occurences = 0
    for substring in substrings:
        occurences += text.count(substring)
    return occurences

def count_syllables(text):
    num_vowels = count_occurences(text,vowels)
    num_diphtongs = count_occurences(text,diphtongs)
    return (num_vowels - num_diphtongs)

#### Checking every token

In [None]:
for sentence in doc.sents:
    for token in sentence.subtree:
        syllables = count_syllables(token.text)
        if (syllables > 3) or (len(token.text) > 10):
            print("Long word "
                  + "(" + str(syllables)  + " syllables, "
                  + str((len(token.text))) + " characters" + "): "
                  + "'" + token.text + "'"
                  + " in '" + sentence.text + "'\n")

### Finding long sentences

In [None]:
for sentence in doc.sents:
    words = [ token.text for token in sentence.subtree if token.pos_ not in ["PUNCT","SYM","X"] ]
    if len(words) > 15:
        print("Long sentence " + "(" + str(len(words)) + " words): "
              + sentence.text +"\n")

### Dependent clauses, enumerations...: What commas can tell you

In [None]:
for sentence in doc.sents:
    num_of_commas = count_occurences(sentence.text,",")
    if num_of_commas > 1:
        print("Multiple commas " + "(" + str(num_of_commas) + " commas): "
                  + "'" + sentence.text + "'\n")

### Finding nominalizations.. or at least the worst ones

#### Identifying nominalizations

In [None]:
nominalization_hints = ["ung","keit","heit","tion"]

def is_nominalization(word,pos):
    for nominalization_hint in nominalization_hints:
        if word.endswith(nominalization_hint) and pos in ["NOUN","PROPN"]:
            return True
    return False

#### Checking every sentence

In [None]:
for sentence in doc.sents:
    for token in sentence.subtree:
        if is_nominalization(token.text,token.pos_):
            print("Possible nominalization: "
                  + "'" + token.lemma_ + "'"
                  + " in '" + sentence.text + "'\n")

### Further ideas

- Finding passive sentences.
- Checking for "dass" vs. "das".
- Calculating readibility indeces.
- Estimating reading time.