# Lemmatization

Lemmatization is very similiar to stemming in that it reduces a set of inflected words down to a common word. The difference is that lemmatization reduces inflections down to their real root words, which is called a lemma. If we take the words *'amaze'*, *'amazing'*, *'amazingly'*, the lemma of all of these is *'amaze'*. Compared to stemming which would usually return *'amaz'*. Generally lemmatization is seen as more advanced than stemming.

In [1]:
words = ['amaze', 'amazed', 'amazing']

We will use NLTK again for our lemmatization. We also need to ensure we have the *WordNet Database* downloaded which will act as the lookup for our lemmatizer to ensure that it has produced a real lemma.

In [2]:
import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

[lemmatizer.lemmatize(word) for word in words]

[nltk_data] Downloading package wordnet to /home/nataliia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['amaze', 'amazed', 'amazing']

Clearly nothing has happened, and that is because lemmatization requires that we also provide the *parts-of-speech* (POS) tag, which is the category of a word based on syntax. For example noun, adjective, or verb. In our case we could place each word as a verb, which we can then implement like so:

In [3]:
from nltk.corpus import wordnet

[lemmatizer.lemmatize(word, wordnet.VERB) for word in words]

['amaze', 'amaze', 'amaze']

### German

In [14]:
import spacy
import spacy_transformers
# ! python -m spacy download de_dep_news_trf

In [15]:
nlp = spacy.load("de_dep_news_trf")
# nlp = spacy.load("de_core_news_sm")

In [16]:
mails=['Hallo. Ich spielte am frühen Morgen und ging dann zu einem Freund. Auf Wiedersehen',
       'Guten Tag Ich mochte Bälle und will etwas kaufen. Tschüss']

In [33]:
mails_lemma = []

for text in mails:
    doc = nlp(text)
    result = [word.lemma_ for word in doc]
    print(*result, sep=" ")
    # mails_lemma.append(result)
    
# print(*mails_lemma)

Hallo -- ich spielen an früh Morgen und gehen dann zu ein Freund -- auf wiedersehen
Guten Tag ich mochte Ball und wollen etwas kaufen -- Tschüss


In [34]:
# !pip install HanTa

In [35]:
from HanTa import HanoverTagger as ht

In [36]:
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
for text in mails:
    lemma_text = [lemma for (word, lemma, pos) in tagger.tag_sent(text.split())]
    print(*lemma_text, sep=" ")

hallo. ich spielen an früh Morgen und gehen dann zu ein Freund. auf Wiedersehen
gut Tag ich mögen Ball und wollen etwas Kaufen. Tschüss
