# Stemming

Stemming is a *text normalization* method used in NLP to simplify text before it is processed by a model. When stemming break the final few characters of a word in order to find a common form of the word. If we take the following sentence:

In [1]:
txt = "I am amazed by how amazingly amazing you are"

We use different forms of the word **amaze** a total of *three* times. Each of these different forms is called an *'inflection'*, which is the modification of a word to slightly adjust the meaning or context of the word. When we tokenize this text we produce three different tokens for each inflection of happy, which is okay but in many applications this level of granularity in the semantic meaning of the word is not required and can damage model performance.

*Later, when we get to using more complex, sophisticated models (eg BERT), we will use different methods that maintain the inflection of each word - but it is important to understand stemming as it was a very important part of text preprocessing for a very long time, and still relevant to many applications.*

To apply stemming we will be using the NLTK package, which provides several different stemmers, we will test the `PorterStemmer` and `LancasterStemmer`.

In [3]:
words_to_stem = ['happy', 'happiest', 'happier', 'cactus', 'cactii', 'elephant', 'elephants', 'amazed', 'amazing', 'amazingly', 'cement', 'owed', 'maximum']

from nltk.stem import PorterStemmer, LancasterStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()

stemmed = [(porter.stem(word), lancaster.stem(word)) for word in words_to_stem]
stemmed


[('happi', 'happy'),
 ('happiest', 'happiest'),
 ('happier', 'happy'),
 ('cactu', 'cact'),
 ('cactii', 'cacti'),
 ('eleph', 'eleph'),
 ('eleph', 'eleph'),
 ('amaz', 'amaz'),
 ('amaz', 'amaz'),
 ('amazingli', 'amaz'),
 ('cement', 'cem'),
 ('owe', 'ow'),
 ('maximum', 'maxim')]

In [4]:
print("Porter | Lancaster")
for stem in stemmed:
    print(f"{stem[0]} | {stem[1]}")

Porter | Lancaster
happi | happy
happiest | happiest
happier | happy
cactu | cact
cactii | cacti
eleph | eleph
eleph | eleph
amaz | amaz
amaz | amaz
amazingli | amaz
cement | cem
owe | ow
maximum | maxim


The [Porter stemmer](https://tartarus.org/martin/PorterStemmer/) is a set of rules that strip common suffixes from the ends of words, each of these rules are applied on after the other and produce our Porter stemmed words. It is a simple stemmer, and very fast.

The [Lancaster stemmer](https://www.nltk.org/_modules/nltk/stem/lancaster.html) contains a larger set of rules and rather than applying each rule one after the other will keep iterating through the list of rules and find a rule that matches the current condition, which will then delete or replace the ending of the word. The iterations will stop once no more rules can be applied to the word OR if the word starts with a vowel and only two characters remain OR if the word starts with a consonant and there are three characters remaining. The Lancaster stemmer is much more aggressive in its stemming, sometimes this is a good thing, sometimes not.

We can see from the results of the two stemmers above that neither are perfect, and this is the case with all stemming algorithms.

In [7]:
words_to_stem_de = ["experte", "Experte", "Experten", "Expertin", "Expertinnen", "gebäude", "Gebäude", "Gebäudes", "schön", "schöner", "schönsten"]
stemmed_de = [(porter.stem(w), lancaster.stem(w)) for w in words_to_stem_de]
stemmed_de

[('expert', 'expert'),
 ('expert', 'expert'),
 ('experten', 'expert'),
 ('expertin', 'expertin'),
 ('expertinnen', 'expertin'),
 ('gebäud', 'gebäud'),
 ('gebäud', 'gebäud'),
 ('gebäud', 'gebäud'),
 ('schön', 'schön'),
 ('schöner', 'schöner'),
 ('schönsten', 'schönsten')]

#### Test with Spacy

In [10]:
# ! pip install spacy
# ! python -m spacy download de

In [12]:
import spacy
nlp = spacy.load("de_core_news_sm")

In [13]:
doc_de = nlp(" ".join(words_to_stem_de))

In [29]:
spacy_lemma_de = [word.lemma_ for word in doc_de]
spacy_lemma_de

['expert',
 'Experte',
 'Experte',
 'Expertin',
 'Expertin',
 'gebäude',
 'Gebäude',
 'Gebäude',
 'schön',
 'schön',
 'schönsen']

#### Test with PyStemmer

In [30]:
! pip install PyStemmer



In [31]:
import Stemmer
stemmer = Stemmer.Stemmer('russian')
word = "Вася"
word = stemmer.stemWord(word.lower())
word

'ва'

In [32]:
stemmer_de = Stemmer.Stemmer('german')
stemmer_de_res = [stemmer_de.stemWord(w.lower()) for w in words_to_stem_de]
stemmer_de_res

['expert',
 'expert',
 'expert',
 'expertin',
 'expertinn',
 'gebaud',
 'gebaud',
 'gebaud',
 'schon',
 'schon',
 'schon']

experte: expert --- expert
Experte: Experte --- expert
Experten: Experte --- expert
Expertin: Expertin --- expertin
Expertinnen: Expertin --- expertinn
gebäude: gebäude --- gebaud
Gebäude: Gebäude --- gebaud
Gebäudes: Gebäude --- gebaud
schön: schön --- schon
schöner: schön --- schon
schönsten: schönsen --- schon


In [37]:
from nltk.stem.cistem import Cistem

In [39]:
cistem = Cistem()
cistem_res = [cistem.stem(word) for word in words_to_stem_de]
cistem_res

['exper',
 'expert',
 'expert',
 'experti',
 'expertinn',
 'baud',
 'baud',
 'baud',
 'schon',
 'schoner',
 'schon']

In [41]:
for i in range(len(words_to_stem_de)):
    print(f"{words_to_stem_de[i]}: {spacy_lemma_de[i]} --- {stemmer_de_res[i]} --- {cistem_res[i]}")

experte: expert --- expert --- exper
Experte: Experte --- expert --- expert
Experten: Experte --- expert --- expert
Expertin: Expertin --- expertin --- experti
Expertinnen: Expertin --- expertinn --- expertinn
gebäude: gebäude --- gebaud --- baud
Gebäude: Gebäude --- gebaud --- baud
Gebäudes: Gebäude --- gebaud --- baud
schön: schön --- schon --- schon
schöner: schön --- schon --- schoner
schönsten: schönsen --- schon --- schon
