Stemmimg and lemmatization

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.

stemming_v2.png

Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.

lemma_v2.png

Another important difference to highlight is that a lemma is the base form of all its inflectional forms, whereas a stem isn’t. This is why regular dictionaries are lists of lemmas, not stems. 

How do they work?

Stemming: there are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer. The rules contained in this algorithm are divided in five different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the root.
Lemmatization: the key to this methodology is linguistics. To extract the proper lemma, it is necessary to look at the morphological analysis of each word. This requires having dictionaries for every language to provide that kind of analysis.
Which one is best: lemmatization or stemming?

As a conclusion, we can say developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create the dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate.

In [5]:
#stemming
#There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers. 
#Both of them have been implemented using different algorithms.

# we cannot perform stemming with spacy

import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
#tokens = ['final','finally','fine','finalle']
tokens = ['compute', 'computer', 'computed', 'computing']


for token in tokens:
    print(stemmer.stem(token))

comput
comput
comput
comput


In [6]:
#Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. 
#Let's see snowball stemmer in action:

from nltk.stem.snowball import SnowballStemmer
obj = SnowballStemmer(language='english')

for token in tokens:
    print(obj.stem(token))

comput
comput
comput
comput


In [20]:
#lemmatization
import spacy
from spacy.lang.en import English

nlp = spacy.load('en_core_web_sm')

my_data = u"""
written computed If finally getting up early doesn’t come naturally, there are some strategies you can try. 
Early exercise and exposing yourself to light as soon as possible can help stimulate metabolism and body temperature, which gets you going more quickly. Yet the early alarm clock may not work for everyone – it turns out there are plenty of caveats around trying to become a morning person if it’s not an easy fit. Is getting up early for everyone? No. Whether or not waking up early actually makes you more productive could be in your genes. There’s been lots of research about how some people are biologically more likely to feel more alert in the morning, while others are at their best at night. You might be more alert and have better cognitive ability in the afternoon, for instance. In fact, a recent study published in the journal Nature Communications provided further evidence that this is the case. Looking at data from over 700,000 people, researchers found over 350 genetic factors that could influence whether people feel more naturally energised either in the morning or in the evening. The large sample size makes the study the biggest of its kind so far, though further research is needed to confirm the results. 
So, if you don’t naturally feel alert in the morning but decide to wake up early anyway, you might be sabotaging your actual peak performance times.  
"""

doc = nlp(u'A letter has been written, asking him to be released')

for token in doc:
    print(token.lemma_)


a
letter
have
be
write
,
ask
-PRON-
to
be
release


In [11]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

stemmer = SnowballStemmer(language='english')

my_data = """
If getting up early doesn’t come naturally, there are some strategies you can try. 
Early exercise and exposing yourself to light as soon as possible can help stimulate metabolism and body temperature, which gets you going more quickly. Yet the early alarm clock may not work for everyone – it turns out there are plenty of caveats around trying to become a morning person if it’s not an easy fit. Is getting up early for everyone? No. Whether or not waking up early actually makes you more productive could be in your genes. There’s been lots of research about how some people are biologically more likely to feel more alert in the morning, while others are at their best at night. You might be more alert and have better cognitive ability in the afternoon, for instance. In fact, a recent study published in the journal Nature Communications provided further evidence that this is the case. Looking at data from over 700,000 people, researchers found over 350 genetic factors that could influence whether people feel more naturally energised either in the morning or in the evening. The large sample size makes the study the biggest of its kind so far, though further research is needed to confirm the results. 
So, if you don’t naturally feel alert in the morning but decide to wake up early anyway, you might be sabotaging your actual peak performance times.  
"""

my_sentences = sent_tokenize(my_data)

for sent in my_sentences:
    print(sent)
    words = word_tokenize(sent)
    stems = [ stemmer.stem(word) for word in words if not word in set(stopwords.words('english'))]
    print(stems)
    print("\n","---------------------------------------------------------","\n")


If getting up early doesn’t come naturally, there are some strategies you can try.
['if', 'get', 'earli', '’', 'come', 'natur', ',', 'strategi', 'tri', '.']

 --------------------------------------------------------- 

Early exercise and exposing yourself to light as soon as possible can help stimulate metabolism and body temperature, which gets you going more quickly.
['earli', 'exercis', 'expos', 'light', 'soon', 'possibl', 'help', 'stimul', 'metabol', 'bodi', 'temperatur', ',', 'get', 'go', 'quick', '.']

 --------------------------------------------------------- 

Yet the early alarm clock may not work for everyone – it turns out there are plenty of caveats around trying to become a morning person if it’s not an easy fit.
['yet', 'earli', 'alarm', 'clock', 'may', 'work', 'everyon', '–', 'turn', 'plenti', 'caveat', 'around', 'tri', 'becom', 'morn', 'person', '’', 'easi', 'fit', '.']

 --------------------------------------------------------- 

Is getting up early for everyone?
['is