# Stemming
Stemming strips a word to it's root form.  E.g the words "eat", "eating", "eaten" all have "eat" as the stem.
We use stemming whereever we want to group words that have the same root.  Often, when performing machine processing of text, we use stemming so that the information value of the text is increased.  This can improve the performance of any algorithms that rely on this information value, such as machine learning.

## Stemming Algorithms
These algorithms work though suffix-stripping.  It's a pretty mechanical task, applying some very simple rules.  Porter stemmer is the most simple, with rules such as:
- if the word ends in 'ed', remove the 'ed'
- if the word ends in 'ing', remove the 'ing'
- if the word ends in 'ly', remove the 'ly'

There are 3 well-know stemming algorithms.  
Porter Stemmer is a very gentle algorithm, in that it doesn't aggressively remove suuffixes.
Snowball Stemmer, which addresses some issues in the Porter Stemmer and is slightly more aggressive.
Lancaster Stemmer, which is quite agressive.

In [1]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import LancasterStemmer

In [2]:
dance = ["dance", "danced", "dancing", "dancer", "dances"]
fish = ["fishing","fisher","fished","fishes"]
sport = ["sport", "sporty", "sporting", "sportingly"]

In [3]:
def porterStemList(x):
    ps = PorterStemmer()
    for w in x:
        rootWord = ps.stem(w)
        print(w, "->", rootWord)
        
def snowballStemList(x):
    ss = SnowballStemmer(language="english")
    for w in x:
        rootWord = ss.stem(w)
        print(w, "->", rootWord)
                
def lancasterStemList(x):
    ls = LancasterStemmer()
    for w in x:
        rootWord = ls.stem(w)
        print(w, "->", rootWord)

In [4]:
porterStemList(dance) 
porterStemList(fish) 
porterStemList(sport) 

dance -> danc
danced -> danc
dancing -> danc
dancer -> dancer
dances -> danc
fishing -> fish
fisher -> fisher
fished -> fish
fishes -> fish
sport -> sport
sporty -> sporti
sporting -> sport
sportingly -> sportingli


In [5]:
snowballStemList(dance) 
snowballStemList(fish) 
snowballStemList(sport) 

dance -> danc
danced -> danc
dancing -> danc
dancer -> dancer
dances -> danc
fishing -> fish
fisher -> fisher
fished -> fish
fishes -> fish
sport -> sport
sporty -> sporti
sporting -> sport
sportingly -> sport


In [6]:
lancasterStemList(dance) 
lancasterStemList(fish) 
lancasterStemList(sport) 

dance -> dant
danced -> dant
dancing -> dant
dancer -> dant
dances -> dant
fishing -> fish
fisher -> fish
fished -> fish
fishes -> fish
sport -> sport
sporty -> sporty
sporting -> sport
sportingly -> sport


## Issues with Stemming
In this example, there are a number of different meanings, but the stemmers fail to recognise this.  Stemmers have no concept of how a word is being used.

In [7]:
uni = ["university", "universal", "universities", "universe", "universality"]
porterStemList(uni)

university -> univers
universal -> univers
universities -> univers
universe -> univers
universality -> univers


## Lemmatization
A more complex algorithm than stemming, which uses knowledge of the language to reduce the word to its base dictionary form.


In [1]:
import nltk
nltk.download('wordnet')

[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [2]:
from nltk.stem import WordNetLemmatizer

In [10]:
# Note that in this function we pass in the part of speech (pos) defaulting to noun
def lemmatizeList(x, pos="n"):
    lemmatizer = WordNetLemmatizer()
    for w in x:
        rootWord = lemmatizer.lemmatize(w, pos=pos)
        print(w, "->", rootWord)

In [11]:
lemmatizeList(dance) 
lemmatizeList(fish) 
lemmatizeList(uni)

dance -> dance
danced -> danced
dancing -> dancing
dancer -> dancer
dances -> dance
fishing -> fishing
fisher -> fisher
fished -> fished
fishes -> fish
university -> university
universal -> universal
universities -> university
universe -> universe
universality -> universality


In [12]:
strip = ["strip", "stripe", "stripes"]

lemmatizeList(strip, pos="n") 

strip -> strip
stripe -> stripe
stripes -> stripe


In [13]:

lemmatizeList(strip, pos="v") 

strip -> strip
stripe -> stripe
stripes -> strip


## Stemming or Lemmatization?
Which is better?
It depends!  What is your task?  If your task requires preserving meaning of words, lemmatization is probably better.   If your task is more about getting the rough sense of things, then stemming may be better.

