# Stemming

For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the stem for [boat, boater, boating, boats].

- Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. 
- This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required.
- In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. 

Instead, we'll use another popular NLP tool called nltk, which stands for Natural Language Toolkit. For more information on nltk visit https://www.nltk.org/

# Porter Stemmer

In [1]:
import nltk

from nltk.stem.porter import *

In [2]:
p_stemmer = PorterStemmer()

In [13]:
words = ['run','runner','running','ran','runs','easily','fairly','fairness']

In [14]:
for word in words:
  print(word + ' --> ' + p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli
fairness --> fair


# Snowball Stemmer

The algorithm used here is more acurately called the `"English Stemmer"` or `"Porter2 Stemmer"`. It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since nltk uses the name SnowballStemmer, we'll use it here.

In [6]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [7]:
words = ['run','runner','running','ran','runs','easily','fairly']

In [15]:
for word in words:
  print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair
fairness --> fair


# Try it by yourself!

In [9]:
words = ['consolingly']

In [10]:
print('Porter Stemmer:')
for word in words:
  print(word+' --> '+p_stemmer.stem(word))

Porter Stemmer:
consolingly --> consolingli


In [11]:
print('Porter2 Stemmer:')
for word in words:
  print(word+' --> '+s_stemmer.stem(word))

Porter2 Stemmer:
consolingly --> consol


Stemming has its drawbacks. If given the token `saw`, stemming might always return `saw`, whereas lemmatization would likely return either `see` or `saw` depending on whether the use of the token was as a verb or a noun.

In [12]:
phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
  print(word+' --> '+p_stemmer.stem(word))

I --> i
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet


Here the word "meeting" appears twice - once as a verb, and once as a noun, and yet the stemmer treats both equally.