# Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the **stem** for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision [here](https://github.com/explosion/spaCy/issues/327). We discuss the virtues of *lemmatization* in the next section.

Instead, we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.org/

## Porter Stemmer

One of the most common - and effective - stemming tools is [*Porter's Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). The algorithm employs five phases of word reduction, each with its own set of mapping rules.

- In the first phase, simple suffix mapping rules are defined.
- More sophisticated phases consider the length/complexity of the word before applying a rule.

In [1]:
import nltk

In [3]:
#Import Porter Stemmer

from nltk.stem.porter import PorterStemmer

In [4]:
#Create the object

p_stemmer = PorterStemmer()

In [20]:
words = ['run','runner','ran','runs','easily','fairly','fairness','functional','university']

In [21]:
for word in words:
    print(word + '---->' + p_stemmer.stem(word))

run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fairli
fairness---->fair
functional---->function
university---->univers


## Snowball Stemmer

It is a stemming language developed by Martin Porter. The algorithm used here is more accurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter Stemmer, both in logic and speed.

In [22]:
# Import Snowball Stemmer

from nltk.stem.snowball import SnowballStemmer
snow_stemmer = SnowballStemmer(language='english')

In [23]:
for word in words:
    print(word + '---->' + snow_stemmer.stem(word))

run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fair
fairness---->fair
functional---->function
university---->univers


In [24]:
words = ['generous','generation','generously','generate']

In [26]:
for word in words:
    print(word + '---->' + snow_stemmer.stem(word))

generous---->generous
generation---->generat
generously---->generous
generate---->generat


In [28]:
phrase = 'We are meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word+' --> '+snow_stemmer.stem(word))

We --> we
are --> are
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet
