Two common ways of combining similar words are _stemming_ and _lemmatization_.

## Stemming

The simpler of the two, and works by chopping off the end of each word. A popular algorithm for this, called Porter's algorithm, would convert `ponies` -> `poni` and `cats` -> `cat`. Stemming is fast but can have a high false positive rate.

In [29]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

text = "This shirt is dry-clean only which means it's dirty"
tokens = text.split(" ")
singles = [stemmer.stem(token) for token in tokens]
for token in tokens:
    stem = stemmer.stem(token)
    print(stem)

thi
shirt
is
dry-clean
onli
which
mean
it'
dirti


Stemming doesn't take into account parts of speech, and can produce invalid words. This is why spaCy doesn't include a stemmer, instead it provides lemmatization.

## Lemmatization
Grouping words into their _lemma_, or standard form of the word. For example, `better` and `best` have the lemma `good` whereas `walk` is the lemma for `walking`, `walked`, and `walks`. Lemmatization is more complicated than stemming, but has fewer false positives.

In [16]:
import spacy
from spacy.lang.en import English

text = "I haven't slept for ten days, because that would be too long."

nlp = spacy.load('en')
doc = nlp(text)
for token in doc:
    if not token.is_punct and token.text != token.lemma_:
        print(token.text, "->", token.lemma_)

I -> -PRON-
n't -> not
slept -> sleep
days -> day


You can see plurals and different tenses are collapsed into a single base form. spaCy also introduces a special case for pronouns because there is no clear lemma for pronouns. Should "I" become "me", "it", or "they"? To avoid this ambiguity they treat all pronouns as a special lemma, -PRON-.