# 1. Stemming

Stemming works fairly well in most of the cases but unfortunately English has so many exceptions where a more sophisticated process is required.

SpaCy dosen't include stemming, it uses lemmatization instead.

Stemming is basically removes the suffixes from a word and reduce it to its root word.

We will use Natural Language Toolkit (nltk) to understand and learn stemming


## Porter Stemmer

In [1]:
import nltk
from nltk.stem.porter import PorterStemmer

In [2]:
p_stemmer = PorterStemmer() # object of class PorterStemmer

In [12]:
words = ['run','runner','running','ran','runs','easily','fairly', 'raining', 'beautiful', 'beauty'] # list of words 

In [13]:
for word in words:
    print(word + '------>' + p_stemmer.stem(word))

run------>run
runner------>runner
running------>run
ran------>ran
runs------>run
easily------>easili
fairly------>fairli
raining------>rain
beautiful------>beauti
beauty------>beauti


## Snowball Stemmer

Snowball Stemmer is also called the "English Stemmer" or "Porter2 Stemmer"
It offers a slight improvement over the original Porter stemmer

In [5]:
from nltk.stem.snowball import SnowballStemmer
# Pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [14]:
for word in words:
    print(word +' --> '+ s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair
raining --> rain
beautiful --> beauti
beauty --> beauti


# Lemmatization

Lemmatization is the Process of converting words into their dictionary form.

Lemmatization considers full vocabulary of a language to apply a morphological analysis.

e.g. Word: Feet, Lemma: Foot

Stemming is the process of converting words into their non-changing portion.


In [18]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [19]:
doc1 = nlp("The striped bats are hanging on their feet for best")

In [17]:
for token in doc1:
    print(token.text, '\t', token.pos_, '\t',token.lemma, '\t',token.lemma_)

# token, POS, Lemma

The 	 DET 	 7425985699627899538 	 the
striped 	 VERB 	 929563449582324419 	 stripe
bats 	 NOUN 	 8577633547555682751 	 bat
are 	 AUX 	 10382539506755952630 	 be
hanging 	 VERB 	 4780549502391586051 	 hang
on 	 ADP 	 5640369432778651323 	 on
their 	 DET 	 561228191312463089 	 -PRON-
feet 	 NOUN 	 779410287755165804 	 foot
for 	 ADP 	 16037325823156266367 	 for
best 	 ADJ 	 5711639017775284443 	 good


In [23]:
def show_lemmas(doc):
    for token in doc:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [21]:
doc2 = nlp(u"The striped bats are hanging on their feet for best")

In [24]:
show_lemmas(doc2) # 'are' becomes 'be', 'best' -> 'good'. Stemming doesn't do this.

The          DET    7425985699627899538    the
striped      VERB   929563449582324419     stripe
bats         NOUN   8577633547555682751    bat
are          AUX    10382539506755952630   be
hanging      VERB   4780549502391586051    hang
on           ADP    5640369432778651323    on
their        DET    561228191312463089     -PRON-
feet         NOUN   779410287755165804     foot
for          ADP    16037325823156266367   for
best         ADJ    5711639017775284443    good
