# Stemming with NLTK

```
# ! pip install nltk
```
- Stemming is somewhat crude method for cataloging related words; it essentially trim off letters from the end until the stem is reached.
- This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required.
- In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization
- Because of this decision to not include Stemming in Spacy, we will jump over to using NLTK and learn about various Stemmers.
- let's discuss both the Porter Stemmer and the Snowball Stemmer.
  
# Stemming with Porter
- One of the most common and effective stemming tools, is Porter's Algorithm developed by Martin Porter in 1980
- The algorithm employs five phases of word reduction, each with its own set of mapping rules.

In [7]:
import nltk
from nltk.stem.porter import PorterStemmer 

In [9]:
p_stemmer = PorterStemmer()

In [11]:
words = ['run','ran','runner','running','runs','easily','fairly','mainly']

In [30]:
for word in words:
    print(word, "--->",p_stemmer.stem(word))

generous ---> gener
generation ---> gener
generously ---> gener
generate ---> gener


# Stemming with Snowball
- Snowball is the name fo a stemming language also developed by Martin Porter.
- the algorithm used here is more accurately called the "English Stemmer" or "Porter2 Stemmer".
- It offfers a slight improvement over the original Porter stemmer, both in logic and speed.

In [20]:
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language='english')

In [22]:
for word in words:
    print(word, "--->", s_stemmer.stem(word))

run ---> run
ran ---> ran
runner ---> runner
running ---> run
runs ---> run
easily ---> easili
fairly ---> fair
mainly ---> main


In [24]:
words = ['generous','generation','generously','generate']

In [28]:
for word in words:
    print(word, "--->", s_stemmer.stem(word))

generous ---> generous
generation ---> generat
generously ---> generous
generate ---> generat


# Lemmatization
1. In contrast to stemming, **lemmatization** looks beyond word reduction, and considers a langauage's full vocabulory to apply a morphological analysis to words.
2. For Exmpale<br>
   - was --> be<br>
   - mice --> mouse<br>
   - meeting --> 'meet' or 'meeting' depending on its use in a sentence.
4. _**Lemmatization**_ is typically seen as much more informative than simple stemming, which is why Spacy has opted to only have *Lemmatization* available instead of *Stemming*
5. Lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases.
6. Next we will see word vectors and similarity.

## Lemmatization with spaCy

In [51]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [53]:
doc1 = nlp("I am a runner running in a race because I love to run since I ran today")

In [57]:
for token in doc1:
    print(token.text,'\t\t',token.pos_,'\t',token.lemma,'\t',token.lemma_)

I 		 PRON 	 4690420944186131903 	 I
am 		 AUX 	 10382539506755952630 	 be
a 		 DET 	 11901859001352538922 	 a
runner 		 NOUN 	 12640964157389618806 	 runner
running 		 VERB 	 12767647472892411841 	 run
in 		 ADP 	 3002984154512732771 	 in
a 		 DET 	 11901859001352538922 	 a
race 		 NOUN 	 8048469955494714898 	 race
because 		 SCONJ 	 16950148841647037698 	 because
I 		 PRON 	 4690420944186131903 	 I
love 		 VERB 	 3702023516439754181 	 love
to 		 PART 	 3791531372978436496 	 to
run 		 VERB 	 12767647472892411841 	 run
since 		 SCONJ 	 10066841407251338481 	 since
I 		 PRON 	 4690420944186131903 	 I
ran 		 VERB 	 12767647472892411841 	 run
today 		 NOUN 	 11042482332948150395 	 today


In [85]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{15}} {token.pos_:{10}} {token.lemma:<{25}} {token.lemma_}')

In [89]:
show_lemmas(doc1)
# token.lemma --> returns hashvalue for that lemma in spacy implementation

I               PRON       4690420944186131903       I
am              AUX        10382539506755952630      be
a               DET        11901859001352538922      a
runner          NOUN       12640964157389618806      runner
running         VERB       12767647472892411841      run
in              ADP        3002984154512732771       in
a               DET        11901859001352538922      a
race            NOUN       8048469955494714898       race
because         SCONJ      16950148841647037698      because
I               PRON       4690420944186131903       I
love            VERB       3702023516439754181       love
to              PART       3791531372978436496       to
run             VERB       12767647472892411841      run
since           SCONJ      10066841407251338481      since
I               PRON       4690420944186131903       I
ran             VERB       12767647472892411841      run
today           NOUN       11042482332948150395      today


In [91]:
dco2 = nlp("I saw a ten mice today")
show_lemmas(doc2)

NameError: name 'doc2' is not defined