# Stemming and Lemmatization:

---
## Stemming:

- Rule based/ heuristic approach to stemming forms to its root form. Uses crude heuristics to stem to root form.
- Types of popular stemmers:
    - Porter Stemmer
    - Snowball stemmer(Porter 2)
    - Lancaster Stemmer
    
- ***Since the stemming is a rule based approach the final stems could be faulted as it can lead to overstemming or understemming***

In [9]:
# Importing Dependencies
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import pandas as pd

In [2]:
new_text = "It is important to by very pythonly & % / while you are pythoning with python. All pythoners have pythoned poorly at least once."

### Types of stemming algorithms:

1. Porter Stemmer: 
A gentle stemming algorithm. This is ***unused*** nowadays.

2. Snowball Stemmer: (a.k.a Porter 2)
Aggressive porter stemmer algorithm. Also has a ***faster computation****.

3. Lancaster Stemmer:
Most aggressive stemmer in comparison. Also the ***fastest*** stemmer out of the three.

In [4]:
# Evaluating multiple types of stemmers on scikit-learn
ps = PorterStemmer()
ss = SnowballStemmer("english")
ls = LancasterStemmer()

In [5]:
# Word Tokenizer
words = word_tokenize(new_text)

words

['It',
 'is',
 'important',
 'to',
 'by',
 'very',
 'pythonly',
 '&',
 '%',
 '/',
 'while',
 'you',
 'are',
 'pythoning',
 'with',
 'python',
 '.',
 'All',
 'pythoners',
 'have',
 'pythoned',
 'poorly',
 'at',
 'least',
 'once',
 '.']

In [6]:
# Sentence tokenizer
sentence = sent_tokenize(new_text)

sentence

['It is important to by very pythonly & % / while you are pythoning with python.',
 'All pythoners have pythoned poorly at least once.']

In [14]:
#Porter stemmer
stemmed_words_ps = []

for w in words:
    stemmed_words_ps.append(ps.stem(w))

# stemmed_words_ps

In [24]:
print(f"The original input: {new_text} \n")
print(f"The porter stemmer output: {' '.join(stemmed_words_ps)}")

The original input: It is important to by very pythonly & % / while you are pythoning with python. All pythoners have pythoned poorly at least once. 

The porter stemmer output: It is import to by veri pythonli & % / while you are python with python . all python have python poorli at least onc .


In [26]:
#Snowball stemmer
stemmed_words_ss = []

for w in words:
    stemmed_words_ss.append(ss.stem(w))

# stemmed_words_ss
print(f"The original input: {new_text} \n")
print(f"The snowball stemmer output: {' '.join(stemmed_words_ss)}")

The original input: It is important to by very pythonly & % / while you are pythoning with python. All pythoners have pythoned poorly at least once. 

The snowball stemmer output: it is import to by veri python & % / while you are python with python . all python have python poor at least onc .


In [28]:
#Lancaster stemmer
stemmed_words_ls = []

for w in words:
    stemmed_words_ls.append(ls.stem(w))

# stemmed_words_ls
print(f"The original input: {new_text} \n")
print(f"The lancaster stemmer output: {' '.join(stemmed_words_ls)}")

The original input: It is important to by very pythonly & % / while you are pythoning with python. All pythoners have pythoned poorly at least once. 

The lancaster stemmer output: it is import to by very python & % / whil you ar python with python . al python hav python poor at least ont .


In [21]:
# Comparing words and stemmed_words to understand stemming

df = pd.DataFrame({'original_word': words, 'porter_stemmer': stemmed_words_ps, 'lancaster_stemmer': stemmed_words_ls, 'snowball_stemmer': stemmed_words_ss})
df

Unnamed: 0,original_word,porter_stemmer,lancaster_stemmer,snowball_stemmer
0,It,It,it,it
1,is,is,is,is
2,important,import,import,import
3,to,to,to,to
4,by,by,by,by
5,very,veri,very,veri
6,pythonly,pythonli,python,python
7,&,&,&,&
8,%,%,%,%
9,/,/,/,/


---
## Lemmatization

A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.

Some times you will wind up with a very similar word, but sometimes, you will wind up with a completely different word. Let's see some examples.

In [17]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/harish3110/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [18]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
  
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 
  
# a denotes adjective in "pos" 
print("better :", lemmatizer.lemmatize("better", pos ="a"))


rocks : rock
corpora : corpus
better : good


In [19]:
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


---