#### Stemming words

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

I was taking a ride in the car.
I was riding in the car.

One of the most popular stemming algorithms is the __Porter stemmer__, which has been around since 1979.

In [4]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer

In [5]:
porter   = PorterStemmer()
lancaster= LancasterStemmer()
sno      = nltk.stem.SnowballStemmer('english')

In [6]:
word_list = ["cave", "caver", "caved"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
cave                 cave                 cav                  cave                
caver                caver                cav                  caver               
caved                cave                 cav                  cave                


In [7]:
word_list = ["run", "ran", "runner", "running"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
run                  run                  run                  run                 
ran                  ran                  ran                  ran                 
runner               runner               run                  runner              
running              run                  run                  run                 


In [8]:
word_list = ["cats", "trouble", "troubling", "troubled", "troublesome"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word))) 

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
cats                 cat                  cat                  cat                 
trouble              troubl               troubl               troubl              
troubling            troubl               troubl               troubl              
troubled             troubl               troubl               troubl              
troublesome          troublesom           troublesom           troublesom          


Notice how the PorterStemmer is 
- giving the root (stem) of the word "cats" by simply removing the 's' after cat. This is a suffix added to cat to make it plural. 
- But if we look at 'trouble', 'troubling' and 'troubled' they are stemmed to 'trouble' because **PorterStemmer algorithm does not follow linguistics rather a set of 05 rules for different cases that are applied in phases (step by step) to generate stems**


In [9]:
word_list = ["argue", "argued", "argues", "arguing", "argus"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","Porter Stemmer", "lancaster Stemmer", "Snowball Stemmer"))

for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, porter.stem(word), lancaster.stem(word), sno.stem(word)))

Word                 Porter Stemmer       lancaster Stemmer    Snowball Stemmer    
argue                argu                 argu                 argu                
argued               argu                 argu                 argu                
argues               argu                 argu                 argu                
arguing              argu                 argu                 argu                
argus                argu                 arg                  argus               


### lemmatization

Lemmatization is the process of converting a word to its base form. 

The difference between stemming and lemmatization is, 

> lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas, stemming would cutoff the ‘ing’ part and convert it to car.

    ‘Caring’ -> Lemmatization -> ‘Care’
    ‘Caring’ -> Stemming -> ‘Car’
    
ways to lemmatize:-

    Wordnet Lemmatizer
    Spacy Lemmatizer
    TextBlob
    CLiPS Pattern
    Stanford CoreNLP
    Gensim Lemmatizer
    TreeTagger

In [10]:
from nltk.stem import WordNetLemmatizer

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

In [11]:
word_list = ["friend", "friendship", "friends", "friendships","stabilize","destabilize","misunderstanding","railroad","moonlight","football"]

print("{0:20} {1:20}".format("Word","WordNetLemmatizer"))

for word in word_list:
    print("{0:20} {1:20} ".format(word, lemmatizer.lemmatize(word)))

Word                 WordNetLemmatizer   
friend               friend               
friendship           friendship           
friends              friend               
friendships          friendship           
stabilize            stabilize            
destabilize          destabilize          
misunderstanding     misunderstanding     
railroad             railroad             
moonlight            moonlight            
football             football             


In [12]:
# Lemmatize Single Word
print(lemmatizer.lemmatize("bats"))

print(lemmatizer.lemmatize("are"))

print(lemmatizer.lemmatize("feet"))

bat
are
foot


In [13]:
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)

['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']


In [14]:
for w in word_list:
    print(w, '-->', lemmatizer.lemmatize(w) )

The --> The
striped --> striped
bats --> bat
are --> are
hanging --> hanging
on --> on
their --> their
feet --> foot
for --> for
best --> best


Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. 

This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().

In [15]:
print(lemmatizer.lemmatize("stripes", 'v')) 
print(lemmatizer.lemmatize("stripes", 'n'))  

strip
stripe
