# Lemmatization

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.


Example:
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words. Because lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
    


In [10]:
import nltk
from nltk.stem import WordNetLemmatizer
  

In [11]:
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

sentence_words = nltk.word_tokenize(sentence)
punctuations="?:!.,;"

#removing punctuation
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)
        
sentence_words

['He',
 'was',
 'running',
 'and',
 'eating',
 'at',
 'same',
 'time',
 'He',
 'has',
 'bad',
 'habit',
 'of',
 'swimming',
 'after',
 'playing',
 'long',
 'hours',
 'in',
 'the',
 'Sun']

In [13]:
print("{0:20}{1:10}".format("Original Word","Lemmatized Word"))

for word in sentence_words:
    print("{0:20}{0:20}".format(word, wordnet_lemmatizer.lemmatize(word)))


Original Word       Lemmatized Word
He                  He                  
was                 was                 
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 has                 
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


In the above output, you must be wondering that no actual root form has been given for any word, this is because they are given without context. You need to provide the context in which you want to lemmatize that is the parts-of-speech (POS)

In [15]:
print("{0:20}{1:10}".format("Original Word","Lemmatized Word"))

#with pos parameter
for word in sentence_words:
    print("{0:20}{0:20}".format(word, wordnet_lemmatizer.lemmatize(word, pos='v')))

Original Word       Lemmatized Word
He                  He                  
was                 was                 
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 has                 
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


# Stemming or lemmatization?

So when to use what! The above points show that if speed is focused then stemming should be used since lemmatizers scan a corpus which consumed time and processing. It depends on the application you are working on that decides if stemmers should be used or lemmatizers. If you are building a language application in which language is important you should use lemmatization as it uses a corpus to match root forms.

# Information Retrieval (IR) Environments:

It is useful to use stemming and lemmatization to map documents to common topics and display search results by indexing when documents are increasing to mind-boggling numbers. Query Expansion is a term used in Search Environments which refers to that when a user inputs a query. It is used to expand or enhance the query to match additional documents.

# Sentiment Analysis

Sentiment Analysis is the analysis of people's reviews and comments about something. It is widely used for analysis of product on online retail shops. Stemming and Lemmatization is used as part of the text-preparation process before it is analyzed.

# Document Clustering

Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in an automatic document organization, topic extraction, and fast information retrieval or filtering. Examples of document clustering include web document clustering for search engines. Before Clustering methods are applied document is prepared through tokenization, removal of stop words and then Stemming and Lemmatization to reduce the number of tokens that carry out the same information and hence speed up the whole process. After this pre-processing, features are calculated by calculating the frequency of all tokens and then clustering methods are applied.