## Stemming & Lemmatization
In NLP tasks, **stemming** refers to the process of reducing a word to its base or root form, known as the stem. This is done by removing any suffixes or prefixes that may be attached to the word. The purpose of stemming is to reduce the number of unique words in a text corpus, which can help to simplify analysis and improve efficiency. Stemming is often used in text mining, search engines, and information retrieval systems. However, stemming can sometimes lead to errors or loss of meaning, as some words may have multiple stems or may not be related in meaning despite having the same stem.

In NLP tasks, **lemmatization** refers to the process of reducing a word to its base or dictionary form, known as the lemma. This is done by considering the context and morphological analysis of the word. Unlike stemming, which simply removes suffixes and prefixes, lemmatization takes into account the part of speech of the word and its inflectional forms. The purpose of lemmatization is to reduce the number of unique words in a text corpus while preserving the meaning of the words. Lemmatization is often used in natural language processing tasks such as text classification, sentiment analysis, and machine translation. However, lemmatization can be computationally expensive compared to stemming.

These two are the essential parts that we need to perform in preprocessing stage while building an NLP application.

In some cases when we want to make a base form of a word we just use a fix rules like removing 'ing' and 'able' suffixes from the words, so this process is called **Stemming.**

But sometimes in some special cases the fix rules doens't word ok, we need some further processing, we have to use knowledge of the language to find the base form of the word. The base form of the word is called 'lemma'. And the process is called **Lemmatization.**

<img src = "img.jpg" width = "600px" heigh = "600px"></img>

**Stemming** always doesn't give correct base form of the word, that's why we need **Lemmatization.**

<img src = "img1.jpg" width = "600px" heigh = "600px"></img>

* So for **Stimming** we'll use **NLTK**, because **Spacy** doesn't have support for stemming, it has just **Lemmatization** support. But **NLTK** supports both **Stemming** and **Lemmatization.** 

In [1]:
# So let's first import NLTK and Spacy libraries:
import nltk
import spacy

In [2]:
# So now we import 'PorterStemmer'class from NLTK and we create the object of this class.
# There is other Stemmer as well called 'SnowballStemmer' which you can use it.
from nltk import PorterStemmer
# from nltk import SnowballStemmer
stemmer = PorterStemmer()

In [3]:
# So here we have some words, we want to print the base words for these words using stemmer.
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))   # It will apply fix set of rules. It will make some mistakes because it doesn't 
                                           # has language knowledge.

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


In [5]:
# Now let's use 'Lemmatization' in 'Spacy'.
# First we create English language pre-trained pipeline. Then we defined couple of texts. And then we apply lemmatizaion.
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_) # Based on the trained model it will do lemmatization. Lemmatization will also have
                                      # some errors, it works based on pre-trained English language model.

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meeting
better  |  well


In [6]:
# 'lemma' will print unique identifier for each word:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_, " | ", token.lemma)

eating  |  eat  |  9837207709914848172
eats  |  eat  |  9837207709914848172
eat  |  eat  |  9837207709914848172
ate  |  eat  |  9837207709914848172
adjustable  |  adjustable  |  6033511944150694480
rafting  |  raft  |  7154368781129989833
ability  |  ability  |  11565809527369121409
meeting  |  meeting  |  14798207169164081740
better  |  well  |  4525988469032889948


In [7]:
# To see another example:
doc = nlp("Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes.")
for token in doc:
    print(token, " | ", token.lemma_)

Stemming  |  stemming
is  |  be
the  |  the
process  |  process
of  |  of
reducing  |  reduce
a  |  a
word  |  word
to  |  to
its  |  its
stem  |  stem
that  |  that
affixes  |  affix
to  |  to
suffixes  |  suffix
and  |  and
prefixes  |  prefix
.  |  .


### Customizing lemmatizer

In [8]:
# Sometimes we might want to modify the behavior of pipeline, we might want to customize it. So for this particular pipeline
# which we loaded, we have the following components:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [9]:
# From these components 'attribute_ruler' assigns attribute to a particular token and you can customize it. 
# In the following sentence we have words such as 'Bro', 'Brah' which gives the meaning of 'Brother'. By default the language
# model doesn't understand these slangs. It will return themself and won't change to its base form.

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | bro
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brah
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [10]:
# Now we want to customize the model. We know that the Bro and Brah is brother. So we can customize lemmatizer by using
# 'attribute_ruler'.

ar = nlp.get_pipe('attribute_ruler')   # Get 'attribute_ruler' component from the pipeline.

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"}) # Here we add custom rules for words 'Bro' and 'Brah'.

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [11]:
# So as we see now instead of words 'Bro' and 'Brah' we get 'brother'.