## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [1]:
## Classification Problem
## Comments of product is a positive review or negative review
## Dataset will have Reviews
## In reviews we can have words like --> eating, eat,eaten => So eat here is the root word or stem word
## [going,gone,goes]--->go  => go here is the stem word

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

So in the above comments eat and go are the stem words.

So __go__ is the stem word for all these `[going,gone,goes]`. So its not necessary to have all these kind of similar words again and again, because it increases the number of input features as each and every word represents a vector (will be covered later in text pre-processing). So its better to have a single word __go__.

So finding this stem word can be done with help of __stemming__. 

Stemming techniques:




### PorterStemmer

In [2]:
from nltk.stem import PorterStemmer

In [3]:
stemming=PorterStemmer()

So for each and every word, this __stemming__ process will be applied.

In [4]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [5]:
stemming.stem('congratulations')

'congratul'

Observe while doing stemming entire meaning of the word itself has changed for the below cases:
```
history---->histori
congratulations---->congratul
``` 

In [6]:
## stemming works for some like below but for some words it doesnt work correctly.
stemming.stem("sitting")

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [7]:
from nltk.stem import RegexpStemmer

__RegexpStemmer__: A stemmer that uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.
```
>>> from nltk.stem import RegexpStemmer
>>> st = RegexpStemmer('ing$|s$|e$|able$', min=4)
>>> st.stem('cars')
'car'
>>> st.stem('mass')
'mas'
>>> st.stem('was')
'was'
>>> st.stem('bee')
'bee'
>>> st.stem('compute')
'comput'
>>> st.stem('advisable')
'advis'
```

In [8]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

It means wherever the characters in a word are _ing_, _e_, _s_ and _able_ remove that.

In [9]:
reg_stemmer.stem('eating')

'eat'

In [10]:
reg_stemmer.stem('ingeating')

'ingeat'

Now if we want to see we want to check if the affixes are just given and if `$` is removed.

In [11]:
reg_stemmer=RegexpStemmer('ing|s$|e$|able$', min=4)

Observe below it will remove __ing__ at the beginning and in the end.

In [12]:
reg_stemmer.stem('ingeating')

'eat'

In [13]:
reg_stemmer.stem('inggo')

'go'

What will happen if we put `$` at the beginning.

In [14]:
reg_stemmer=RegexpStemmer('$ing|s$|e$|able$', min=4)

In [15]:
reg_stemmer.stem('ingeating')

'ingeating'

Nothing happens here as __$__ is for characters at end of word. For removing from beginning we need to add __^__.

In [16]:
reg_stemmer=RegexpStemmer('^ing|s$|e$|able$', min=4)

In [17]:
reg_stemmer.stem('ingeating')

'eating'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [18]:
from nltk.stem import SnowballStemmer

In [19]:
snowballsstemmer=SnowballStemmer('english')

In [20]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


Observe above for history also we are not getting the correct result and its same as `PorterStemmer`. Below code see when PorterStemmer is applied to 2 words and what happens when `snowballsstemmer` is applied to it.

In [21]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [22]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

So `snowballsstemmer` is used for text pre-processing - to clean data so that we can convert it into vectors in efficient ways.

But with stemming however we try we wont be able to capture the root word (eg: Go ) as the form of word changes as shown below so this one of the major disadvantages of stemming. So for usecases like chatbot we cannot use stemming and instead we have to go with Lemmatization. 

Lemmatization has the dictionary of all the root words. So whichever word you use it will give a good grammatical word for that word.

In [23]:
snowballsstemmer.stem('goes')

'goe'

In [24]:
stemming.stem('goes')

'goe'

#### Additonal Reading:
- [Analytics Vidhya Stemming](https://www.analyticsvidhya.com/blog/2021/11/an-introduction-to-stemming-in-natural-language-processing/)
- [Stemming vs Lemmatization](https://www.datacamp.com/tutorial/stemming-lemmatization-python)