## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

### Porter Stemmer

- Created by: Martin Porter (1980)
- Approach: Rule-based with a specific set of suffix-stripping rules
- Language: English only
- Speed: Fast
- Accuracy: Good but can be aggressive (over-stemming)


### Regexp Stemmer

- Approach: Uses regular expressions to remove suffixes
- Customization: Highly customizable - you define your own regex patterns
- Language: Any language (depends on your regex rules)
- Speed: Very fast
- Accuracy: Depends entirely on how well you craft the regex patterns

### Snowball Stemmer

- Created by: Martin Porter (improved version of Porter)
- Approach: More sophisticated algorithm with better rules
- Language: Multi-language support (English, French, German, Spanish, etc.)
- Speed: Slightly slower than Porter but still fast
- Accuracy: Generally more accurate, less over-stemming


In [18]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer

In [19]:
from nltk.stem import PorterStemmer

In [20]:
stemming=PorterStemmer()

In [21]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [22]:
stemming.stem('congratulations')

'congratul'

In [23]:
stemming.stem("sitting")

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [30]:
from nltk.stem import RegexpStemmer

In [31]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [32]:
reg_stemmer.stem('eating')

'eat'

In [33]:
reg_stemmer.stem('ingeating')

'ingeat'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [34]:
from nltk.stem import SnowballStemmer

In [35]:
snowballsstemmer=SnowballStemmer('english')

In [36]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [46]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [47]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [48]:
stemming.stem("sittingly")

'sittingli'

In [None]:
snowballsstemmer.stem('sittingly')

'sit'

## Wordnet Lemmatizer
Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. 

### Core Difference
- Stemming: Crude chopping of word endings
- Lemmatization: Intelligent reduction to dictionary root form

#### Stemming

- Method: Rule-based suffix removal
- Knowledge: Uses algorithmic rules only
- Speed: Fast
- Output: Stem (may not be a real word)

#### Lemmatization

- Method: Morphological analysis + dictionary lookup
- Knowledge: Uses grammar, vocabulary, and context
- Speed: Slower
- Output: Lemma (always a valid dictionary word)

In [49]:
## Q&A,chatbots,text summarization
from nltk.stem import WordNetLemmatizer

In [50]:
lemmatizer=WordNetLemmatizer()

In [65]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
print(stemming.stem("geese"))
print(snowballsstemmer.stem('geese'))
print(lemmatizer.lemmatize("geese",pos='n'))

gees
gees
goose


In [69]:
print(stemming.stem("mice"))
print(snowballsstemmer.stem('mice'))
print(lemmatizer.lemmatize("mice",pos='n'))

mice
mice
mouse


In [66]:
print(stemming.stem("goes"))
print(snowballsstemmer.stem('goes'))
print(lemmatizer.lemmatize("goes",pos='v'))

goe
goe
go


In [68]:
print(stemming.stem("went"))
print(snowballsstemmer.stem('went'))
print(lemmatizer.lemmatize("went",pos='v'))

went
went
go


In [56]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [57]:
for word in words:
    print(word+"---->"+lemmatizer.lemmatize(word,pos='v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


In [60]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [61]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [62]:
lemmatizer.lemmatize("fairly",pos='v'),lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')