## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [None]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

# üî™ Stemming - Reducing Words to Root Form

**Goal**: Reduce words to their base/root form to decrease vocabulary size

## Why Stemming?
- Reduces feature space (eating, eats, eaten ‚Üí eat)
- Improves model efficiency
- Useful for classification tasks

## Trade-off:
‚úÖ **Fast** - Rule-based, no dictionary lookup  
‚ùå **Inaccurate** - May produce non-words (history ‚Üí histori)

**When to use**: Spam detection, sentiment analysis, search engines

In [None]:
from nltk.stem import PorterStemmer

## 1Ô∏è‚É£ Porter Stemmer

**Most common stemmer** - Oldest and widely used

**How it works**: Applies a series of rules to remove suffixes

**Problem**: Can produce non-English words
- congratulations ‚Üí congratul ‚ùå
- history ‚Üí histori ‚ùå

In [None]:
stemming=PorterStemmer()

In [5]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [6]:
stemming.stem('congratulations')

'congratul'

In [7]:
stemming.stem("sitting")

'sit'

## 2Ô∏è‚É£ Regex Stemmer

**Custom rule-based stemming** using regular expressions

**Use case**: When you need domain-specific stemming control

**How it works**: You define patterns to remove (ing$, s$, able$)

**Advantage**: Full control over what gets removed

In [8]:
from nltk.stem import RegexpStemmer

In [9]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [10]:
reg_stemmer.stem('eating')

'eat'

In [11]:
reg_stemmer.stem('ingeating')

'ingeat'

## 3Ô∏è‚É£ Snowball Stemmer (Recommended! ‚úÖ)

**Improved version of Porter Stemmer**

**Why better**:
- More accurate than Porter
- Supports multiple languages
- Better word forms (fairly ‚Üí fair instead of fairli)

**Recommendation**: Use this instead of Porter for better results!

In [12]:
from nltk.stem import SnowballStemmer

In [13]:
snowballsstemmer=SnowballStemmer('english')

In [14]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [15]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [16]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [17]:
snowballsstemmer.stem('goes')

'goe'

In [18]:
stemming.stem('goes')

'goe'

---

## ‚úÖ Checkpoint: What You Learned

- ‚úÖ What stemming is and why it's needed
- ‚úÖ Porter Stemmer (most common, but inaccurate)
- ‚úÖ Regex Stemmer (custom rules)
- ‚úÖ Snowball Stemmer (best choice!) ‚≠ê
- ‚úÖ Trade-off: Speed vs Accuracy

---

## üéØ Next Step

Stemming is fast but produces non-words. Want better accuracy?

**Next Notebook**: `3-Lemmatization-+Text+Preprocessing.ipynb`

**What you'll learn**: Dictionary-based root finding with perfect English words!

---

üí° **Key Takeaway**: Use Snowball for stemming, but lemmatization is better for accuracy!