# Stemming


Stemming is like cutting words down to their basic form. For example, the words **"playing," "played," and "plays"** can all be reduced to **"play."** It helps computers understand that these words are related.

In [1]:
## classification problem
## comments of product is positive or negative
## reviews ==> eating, eat, eaten

In [2]:
words = ["running", "played", "playing", "easily", "happier", "flying", "studies", "studying", "cats", "dogs"]


## Porter Stemmer

In [3]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [6]:
for i in words:
    print(f'{i}-->{stemming.stem(i)}')

running-->run
played-->play
playing-->play
easily-->easili
happier-->happier
flying-->fli
studies-->studi
studying-->studi
cats-->cat
dogs-->dog


In [7]:
# check another example
stemming.stem('Congratulations')

'congratul'

In [8]:
# check another example
stemming.stem('Seating')

'seat'

## RegexStemmer

A **RegexStemmer** is a simple way of stemming words by using patterns (regex) to remove common prefixes or suffixes from words. This is not as advanced as other stemmers like PorterStemmer, but it works for basic cases.

In [10]:
from nltk.stem import RegexpStemmer
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [13]:
reg_stemmer.stem('eating')

'eat'

In [15]:
# work with only if ing is in the last
reg_stemmer.stem('eating')

'ingeat'

In [17]:
reg_stemmer.stem('sing')

's'

## Snowball Stemmer  

The Snowball Stemmer is a type of tool that cuts words down to their root form, just like the PorterStemmer, but it’s newer and works better for different languages. Think of it as a smarter stemmer that knows how to handle more word endings.  


In [18]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer('english')

In [20]:
for i in words:
    print(f'{i}-->{snowball.stem(i)}')

running-->run
played-->play
playing-->play
easily-->easili
happier-->happier
flying-->fli
studies-->studi
studying-->studi
cats-->cat
dogs-->dog


### Differnce Working with Snowball Vs PorterStemmer

In [24]:
# porter stemmer

stemming.stem('generalizations')

'gener'

In [26]:
# snowball stemmer

snowball.stem('generalizations')

'general'

We often prefer **lemmatization** over **stemming** because lemmatization provides more accurate results. Here's why:

**Meaningful Output**  
- **Stemming**: Cuts words to their base form without understanding the context, often creating non-real words (e.g., "studying" → "studi").  
- **Lemmatization**: Uses a dictionary to find the root word, producing real words (e.g., "studying" → "study").

---

**Accuracy**  
- **Stemming**: Can over-stem or under-stem words, leading to errors.  
  Example:  
  - "running" → "runn" (not a valid word).  
- **Lemmatization**: Produces more accurate and complete root forms.  
  Example:  
  - "running" → "run".