## Stemming


Stemming in NLTK (Natural Language Toolkit) is a process used in natural language processing to reduce words to their base or root form, known as the stem. The stem may not always be a valid word, but it represents the core meaning of the word. Stemming is useful for tasks like text normalization and information retrieval, where variations of words need to be treated as the same word.

In [70]:
## classsification problem
## comments of product is a positive review or negative reviews

## Reviews ---> eating,eat,eaten[going,gone,goes] --> go
words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]

## Porter Stemmer:

The Porter stemming algorithm, developed by Martin Porter in 1980, is one of the oldest and most widely used stemming algorithms. It applies a set of rules to remove common suffixes from words. While fast and simple, the Porter Stemmer may not always produce the most linguistically accurate stems.



In [71]:
from nltk.stem import PorterStemmer

In [72]:
stemming = PorterStemmer()

In [73]:
for word in words:
  print(word+"--->"+stemming.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program
history--->histori
running--->run
cats--->cat
jumped--->jump
faster--->faster
quickly--->quickli


In [74]:
stemming.stem('congratulation')

'congratul'

In [75]:
stemming.stem("sitting")

'sit'

## RegexpStemmer Class

RegexpStemmer class is a stemming algorithm that allows you to define custom stemming rules using regular expressions.

Regular expressions are like patterns we can use to find and change parts of words. For example, if we want to change words ending in "ing" to just their root form, like changing "running" to "run," we can create a rule for that.

- Define stemming rules using regular expressions
  
  pattern = r"(ing$|s$|ed$|er$|est$|ly$)"

For example:

- If you give it "running," and you've made a rule to remove "ing," it will change it to "run."

- If you give it "cats," and you've made a rule to remove "s," it will change it to "cat."

In [76]:
from nltk.stem import RegexpStemmer

In [77]:
words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]

In [78]:
# Define stemming rules using regular expressions
pattern = r"(ing$|s$|ed$|er$|est$|ly$)"

In [79]:
reg_stemmer = RegexpStemmer(pattern)

In [80]:
stemmed_words = [reg_stemmer.stem(word) for word in words]

In [81]:
for original,stemmed in zip(words,stemmed_words):
  print(original,"--->",stemmed)

eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> writ
writes ---> write
programming ---> programm
programs ---> program
history ---> history
running ---> runn
cats ---> cat
jumped ---> jump
faster ---> fast
quickly ---> quick


## Snowball Stemmer

The Snowball Stemmer, also known as the Porter2 Stemmer, is an improved version of the original Porter Stemmer. It supports stemming for multiple languages and provides better performance and accuracy compared to the Porter Stemmer.

In [82]:
from nltk.stem import SnowballStemmer

In [83]:
# Create an instance of the Snowball Stemmer for English
Snowball_stemmer = SnowballStemmer("english")

In [84]:
# list of words to stem
words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]

In [85]:
# Apply the Snowball Stemmer to each word in the list
stemmed_words = [Snowball_stemmer.stem(word) for word in words]

In [86]:
# Print original and stemmed words

for original,stemmed in zip(words,stemmed_words):
  print(original,"--->",stemmed)

eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> write
writes ---> write
programming ---> program
programs ---> program
history ---> histori
running ---> run
cats ---> cat
jumped ---> jump
faster ---> faster
quickly ---> quick


#### Difference Between Porter Stemmer and Snowball Stemmer

In [87]:
# poter Stemmer
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [88]:
#Snowball Stemmer
Snowball_stemmer.stem("fairly"),Snowball_stemmer.stem("sportingly")

('fair', 'sport')

In [90]:
# disAdvantages of stemming

Snowball_stemmer.stem("goes")


'goe'

In [91]:
stemming.stem("goes")

'goe'