# Stemming
Stemming is a text normalization technique used in natural language processing (NLP) and information retrieval to reduce words to their root or base form, called the stem. The purpose of stemming is to group words with similar meanings together, despite variations in their suffixes or prefixes.

Original words: "running", "runs", "ran", "runner"

Stemmed form: "run"

### Different types of stemming algorithms commonly used in NLP:

Porter Stemmer:One of the oldest and most widely used stemming algorithms developed by Martin Porter. It applies a series of rules to strip suffixes from words.

Snowball Stemmer: Also developed by Martin Porter, this stemming algorithm is an improvement over the Porter Stemmer and supports multiple languages.

Lancaster Stemmer: A highly aggressive stemming algorithm that tends to produce more aggressive stemming than the Porter Stemmer.

Lovins Stemmer: Developed by Julie Beth Lovins, this stemming algorithm is similar to the Porter Stemmer but with different rules.

### Drawbacks of stemming:

Overstemming: Stemming algorithms may sometimes over-simplify words, resulting in unrelated words being grouped together. For example, "universe" and "university" both stem to "univers", which can lead to confusion.

Understemming: Stemming algorithms may fail to reduce words to their appropriate stems, resulting in words with similar meanings being treated as separate entities. For example, "computer" and "computing" might not stem to the same root, even though they are related.

Loss of meaning: Stemming can sometimes result in loss of meaning, as it removes affixes from words without considering the semantics. For example, "connection" and "connectivity" both stem to "connect", losing the distinction in meaning.

### Usage:
Stemming is typically used in applications where the focus is on information retrieval, such as search engines, document clustering, and text mining. It helps in reducing the vocabulary size and simplifying text for analysis. However, stemming might not be suitable in applications where preserving the exact meaning of words is crucial, such as sentiment analysis or machine translation, as it can lead to loss of semantic information. In such cases, lemmatization, which considers the context of words and their morphological analysis, may be preferred over stemming.

## Porter Stemmer

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()

In [None]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
'died', 'agreed', 'owned', 'humbled', 'sized',
'meeting', 'stating', 'siezing', 'itemization',
'sensational', 'traditional', 'reference', 'colonizer','plotted']

In [None]:
for plural in plurals:
  print(plural+"--->"+stemmer.stem(plural))

caresses--->caress
flies--->fli
dies--->die
mules--->mule
denied--->deni
died--->die
agreed--->agre
owned--->own
humbled--->humbl
sized--->size
meeting--->meet
stating--->state
siezing--->siez
itemization--->item
sensational--->sensat
traditional--->tradit
reference--->refer
colonizer--->colon
plotted--->plot


## Snowball Stemmer

In [None]:
from nltk.stem import SnowballStemmer

In [None]:
stemmer = SnowballStemmer("english")

In [None]:
for plural in plurals:
  print(plural+"--->"+stemmer.stem(plural))

caresses--->caress
flies--->fli
dies--->die
mules--->mule
denied--->deni
died--->die
agreed--->agre
owned--->own
humbled--->humbl
sized--->size
meeting--->meet
stating--->state
siezing--->siez
itemization--->item
sensational--->sensat
traditional--->tradit
reference--->refer
colonizer--->colon
plotted--->plot


## RegexpStemmer

In [None]:
from nltk.stem import RegexpStemmer

In [None]:
regexp_stemmer = RegexpStemmer('(s|es|ed|ing|er|ion)$')

In [None]:
for plural in plurals:
  print(plural+"--->"+regexp_stemmer.stem(plural))

caresses--->caress
flies--->fli
dies--->di
mules--->mul
denied--->deni
died--->di
agreed--->agre
owned--->own
humbled--->humbl
sized--->siz
meeting--->meet
stating--->stat
siezing--->siez
itemization--->itemizat
sensational--->sensational
traditional--->traditional
reference--->reference
colonizer--->coloniz
plotted--->plott
