<a href="https://colab.research.google.com/github/krishanu34/DataScience/blob/main/01.NLP/02.Text Preprocessing-Stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Stemming is a process in natural language processing that reduces words to their root or base form, often called the **"stem"**. The goal is to remove suffixes from words so that variations of the same word are treated as the same token.

For example, consider the words:
- running
- runs
- runner

After stemming, all these words would be reduced to the stem "run".

Here are some common types of stemming algorithms:

*   **Porter Stemmer:** This is one of the most widely used stemming algorithms. It uses a set of rules to remove suffixes based on different steps.
*   **Snowball Stemmer (Porter2 Stemmer):** This is an improvement over the Porter Stemmer and is available in many languages. It's generally considered more aggressive than the Porter Stemmer.
*   **Lancaster Stemmer:** This is a more aggressive stemming algorithm compared to Porter and Snowball, meaning it can reduce words to even shorter stems.
*   **RegexpStemmer** : This is a simple stemming algorithm that uses regular expressions to remove suffixes from words. It's a rule-based stemmer where you define a set of regular expressions to match and remove patterns at the end of words.

Stemming is a rule-based approach and can sometimes produce stems that are not actual words. For instance, "beautiful" might be stemmed to "beauti". It's important to note that stemming is different from lemmatization, which reduces words to their base or dictionary form (lemma), which is always a valid word.

In [13]:
words = ["running", "runs", "runner", "easily", "easier", "beautify", "beautiful", "connecting", "connection", "connected","runnings"]

In [14]:
 from nltk.stem import PorterStemmer

In [15]:
porter_stemmer = PorterStemmer()

In [16]:
for word in words:
  print (f"{word} ----> {porter_stemmer.stem(word)}")

running ----> run
runs ----> run
runner ----> runner
easily ----> easili
easier ----> easier
beautify ----> beautifi
beautiful ----> beauti
connecting ----> connect
connection ----> connect
connected ----> connect
runnings ----> run


In [17]:
from nltk.stem import RegexpStemmer

In [18]:
regerx_stemmer = RegexpStemmer("ing$|s$|e$|able$")

In [19]:
for word in words:
  print (f"{word} ----> {regerx_stemmer.stem(word)}")

running ----> runn
runs ----> run
runner ----> runner
easily ----> easily
easier ----> easier
beautify ----> beautify
beautiful ----> beautiful
connecting ----> connect
connection ----> connection
connected ----> connected
runnings ----> running


In [20]:
from nltk.stem import SnowballStemmer

In [21]:
snowball_stemmer = SnowballStemmer(language='english')

In [22]:
for word in words:
  print (f"{word} ----> {snowball_stemmer.stem(word)}")

running ----> run
runs ----> run
runner ----> runner
easily ----> easili
easier ----> easier
beautify ----> beautifi
beautiful ----> beauti
connecting ----> connect
connection ----> connect
connected ----> connect
runnings ----> run


In [23]:
from nltk.stem import LancasterStemmer

In [24]:
lanc_stemmer = LancasterStemmer()

In [25]:
for word in words:
  print (f"{word} ----> {lanc_stemmer.stem(word)}")

running ----> run
runs ----> run
runner ----> run
easily ----> easy
easier ----> easy
beautify ----> beaut
beautiful ----> beauty
connecting ----> connect
connection ----> connect
connected ----> connect
runnings ----> run
