---
<strong>
    <h1 align='center'><strong>Stemming</strong></h1>
</strong>

---

**Stemming in NLP**

In natural language processing (NLP), stemming is a text normalization technique used to reduce words to their base or root form. The purpose of stemming is to remove suffixes and prefixes from words so that different grammatical forms or derivations of a word are treated as the same word. This can help improve text analysis, information retrieval, and text-based machine learning tasks by reducing the dimensionality of the vocabulary.

*Example*: Consider the words "jumping," "jumps," and "jumped." When these words are stemmed, they are reduced to their common root form, which is "jump." This allows NLP algorithms to treat all these variations of the word "jump" as the same word, simplifying text analysis.

Stemming algorithms work by applying a set of rules or heuristics to trim prefixes and suffixes from words. Common stemming algorithms include:

1. **Porter Stemmer**: The Porter stemming algorithm is one of the most widely used stemming algorithms. It applies a series of rules to remove suffixes from words, but it may not always produce a valid word root.

2. **Snowball Stemmer**: This is an improved and more versatile version of the Porter stemmer. It offers stemmers for multiple languages and allows for more accurate stemming.

3. **Lancaster Stemmer**: The Lancaster stemming algorithm is more aggressive than the Porter stemmer, often producing shorter stems, but it may also result in less recognizable word forms.


Stemming is a useful preprocessing step in various NLP tasks, such as information retrieval, document classification, and text mining. However, it's important to note that stemming may not always produce valid words, and the stemmed forms might not be easily interpretable. For certain applications, such as sentiment analysis or language understanding, lemmatization (which returns valid dictionary words and considers word meanings) may be a more suitable alternative to stemming.


In [None]:
from pprint import pprint

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download a specific NLTK dataset, e.g., the 'punkt' tokenizer models.
nltk.download('punkt', quiet=True)

# Download the NLTK stopwords dataset, which contains common stopwords for various languages.
# nltk.download('stopwords', quiet=True)

# Download the NLTK averaged perceptron tagger, which is used for part-of-speech tagging.
# nltk.download('averaged_perceptron_tagger', quiet=True)

# Download the WordNet lexical database, which is used for various NLP tasks like synonym and antonym lookup.
# nltk.download('wordnet', quiet=True)

# Download the NLTK names dataset, which contains a list of common first names and last names.
# nltk.download('names', quiet=True)

# Download the NLTK movie_reviews dataset, which contains movie reviews categorized as positive and negative.
# nltk.download('movie_reviews', quiet=True)

# Download the NLTK reuters dataset, which is a collection of news documents categorized into topics.
# nltk.download('reuters', quiet=True)

# Download the NLTK brown corpus, which is a collection of text from various genres of written American English.
# nltk.download('brown', quiet=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Sample paragraph
paragraph = """Natural Language Processing (NLP) is a fascinating field with numerous applications.
               It involves the interaction between computers and human language. NLP tasks include
               text classification, sentiment analysis, and machine translation. In addition to common
               stop words like 'the', 'and', and 'is', there are domain-specific stop words such as
               'algorithm', 'linguistics', and 'corpus' that are often excluded from NLP analysis."""

# Tokenize the paragraph into sentences
sentences = nltk.sent_tokenize(paragraph)

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Stemming and removing stopwords
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

# Print the preprocessed sentences
for sentence in sentences:
    print(sentence)

natur languag process ( nlp ) fascin field numer applic .
it involv interact comput human languag .
nlp task includ text classif , sentiment analysi , machin translat .
in addit common stop word like 'the ' , 'and ' , 'i ' , domain-specif stop word 'algorithm ' , 'linguist ' , 'corpu ' often exclud nlp analysi .


$$
\begin{array}{|l|l|l|}
\hline
\text { Stemmer } & \text { Use When } & \text { Example } \\
\hline
\text { Porter Stemmer } &
\begin{array}{l}
- \text { Good balance between} \\
\text { aggressiveness and accuracy.} \\
- \text { Suitable for general-purpose} \\
\text { applications where over-stemming} \\
\text { is a concern.} \\
- \text { Provides less aggressive stemming} \\
\text { compared to others.}
\end{array}
&
\begin{array}{l}
\text { "Happily" } \rightarrow \text { "Happili"} \\
\text { "Running" } \rightarrow \text { "Run"}
\end{array}
\\
\hline
\text { Snowball Stemmer (Porter2) } &
\begin{array}{l}
- \text { Improved version of the Porter stemmer.} \\
- \text { Offers better stemming for} \\
\text { modern English words.} \\
- \text { Good choice for search engines,} \\
\text { information retrieval, and text} \\
\text { mining tasks.}
\end{array}
&
\begin{array}{l}
\text { "Happily" } \rightarrow \text { "Happili"} \\
\text { "Running" } \rightarrow \text { "Run"}
\end{array}
\\
\hline
\text { Lancaster Stemmer } &
\begin{array}{l}
- \text { Very aggressive stemming.} \\
- \text { Useful when you want to reduce} \\
\text { words to their most basic form,} \\
\text { even if the result is not a valid word.} \\
- \text { May produce very short stems,} \\
\text { which can be hard to interpret.}
\end{array}
&
\begin{array}{l}
\text { "Happily" } \rightarrow \text { "Happy"} \\
\text { "Running" } \rightarrow \text { "Run"}
\end{array}
\\
\hline
\end{array}
$$


In [None]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

# List of words
words = ['algorithm', 'beautifully', 'flies', 'friendship', 'happening', 'happily', 'interaction', 'jumped', 'jumping', 'jumps', 'quickly', 'runner', 'running']

# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()

# Iterate through the list of words
for word in words:
    # Apply each stemmer and print the results
    porter_stem = porter.stem(word)
    snowball_stem = snowball.stem(word)
    lancaster_stem = lancaster.stem(word)

    print(f'Original: {word}')
    print(f'Porter: {porter_stem}')
    print(f'Snowball: {snowball_stem}')
    print(f'Lancaster: {lancaster_stem}')
    print()

Original: algorithm
Porter: algorithm
Snowball: algorithm
Lancaster: algorithm

Original: beautifully
Porter: beauti
Snowball: beauti
Lancaster: beauty

Original: flies
Porter: fli
Snowball: fli
Lancaster: fli

Original: friendship
Porter: friendship
Snowball: friendship
Lancaster: friend

Original: happening
Porter: happen
Snowball: happen
Lancaster: hap

Original: happily
Porter: happili
Snowball: happili
Lancaster: happy

Original: interaction
Porter: interact
Snowball: interact
Lancaster: interact

Original: jumped
Porter: jump
Snowball: jump
Lancaster: jump

Original: jumping
Porter: jump
Snowball: jump
Lancaster: jump

Original: jumps
Porter: jump
Snowball: jump
Lancaster: jump

Original: quickly
Porter: quickli
Snowball: quick
Lancaster: quick

Original: runner
Porter: runner
Snowball: runner
Lancaster: run

Original: running
Porter: run
Snowball: run
Lancaster: run

