## Stemming in NLTK


Stemming is a text normalization technique used in natural language processing (NLP) and information retrieval to reduce words to their base or root form, called the "stem."


**Key Points about Stemming:**

1. **Base or Root Form**: Stemming algorithms aim to remove affixes from words, such as prefixes, suffixes, and inflectional endings, to derive the base form of the word.

2. **Equivalence**: Stemming allows different variations of a word to be treated as the same word. For example, "running," "ran," and "runs" would all be stemmed to the common base form "run."

3. **Heuristic Approach**: Stemming algorithms often use simple heuristic rules to remove affixes. While these rules are effective in many cases, they may not always produce correct linguistic stems.

4. **Speed**: Stemming algorithms are generally fast and suitable for processing large volumes of text. However, they may produce stems that are not valid words in the language.

5. **Porter Stemmer**: The Porter Stemmer is one of the most well-known stemming algorithms, developed by Martin Porter in 1980. It applies a series of heuristic rules to reduce words to their stems.

**Example of Stemming:**

- **Original**: "running"
- **Stemmed**: "run"

- **Original**: "cats"
- **Stemmed**: "cat"

- **Original**: "writing"
- **Stemmed**: "write"

**Limitations of Stemming:**

1. **Overstemming**: Stemming may sometimes remove too many affixes, leading to stems that are not valid words or that lose their meaning. For example, "universally" might be stemmed to "univers" instead of "universal."

2. **Understemming**: Conversely, stemming may also fail to remove all affixes, leaving the stem in an incorrect or incomplete form. For example, "happiness" might be stemmed to "happi" instead of "happy."

3. **Language Dependence**: Stemming algorithms are often language-dependent and may not perform well for languages with complex morphology or irregular word forms.

**Applications of Stemming:**

1. **Information Retrieval**: Stemming is used in search engines to improve retrieval by treating different forms of words as equivalent.

2. **Text Mining**: Stemming is employed in text analysis tasks such as clustering, classification, and topic modeling to reduce dimensionality and improve performance.

3. **Text Normalization**: Stemming is a step in text preprocessing pipelines to prepare text data for further analysis or processing.

4. **Language Processing**: Stemming is used in various language processing tasks, including machine translation, sentiment analysis, and named entity recognition.

In [10]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [11]:
sample_text= 'Stemming is a text normalization technique used in natural language processing (NLP) and information retrieval to reduce words to their base or root form, called the "stem."'
tokens = word_tokenize(text=sample_text)


In [12]:
stemmer = PorterStemmer()

In [13]:
stemmed_words_Po = [stemmer.stem(word) for word in tokens]

# Printing the original words and their stems
for i in range(len(tokens)):
    print(f"Original: {tokens[i]}\t\tStemmed: {stemmed_words_Po[i]}")

Original: Stemming		Stemmed: stem
Original: is		Stemmed: is
Original: a		Stemmed: a
Original: text		Stemmed: text
Original: normalization		Stemmed: normal
Original: technique		Stemmed: techniqu
Original: used		Stemmed: use
Original: in		Stemmed: in
Original: natural		Stemmed: natur
Original: language		Stemmed: languag
Original: processing		Stemmed: process
Original: (		Stemmed: (
Original: NLP		Stemmed: nlp
Original: )		Stemmed: )
Original: and		Stemmed: and
Original: information		Stemmed: inform
Original: retrieval		Stemmed: retriev
Original: to		Stemmed: to
Original: reduce		Stemmed: reduc
Original: words		Stemmed: word
Original: to		Stemmed: to
Original: their		Stemmed: their
Original: base		Stemmed: base
Original: or		Stemmed: or
Original: root		Stemmed: root
Original: form		Stemmed: form
Original: ,		Stemmed: ,
Original: called		Stemmed: call
Original: the		Stemmed: the
Original: ``		Stemmed: ``
Original: stem		Stemmed: stem
Original: .		Stemmed: .
Original: ''		Stemmed: ''


In [14]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]

# Printing the original words and their stems
for i in range(len(tokens)):
    print(f"Original: {tokens[i]}\t\tLStemmed: {stemmed_words[i]}\t\tPoStemme:{stemmed_words_Po[i]}")

Original: Stemming		LStemmed: stem		PoStemme:stem
Original: is		LStemmed: is		PoStemme:is
Original: a		LStemmed: a		PoStemme:a
Original: text		LStemmed: text		PoStemme:text
Original: normalization		LStemmed: norm		PoStemme:normal
Original: technique		LStemmed: techn		PoStemme:techniqu
Original: used		LStemmed: us		PoStemme:use
Original: in		LStemmed: in		PoStemme:in
Original: natural		LStemmed: nat		PoStemme:natur
Original: language		LStemmed: langu		PoStemme:languag
Original: processing		LStemmed: process		PoStemme:process
Original: (		LStemmed: (		PoStemme:(
Original: NLP		LStemmed: nlp		PoStemme:nlp
Original: )		LStemmed: )		PoStemme:)
Original: and		LStemmed: and		PoStemme:and
Original: information		LStemmed: inform		PoStemme:inform
Original: retrieval		LStemmed: retriev		PoStemme:retriev
Original: to		LStemmed: to		PoStemme:to
Original: reduce		LStemmed: reduc		PoStemme:reduc
Original: words		LStemmed: word		PoStemme:word
Original: to		LStemmed: to		PoStemme:to
Original: their		LS