<a href="https://colab.research.google.com/github/goel4ever/machine-learning-notebooks/blob/main/nlp_stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP: Stemming

`Stemming` is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words "helping" and "helper" share the root "help". Stemming allows you to zero in on the basic meaning of a word

This notebook focuses on applying stemming on a text using Natural Language Processing. NLTK has more than one stemmer. We'll use `Porter stemmer` here.

We'll use `NLTK` package for implementation. A group of texts is called a `corpus`. NLTK provides several corpora covering everything from novels hosted by Project Gutenberg to inaugural speeches by presidents of the United States.

In order to analyze texts in NLTK, you first need to import them. We need a one-off run of `nltk.download()` to get all the resources in one go. Note: It will take some time.

In [7]:
# Import Punkt Tokenizer Models
# It a machine learning based tokenizer trained on a variety of European languages.
# It works well for many Western languages and is capable of handling complex tokenization tasks such as splitting contractions
import nltk
nltk.download('punkt')

# Required imports
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [8]:
# Instantiate a stemmer
stemmer = PorterStemmer()

In [9]:
string_for_stemming = """
The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do.
"""

In [11]:
# Tokenize the string
words = word_tokenize(string_for_stemming)
words

['The',
 'crew',
 'of',
 'the',
 'USS',
 'Discovery',
 'discovered',
 'many',
 'discoveries',
 '.',
 'Discovering',
 'is',
 'what',
 'explorers',
 'do',
 '.']

In [12]:
# Stem the words
stemmed_words = [stemmer.stem(word) for word in words]
stemmed_words

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

Those results look a little inconsistent. Why would `Discovery` give you `discoveri` when `Discovering` gives you `discov`?

Understemming and overstemming are two ways stemming can go wrong:
1. `Understemming` happens when two related words should be reduced to the same stem but aren't. This is a `false negative`.
2. `Overstemming` happens when two unrelated words are reduced to the same stem even though they shouldn't be. This is a `false positive`.

The Porter stemming algorithm is a little on the older side. The `Snowball stemmer (Porter2)`, is an improvement on the original. The purpose of the Porter stemmer is not to produce complete words but to find variant forms of a word.

Fortunately, there are some other ways to reduce words to their core meaning, such as `lemmatizing`.