How can we normalize and reduce the number of common words into a single word for text analysis?

These two techniques are used in SEO, tagging, and indexing systems.

# Stemming
***
**Stemming is about stripping suffixes.**

Some words can be reduced to a single word, so stemming involves removing the last characters of a word until we can get a common word to represent a number of words. The final word is known as the _lemma_. 

An example of this would be that the words "blogging", "blogged", and "blogs" can all be reduced to the single root word word "blog". 

Some of the most common stemming algorithms are Porter, Lancaster and Snowball. 

Here is an example of using the snowball stemmer:

In [3]:
from nltk import SnowballStemmer

stemmer = SnowballStemmer('english')

for word in ['blogging','blogged','blogs']:
    print(stemmer.stem(word))

blog
blog
blog


As we can see, those three words have the same root word, and so when we perform stemming, the suffixes that contain no new information are removed to reveal the similarity between the words.

# Lemmatisation
***
Lemmatisation is the process of reducing the number of words into a single word by combining common words together. This is similar to transforming to the dictionary base form.

Here is an example of using the lemmatizer built into `nltk`:

In [4]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("blogs"))

blog


# Summary
***
Both of these techniques are useful for reducing the number of common words into a single word. 