# Stemming

It's often (but not always) useful to reduce words to their roots. One reason for doing this may be that word tense or conjugation is not important for your model. It would be useful to combine variations of a word together. Then for models like Naive Bayes where each word is a feature, we can strongly reduce our feature space.

Let's see what this looks like. First, let's tokenize a bit of text from the wikipedia page on data science.

In [2]:
from nltk.tokenize import wordpunct_tokenize  # for tokenizing our text

In [3]:
# sample text from wikipedia
import codecs
text = codecs.open('../data/nlp_data/sample.txt', "r", "utf-8").read()
# text

In [5]:
print text

Data science
From Wikipedia, the free encyclopedia

Data Science Venn Diagram
Data science is the study of the generalizable extraction of knowledge from data,[1] yet the key word is science.[2] It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. The subject is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science.

A practitioner of data science is called a data scientist. Data scientists solve complex data problems through employing deep expertise in some scientific discipline. It is generally expected that data scientists are able to work w

In [6]:
word_bag = wordpunct_tokenize(text)
print 'a few tokens:', word_bag[:10]
print 'number of tokens:', len(word_bag)
print 'number of unique tokens:', len(set(word_bag))

a few tokens: [u'Data', u'science', u'From', u'Wikipedia', u',', u'the', u'free', u'encyclopedia', u'Data', u'Science']
number of tokens: 1684
number of unique tokens: 665


Look for common word endings to clip off. Start with the suffix, '-s', '-er', '-ing'. But be careful to only strip these tokens when they appear at the end of the word. Write rules into the function below.

In [7]:
# define a function to stem tokens based on rules.

def stem(tokens):
    '''rules-based stemming of a bunch of tokens'''
    
    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('ing'):
            new_bag.append(token[:-2])
        else:
            new_bag.append(token)

    return new_bag

In [8]:
# Check how well you're doing by running this cell:

print 'initial number of unique tokens:', len(set(word_bag))
print 'stemmed number of unique tokens:', len(set(stem(word_bag)))

initial number of unique tokens: 665
stemmed number of unique tokens: 644


In [9]:
# Do we have to refine our rules? Are we stripping away too many letters? Run this cell to see

# for token in stem(word_bag):
#     print token

Feel free to add more rules and see how much you can pare down the feature set, i.e. the number of unique tokens. Try not to strip too much off the words!

## Porter Stemmer

The classic stemmer is the Porter stemmer which is [available in NLTK](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter). Others are available, too

In [10]:
from nltk.stem.porter import PorterStemmer

In [11]:
# Run this cell to see how the Porter Stemmer performs.

ps = PorterStemmer()

print 'initial number of unique tokens:', len(set(word_bag))
print 'stemmed number of unique tokens:', len({ps.stem(token) for token in word_bag})  # this uses a set comprehension

initial number of unique tokens: 665
stemmed number of unique tokens: 601


In [13]:
# examine how weird the tokens get

# for token in word_bag:
#     print ps.stem(token)