This notebook demonstrates stemming. To get started, we first need to install the tokenizer dependency:

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

This `punkt` tokenizer divides a text into a list of sentences
by using an unsupervised algorithm to build a model for abbreviation
words, collocations, and words that start sentences. Next, we import the stemmers from `nltk` and define documents.

In [None]:
from nltk.stem.porter import *
from nltk.stem.snowball import *
from nltk.stem.lancaster import *

d1 = "The quick brown fox jumped over the lazy dog. résumé"
d2 = "The sum of the square of the legs of a right triangle are equal to the square of the hypotenuse. resume"
d3 = "We are what we pretend to be, so we must be careful about what we pretend to be."
docs = [d1, d2, d3]

docs


['The quick brown fox jumped over the lazy dog. résumé',
 'The sum of the square of the legs of a right triangle are equal to the square of the hypotenuse. resume',
 'We are what we pretend to be, so we must be careful about what we pretend to be.']

The stemmers each need to be initialized to be used in code.

In [None]:
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english") # aka Porter2
lancaster_stemmer = LancasterStemmer()

stemmer = porter_stemmer  # Change to any of the above
stemmer

<PorterStemmer>

Next we tokenize each document:

In [None]:
term_list = []
for document in docs:
   term_list.append(nltk.word_tokenize(document))

term_list

[['The',
  'quick',
  'brown',
  'fox',
  'jumped',
  'over',
  'the',
  'lazy',
  'dog',
  '.',
  'résumé'],
 ['The',
  'sum',
  'of',
  'the',
  'square',
  'of',
  'the',
  'legs',
  'of',
  'a',
  'right',
  'triangle',
  'are',
  'equal',
  'to',
  'the',
  'square',
  'of',
  'the',
  'hypotenuse',
  '.',
  'resume'],
 ['We',
  'are',
  'what',
  'we',
  'pretend',
  'to',
  'be',
  ',',
  'so',
  'we',
  'must',
  'be',
  'careful',
  'about',
  'what',
  'we',
  'pretend',
  'to',
  'be',
  '.']]

Then we iterate through each document in the document and stem the token.

In [None]:
stemmed_term_list = []
for i, term_list_for_doc in enumerate(term_list):
   stemmed_term_list.append([stemmer.stem(word) for word in term_list_for_doc])

# Print lists of stemmed terms for each document
for stemmed_doc in stemmed_term_list:
  print(stemmed_doc)


['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.', 'résumé']
['the', 'sum', 'of', 'the', 'squar', 'of', 'the', 'leg', 'of', 'a', 'right', 'triangl', 'are', 'equal', 'to', 'the', 'squar', 'of', 'the', 'hypotenus', '.', 'resum']
['we', 'are', 'what', 'we', 'pretend', 'to', 'be', ',', 'so', 'we', 'must', 'be', 'care', 'about', 'what', 'we', 'pretend', 'to', 'be', '.']


Finally, we re-assemble terms in each document to demonstrate stemming:

In [None]:
for i, doc in enumerate(docs):
  stemmed_docs_separated_by_space = ' '.join(stemmed_term_list[i])
  print(f'D{i} stemmed terms: {stemmed_docs_separated_by_space}')

D0 stemmed terms: the quick brown fox jump over the lazi dog .
D1 stemmed terms: the sum of the squar of the leg of a right triangl are equal to the squar of the hypotenus .
D2 stemmed terms: we are what we pretend to be , so we must be care about what we pretend to be .
