# Stemming Words with NLTK

In [28]:
from nltk import word_tokenize
from nltk.stem import *

import pandas as pd

### PorterStemmer
* PorterStemmer uses Suffix Stripping to produce stems.

In [29]:
input_tokens = ['overwhelming', 'overwhelmingly', 
                'hushed', 'hush',
                'functional', 'functionally',
                'lying', 'lied',
                'fairly', 
                'destabilize', 'stability',
                'friendship', 'friendships', 'friendly', 'friendless', 
                'connect', 'connections', 'connected',  
                'the', 'these', 'those',
                'motivational', 'motivate', 'motivating']

In [30]:
ps = PorterStemmer()

ps_stemmed_tokens = []
for token in input_tokens:
    ps_stemmed_tokens.append(ps.stem(token))

In [31]:
stems_df = pd.DataFrame({
    'words': input_tokens,
    'Porter Stemmer': ps_stemmed_tokens
})

stems_df

Unnamed: 0,words,Porter Stemmer
0,overwhelming,overwhelm
1,overwhelmingly,overwhelmingli
2,hushed,hush
3,hush,hush
4,functional,function
5,functionally,function
6,lying,lie
7,lied,lie
8,fairly,fairli
9,destabilize,destabil


### LancasterStemmer
* The LancasterStemmer (Paice-Husk stemmer) is an iterative algorithm with rules saved externally.
* LancasterStemmer is simple, but heavy stemming due to iterations and over-stemming may occur. 
* Over-stemming causes the stems to be not linguistic, or they may have no meaning.

In [32]:
ls = LancasterStemmer()

ls_stemmed_tokens = []
for token in input_tokens:
    ls_stemmed_tokens.append(ls.stem(token))

In [33]:
stems_df = pd.DataFrame({
    'words': input_tokens,
    'Lancaster Stemmer': ls_stemmed_tokens
})

stems_df

Unnamed: 0,words,Lancaster Stemmer
0,overwhelming,overwhelm
1,overwhelmingly,overwhelm
2,hushed,hush
3,hush,hush
4,functional,funct
5,functionally,funct
6,lying,lying
7,lied,lied
8,fairly,fair
9,destabilize,dest


In [34]:
stems_df = pd.DataFrame({
    'words': input_tokens,
    'Porter Stemmer': ps_stemmed_tokens,
    'Lancaster Stemmer': ls_stemmed_tokens
})

stems_df

Unnamed: 0,words,Porter Stemmer,Lancaster Stemmer
0,overwhelming,overwhelm,overwhelm
1,overwhelmingly,overwhelmingli,overwhelm
2,hushed,hush,hush
3,hush,hush,hush
4,functional,function,funct
5,functionally,function,funct
6,lying,lie,lying
7,lied,lie,lied
8,fairly,fairli,fair
9,destabilize,destabil,dest


### SnowballStemmer
* One can generate its own set of rules for any language that is why Python nltk introduced SnowballStemmers that are used to create non-English Stemmers!

In [35]:
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [36]:
ss =  SnowballStemmer('english')

ss_stemmed_tokens = []
for token in input_tokens:
    ss_stemmed_tokens.append(ss.stem(token))

In [37]:
stems_df = pd.DataFrame({
    'words': input_tokens,
    'Snowball Stemmer': ss_stemmed_tokens
})

stems_df

Unnamed: 0,words,Snowball Stemmer
0,overwhelming,overwhelm
1,overwhelmingly,overwhelm
2,hushed,hush
3,hush,hush
4,functional,function
5,functionally,function
6,lying,lie
7,lied,lie
8,fairly,fair
9,destabilize,destabil


In [38]:
stems_df = pd.DataFrame({
    'words': input_tokens,
    'Porter Stemmer': ps_stemmed_tokens,
    'Lancaster Stemmer': ls_stemmed_tokens,
    'Snowball Stemmer': ss_stemmed_tokens
})

stems_df

Unnamed: 0,words,Porter Stemmer,Lancaster Stemmer,Snowball Stemmer
0,overwhelming,overwhelm,overwhelm,overwhelm
1,overwhelmingly,overwhelmingli,overwhelm,overwhelm
2,hushed,hush,hush,hush
3,hush,hush,hush,hush
4,functional,function,funct,function
5,functionally,function,funct,function
6,lying,lie,lying,lie
7,lied,lie,lied,lie
8,fairly,fairli,fair,fair
9,destabilize,destabil,dest,destabil


In [39]:
with open('./datasets/stemming.txt', 'r') as f:
    file_contents = f.read()

print(file_contents)

Suffix stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in the language). Alternatively, some suffix stripping approaches maintain a database (a large list) of all known morphological word roots that exist as real words. These approaches check the list for the existence of the term prior to making a decision. Typically, if the term does not exist, alternate action is taken. This alternate action may involve several other criteria. The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules.

It can be the case that two or more suffix stripping rules apply to the same input term, which creates an ambiguity as to which rule to apply. The algorithm may assign (by human hand or stochastically) a priorit

In [40]:
word_tokens = word_tokenize(file_contents)

In [41]:
ss =  SnowballStemmer('english', ignore_stopwords=True)

ss_stemmed_words = []
for word in word_tokens:
    ss_stemmed_words.append(ss.stem(word))

In [42]:
" ".join(ss_stemmed_words)

"suffix strip algorithm may differ in result for a varieti of reason . one such reason is whether the algorithm constrain whether the output word must be a real word in the given languag . some approach do not requir the word to actual exist in the languag lexicon ( the set of all word in the languag ) . altern , some suffix strip approach maintain a databas ( a larg list ) of all known morpholog word root that exist as real word . these approach check the list for the exist of the term prior to make a decis . typic , if the term does not exist , altern action is taken . this altern action may involv sever other criteria . the non-exist of an output term may serv to caus the algorithm to tri altern suffix strip rule . it can be the case that two or more suffix strip rule appli to the same input term , which creat an ambigu as to which rule to appli . the algorithm may assign ( by human hand or stochast ) a prioriti to one rule or anoth . or the algorithm may reject one rule applic beca