# Stemming & Lemmatization

Both stemming and lemmatization are the methods to normalize documents on a syntactical level. Often the same words are used in different forms depending due to grammatical reasons. Consider the following to sentences:

- Dogs make the best friends.
- A dog makes a good friend.

Semantically, both sentences are essentially conveying the same message, but syntactically they are very different since the vocabulary is different: "dog" vs. "dog", "make" vs. "makes", "friends" vs. "friend". This is a big problem when comparing documents or when searching for documents in a database. For example, when one uses "dog" as search term, both sentences should be return and not just the second one.

While the goals of stemming and lemmatization are similar, there a basic differences: 

 - **Stemmming:** Usually just applying crude heuristics that chop off the end of words. This may result in terms that are no longer proper words.
 - **Lemmatization:** Using vocabularies and morphological analysis of words to derive the root word for a term.

## Import all important packages

In [1]:
import string

from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer

from nltk.stem import WordNetLemmatizer

from utils.nlputil import remove_punctuation

In [2]:
print (remove_punctuation("Test, 123."))

Test 123


## Stemming

We first define a few stemmers provided by NLTK.

For more stemmer, see http://www.nltk.org/api/nltk.stem.html

In [3]:
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')

# Put all stemmers into a list to make their use easier
stemmer_list = [porter_stemmer, snowball_stemmer, lancaster_stemmer]

In [4]:
word_list = ['dogs', 'cats', 'running', 'phones', 'viewed', 'presumably', 'crying', 'went', 'packed', 'worse', 'best', 'mice', 'friends', 'makes']

In [5]:
for word in word_list:
    print (word + ':')
    for stemmer in stemmer_list:
        stemmed_word = stemmer.stem(word)
        print ('\t', stemmed_word)

dogs:
	 dog
	 dog
	 dog
cats:
	 cat
	 cat
	 cat
running:
	 run
	 run
	 run
phones:
	 phone
	 phone
	 phon
viewed:
	 view
	 view
	 view
presumably:
	 presum
	 presum
	 presum
crying:
	 cri
	 cri
	 cry
went:
	 went
	 went
	 went
packed:
	 pack
	 pack
	 pack
worse:
	 wors
	 wors
	 wors
best:
	 best
	 best
	 best
mice:
	 mice
	 mice
	 mic
friends:
	 friend
	 friend
	 friend
makes:
	 make
	 make
	 mak


## Lemmatization

The output of a lemmatizer, in general, depends on the type of word (noun, verb, or adjective). For example, when used as an adjective "running" (e.g., "a running tap") the word is already in its base form. However, "running" used as a verb (e.g., "he was running away") then the base form is "run"

In [6]:
wordnet_lemmatizer = WordNetLemmatizer()

In [7]:
word_type_list = ['n', 'v', 'a']

for word in word_list:
    print (word + ':')
    for word_type in word_type_list:
        lemmatized_word = wordnet_lemmatizer.lemmatize(word, pos=word_type) # default is 'n'
        print ('\t', word, '=[{}]=>'.format(word_type), lemmatized_word)

dogs:
	 dogs =[n]=> dog
	 dogs =[v]=> dog
	 dogs =[a]=> dogs
cats:
	 cats =[n]=> cat
	 cats =[v]=> cat
	 cats =[a]=> cats
running:
	 running =[n]=> running
	 running =[v]=> run
	 running =[a]=> running
phones:
	 phones =[n]=> phone
	 phones =[v]=> phone
	 phones =[a]=> phones
viewed:
	 viewed =[n]=> viewed
	 viewed =[v]=> view
	 viewed =[a]=> viewed
presumably:
	 presumably =[n]=> presumably
	 presumably =[v]=> presumably
	 presumably =[a]=> presumably
crying:
	 crying =[n]=> cry
	 crying =[v]=> cry
	 crying =[a]=> crying
went:
	 went =[n]=> went
	 went =[v]=> go
	 went =[a]=> went
packed:
	 packed =[n]=> packed
	 packed =[v]=> pack
	 packed =[a]=> packed
worse:
	 worse =[n]=> worse
	 worse =[v]=> worse
	 worse =[a]=> bad
best:
	 best =[n]=> best
	 best =[v]=> best
	 best =[a]=> best
mice:
	 mice =[n]=> mouse
	 mice =[v]=> mice
	 mice =[a]=> mice
friends:
	 friends =[n]=> friend
	 friends =[v]=> friends
	 friends =[a]=> friends
makes:
	 makes =[n]=> make
	 makes =[v]=> make
	 makes =[a

To show a complete example, we already look ahead and use a Part-of-Speech (POS) tagger that tells use the type for each word in sentence (see the follow-up tutorial for more details).

In [8]:
from nltk import word_tokenize
from nltk import pos_tag

In [9]:
sentence = "The newest study has shown that cats have a better sense of smell than dogs."
#sentence = "Dogs make the best friends."

In [10]:
# First, tokenize sentence
token_list = word_tokenize(sentence)

# Second, calculate POS tags for each token
pos_tag_list = pos_tag(token_list)

print (pos_tag_list)

[('The', 'DT'), ('newest', 'JJS'), ('study', 'NN'), ('has', 'VBZ'), ('shown', 'VBN'), ('that', 'IN'), ('cats', 'NNS'), ('have', 'VBP'), ('a', 'DT'), ('better', 'JJR'), ('sense', 'NN'), ('of', 'IN'), ('smell', 'NN'), ('than', 'IN'), ('dogs', 'NNS'), ('.', '.')]


The POS tagger distinguishes several dozens of word types. However, we are only interested wether a word is a noun, verb, or adjective. We therefore need to map the output of the POS tagger to the 3 valid options "n", "v", and "a"

In [11]:
print ('\nOutput of NLTK lemmatizer:\n')
for token, tag in pos_tag_list:
    word_type = 'n'
    tag_simple = tag[0].lower() # Converts, e.g., "VBD" to "c"
    if tag_simple in ['n', 'v']:
        # If the POS tag starts with "n" or "v", we know it's a noun or verb
        word_type = tag_simple 
    elif tag_simple in ['j']:
        # If the POS tag starts with a "j", we know it's an adjective
        word_type = 'a' 
    lemmatized_token = wordnet_lemmatizer.lemmatize(token.lower(), pos=word_type)
    print(token, '=[{}]==[{}]=>'.format(tag, word_type), lemmatized_token)


Output of NLTK lemmatizer:

The =[DT]==[n]=> the
newest =[JJS]==[a]=> new
study =[NN]==[n]=> study
has =[VBZ]==[v]=> have
shown =[VBN]==[v]=> show
that =[IN]==[n]=> that
cats =[NNS]==[n]=> cat
have =[VBP]==[v]=> have
a =[DT]==[n]=> a
better =[JJR]==[a]=> good
sense =[NN]==[n]=> sense
of =[IN]==[n]=> of
smell =[NN]==[n]=> smell
than =[IN]==[n]=> than
dogs =[NNS]==[n]=> dog
. =[.]==[n]=> .


## Lemmatization with spaCy

In [12]:
import spacy

In [13]:
nlp = spacy.load('en_core_web_sm')

spaCy already performs lemmatization by default when processing a document without any additional commands.

In [14]:
print ('\nOutput of spaCy lemmatizer:')
doc = nlp(sentence) # doc is an object, not just a simple list
# Let's create a list so the output matches the previous ones
token_list = []
for token in doc:
    print (token.text, '={}=>'.format(token.pos_), token.lemma_) # token is also an object, not a string



Output of spaCy lemmatizer:
The =DET=> the
newest =ADJ=> new
study =NOUN=> study
has =VERB=> have
shown =VERB=> show
that =ADP=> that
cats =NOUN=> cat
have =VERB=> have
a =DET=> a
better =ADJ=> good
sense =NOUN=> sense
of =ADP=> of
smell =NOUN=> smell
than =ADP=> than
dogs =NOUN=> dog
. =PUNCT=> .


Notice that the spaCy lemmatizer, compared to the NLTK lemmatizer, does not convert "better" to "good" although correctly identified as adjective. On the other hand, "newest" gets converted to "new". The spaCy lemmatizer also converts all tokens/word to lowercase, which is typically does not matter.

## Application use case: document similarity

The following two methods take a document as input and return a set of words (i.e., no duplicates). `create_stemmed_word_set()` stems each word; `create_lemmatized_word_set()` lemmatizes each word. The methods simply put together all the individial steps as previously shown.

In [15]:
from utils.nlputil import preprocess_text

Print some example output for both methods.

In [16]:
# Show example output of create_stemmed_word_set() method
print (preprocess_text(sentence, stemmer=porter_stemmer))

# Show example output of create_lemmatized_word_set() method
print (preprocess_text(sentence, lemmatizer=wordnet_lemmatizer))

newest studi ha shown cat better sens smell dog
new study show cat good sense smell dog


To caluclate the similarity between two documents, let's define a second sentence that is sematically similar to the first one, but not syntactically.

In [17]:
# sentence = "The newest study has shown that cats have a better sense of smell than dogs."
sentence_2 = "Some studies show that a cat can smell better than a dog."

For both sentences, we can caluculate all 3 different word sets:
- naive (only simple tokenizing)
- stemmed
- lemmatized

In [18]:
naive_word_set_1 = set(word_tokenize(sentence.lower()))
naive_word_set_2 = set(word_tokenize(sentence_2.lower()))

stemmed_word_set_1 = preprocess_text(sentence, stemmer=porter_stemmer, return_type='set')
stemmed_word_set_2 = preprocess_text(sentence_2, stemmer=porter_stemmer, return_type='set')

lemmatized_word_set_1 = preprocess_text(sentence, lemmatizer=wordnet_lemmatizer, return_type='set')
lemmatized_word_set_2 = preprocess_text(sentence_2, lemmatizer=wordnet_lemmatizer, return_type='set')

print (naive_word_set_1)
print (stemmed_word_set_1)
print (lemmatized_word_set_1)

{'shown', 'cats', 'study', '.', 'sense', 'the', 'than', 'smell', 'better', 'have', 'that', 'newest', 'has', 'dogs', 'a', 'of'}
{'dog', 'shown', 'sens', 'cat', 'smell', 'better', 'ha', 'newest', 'studi'}
{'dog', 'good', 'study', 'cat', 'sense', 'smell', 'new', 'show'}


In [19]:
def jaccard_similarity(word_set_1, word_set_2):
    union_set = word_set_1.union(word_set_2)
    intersection_set = word_set_1.intersection(word_set_2)
    similarity = len(intersection_set) / len(union_set)
    return similarity
    

To qunatify the similarity between two word sets A and B, we can use the *Jaccard Similarity* J(A,B) as defined as:

$$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$$

Inuitively, if A and B are completely different, the size interesection $|A\cap B|$ is 0, making the similarity 0. If A and B are identical both the size intersection and the size of the union are the same, making the similarity 1.0.

In [21]:
print (jaccard_similarity(naive_word_set_1, naive_word_set_2))
print (jaccard_similarity(stemmed_word_set_2, stemmed_word_set_1))
print (jaccard_similarity(lemmatized_word_set_1, lemmatized_word_set_2))

0.2727272727272727
0.5
0.75
0.75
