<a href="https://colab.research.google.com/github/ngzhiwei517/NLP/blob/main/Lecture_02_Stemming_%26_Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stemming & Lemmatization

Both stemming and lemmatization are the methods to normalize documents on a syntactical level. Often the same words are used in different forms depending on their grammatical use in a sentence. Consider the following to sentences:

- Dogs make the best friends.
- A dog makes a good friend.

Semantically, both sentences are essentially conveying the same message, but syntactically they are very different since the vocabulary is different: "dog" vs. "dog", "make" vs. "makes", "friends" vs. "friend". This is a big problem when comparing documents or when searching for documents in a database. For example, when one uses "dog" as a search term, both sentences should be returned and not just the second one.



---







While the goals of stemming and lemmatization are similar, there a basic differences:





---



 - **Stemming:** *Stemming is like chopping off parts of a word without worrying too much about grammar or correctness. It’s fast but a bit rough.*






 Example:

"playing" → "play"

"happiness" → "happi"

"better" →  "bett" (not meaningful)

 It may not return a valid word — it just cuts suffixes using rules.



---



 - **Lemmatization:**  *it looks up the correct dictionary form of a word (lemma) using grammar and vocabulary.*


📌 Example:

"playing" → "play"

"happiness" → "happiness" ✅

"better" → "good" (based on meaning)

📝 It ensures the root is an actual word and considers part of speech (POS).



---
🤖 What’s the real use of stemming and lemmatization?
Let’s say you have a database of sentences like:

Dogs make the best friends.

A dog makes a good friend.

She is playing with her dog.

They played outside.

Now, a user types in a search: "play"


---



❌ Without stemming/lemmatization:
The system looks for exact word matches. It will only find sentence 3 if it contains the exact word "play" — but not "playing" or "played".

So only sentences with the word "play" (exact form) will match. You’ll miss important results.

🔄 Step-by-Step Flow (Search for "dog"):
🧠 Preprocess the sentences in the database:
Every sentence is processed using stemming or lemmatization:

"dogs make the best friends" → ["dog", "make", "best", "friend"]

"a dog makes a good friend" → ["dog", "make", "good", "friend"]



---



🔍 User types a search term, like "dog"

🧹 System also stems or lemmatizes the search word:

"dogs" or "dog" → "dog"

🧮 System checks which sentences (already stemmed/lemmatized) contain the word "dog"

✅ Sentences that match are returned to the user:

It finds both sentence 1 and 2 above because they contain "dog" (after processing)



## Import all important packages

In [1]:
import string # Helps with string operations (e.g., removing punctuation).
import nltk

from nltk.stem.porter import PorterStemmer
#The oldest and most commonly used stemmer.
#It's simple and widely used in search engines.

from nltk.stem.snowball import SnowballStemmer
#A more advanced version of Porter.
#Slightly more accurate and supports multiple languages.

from nltk.stem.lancaster import LancasterStemmer
#Very aggressive. It cuts words more harshly and
#can over-stem (too much cutting).

from nltk.stem import WordNetLemmatizer
#which uses a dictionary (WordNet) to return the proper root word (lemma).

In [2]:
nltk.download('averaged_perceptron_tagger')
#This downloads a POS (part-of-speech) tagger  useful for lemmatization
#so the system knows if a word is a noun, verb, adjective, etc.

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Stemming

We first define a few stemmers provided by NLTK.

For more stemmer, see http://www.nltk.org/api/nltk.stem.html

In [3]:
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')

# Put all stemmers into a list to make their use easier
stemmer_list = [porter_stemmer, snowball_stemmer, lancaster_stemmer]

In [4]:
word_list = ['only', 'accepted', 'studying','study','studied','dogs', 'cats', 'running', 'phones', 'viewed', 'presumably', 'crying', 'went', 'packed', 'worse', 'best', 'mice', 'friends', 'makes']

In [5]:
for word in word_list:
    print (word + ':')
    for stemmer in stemmer_list:
        stemmed_word = stemmer.stem(word)
        print ('\t', stemmed_word)

only:
	 onli
	 onli
	 on
accepted:
	 accept
	 accept
	 acceiv
studying:
	 studi
	 studi
	 study
study:
	 studi
	 studi
	 study
studied:
	 studi
	 studi
	 study
dogs:
	 dog
	 dog
	 dog
cats:
	 cat
	 cat
	 cat
running:
	 run
	 run
	 run
phones:
	 phone
	 phone
	 phon
viewed:
	 view
	 view
	 view
presumably:
	 presum
	 presum
	 presum
crying:
	 cri
	 cri
	 cry
went:
	 went
	 went
	 went
packed:
	 pack
	 pack
	 pack
worse:
	 wors
	 wors
	 wors
best:
	 best
	 best
	 best
mice:
	 mice
	 mice
	 mic
friends:
	 friend
	 friend
	 friend
makes:
	 make
	 make
	 mak


Stemming doesn't understand plural irregulars — only lemmatizers know mice → mouse.

❗ Notice how stemming does not understand word meaning — "worse" should ideally relate to "bad" (only lemmatization can handle this).

## Lemmatization

The output of a lemmatizer, in general, depends on the type of word (noun, verb, or adjective). For example, when used as an adjective "running" (e.g., "a running tap") the word is already in its base form. However, "running" used as a verb (e.g., "he was running away") then the base form is "run"

In [6]:
wordnet_lemmatizer = WordNetLemmatizer()

This process:

Tokenizes a sentence (splits it into words)

Tags each word with a Part-of-Speech (POS) tag (like noun, verb, adjective)

Maps the detailed tag to a simpler one: "n" for noun, "v" for verb, "a" for adjective

Lemmatizes each word using the correct word type (improves accuracy)



In [8]:
import nltk
nltk.download('wordnet')

word_type_list = ['n', 'v', 'a']
#Because lemmatizers need the word type to return the right result.
for word in word_list:
    print (word + ':')
    for word_type in word_type_list:
        lemmatized_word = wordnet_lemmatizer.lemmatize(word, pos=word_type) # default is 'n'
        print ('\t', word, '=[{}]=>'.format(word_type), lemmatized_word)

[nltk_data] Downloading package wordnet to /root/nltk_data...


only:
	 only =[n]=> only
	 only =[v]=> only
	 only =[a]=> only
accepted:
	 accepted =[n]=> accepted
	 accepted =[v]=> accept
	 accepted =[a]=> accepted
studying:
	 studying =[n]=> studying
	 studying =[v]=> study
	 studying =[a]=> studying
study:
	 study =[n]=> study
	 study =[v]=> study
	 study =[a]=> study
studied:
	 studied =[n]=> studied
	 studied =[v]=> study
	 studied =[a]=> studied
dogs:
	 dogs =[n]=> dog
	 dogs =[v]=> dog
	 dogs =[a]=> dogs
cats:
	 cats =[n]=> cat
	 cats =[v]=> cat
	 cats =[a]=> cats
running:
	 running =[n]=> running
	 running =[v]=> run
	 running =[a]=> running
phones:
	 phones =[n]=> phone
	 phones =[v]=> phone
	 phones =[a]=> phones
viewed:
	 viewed =[n]=> viewed
	 viewed =[v]=> view
	 viewed =[a]=> viewed
presumably:
	 presumably =[n]=> presumably
	 presumably =[v]=> presumably
	 presumably =[a]=> presumably
crying:
	 crying =[n]=> cry
	 crying =[v]=> cry
	 crying =[a]=> crying
went:
	 went =[n]=> went
	 went =[v]=> go
	 went =[a]=> went
packed:
	 packed =[

accepted:

	 accepted =[n]=> accepted
	 accepted =[v]=> accept
	 accepted =[a]=> accepted

As a verb, it correctly turns accepted → accept

As a noun or adjective, the lemmatizer keeps it as-is because it either:

Doesn't know how to reduce it ('n')

Thinks it's already the base ('a')

✅ when the lemmatizer doesn't know how to reduce a word, or if it's already in its base form for the given word type, it won't change it.



---

worse:

	 worse =[n]=> worse
	 worse =[v]=> worse
	 worse =[a]=> bad


*   ✅ "worse" is an adjective (comparative form of bad).
*   ❌ It's not used as a verb or noun commonly, so the lemmatizer just returns it unchanged in those cases



---




---



To show a complete example, we already look ahead and use a Part-of-Speech (POS) tagger that tells us the type for each word in a sentence (see the follow-up tutorial for more details).


In [None]:
from nltk import word_tokenize
from nltk import pos_tag

In [None]:
sentence = "The newest study has shown that cats have a better sense of smell than dogs."
#sentence = "Dogs make the best friends."

In [None]:
# First, tokenize sentence
token_list = word_tokenize(sentence)

# Second, calculate POS tags for each token
pos_tag_list = pos_tag(token_list)

print (pos_tag_list)

The POS tagger distinguishes several dozens of word types. However, we are only interested in whether a word is a noun, verb, or adjective. We therefore need to map the output of the POS tagger to the 3 valid options "n", "v", and "a"


In [None]:
print ('\nOutput of NLTK lemmatizer:\n')
for token, tag in pos_tag_list:
    word_type = 'n'
    tag_simple = tag[0].lower() # Converts, e.g., "VBD" to "v"
    if tag_simple in ['n', 'v']:
        # If the POS tag starts with "n" or "v", we know it's a noun or verb
        word_type = tag_simple
    elif tag_simple in ['j']:
        # If the POS tag starts with a "j", we know it's an adjective
        word_type = 'a'
    lemmatized_token = wordnet_lemmatizer.lemmatize(token.lower(), pos=word_type)
    print(token, '=[{}]==[{}]=>'.format(tag, word_type), lemmatized_token)

## Lemmatization with spaCy

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

spaCy already performs lemmatization by default when processing a document without any additional commands.

In [None]:
print ('\nOutput of spaCy lemmatizer:')
doc = nlp(sentence) # doc is an object, not just a simple list
# Let's create a list so the output matches the previous ones
token_list = []
for token in doc:
    print (token.text, '={}=>'.format(token.pos_), token.lemma_) # token is also an object, not a string


## Application use case: document similarity

The following two methods take a document as input and return a set of words (i.e., no duplicates). `create_stemmed_word_set()` stems each word; `create_lemmatized_word_set()` lemmatizes each word. The methods simply put together all the individual steps as previously shown.


In [None]:
import utils
from utils.nlputil import preprocess_text

Print some example output for both methods.

In [None]:
# Show example output of create_stemmed_word_set() method
print (preprocess_text(sentence, stemmer=porter_stemmer))

# Show example output of create_lemmatized_word_set() method
print (preprocess_text(sentence, lemmatizer=wordnet_lemmatizer))

To caluclate the similarity between two documents, let's define a second sentence that is sematically similar to the first one, but not syntactically.

In [None]:
# sentence = "The newest study has shown that cats have a better sense of smell than dogs."
sentence_2 = "Some studies show that a cat can smell better than a dog."

For both sentences, we can calculate all 3 different word sets:
- naive (only simple tokenizing)
- stemmed
- lemmatized


In [None]:
naive_word_set_1 = set(word_tokenize(sentence.lower()))
naive_word_set_2 = set(word_tokenize(sentence_2.lower()))

stemmed_word_set_1 = preprocess_text(sentence, stemmer=porter_stemmer, return_type='set')
stemmed_word_set_2 = preprocess_text(sentence_2, stemmer=porter_stemmer, return_type='set')

lemmatized_word_set_1 = preprocess_text(sentence, lemmatizer=wordnet_lemmatizer, return_type='set')
lemmatized_word_set_2 = preprocess_text(sentence_2, lemmatizer=wordnet_lemmatizer, return_type='set')

print (naive_word_set_1)
print (stemmed_word_set_1)
print (lemmatized_word_set_1)

In [None]:
def jaccard_similarity(word_set_1, word_set_2):
    union_set = word_set_1.union(word_set_2)
    intersection_set = word_set_1.intersection(word_set_2)
    similarity = len(intersection_set) / len(union_set)
    return similarity


To quantify the similarity between two word sets A and B, we can use the *Jaccard Similarity* J(A,B) as defined as:

$$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$$

Intuitively, if A and B are completely different, the size intersection $|A\cap B|$ is 0, making the similarity 0. If A and B are identical both the size intersection and the size of the union are the same, making the similarity 1.0.


In [None]:
print (jaccard_similarity(naive_word_set_1, naive_word_set_2))
print (jaccard_similarity(stemmed_word_set_1, stemmed_word_set_2))
print (jaccard_similarity(lemmatized_word_set_1, lemmatized_word_set_2))