# What is Lemmatization?

Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. This technique considers the context and the meaning of the words, ensuring that the base form belongs to the language's dictionary. For example, the words `"running,"` `"ran,"` and `"runs" `are all lemmatized to the lemma `"run."`

In [None]:
# practical implementation of lemmatization with nltk in python
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# download necessary nltk data files (only needs to run once)
nltk.download('punkt')  # if you plan to tokenize full sentences
nltk.download('averaged_perceptron_tagger_eng')  # for pos tagging
nltk.download('wordnet')  # wordnet corpus
nltk.download('omw-1.4')  # wordnet synonyms and mappings

# function to get wordnet-compatible pos tag
# assigns grammatical labels (like noun, verb, adjective, etc.) to each word in a sentence.
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN)  # default to noun if tag not found

# initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# example usage: lemmatize a single word
word = "running"
lemma = lemmatizer.lemmatize(word, get_wordnet_pos(word))
print(f"Lemmatized word: {lemma}")

# optional: example with multiple words
words = ["running", "flies", "better", "studies"]
lemmatized_words = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in words]
print(f"Lemmatized list: {lemmatized_words}")


Lemmatized word: run
Lemmatized list: ['run', 'fly', 'well', 'study']


[nltk_data] Downloading package punkt to /home/mulombi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/mulombi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/mulombi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/mulombi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# What is Stemming?

Stemming is a more straightforward process that cuts off prefixes and suffixes (i.e., affixes) to reduce a word to its root form. This root form, known as the stem, may not be a valid word in the language. For example, the words `"running,"` `"runner,"` and `"runs"` might all be stemmed to` "run"` or `"runn,"` depending on the stemming algorithm used.

In [11]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stem = stemmer.stem(word)
print(f"Stemmed word: {stem}")

Stemmed word: run


In [14]:
# example of code of difference between lemmatization & stemming
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# sample text
text = "The striped bats are hanging on their feet for best"

# tokenize the text
words = nltk.word_tokenize(text)

# initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

# function to get the part of speech tag for lemmatization
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]

# print results
print("Original Text: ", text)
print("Tokenized Words: ", words)
print("Stemmed Words: ", stemmed_words)
print("Lemmatized Words: ", lemmatized_words)

Original Text:  The striped bats are hanging on their feet for best
Tokenized Words:  ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
Stemmed Words:  ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmatized Words:  ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


## When to Use Lemmatization vs. Stemming

The choice between **lemmatization** and **stemming** depends on the specific requirements of the NLP task at hand:

### Use Lemmatization When:
- **Accuracy and context are crucial**
- The task involves complex language understanding, such as:
  - Chatbots
  - Sentiment analysis
  - Machine translation
- **Computational resources are sufficient** to handle the additional complexity

### Use Stemming When:
- **Speed and efficiency are more important than accuracy**
- The task involves simple text normalization, such as:
  - Search engines
  - Information retrieval systems
- **Computational resources are limited**


## Conclusion

Both lemmatization and stemming are essential techniques in NLP for reducing words to their base forms, but they serve different purposes and are chosen based on the specific requirements of a task. Lemmatization, with its context-aware and dictionary-based approach, is more accurate and suitable for tasks requiring precise language understanding. On the other hand, stemming, with its rule-based and faster approach, is useful for tasks where speed and simplicity are prioritized over accuracy. Understanding the differences and applications of these techniques enables better preprocessing and handling of textual data in various NLP applications.