<a href="https://colab.research.google.com/github/goel4ever/machine-learning-notebooks/blob/main/nlp_lemmatizing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP: Lemmatizing

A `lemma` is a word that represents a whole group of words, and that group of words is called a `lexeme`.

Like stemming, `lemmatizing` reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like "discoveri". This notebook focuses on reducing words to their core meaning using Natural Language Processing.

We'll use NLTK package for implementation. A group of texts is called a corpus. NLTK provides several corpora covering everything from novels hosted by Project Gutenberg to inaugural speeches by presidents of the United States.

In order to analyze texts in NLTK, you first need to import them. We need a one-off run of nltk.download() to get all the resources in one go. Note: It will take some time.

In [12]:
import nltk
# Download resource wordnet for lemmatization
nltk.download('wordnet')
# Download resource wordnet for tokenization
nltk.download('punkt')

# Required imports
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
# Create an instance of lemmatizer
lemmatizer = WordNetLemmatizer()

In [14]:
# Test lemmatizing a plural noun
lemmatizer.lemmatize("scarves")

'scarf'

In [15]:
# Tokenize a string
string_for_lemmatizing = "The friends of DeSoto love scarves."
words = word_tokenize(string_for_lemmatizing)
words

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']

In [16]:
# Lemmatize the words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
lemmatized_words
# Note the plurals 'friends' and 'scarves' became the singulars 'friend' and 'scarf'.

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']

Scenario where you lemmatized a word that looked very different from its lemma

In [18]:
# You'll get the result 'worst' because lemmatizer.lemmatize() assumes that "worst" was a noun.
lemmatizer.lemmatize("worst")

'worst'

In [19]:
# You can make it clear that you want "worst" to be an adjective
lemmatizer.lemmatize("worst", pos="a")
# The default parameter for pos is 'n' for noun, but you made sure that "worst" was treated as an adjective

'bad'