# Lemmatization
Lemmatization surpasses stemming by grouping words with similar meanings. For instance, it transforms 'better' to 'good.' The process involves part-of-speech (POS) tagging and utilizing NLTK's lemmatizer. The function below converts Treebank's tagging to WordNet's tagging convention. Additionally, it excludes stop words, such as prepositions, conjunctions, and articles, as they lack definitions in WordNet.

#### Downloads required

In [4]:
import nltk
nltk.download('averaged_perceptron_tagger') #for POS tagging
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sahithimv/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/sahithimv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sahithimv/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sahithimv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

sentence = "Cats are chasing mice in the garden, and the mice are running away."
tokens = word_tokenize(sentence.lower())  # Convert to lowercase for consistency
stop_words = set(stopwords.words('english')) #<-- Plug in any other language here if you want
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print("Original Tokens:", tokens)
print("Filtered Tokens (after removing stop words):", filtered_tokens)
print("Lemmatized Words:", lemmatized_words)


Original Tokens: ['cats', 'are', 'chasing', 'mice', 'in', 'the', 'garden', ',', 'and', 'the', 'mice', 'are', 'running', 'away', '.']
Filtered Tokens (after removing stop words): ['cats', 'chasing', 'mice', 'garden', 'mice', 'running', 'away']
Lemmatized Words: ['cat', 'chasing', 'mouse', 'garden', 'mouse', 'running', 'away']
