<a href="https://colab.research.google.com/github/niksom406/Learning_NLP/blob/main/Lemmatization_Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lemmatization and WordNet Lemmatization with NLTK

**Lemmatization** is the process of reducing words to their base or dictionary form, known as the lemma. Unlike stemming, which often just chops off suffixes, lemmatization considers the context and uses a vocabulary and morphological analysis of words to return the base or dictionary form of a word, which is known as the **lemma**. For example, the words "running", "ran", and "runs" all have the same lemma: "run".

**WordNet Lemmatization** is a specific type of lemmatization that uses the **WordNet** lexical database. WordNet is a large database of English words linked together by their semantic relationships (like synonyms, antonyms, hyponyms, etc.). The NLTK library provides access to WordNet and includes a lemmatizer that leverages this database.

Here's how WordNet Lemmatization typically works in NLTK:

1.  **Import necessary modules:** You need to import `WordNetLemmatizer` from `nltk.stem` and potentially `wordnet` from `nltk.corpus`.
2.  **Initialize the lemmatizer:** Create an instance of the `WordNetLemmatizer`.
3.  **Lemmatize the word:** Use the `lemmatize()` method of the lemmatizer object. This method often takes the word and an optional part-of-speech (POS) tag as arguments. The POS tag is important because the lemma of a word can depend on its grammatical role in a sentence (e.g., "lead" as a noun vs. "lead" as a verb). WordNet requires specific POS tags ('n' for noun, 'v' for verb, 'a' for adjective, 'r' for adverb). If no POS tag is provided, it defaults to noun.

**Key advantages of WordNet Lemmatization:**

*   **More accurate:** It often provides a more accurate lemma than simple stemming because it considers the word's meaning and context through the WordNet database.
*   **Handles irregular forms:** It can handle irregular plurals (e.g., "mice" -> "mouse") and irregular verb conjugations (e.g., "went" -> "go") effectively.

**Considerations:**

*   **Requires POS tagging:** For optimal results, you usually need to perform part-of-speech tagging on your text before lemmatization to provide the correct POS tag to the lemmatizer.
*   **WordNet coverage:** While extensive, WordNet doesn't contain every word or every possible form of a word.

In summary, WordNet Lemmatization with NLTK is a powerful technique for reducing words to their base forms by leveraging the rich information in the WordNet lexical database, resulting in more accurate and contextually relevant lemmas.

In [43]:
from nltk.stem import WordNetLemmatizer

In [44]:
from nltk.corpus import wordnet

In [45]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [46]:
lemmatizer = WordNetLemmatizer()

In [47]:
lemmatizer.lemmatize("mice")

'mouse'

In [48]:
lemmatizer.lemmatize("going")

'going'

In [49]:
'''
POS- Noun - n
POS- Verb - v
POS- Adjective - a
POS- Adverb - r
'''
lemmatizer.lemmatize("going", pos='v')

'go'

In [50]:
lemmatizer.lemmatize("better", pos='a')

'good'

In [51]:
lemmatizer.lemmatize("ran", pos='r')

'ran'

In [52]:
words = ['history', 'historical', 'historically', 'historian', 'histories', 'believe', 'believer', 'believing', 'believed', 'beautiful', 'beauty', 'beautify', 'beautifying', 'run', 'running','eats','eating','eaten','writing','written','programming','programs','finally','finalized']

In [53]:
for word in words:
  print(word + ' ---> ' + lemmatizer.lemmatize(word, pos = "v"))

history ---> history
historical ---> historical
historically ---> historically
historian ---> historian
histories ---> histories
believe ---> believe
believer ---> believer
believing ---> believe
believed ---> believe
beautiful ---> beautiful
beauty ---> beauty
beautify ---> beautify
beautifying ---> beautify
run ---> run
running ---> run
eats ---> eat
eating ---> eat
eaten ---> eat
writing ---> write
written ---> write
programming ---> program
programs ---> program
finally ---> finally
finalized ---> finalize


In [54]:
lemmatizer.lemmatize('fairly'), lemmatizer.lemmatize('sportingly')

('fairly', 'sportingly')

In [55]:
lemmatizer.lemmatize('going',pos = "v")

'go'

In [56]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN




In [57]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [65]:
sentence = "Donald Trump has a devoted following".split()

In [66]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN')]

In [71]:
for word,tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos = get_wordnet_pos(tag))
  print(word+" ---> "+lemma)


Donald ---> Donald
Trump ---> Trump
has ---> have
a ---> a
devoted ---> devote
following ---> following


In [72]:
sentence = "The cat was following the bird as it flew by".split()

In [73]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('bird', 'NN'),
 ('as', 'IN'),
 ('it', 'PRP'),
 ('flew', 'VBD'),
 ('by', 'IN')]

In [74]:
for word,tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos = get_wordnet_pos(tag))
  print(word+" ---> "+lemma)

The ---> The
cat ---> cat
was ---> be
following ---> follow
the ---> the
bird ---> bird
as ---> a
it ---> it
flew ---> fly
by ---> by
