<a href="https://colab.research.google.com/github/hackveda-canada/Data-Science-Essentials/blob/master/Data_Science_Essentials_Day_10_Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Lemmatisation**
---------------------


https://en.wikipedia.org/wiki/Lemmatisation

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster. The reduced "accuracy" may not matter for some applications. In fact, when used within information retrieval systems, stemming improves query recall accuracy, or true positive rate, when compared to lemmatisation. Nonetheless, stemming reduces precision, or true negative rate, for such systems.[5]

For instance:

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up. The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.

Process of transforming dictionary


In [1]:
#Lemmatization : Identify the root word using semantics / meanings
#Dictionary : Given a wordform with affixed ending -> root word in dictionary
#Dictionary (lexical database) = WordNet (Words & Synonyms)

#Lemmatisation : Identify the root word depending on the meeting
#using disctionary base
#Dicitionary: WordNet consist of 155,287 words and 117,000 synonyms
# https://wordnet.princeton.edu/

#Import WordNetLemmatizer

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()



#1. Find Part of Speech
#2. Find Lemmatize word

#Objective :Provide some keywrds and extarct lemmatized words

text = "caring cares cared caringly carefully"

#Example of implementation of lemmatized word
nltk.download('wordnet')
wordnet_lemmatizer.lemmatize("better",pos="a")


# Adj = a
# Adverbs = "r"
# Noun = "n"
# Verb = "v"

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


'good'

In [2]:
#https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

# Lemmatize list of words and join
lemmatized_output = ' '.join([wordnet_lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
The striped bat are hanging on their foot for best


In [0]:
# Lemmatization using (WordNet)


In [0]:
text = "He matured fast"
#data Processing

#Step 1: Word tokenize
import nltk
nltk.download("punkt")

from nltk import word_tokenize
tokens = word_tokenize(text)
print(tokens)

#Step 2: Stemming (Snowball)
#/Lemmatization (Dictionary)

from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

lemma_words = []

for token in tokens:
  lemma_word = wnl.lemmatize(token, "v")
  lemma_words.append(lemma_word)
print(" ".join(lemma_words))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['He', 'matured', 'fast']
He mature fast


In [0]:
term = "left"
pos = "v"
wnl.lemmatize(term,pos)

'leave'