# Lemmatization

As we saw in the previous chapter, we can explain to the machine which words are similar but also how different there are. But sometimes you don't want to catch the difference between those words, let's take an example.

You can building a model to classify books, for that you want to take a list of the most recurrent words in each category. You have books about cooking and books about cars.
You don't really want to make a distinction between `wheel` and `wheels` or between `foot` and `feet` for example. To fix that, we will apply **lemmatization** which will put each word in its simplest variation.

## Still confused?
Let's see how it works in a practical case.

First, read [this article](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/).

Then, try to apply what you have learned.

**Pro tips:** Most lemmatizers only work with a single word and not on sentences. Think about tokenizing your sentence first.

**Pro tips:** If you experience SSL issues during `nltk` import [check this](https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed).

In [13]:
import nltk
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [15]:
# Can you lemmatize this sentence with nltk?

my_sentence = "Those childs are playing. this game, those games, I play he plays"

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

tokens =word_tokenize(my_sentence)

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
#words_lemmatized = [wordnet_lemmatizer.lemmatize(t, pos="v") for t in tokens]
#words_lemmatized = [wordnet_lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens if nltk.pos_tag([word])[0][1][0].upper()=='N']
words_lemmatized = [wordnet_lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens if nltk.pos_tag([word])[0][1][0].upper()=='N']
print(words_lemmatized)

['child', 'game', 'game', 'play', 'play']


What are the differences?

## Conclusion
There are multiple libraries that allow you to do lemmatization. Each of them have their particularities.
There are also other techniques to "simplify" words like [Stemming](https://medium.com/swlh/introduction-to-stemming-vs-lemmatization-nlp-8c69eb43ecfe). Feel free explore those that seems relevant to your use-case.

![stemming vs lemmatization](https://miro.medium.com/max/2050/1*ES5bt7IoInIq2YioQp2zcQ.png)
