# Lemmatization

Here we will use [spaCy](https://spacy.io/) to see the effect of lemmatization words.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy

# loading the small English model
nlp = spacy.load("en_core_web_sm")

In [None]:
text = "At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction."
text

Let's lemmatize all token we can find.

In [None]:
lemmas = [token.lemma_ for token in nlp(text.lower())]
" ".join(lemmas)

You can see that "were" was correctly lemmatized to "be".

Note that the result is strongly affected by the quality of the tokenizer. For example `reconstruction.[5` was badly tokenized. You can add a 's' at the end of "reconstruction" and see that it's not lemmatized correctly.

Another example with "went". 

In [None]:
import re

re_word = re.compile(r"^\w+$")
text = " I went to the cinema"
lemmas = [token.lemma_ for token in nlp(text.lower()) if re_word.match(token.text)]
" ".join(lemmas)

## Speed

Let's compare the speed and number of tokens generated using a lemmatizer.

In [None]:
from torchtext.datasets import PennTreebank
train, valid, test = PennTreebank()

In [None]:
from datetime import datetime

nb_unique_token = set()
nb_unique_lemma = set()
t0 = datetime.now()
for text in train:
    for token in nlp(text):
        if re_word.match(token.text):
            nb_unique_token.add(token.text)
            nb_unique_lemma.add(token.lemma_)
processing_time = datetime.now() - t0

In [None]:
print(f"nb unique token: {len(nb_unique_token)} vs nb unique lemma: {len(nb_unique_lemma)}. Processed in {processing_time}")

Ask yourself the following questions
* Why is it much slower than stemming?
* How come we have more unique lemmas than stems?

## Going further

spaCy provides [models](https://spacy.io/usage/models#languages) of different size for 18 languages (and two multilingual models). Some of these models support operations such as part-of-speech tagging and named entity recognition. You can learn more about the library following their [interactive tutorial](https://course.spacy.io/en/) (though the tutorial uses spaCy 2, and not 3 yet).