<a href="https://colab.research.google.com/github/kapilgautamin/Machine-Learning-/blob/master/spacey_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

In [0]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
doc1 = nlp(u"I am a runner running in a race because I love to run since i ran today")
for token in doc1:
  print(token.text,'-->',token.lemma_)

I --> -PRON-
am --> be
a --> a
runner --> runner
running --> run
in --> in
a --> a
race --> race
because --> because
I --> -PRON-
love --> love
to --> to
run --> run
since --> since
i --> i
ran --> run
today --> today


<font color=green>In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication.</font>

### Function to display lemmas
Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly.

In [0]:
def show_lemmas(text):
  for token in text:
    print(f'{token.text:{12}} {token.pos_}{8} {token.lemma:>{22}} {token.lemma_}')

Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

In [40]:
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)

I            PRON8     561228191312463089 -PRON-
saw          VERB8   11925638236994514241 see
eighteen     NUM8    9609336664675087640 eighteen
mice         NOUN8    1384165645700560590 mouse
today        NOUN8   11042482332948150395 today
!            PUNCT8   17494803046312582752 !


<font color=green>Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.</font>

In [22]:
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")
show_lemmas(doc3)

I            PRON6 561228191312463089     -PRON-
am           VERB6 10382539506755952630   be
meeting      VERB6 6880656908171229526    meet
him          PRON6 561228191312463089     -PRON-
tomorrow     NOUN6 3573583789758258062    tomorrow
at           ADP6 11667289587015813222   at
the          DET6 7425985699627899538    the
meeting      NOUN6 14798207169164081740   meeting
.            PUNCT6 12646065887601541794   .


<font color=green>Here the lemma of `meeting` is determined by its Part of Speech tag.</font>

In [23]:
doc4 = nlp(u"That's an enormous automobile")
show_lemmas(doc4)

That         DET6 4380130941430378203    that
's           VERB6 10382539506755952630   be
an           DET6 15099054000809333061   an
enormous     ADJ6 17917224542039855524   enormous
automobile   NOUN6 7211811266693931283    automobile


<font color=green>Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.</font>

We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.
