<h1 style="text-align:center;color:mediumvioletred">Stemming & Lemmatization</h1>

### 1. Stemming
    Use fixed rules such as remove able, ing, ly etc. to derive a base word

    talking    --> talk
    eating     --> eat
    adjustable --> adjust

### 2. Lemmatization
    Here we need knowledge of a language(a.k.a. linguistic knowledge) to derive a base word
    
    ate     --> eat
    better  --> good
    wrote   --> write

### Stemming

In [1]:
import nltk
import spacy

In [2]:
from nltk.stem import PorterStemmer

stemmer =  PorterStemmer()

In [8]:
text = ("The cats are running faster than the dogs")
words = text.split(" ")

for word in words:
    print(word, "|", stemmer.stem(word))

The | the
cats | cat
are | are
running | run
faster | faster
than | than
the | the
dogs | dog


In [15]:
text = ("The mice were eating the cheeses while the men sang happily and became tired")
words = text.split(" ")

for word in words:
    print(word, "|", stemmer.stem(word))

The | the
mice | mice
were | were
eating | eat
the | the
cheeses | chees
while | while
the | the
men | men
sang | sang
happily | happili
and | and
became | becam
tired | tire


### Lemmatizing

In [12]:
nlp = spacy.load("en_core_web_md")

doc = nlp("The children were running quickly towards the better houses")

for word in doc:
    print(word, "|", word.lemma_, "|", word.lemma)

The | the | 7425985699627899538
children | child | 737253710922290542
were | be | 10382539506755952630
running | run | 12767647472892411841
quickly | quickly | 7007696535375059571
towards | towards | 9315050841437086371
the | the | 7425985699627899538
better | well | 4525988469032889948
houses | house | 9471806766518506264


In [14]:
doc = nlp("The mice were eating the cheeses while the men sang happily and became tired")

for word in doc:
    print(word, "|", word.lemma_)

The | the
mice | mouse
were | be
eating | eat
the | the
cheeses | cheese
while | while
the | the
men | man
sang | sing
happily | happily
and | and
became | become
tired | tired


In [16]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [18]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for word in doc:
    print(word.text, "|", word.lemma_)

Bro | Bro
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brah
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [19]:
doc[0]

Bro

In [20]:
doc[0].lemma_

'Bro'

In [23]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]], {"LEMMA": "Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for word in doc:
    print(word.text, "|", word.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [24]:
doc[0]

Bro

In [25]:
doc[0].lemma_

'Brother'

In [26]:
doc[6]

Brah

In [27]:
doc[6].lemma_

'Brother'

# Exercise

### Question 1:
- Convert these list of words into base form using Stemming and Lemmatization and observe the transformations
- Write a short note on the words that have different base words using stemming and Lemmatization

In [28]:
text1 = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']

text2 = "running painting walking dressing likely children who good ate fishing"

#### Stemming using nltk

In [29]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [67]:
print("Raw word   -->   Stem")
print("-"*30)
given_words = []
stems = []
for word in text1:
    stem = stemmer.stem(word)
    given_words.append(word)
    stems.append(stem)
    print(word,"   -->   ",stem)

Raw word   -->   Stem
------------------------------
running    -->    run
painting    -->    paint
walking    -->    walk
dressing    -->    dress
likely    -->    like
children    -->    children
whom    -->    whom
good    -->    good
ate    -->    ate
fishing    -->    fish


#### Lemmatization using spacy

In [68]:
import spacy

nlp = spacy.load("en_core_web_md")

In [70]:
doc = nlp("running painting walking dressing likely children whom good ate fishing")

print("Raw word   -->   Lemma")
print("-"*30)
lemmas = []
for word in doc:
    lemma = word.lemma_
    lemmas.append(lemma)
    print(word.text, "   -->   ",lemma)

Raw word   -->   Lemma
------------------------------
running    -->    run
painting    -->    paint
walking    -->    walk
dressing    -->    dress
likely    -->    likely
children    -->    child
whom    -->    whom
good    -->    good
ate    -->    eat
fishing    -->    fishing


#### Differences

In [75]:
print("Given words   -->   Stems  -->  Lemmas")
print("-"*40)

for i in range(len(given_words)):
    if stems[i] != lemmas[i]:
        print(given_words[i], "   -->   ", stems[i], "   -->   ", lemmas[i])

Given words   -->   Stems  -->  Lemmas
----------------------------------------
likely    -->    like    -->    likely
children    -->    children    -->    child
ate    -->    ate    -->    eat
fishing    -->    fish    -->    fishing


### Question 2:
- convert the given text into it's base form using both stemming and lemmatization

In [82]:
text = """Latha is very multi talented girl. She is good at many skills like dancing, running, singing, playing. She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too. Besides all this, she is a wonderful at cooking too."""

#### Stemming using nltk

In [83]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

In [84]:
#step1: Word tokenizing
words = word_tokenize(text)

#step2: Getting all base words
base_words = []

for word in words:
    base_words.append(stemmer.stem(word))

#joining all base words into a string
joined = ' '.join(base_words)
joined

'latha is veri multi talent girl . she is good at mani skill like danc , run , sing , play . she also like eat pav bhagi . she ha a habit of fish and swim too . besid all thi , she is a wonder at cook too .'

#### Using lemmatization in Spacy

In [85]:
import spacy

nlp = spacy.load("en_core_web_md")

In [86]:
doc = nlp(text)

base_words = []

for word in doc:
    lemma = word.lemma_
    base_words.append(lemma)

all_base_words = ' '.join(base_words)
all_base_words

'Latha be very multi talented girl . she be good at many skill like dance , running , singing , playing . she also like eat Pav Bhagi . she have a \n habit of fishing and swimming too . besides all this , she be a wonderful at cook too .'