Stemming: use fixed rules such as remove able, ing etc. to derive a base word.

Lemmatization: use knowledge of a language (a.k.a. linguistic knowledge) to derive a base word.

SpaCy: has only lemmatization

NLTK: have both stemming and lemmatization

In [None]:
import spacy
import nltk

#Stemming in NLTK

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [None]:
words = ["flying", "eat", "ate", "meeting", "eats", "eat", "adjustable", "rafting", "adjust", "ability", "adjustment"]

for word in words:
    print(word, "|", stemmer.stem(word))

flying | fli
eat | eat
ate | ate
meeting | meet
eats | eat
eat | eat
adjustable | adjust
rafting | raft
adjust | adjust
ability | abil
adjustment | adjust


#Lemmatization in SpaCy

In [None]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("eating eats eat ate ajustable adjustment rafting ability meeting better")

for token in doc:
    print(token, "|", token.lemma_)

eating | eat
eats | eat
eat | eat
ate | eat
ajustable | ajustable
adjustment | adjustment
rafting | raft
ability | ability
meeting | meeting
better | well


In [None]:
doc = nlp("Mando talked for 3 hours although talking isn't his thing")

for token in doc:
    print(token, "|", token.lemma_)

Mando | Mando
talked | talk
for | for
3 | 3
hours | hour
although | although
talking | talk
is | be
n't | not
his | his
thing | thing


In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

We can customize the model if the model doesn't understand slang words or other we use attribute_ruler.

In [None]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}], [{"TEXT":"Brah"}]], {"LEMMA": "Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [None]:
doc[0]

Bro

In [None]:
doc[0].lemma_

'Brother'

#Exercise

In [None]:
import nltk
import spacy

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#for nltk
nltk.download('all')
#for spacy
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

Exercise1:

* Convert these list of words into base form using Stemming and Lemmatization and
observe the transformations
* Write a short note on the words that have different base words using stemming and Lemmatization




In [None]:
#using stemming in nltk
lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']

for word in lst_words:
    print(word, " | ", stemmer.stem(word))

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  like
children  |  children
whom  |  whom
good  |  good
ate  |  ate
fishing  |  fish


In [None]:
#using lemmatization in spacy
doc = nlp("running painting walking dressing likely children whom good ate fishing")

for token in doc:
    print(token, " | ", token.lemma_)

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  likely
children  |  child
whom  |  whom
good  |  good
ate  |  eat
fishing  |  fishing


Exercise2:

* convert the given text into it's base form using both stemming and lemmatization

In [None]:
text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""

In [None]:
#using stemming in nltk
#step1: Word tokenizing
all_word_tokens = nltk.word_tokenize(text)

#step2: getting the base form for each token using stemmer
all_base_words = []

for token in all_word_tokens:
  base_form = stemmer.stem(token)
  all_base_words.append(base_form)

#step3: joining all words in a list into string using 'join()'
final_base_text = " ".join(all_base_words)
print(final_base_text)

latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhagi . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .


In [None]:
#using lemmatisation in spacy
#step1: Creating the object for the given text
doc = nlp(text)
all_base_words = []

#step2: getting the base form for each token using spacy 'lemma_'
for token in doc:
  all_base_words.append(token.lemma_)

#step3: joining all words in a list into string using 'join()'
final_base_text = " ".join(all_base_words)
print(final_base_text)

Latha be very multi talented girl . she be good at many skill like dancing , running , singing , play . she also like eat Pav Bhagi . she have a 
 habit of fishing and swim too . besides all this , she be a wonderful at cook too . 

