## Stemming 
      -> use of fixed rules to derive a base word is basicaaly stemming
      -> SpaCy doesn't have any support for Stemming, hence we'll be using             NLTK., as they already have lemmatization integrated in their library...But NLTK supports both Stemming and Lemmatization.
      

In [1]:
import nltk
import spacy

In [2]:
from nltk.stem import PorterStemmer # PorterStemmer is a class

stemmer = PorterStemmer()   # creating an object of that class

In [3]:
words = ["eating","eats","eat","ate","adjustable","rafting","ability","meeting"]

for word in words:
    print(word , " | ", stemmer.stem(word))

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  ate
adjustable  |  adjust
rafting  |  raft
ability  |  abil
meeting  |  meet


## Lemmatization
    -> using the knowledge of a language (a.k.a linguistic knowledge) to derive a base word(lemma) is basically lemmatization 
    -> we'll be using SpaCy

In [4]:
nlp = spacy.load("en_core_web_sm")

doc = nlp(" eating eats eat ate adjustable rafting ability meeting better")

for token in doc:
    print(token , " | ", token.lemma_)

   |   
eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meeting
better  |  well


In [5]:
doc = nlp("Mando talked for 3 hours although talking isn't his thing and he became talkative")

for token in doc:
    print(token , " | ", token.lemma_)

Mando  |  Mando
talked  |  talk
for  |  for
3  |  3
hours  |  hour
although  |  although
talking  |  talk
is  |  be
n't  |  not
his  |  his
thing  |  thing
and  |  and
he  |  he
became  |  become
talkative  |  talkative


In [6]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [7]:
# attribute_ruler assigns attribute to the particular token
# tagger is used for determining the Part Of Speech(POS)
# lemmatizer is used for generating the base words.

# add_pipe() ->  adds a particular component to the pipeline
# get_pipe() -> gets a particular component from the pipeline

### Customizing Attribute Ruler component

In [8]:
doc = nlp("Bro ,  you wanna go ? Brah, don't say no! I am Exhausted")

for token in doc:
    print(token , " | ", token.lemma_)

Bro  |  bro
,  |  ,
   |   
you  |  you
wanna  |  wanna
go  |  go
?  |  ?
Brah  |  Brah
,  |  ,
do  |  do
n't  |  not
say  |  say
no  |  no
!  |  !
I  |  I
am  |  be
Exhausted  |  exhaust


In [9]:
## customizing the attribute rules
atr = nlp.get_pipe('attribute_ruler')
atr.add([[{"TEXT" : "Bro"}] , [{"TEXT" : "Brah"}]], {"LEMMA" : "Brother"}) # change the base word of 'Bro' and 'Brah' to 'Brother'


doc = nlp("Bro ,  you wanna go ? Brah, don't say no! I am Exhausted")

for token in doc:
    print(token , " | ", token.lemma_)

Bro  |  Brother
,  |  ,
   |   
you  |  you
wanna  |  wanna
go  |  go
?  |  ?
Brah  |  Brother
,  |  ,
do  |  do
n't  |  not
say  |  say
no  |  no
!  |  !
I  |  I
am  |  be
Exhausted  |  exhaust


## Exercises :

Q1. Convert these list of words into base form using Stemming and Lemmatization and observe the transformations.
  

Q2. convert the given text into it's base form using both stemming and lemmatization
   
    

In [10]:
# nltk.download('all')

In [11]:
# performing stemming - Solution 1
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']

for word in lst_words:
    print(word , " | ", stemmer.stem(word))

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  like
children  |  children
whom  |  whom
good  |  good
ate  |  ate
fishing  |  fish


In [12]:
# performing lemmatization - Solution 1

nlp = spacy.load("en_core_web_sm")

doc = nlp("running painting walking dressing likely children who good ate fishing")

for token in doc:
    print(token, " | ",token.lemma_)

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  likely
children  |  child
who  |  who
good  |  good
ate  |  eat
fishing  |  fishing


In [17]:
# using stemming for solution - 2

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""


# step1 : word tokenizing    
word_tokens = nltk.word_tokenize(text)

# step2 : getting the base form for each token using stemmer
base_words = []

for token in word_tokens:
    base_form = stemmer.stem(token)
    base_words.append(base_form) 

# step3 : joining all the words in a list into string using 'join' 
final_text = ' '.join(base_words)
print(final_text)


latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhagi . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .


In [22]:
# using lemmatization in SpaCy for solution - 2

text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""
nlp = spacy.load('en_core_web_sm')

#step1: Creating the object for the given text
doc = nlp(text)
base_words = []

#step2: getting the base form for each token using spacy 'lemma_'
for token in doc:
    base_words.append(token.lemma_)
    
#step3: joining all words in a list into string using 'join()'
final_text = " ".join(base_words)
print(final_text)
    
    

Latha be very multi talented girl . she be good at many skill like dancing , running , singing , play . she also like eat Pav Bhagi . she have a 
 habit of fishing and swim too . besides all this , she be a wonderful at cook too . 

