<a href="https://colab.research.google.com/github/kunal24bit/NLP/blob/main/Stemming%2C_Lemmatization_and_Stop_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Stemming**

Often when searching for a *certain* keyword, it helps if the search returns variations of the word. For instance searching for look might return looks or looking. Here look would be the stem for looks, looking.

Stemming is somewhat a crude method for cataloging related words. It essentially chops off letter from the end until the stem is reached.

It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only. Therefore, in this section, we will use NLTK for stemming.

NLTK stands for Natural Language tool kit.

In [None]:
import nltk

In [None]:
from nltk.stem.porter import PorterStemmer #Porter stemmer is a Stemming algorithm developed by martin Porter in 1980. It used some rules 

In [None]:
p_stemmer = PorterStemmer()

In [None]:
words = ["run", "runner", "ran", "runs", "easily", "fairly"]

In [None]:
for word in words:
  print(word + "----->" + p_stemmer.stem(word))

run----->run
runner----->runner
ran----->ran
runs----->run
easily----->easili
fairly----->fairli


In [1]:
from nltk.stem.snowball import SnowballStemmer # SNow ball is also a Stemming language developed by Martin Porter which is faster and more effircient than Porter stemmer.

In [2]:
s_stemmer = SnowballStemmer(language="english")

In [3]:
words = ["run", "runner", "ran", "runs", "easily", "fairly"]

In [4]:
for word in words:
  print(word + "----->" + s_stemmer.stem(word))

run----->run
runner----->runner
ran----->ran
runs----->run
easily----->easili
fairly----->fair


In [5]:
#One more example

words = ["generous", "generation", "generously", "generate"]

In [6]:
for word in words:
  print(word + "----->" + s_stemmer.stem(word))

generous----->generous
generation----->generat
generously----->generous
generate----->generat


**Lemmatization**

Lemmatization looks beyond word reduction and considers a language full vocabulary to apply a morphological analysis to words.

The Lemma of "was" is "be" and lemma of "mice" is "mouse". Futher the lemma of "meeting" might be "meet" or "meeting" depending on its use in sentence.

Lemmatization is seen as much more informative than simple stemming. That is why SpaCy has opted only to have lemmatization instead of stemming.

Lemmatization looks at sorrounding text to determine a guven parts of speech, It does not categorize phrases.

In [7]:
import spacy

In [8]:
nlp = spacy.load("en_core_web_sm")

In [9]:
doc1 = nlp(u"I am runner running in a race because I love to run since I ran today.")

In [22]:
for token in doc1:
  print(token.text,'\t', token.pos_, '\t', token.lemma, '\t', token.lemma_ )

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 AUX 	 10382539506755952630 	 be
runner 	 PROPN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 561228191312463089 	 -PRON-
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
I 	 PRON 	 561228191312463089 	 -PRON-
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today
. 	 PUNCT 	 12646065887601541794 	 .


Running, run and ran reduced to same lemma and they have same hash value 12767647472892411841.

In [35]:
def show_lemma(text):
  for token in text:
    print(f'{token.text:{12}},{token.pos_:{6}},{token.lemma:<{22}},{token.lemma_:{10}} ')

In [37]:

show_lemma(doc1) #Now its printing in a good format.

I           ,PRON  ,561228191312463089    ,-PRON-     
am          ,AUX   ,10382539506755952630  ,be         
runner      ,PROPN ,12640964157389618806  ,runner     
running     ,VERB  ,12767647472892411841  ,run        
in          ,ADP   ,3002984154512732771   ,in         
a           ,DET   ,11901859001352538922  ,a          
race        ,NOUN  ,8048469955494714898   ,race       
because     ,SCONJ ,16950148841647037698  ,because    
I           ,PRON  ,561228191312463089    ,-PRON-     
love        ,VERB  ,3702023516439754181   ,love       
to          ,PART  ,3791531372978436496   ,to         
run         ,VERB  ,12767647472892411841  ,run        
since       ,SCONJ ,10066841407251338481  ,since      
I           ,PRON  ,561228191312463089    ,-PRON-     
ran         ,VERB  ,12767647472892411841  ,run        
today       ,NOUN  ,11042482332948150395  ,today      
.           ,PUNCT ,12646065887601541794  ,.          


**Stop Words**

Words like "a" and "the" appears so frequently that they do not require tagging like noun and modifiers. We call these words Stop words and they can be filtered from the text to be processed.

In [38]:
len(nlp.Defaults.stop_words)

326

In [39]:
#Adding a stop word in library

nlp.Defaults.stop_words.add('btw')

In [40]:
print(nlp.Defaults.stop_words)

{'whence', 'wherever', 'latterly', 'n’t', 'please', 'myself', 'herself', 'none', 'eight', '’ve', 'thereafter', 'than', 'yours', 'around', 'perhaps', 'noone', 'them', 'whereby', 'there', 'where', 'nor', 'five', 'such', 'btw', 'using', 'whether', 'own', 'hence', 'could', 'off', '’d', 'make', 'however', 'that', 'former', 'empty', 'on', 'others', '’re', 'him', 'anything', 'herein', 'yourself', 'already', 'as', 'ca', 'per', 'ever', 'call', 'due', 'yet', 'twelve', 'here', 'throughout', 'fifty', 'is', 'also', 'will', 'into', 'under', 'after', 'bottom', 'then', 'while', 'he', 'against', 'anywhere', 'used', 'been', 'thereby', 'this', 'even', 'himself', 'though', 'put', 'really', 'whole', 'so', 'does', 'his', 'out', 'nobody', 'became', 'doing', 'what', 'should', 'would', 'from', 'whither', 'mine', 'hereafter', '’m', 'being', 'often', 'still', 'down', 'enough', 'themselves', 'few', 'whereafter', 'may', 'seemed', 'mostly', 'were', 'other', 'why', 'twenty', 'everyone', 'those', 'becomes', 'any', 'r

In [42]:
len(nlp.Defaults.stop_words)#btw  is added.

327

In [44]:
nlp.vocab['btw'].is_stop

True

In [45]:
#Removing a stop word

nlp.Defaults.stop_words.remove("beyond")

In [46]:
nlp.vocab['beyond'].is_stop

False