## Stemming and Lemmatization

### Stemming

> **Stemming** is a technique used to reduce an inflected word down to its word stem. For example, the words “programming,” “programmer,” and “programs” can all be reduced down to the common word stem “program.” In other words, “program” can be used as a synonym for the prior three inflection words. 

### Advantages of Stemming
1. **Improved model performance:** Stemming reduces the number of unique words that need to be processed by an algorithm, which can improve its performance. Additionally, it can also make the algorithm run faster and more efficiently.
2. **Grouping similar words: Words** with a similar meaning can be grouped together, even if they have distinct forms. This can be a useful technique in tasks such as document classification, where it’s important to identify key topics or themes within a document. 
3. **Easier to analyze and understand:** Since stemming typically reduces the size of the vocabulary, it’s much easier to analyze, compare, and understand texts. This is helpful in tasks such as sentiment analysis, where the goal is to determine the sentiment of a document.

### Disadvantages of Stemming
1. **Overstemming / False positives:** This is when a stemming algorithm reduces separate inflected words to the same word stem even though they are not related; for example, the Porter Stemmer algorithm stems  "universal", "university", and "universe" to the same word stem. Though they are etymologically related, their meanings in the modern day are from widely different domains. Treating them as synonyms will reduce relevance in search results. 
2. **Understemming / False negatives:** This is when a stemming algorithm reduces inflected words to different word stems, but they should be the same. For example, the Porter Stemmer algorithm does not reduce the words “alumnus,” “alumnae,” and “alumni” to the same word stem, although they should be treated as synonyms. 
3. **Language challenges:** As the target language's morphology, spelling, and character encoding get more complicated, stemmers become more difficult to design; For example, an Italian stemmer is more complicated than an English stemmer because there is a higher number of verb inflections. A Russian stemmer is even more complex due to more noun declensions.

### Lemmatization

> **Lemmatization** is another technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning. 

### Advantages of Lemmatization
- **Accuracy:** Lemmatization does not merely cut words off as you see in stemming algorithms. Analysis of words is conducted based on the word’s POS to take context into consideration when producing lemmas. Also, lemmatization leads to real dictionary words being produced.

###Disadvantages of Lemmatization 
- **Time-consuming:** Compared to stemming, lemmatization is a slow and time-consuming process. This is because lemmatization involves performing morphological analysis and deriving the meaning of words from a dictionary.

source: [datacamp.com](https://www.datacamp.com/tutorial/stemming-lemmatization-python)


In [2]:
import nltk
import spacy


In [3]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [4]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


In [14]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Michael talked for 2 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_)

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meeting
better  |  well


In [15]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [16]:
doc = nlp("Sis, you wanna go? Sistah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Sis | Sis
, | ,
you | you
wanna | wanna
go | go
? | ?
Sistah | Sistah
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [17]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Sis"}],[{"TEXT":"Sistah"}]],{"LEMMA":"Sister"})

doc = nlp("Sis, you wanna go? Sistah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Sis | Sister
, | ,
you | you
wanna | wanna
go | go
? | ?
Sistah | Sister
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [9]:
doc[6]

Sistah

In [12]:
doc[6].lemma_

'Sister'