# Background

- Languages are made up of several words often derived from one another.

>"In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change"
[Wikipedia](https://en.wikipedia.org/wiki/Inflection)

Stemming and Lemmatization helps us to achieve the root forms (sometimes called synonyms in search context) of inflected (derived) words.
*Stemming is different to Lemmatization in the approach it uses to produce root forms of words and the word produced.*

## Stemming

>"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."

**Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).**

### Stemming algorithms

There are English and Non-English Stemmers available in `nltk` package.
For the English language, you can choose between [PorterStemmer](https://tartarus.org/martin/PorterStemmer/def.txt) or LancasterStemmer. 

In [1]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

words = ['dogs', 'trouble', 'troubling', 'troubled']

In [2]:
# PorterStemmer
porter = PorterStemmer()

for word in words:
    print(porter.stem(word))

dog
troubl
troubl
troubl


PorterStemmer uses **suffix stripping** to produce stems. But if you look at 'trouble', 'troubling' and 'troubled' they are stemmed to 'trouble' because **PorterStemmer algorithm does not follow linguistics rather a set of 5 rules for different cases that are applied in phases (step by step) to generate stems**. This is the reason why PorterStemmer does not often generate stems that are actual English words.

It is known for its **simplicity** and **speed**.

In [3]:
# LancasterStemmer
lancaster = LancasterStemmer()

for word in words:
    print(lancaster.stem(word))

dog
troubl
troubl
troubl


The **LancasterStemmer (Paice-Husk stemmer)** is an iterative algorithm with rules saved externally. One table containing about 120 rules indexed by the last letter of a suffix. On each iteration, it tries to find an applicable rule by the last character of the word.

>LancasterStemmer is simple, but heavy stemming due to iterations and over-stemming may occur. Over-stemming causes the stems to be not linguistic, or they may have no meaning.

For example, in code below `destabilized` is stemmed to `dest` in LancasterStemmer whereas using PorterStemmer `destabl`. LancasterStemmer produces an even shorter stem than porter because of iterations and over-stemming is occurred.

In [4]:
# Comparison of Porter and Lancaster stemmers
word_list = ["friend", "friendship", "friends","friendships","stabilize","destabilize","misunderstanding","railroad","moonlight","football"]
print(f'{"Word":20}{"Porter":20}{"Lancaster":20}')
print()

for word in word_list:
    print(f"{word:20}{porter.stem(word):20}{lancaster.stem(word):20}")


Word                Porter              Lancaster           

friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabilize           stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


#### Stemming sentences

To separate the sentence into words, you can use **tokenizer**.

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize

sentence = "All work and no play makes jack a dull boy, all work and no play."    

In [6]:
def stem_sentence(sentence):
    tokens = word_tokenize(sentence)
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(word) for word in tokens])

In [7]:
stem_sentence(sentence)

'all work and no play make jack a dull boy , all work and no play .'

### Non-English Stemmers

`nltk` provides **SnowballStemmers, ISRIStemmer, RSLPSStemmer**.

**ISRIStemmer** is an Arabic stemmer and **RSLPStemmer** is stemmer for the Portuguese Language.

In [8]:
# SnowballStemmer languages
from nltk.stem.snowball import SnowballStemmer

print('\n'.join(SnowballStemmer.languages))

danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
porter
portuguese
romanian
russian
spanish
swedish


In [9]:
stemmer = SnowballStemmer('english')
stemmer.stem('sleeping')

'sleep'

You can also tell the stemmer to ignore stop-words.

>Stop Words are words which do not contain important significance to be used. Usually, these words are filtered out because they return a vast amount of unnecessary information.

In [10]:
stemmer = SnowballStemmer('english', ignore_stopwords=True)
stemmer.stem('sleeping')

'sleep'

In [11]:
stemmer = SnowballStemmer("spanish", ignore_stopwords=True)
stemmer.stem("dormir")

'dorm'

## Lemmatization

> Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

In [12]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

In [13]:
sentence = 'He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun.'
punctuations = '?:!.,;'
tokens = word_tokenize(sentence)

In [14]:
sentence_words = []

for word in tokens:
    if not word in punctuations:
        sentence_words.append(word)

In [15]:
print(f'{"Word":20}{"Lemma":20}')
for word in sentence_words:
    print(f'{word:20}{wordnet_lemmatizer.lemmatize(word):20}')

Word                Lemma               
He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun                 


You need to provide the context in which you want to lemmatize that is the parts-of-speech (POS). This is done by giving the value for `pos` parameter in `wordnet_lemmatizer.lemmatize`.

In [16]:
print(f'{"Word":20}{"Lemma":20}')
for word in sentence_words:
    print(f'{word:20}{wordnet_lemmatizer.lemmatize(word, pos="v"):20}')

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 
