In [1]:
import nltk

# Stemming 

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

![image.png](attachment:image.png)

###  1. PorterStemmer

It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words.

This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −

In [2]:
from nltk.stem import PorterStemmer

In [3]:
ps_stemmer = PorterStemmer()

In [4]:
ps_stemmer.stem('eating')

'eat'

In [5]:
ps_stemmer.stem('meeting')

'meet'

In [9]:
ps_stemmer.stem('sleeps')

'sleep'

In [14]:
ps_stemmer.stem('eaten')

'eaten'

### 2. LancasterStemmer

In [10]:
from nltk.stem import LancasterStemmer

In [11]:
ls_stemmer = LancasterStemmer()

In [12]:
ls_stemmer.stem('eats')

'eat'

In [13]:
ls_stemmer.stem('eaten')

'eat'

### 3. RegexpStemmer

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −

In [15]:
from nltk.stem import RegexpStemmer

In [17]:
re_stemmer = RegexpStemmer('ing')

In [18]:
re_stemmer.stem('eating')

'eat'

In [19]:
re_stemmer.stem('ingeat')

'eat'

In [20]:
re_stemmer.stem('eats')

'eats'

### 4. SnowballStemmer

supports 15+ non-english languages

In [21]:
from nltk.stem import SnowballStemmer

In [22]:
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [23]:
ss = SnowballStemmer('french')

In [24]:
ss.stem('bonjoura')

'bonjour'

# lemmatization

Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

In [25]:
from nltk.stem import WordNetLemmatizer

In [26]:
wnl = WordNetLemmatizer()

In [28]:
wnl.lemmatize('eating')

'eating'

In [29]:
wnl.lemmatize('sleeps')

'sleep'

## D/f b/w stemming and lemmatization

In [30]:
ps_stemmer.stem('believes')

'believ'

In [31]:
wnl.lemmatize('believes')

'belief'

The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.