## Importing libraries

In [1]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

## Stemming (Porter algorithm)

First making a instance of porterStemmer and trying different words and also sentence.

(Note: Porterstemmer only takes 1 word at a time.)

In [2]:
porter = PorterStemmer()

In [3]:
porter.stem('Walking')

'walk'

In [4]:
porter.stem('walked')

'walk'

In [6]:
porter.stem('walks')

'walk'

In [7]:
porter.stem('ran')

'ran'

In [8]:
porter.stem('running')

'run'

In [9]:
porter.stem('bosses')

'boss'

In [10]:
porter.stem('replacement')

'replac'

In [11]:
sentence = 'Lemmatization is more sophisticated than stemming'.split()

In [13]:
for token in sentence:
    print(porter.stem(token), end=" ")

lemmat is more sophist than stem 

In [14]:
porter.stem('unnecessary')

'unnecessari'

In [15]:
porter.stem('berry')

'berri'

## Lemmatization

To use lemmatization, we need to download the wordnet from nltk. And import the wordnet from corpus.

Rest is same as using Porter algorithm.

In [17]:
nltk.download('wordnet') #This is only done once 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Nirajan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
from nltk.corpus import wordnet

In [19]:
lemmatizer = WordNetLemmatizer()

In [22]:
lemmatizer.lemmatize('Walking')

'Walking'

In [23]:
lemmatizer.lemmatize('walking', pos=wordnet.VERB)

'walk'

Lemmatization depends on POS tagging as we can see from the example. The default POS tagging is noun so for other POS tagging, we need to specify them alongside the word.

In [24]:
lemmatizer.lemmatize('going')

'going'

In [25]:
lemmatizer.lemmatize('going', pos=wordnet.VERB)

'go'

## Comparing Stemming and Lemmatization

In [26]:
porter.stem('mice')

'mice'

In [27]:
lemmatizer.lemmatize('mice')

'mouse'

In [28]:
porter.stem('was')

'wa'

In [29]:
lemmatizer.lemmatize('was', pos=wordnet.VERB)

'be'

In [30]:
porter.stem('is')

'is'

In [31]:
lemmatizer.lemmatize('is', pos=wordnet.VERB)

'be'

In [32]:
porter.stem('better')

'better'

In [33]:
lemmatizer.lemmatize('better', pos=wordnet.ADJ)

'good'

## Problem with lemmatization (with solution)

We cannot specify POS tagging for each word invidually. It takes a lots of time and sometime it is impossible. **nltk** has a method to get POS tagging for each word in document but is is not compatible with WordNetLemmatizer so we need to convert it to make compatible with it.

In [41]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [42]:
sentence = "Donald Trump has devoted following.".split()

In [43]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('devoted', 'VBN'),
 ('following.', 'NNS')]

In [44]:
for token, tag in words_and_tags:
    print(lemmatizer.lemmatize(token, pos=get_wordnet_pos(tag)), end=" ")

Donald Trump have devote following. 

In [46]:
sentence_2 = "The cat was following the bird as it flew by".split()

In [47]:
words_and_tags_2 = nltk.pos_tag(sentence_2)
words_and_tags_2

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('bird', 'NN'),
 ('as', 'IN'),
 ('it', 'PRP'),
 ('flew', 'VBD'),
 ('by', 'IN')]

In [48]:
for token, tag in words_and_tags_2:
    print(lemmatizer.lemmatize(token, pos=get_wordnet_pos(tag)), end=" ")

The cat be follow the bird a it fly by 