## summary
- PorterStemmer: for searh engine, indexing
  - speed
  - eg. running => run
- WordNetLemmatizer: for text analysis, NLP, summarization
  - accuracy (use dictionary from Wordnet)
  - suport POS tagging Support
  - eg. better => good
- WordNet: for synonym search, relationship analysis
  - relationship-driven language understanding
  - synonym set, semantic relation

## PorterStemmer

In [None]:
import nltk
# nltk.download('wordnet')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\khala\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\khala\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\khala\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
from nltk.stem import PorterStemmer

In [2]:
porter = PorterStemmer()

In [3]:
porter.stem("walking")

'walk'

In [4]:
porter.stem("walked")

'walk'

In [5]:
porter.stem("walks")

'walk'

In [6]:
porter.stem("ran")

'ran'

In [7]:
porter.stem("running")

'run'

In [8]:
porter.stem("bosses")

'boss'

In [9]:
porter.stem("replacement")

'replac'

In [10]:
sentence = "Lemmatization is more sophisticated than stemming".split()

In [11]:
sentence

['Lemmatization', 'is', 'more', 'sophisticated', 'than', 'stemming']

In [12]:
for token in sentence:
  print(porter.stem(token), end=" ")

lemmat is more sophist than stem 

In [13]:
porter.stem("unnecessary")

'unnecessari'

In [14]:
porter.stem("berry")

'berri'

## WordNetLemmatizer & wordnet

In [11]:
from nltk.stem import WordNetLemmatizer

In [10]:
from nltk.corpus import wordnet

In [16]:
lemmatizer = WordNetLemmatizer()

In [17]:
lemmatizer.lemmatize("walking")

'walking'

In [18]:
lemmatizer.lemmatize("walking", pos=wordnet.VERB)

'walk'

In [19]:
lemmatizer.lemmatize("going")

'going'

In [20]:
lemmatizer.lemmatize("going", pos=wordnet.VERB)

'go'

In [21]:
lemmatizer.lemmatize("ran", pos=wordnet.VERB)

'run'

In [22]:
porter.stem("mice")

'mice'

In [23]:
lemmatizer.lemmatize("mice")

'mouse'

In [24]:
porter.stem("was")

'wa'

In [25]:
lemmatizer.lemmatize("was", pos=wordnet.VERB)

'be'

In [30]:
porter.stem("is")


'is'

In [31]:
lemmatizer.lemmatize("is", pos=wordnet.VERB)

'be'

In [32]:
porter.stem("better")

'better'

In [33]:
lemmatizer.lemmatize("better", pos=wordnet.ADJ)

'good'

In [8]:
def get_wordnet_pos(treebank_tag):
  if treebank_tag.startswith('J'):
    return wordnet.ADJ
  elif treebank_tag.startswith('V'):
    return wordnet.VERB
  elif treebank_tag.startswith('N'):
    return wordnet.NOUN
  elif treebank_tag.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN

In [2]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ['running', 'ran', 'flies', 'better', 'rocks']

# Default is noun lemmatization
lemmas_default = [lemmatizer.lemmatize(w) for w in words]
print("Default (noun):", lemmas_default)

# Verb lemmatization
lemmas_verbs = [lemmatizer.lemmatize(w, pos='v') for w in words]
print("Verb lemmatization:", lemmas_verbs)

Default (noun): ['running', 'ran', 'fly', 'better', 'rock']
Verb lemmatization: ['run', 'run', 'fly', 'better', 'rock']


## average_perceptron_tagger

In [None]:
# nltk.download('averaged_perceptron_tagger') # for pos_tag

In [5]:
sentence = "Donald Trump has a devoted following".split()

In [6]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN')]

In [12]:
for word, tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
  print(lemma, end=" ")

Donald Trump have a devote following 

In [13]:
sentence = "The cat was following the bird as it flew by".split()

In [14]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('bird', 'NN'),
 ('as', 'IN'),
 ('it', 'PRP'),
 ('flew', 'VBD'),
 ('by', 'IN')]

In [15]:
for word, tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
  print(lemma, end=" ")

The cat be follow the bird a it fly by 

## pos_tag
- CC: coordinating conjunction -- and,but,or
- CD: cardinal number -- one, two, 23
- DT: determiner -- the, a, an
- EX: Existential there -- there
- FW: Foreign word -- café, über
- JJ: adjective -- big, red, beautiful
- JJR: adjective, comparative -- bigger, taller
- JJS: adjective, superlative -- biggest, tallest
- MD: modal -- will, shall, can
- NNP: proper noun, singular -- John, London
- NNPS: proper noun, plural -- American
- PDT: predeterminer -- all, both
- POS: posessive ending -- 's
- RB: adverb -- quickly, silently
- RBR: adverb, comparative -- faster, louder
- RBS: adverb, superlative -- fasters, loudest
- TO: to -- 'to' run
- UH: interjection -- uh, oh
- VB: verb, base form -- run, eat
- VBD: verb, past tense -- ran, ate
- VBG: verb, gerund -- running, eatinng
- VBN: verb, past participle -- run, eaten
- VBP: verb, non-3d person singular person -- run, eat
- VBZ: verb, 3d person singular person -- runs, eats