# Stemming
* Stemming is a process of stripping affixes from words.
* More often, you normalize text by converting all the words into lowercase. This will treat both words __The__ and __the__ as same.
* With stemming, the words __playing__, __played__ and __play__ will be treated as single word, i.e. __play.__

# Stemmers in nltk
* __nltk__ comes with few stemmers.
* The two widely used stemmers are __Porter__ and __Lancaster__ stemmers.
* These stemmers have their own rules for string affixes.
* The following example demonstrates stemming of word __builders__ using __PorterStemmer__.

In [4]:
import nltk
from nltk import PorterStemmer
porter = nltk.PorterStemmer()
porter.stem('builders')

'builder'

# Stemmers in nltk
* Now let's see how to use __LancasterStemmer__ and stem the word __builders__.

In [5]:
from nltk import LancasterStemmer
lancaster = LancasterStemmer()
lancaster.stem('builders')

'build'

* Lancaster Stemmer returns __build__ whereas Porter Stemmer returns __builder__.

# Normalizing with Stemming
* Let's consider the text collection, __text1__.
* Let's first determine the number of unique words present in original __text1__.
* Then normalize the text by converting all the words into lower case and again determine the number of unique words.

In [6]:
from nltk.book import *
len(set(text1))

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


19317

In [7]:
lc_words = [ word.lower() for word in text1] 
len(set(lc_words))


17231

# Normalizing with Stemming
* Now let's further normalize text1 with Porter Stemmer.

In [8]:
from nltk import PorterStemmer
porter = PorterStemmer()
p_stem_words = [porter.stem(word) for word in set(lc_words) ]
len(set(p_stem_words))

10927

* The above output shows that, after normalising with Porter Stemmer, the text1 collection has 10927 unique words.

# Normalising with Stemming
* Now let's normalise with Lancaster stemmer and determine the unique words of __text1__.

In [9]:
from nltk import LancasterStemmer
lancaster = LancasterStemmer()
l_stem_words = [lancaster.stem(word) for word in set(lc_words) ]
len(set(l_stem_words))

9036

* Applying Lancaster Stemmer to text1 collection resulted in 9036 words.

# Understanding Lemma
* __Lemma__ is a lexical entry in a lexical resource such as word dictionary.
* You can find multiple Lemma's with the same spelling. These are known as __homonyms__.
* For example, consider the two Lemma's listed below, which are __homonyms__.<br>
1. saw [verb] - Past tense of see<br>
2. saw [noun] - Cutting instrument

# Lemmatization
* __nltk__ comes with __WordNetLemmatizer__. This lemmatizer removes affixes only if the resulting word is found in lexical resource, __Wordnet__.

In [10]:
wnl = nltk.WordNetLemmatizer()
wnl_stem_words = [wnl.lemmatize(word) for word in set(lc_words) ]
len(set(wnl_stem_words))

15168

* __WordNetLemmatizer__ is majorly used to build a vocabulary of words, which are valid Lemmas.

1. Hands-on - NLP - Python - Stemming and LemmatizationNLP - Python - Stemming and Lemmatization Define a function called `performStemAndLemma`, which takes a parameter. The first parameter, `textcontent`, is a string. The function definition code stub is given in the editor. Perform the following specified tasks: Tokenize all the words given in `textcontent`. The word should contain alphabets or numbers or underscore. Store the tokenized list of words in `tokenizedwords`. (Hint: Use regexp_tokenize)Convert all the words into lowercase from the unique set of `tokenizedwords`. Store the result into the variable `tokenizedwords`.Remove all the stop words from the `tokenizedwords`. Store the result into the variable `filteredwords`. (Hint: Use stopwords corpora)Stem each word present in `filteredwords` with PorterStemmer, and store the result in the list `porterstemmedwords`.Stem each word present in `filteredwords` with LancasterStemmer, and store the result in the list `lancasterstemmedwords`.Lemmatize each word present in `filteredwords` with WordNetLemmatizer, and store the result in the list `lemmatizedwords`. Return `porterstemmedwords`, `lancasterstemmedwords`, `lemmatizedwords` variables from the function. Input Format for Custom TestingInput from stdin will be processed as follows and passed to the function. The first line contains a string `textcontent`. Text content is used to perform stemming and lemmatization. Sample Case Sample InputSTDIN Function Parameters ----- ------------------- "Explain to me again why I shouldn't cheat?" he asked.... → textcontent = 'Explain to me again why I shouldn't cheat?" he asked....' Sample Output['ask', 'cheat', 'cheater', 'ever', 'explain', 'get', 'go', 'happi', 'know', 'lose', 'nobodi', 'other', 'punish', 'tell']['ask', 'che', 'che', 'ev', 'explain', 'get', 'go', 'happy', 'know', 'los', 'nobody', 'oth', 'pun', 'tel']['asked', 'cheat', 'cheater', 'ever', 'explain', 'get', 'go', 'happy', 'know', 'losing', 'nobody', 'others', 'punished', 'telling'] ExplanationThe first line displays all the Porter stemmed words for the given `textcontent`.The second line displays all the Lancaster stemmed words for the given `textcontent`.The third line displays all the Wordner lemmatized words for the given `textcontent`.

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd()+"/nltk_data"
import nltk

#
# Complete the 'performStemAndLemma' function below.
#
# The function accepts STRING textcontent as parameter.
#

def performStemAndLemma(textcontent):
    # Write your code here
    from nltk.corpus import stopwords
    tokenizedword = nltk.regexp_tokenize(textcontent, pattern = r'\w*', gaps = False)
    #Step 2
    tokenizedwords = [y for y in tokenizedword if y != '']
    unique_tokenizedwords = set(tokenizedwords)
    tokenizedwords = [x.lower() for x in unique_tokenizedwords if x != '']
    #Step 3
    #unique_tokenizedwords = set(tokenizedwords)
    stop_words = set(stopwords.words('english')) 
    filteredwords = []
    for x in tokenizedwords:
        if x not in stop_words:
            filteredwords.append(x)
    #Steps 4, 5 , 6
    ps = nltk.stem.PorterStemmer()
    ls = nltk.stem.LancasterStemmer()
    wnl = nltk.stem.WordNetLemmatizer()
    porterstemmedwords =[]
    lancasterstemmedwords = []
    lemmatizedwords = []
    for x in filteredwords:
        porterstemmedwords.append(ps.stem(x))
        lancasterstemmedwords.append(ls.stem(x))
        lemmatizedwords.append(wnl.lemmatize(x))
    return porterstemmedwords, lancasterstemmedwords, lemmatizedwords

if __name__ == '__main__':
    textcontent = input()

    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())

    porterstemmedwords, lancasterstemmedwords, lemmatizedwords = performStemAndLemma(textcontent)

    print(sorted(porterstemmedwords))
    print(sorted(lancasterstemmedwords))
    print(sorted(lemmatizedwords))


In [12]:
import nltk
porter = nltk.PorterStemmer()
print(porter.stem('lying'))

lie


In [13]:
import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('basics'))

bas


In [14]:
import nltk
wnl = nltk.WordNetLemmatizer()
print(wnl.lemmatize('women'))

woman


In [15]:
import nltk
porter = nltk.PorterStemmer()
print(porter.stem('ceremony'))

ceremoni


# POS Tagging
* The method of categorizing words into their parts of speech and then labeling them respectively is called __POS Tagging__.

# POS Tagger
* A __POS Tagger__ processes a sequence of words and tags a part of speech to each word.
* __pos_tag__ is the simplest tagger available in __nltk__.
* The below example shows usage of __pos_tag__.

In [16]:
import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
nltk.pos_tag(words)

[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.')]

# POS_Tagger
* The words Python, is and awesome are tagged to Proper Noun (NNP), Present Tense Verb (VB), and adjective (JJ) respectively.
* You can read more about the pos tags with the below help command

In [None]:
nltk.help.upenn_tagset()

* To know about a specific tag like JJ, use the below-shown expression

In [18]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


# Tagging Text
* Constructing a list of tagged words from a string is possible.
* A tagged word or token is represented in a tuple, having the word and the tag.
* In the input text, each word and tag are separated by __/__.

In [19]:
text = 'Python/NN is/VB awesome/JJ ./.'
[ nltk.tag.str2tuple(word) for word in text.split() ]

[('Python', 'NN'), ('is', 'VB'), ('awesome', 'JJ'), ('.', '.')]

# Tagged Corpora
* Many of the text corpus available in __nltk__, are already tagged to their respective parts of speech.
* __tagged_words__ method can be used to obtain tagged words of a corpus.
* The following example fetches tagged words of __brown__ corpus and displays few.

In [20]:
from nltk.corpus import brown
brown_tagged = brown.tagged_words()
brown_tagged[:3]

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]

# DefaultTagger
* DefaultTagger assigns a specified tag to every word or token of given text.
* An example of tagging NN tag to all words of a sentence, is shown below.

In [21]:
import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(words)

[('Python', 'NN'), ('is', 'NN'), ('awesome', 'NN'), ('.', 'NN')]

# Lookup Tagger
* You can define a custom tagger and use it to tag words present in any text.
* The below-shown example defines a dictionary __defined_tags__, with three words and their respective tags.

In [22]:
import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
defined_tags = {'is':'BEZ', 'over':'IN', 'who': 'WPS'}

# Lookup Tagger
* The example further defines a __UnigramTagger__ with the defined dictionary and uses it to predict tags of words in __text__.

In [23]:
baseline_tagger = nltk.UnigramTagger(model=defined_tags)
baseline_tagger.tag(words)

[('Python', None), ('is', 'BEZ'), ('awesome', None), ('.', None)]

* Since the words Python and awesome are not found in defined_tags dictionary, they are tagged to None.

# Unigram Tagger
* __UnigramTagger__ provides you the flexibility to create your taggers.
* Unigram taggers are built based on statistical information. i.e., they tag each word or token to most likely tag for that particular word.
* You can build a unigram tagger through a process known as __training__.
* Then use the tagger to tag words in a test set and evaluate the performance.

# Unigram Tagger
* Let's consider the tagged sentences of brown corpus collections, associated with government genre.
* Let's also compute the training set size, i.e., 80%.

In [24]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='government')
brown_sents = brown.sents(categories='government')
print(len(brown_sents))
train_size = int(len(brown_sents)*0.8)
print(train_size)

3032
2425


In [25]:
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

0.7799495586380832

* __unigram_tagger__ is built by passing trained tagged sentences as argument to __UnigramTagger__.
* The built __unigram_tagger__ is further evaluated with test sentences.
* The following code snippet shows tagging words of a sentence, taken from the test set.

In [26]:
unigram_tagger.tag(brown_sents[3000])

[('The', 'AT'),
 ('first', 'OD'),
 ('step', 'NN'),
 ('is', 'BEZ'),
 ('a', 'AT'),
 ('comprehensive', 'JJ'),
 ('self', None),
 ('study', 'NN'),
 ('made', 'VBN'),
 ('by', 'IN'),
 ('faculty', None),
 (',', ','),
 ('by', 'IN'),
 ('outside', 'IN'),
 ('consultants', 'NNS'),
 (',', ','),
 ('or', 'CC'),
 ('by', 'IN'),
 ('a', 'AT'),
 ('combination', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('two', 'CD'),
 ('.', '.')]

1. NLP - Python - POS TaggingNLP - Python - POS Tagging Define a function called `tagPOS` which takes three parameters. The first parameter, `textcontent`, is a string, the second parameter, `taggedtextcontent`, is also a string, and the third parameter, `defined_tags`, is a dictionary. The function definition code stub is given in the editor. Perform the following specified tasks: Tag the Part of Speech for the given `textcontent` words, store the result into the variable `nltk_pos_tags`. (Hint: Use pos_tag)Tag the Part of Speech for the given `taggedtextcontent` words using the Tagging Text method. Store the result into the variable `tagged_pos_tag`.Tag the Part of Speech for the given `textcontent` words and use `defined_tags` as a model in the Lookup Tagger method. Store the result into the variable `unigram_pos_tag`. Return `nltk_pos_tags`, `tagged_pos_tag`, `unigram_pos_tag` variables from the function. Input Format for Custom TestingInput from stdin will be processed as follows and passed to the function. The first line contains a string `textcontent`. Text content is used to tag Part of speech.The second line contains a string `taggedtextcontent`. The tagged text content is used to tag Part of speech, which contains the tag itself. Sample Case Sample InputSTDIN Function Parameters ----- ------------------- Python is awesome. → textcontent = 'Python is awesome.'Python/NNP is/VBZ awesome/DT ./. → taggedtextcontent = 'Python/NNP is/VBZ awesome/DT ./.' Sample Output[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.')][('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'DT'), ('.', '.')][('Python', None), ('is', 'VERB'), ('awesome', 'ADJ'), ('.', '.')] ExplanationThe first line displays POS tagged words using pos_tag for the given `textcontent`.The second line displays POS tagged words using the Tagging Text method for the given `taggedtextcontent`.The third line displays POS tagged words using the Lookup Tagger method for the given `textcontent`.

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
from nltk.corpus import brown
import nltk



#
# Complete the 'tagPOS' function below.
#
# The function accepts following parameters:
#  1. STRING textcontent
#  2. STRING taggedtextcontent
#

def tagPOS(textcontent, taggedtextcontent, defined_tags):
    # Write your code here
    words = nltk.word_tokenize(textcontent)
    nltk_pos_tags=nltk.pos_tag(words)
    tagged_pos_tag=[ nltk.tag.str2tuple(word) for word in taggedtextcontent.split() ]
    baseline_tagger = nltk.UnigramTagger(model=defined_tags)
    unigram_pos_tag=baseline_tagger.tag(words)
    return nltk_pos_tags,tagged_pos_tag,unigram_pos_tag
    
if __name__ == '__main__':
    textcontent = input()

    taggedtextcontent = input()
    
    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())

    defined_tags = dict(brown.tagged_words(tagset='universal'))

    nltk_pos_tags, tagged_pos_tag, unigram_pos_tag = tagPOS(textcontent, taggedtextcontent, defined_tags)

    print(nltk_pos_tags)
    print(tagged_pos_tag)
    print(unigram_pos_tag)


In [27]:
import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('power'))

pow
