# Parts of Speech (POS)
Parts of Speech is identifying what each other resemble in the particular sentence, with nltk we can get the parts of speech tags returned from a particular sentence, from which we will able to retrieve particular type of words (lets say only nouns) and get the information about the document(which contains multiple sentences), here we will see how we can get information of "moby dick" book by finding only nouns, then do the frequencey distribution, then finding the most common words of distribution.

In [1]:
# Get the nltk
import nltk

In [2]:
# the original text
text = "I walked to the cafe to buy coffee after work."

In [3]:
# Tokenizing the sentences into words
tokens = nltk.word_tokenize(text)

In [4]:
# Calling the .pos_tag() to get the POS tags of the tokens
nltk.pos_tag(tokens)

[('I', 'PRP'),
 ('walked', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('cafe', 'NN'),
 ('to', 'TO'),
 ('buy', 'VB'),
 ('coffee', 'NN'),
 ('after', 'IN'),
 ('work', 'NN'),
 ('.', '.')]

In [5]:
# For understanding the abbrevations of POS short cuts given above try below command
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [6]:
# Now let us try to see how a word is tagged differently on different contexts of two different sentences
# Observe the 'desert' word
# Sentence 1
nltk.pos_tag(nltk.word_tokenize('I will have a desert'))

[('I', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('a', 'DT'), ('desert', 'NN')]

In [7]:
# Sentence 2
nltk.pos_tag(nltk.word_tokenize('They will desert us.'))

[('They', 'PRP'), ('will', 'MD'), ('desert', 'VB'), ('us', 'PRP'), ('.', '.')]

In [8]:
# From the above two sentences it is clear 
# that in first sentence 'desert' is used to tell about course of meal - so marked as NN - noun
# in second sentence 'desert' is used to say someone left someone - so marked as VB - verb

In [9]:
# Now lets work on the 'Moby Dick' book and use POS to get the information about the book
# Getting the moby dick book - words
md = nltk.corpus.gutenberg.words('melville-moby_dick.txt')

In [10]:
# Normalize it - checking all words with only alphabets, converting all words to lower case
md_norm = [word.lower() for word in md if word.isalpha()]

In [11]:
# Rather than giving with the short cuts lets try to get the unviersal abbrevations like NOUN, VERB etc
md_tags = nltk.pos_tag(md_norm, tagset='universal')

In [12]:
# Printing the first five tags, see that tags are now in universal form
md_tags[:5]

[('moby', 'NOUN'),
 ('dick', 'NOUN'),
 ('by', 'ADP'),
 ('herman', 'NOUN'),
 ('melville', 'NOUN')]

In [13]:
# Now get only the noun tag labelled values from tuples ('value', 'Universal form') -> 'value' if 'Universal form' == 'NOUN'
md_nouns = [word[0] for word in md_tags if word[1] == 'NOUN']

In [14]:
# Printing the first 10 nouns we identified
md_nouns[:10]

['moby',
 'dick',
 'herman',
 'melville',
 'etymology',
 'consumptive',
 'usher',
 'grammar',
 'school',
 'pale']

In [15]:
# Now using frequency distribution and most common words to get the most used words
md_nouns_fd = nltk.FreqDist(md_nouns)
# Getting the first 10 most common
md_nouns_fd.most_common(10)
# We can see that words like 'whale', 'man', 'sea' etc - it talks more about the sailing of man on Sea ??

[('i', 1182),
 ('whale', 909),
 ('s', 774),
 ('man', 527),
 ('ship', 498),
 ('sea', 435),
 ('head', 337),
 ('time', 334),
 ('boat', 332),
 ('ahab', 278)]