<a href="https://colab.research.google.com/github/nhwhite212/DealingwithDataSpring2021/blob/master/7-TextMining_NLP/B-Part_of_Speech_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Categorizing and Tagging Words

Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These are  very useful categories for many language processing tasks. Our goals chapter is to answer the following questions:

1. What are lexical categories and how are they used in natural language processing?
2. What is a good Python data structure for storing words and their categories?
3. How can we automatically tag each word of a text with its word class?

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. 


### Using a POS tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word:
This is very important for trying to extract meaning from text. We often  need to find out the WHAT, WHERE, WHO and HOW in a document, or determine the sentiment of a document. The NLTK (Natural Language Tool Kit) library is one of a number of systems that we can use to understand text. Here are some examples:


In [1]:
!pip  install --user -U nltk

Requirement already up-to-date: nltk in /root/.local/lib/python3.7/site-packages (3.6.2)


In [17]:
# load the toolkit
import nltk
nltk.download('popular')
from nltk.tokenize import word_tokenize
import inspect


[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

Use the toolkit to tokenize (parse)some text into words, and then label the words with their parts of speech.

In [19]:
# tokenize the text
text = nltk.word_tokenize("And now for something completely different")
# Show the parts of speech for each word
print(text)
nltk.pos_tag(text)

['And', 'now', 'for', 'something', 'completely', 'different']


[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

OK, but what do 'CC', 'RB', 'IN', mean? Here we see that AND  is a CC, a coordinating conjunction; NOW and COMPLETELY are RB, or adverbs; FOR is an IN, a preposition; SOMETHING is NN, a noun; and DIFFERENT is JJ, an adjective.

NLTK provides documentation for each tag, which can be queried using the tag, e.g. `nltk.help.upenn_tagset('RB')`, or a regular expression, e.g. `nltk.help.upenn_tagset('NN.*')`. First we have to downlaod the tagsets.

In [24]:
 nltk.download('tagsets')



[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

If the next command doesn't work, type nltk.download()
and download the 'book' grammer, by typing'd' and then 'book'


In [25]:
# ASK NLTK what a   JJ is, and some examples

nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


### TAGSET meanings for the UPENN  (default) tagset.
Display all of the possible POS tags and examples.

In [26]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

### Representing Tagged Tokens

By convention in NLTK, a tagged token is represented using a **tuple** consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

Note how NLTK treats (disambiguates)  the two occurences of the token "refuse" in the sentence below.

In [27]:
tokens = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
tagged = nltk.pos_tag(tokens)
tagged

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

We can index into the "tagged" tuple and retrieve the first element.

In [28]:
tagged_token = tagged[0]
print(tagged_token)
print(tagged_token[0])
print(tagged_token[1])

('They', 'PRP')
They
PRP


Now lets iterate through the tagged tuples and break out the token and the POS Tag.


In [29]:
# print the original text, tokenized
print("Text = ", tokens)
# Now the same from the tagged tuples (note the list comprehension)
tokens = [a for (a, b) in tagged]
print("Tokens = ",tokens)
# and then print the POS TAGs
tags = [b for (a, b) in tagged]
print("POS Tags = ", tags)

Text =  ['They', 'refuse', 'to', 'permit', 'us', 'to', 'obtain', 'the', 'refuse', 'permit']
Tokens =  ['They', 'refuse', 'to', 'permit', 'us', 'to', 'obtain', 'the', 'refuse', 'permit']
POS Tags =  ['PRP', 'VBP', 'TO', 'VB', 'PRP', 'TO', 'VB', 'DT', 'NN', 'NN']


#### Exercise 

Load a text of your choice, tokenize it, and perform part of speech tagging on it. Then extract the nouns from the text, and perform a frequency anaysis, to identify the most common nouns in the text. (Warning: POS tagging takes a good amount of time when processing long texts, so try to select a text with less than 10K tokens, or simply perform POS tagging on the first 10K-20K tokens).

Repeat the exercise for adjectives.

PS: If you want to parse text from HTML without resorting to XPath expressions, you can use the "BeautifulSoup" library: