## Categorizing and Tagging Words

Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These are  very useful categories for many language processing tasks. Our goals chapter is to answer the following questions:

1. What are lexical categories and how are they used in natural language processing?
2. What is a good Python data structure for storing words and their categories?
3. How can we automatically tag each word of a text with its word class?

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. 


### Using a POS tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word:

In [2]:
import nltk

In [3]:
text = nltk.word_tokenize("And now for something completely different")

nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [9]:
text = nltk.word_tokenize("I traveled the world. I visited Barcelona, Moscow, New Delhi, and other big cities in Europe and Asia. I still want to visit the sub-saharan part of Africa.")

[token for (token, pos) in nltk.pos_tag(text) if pos == 'NNP']

['Barcelona', 'Moscow', 'New', 'Delhi', 'Europe', 'Asia', 'Africa']

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

NLTK provides documentation for each tag, which can be queried using the tag, e.g. `nltk.help.upenn_tagset('RB')`, or a regular expression, e.g. `nltk.help.upenn_tagset('NN.*')`.

In [4]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


In [7]:
nltk.help.upenn_tagset('NNP*')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


In [10]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

### Representing Tagged Tokens

By convention in NLTK, a tagged token is represented using a **tuple** consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

In [6]:
tokens = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
tagged = nltk.pos_tag(tokens)
tagged

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

In [7]:
tagged_token = tagged[0]
print(tagged_token)
print(tagged_token[0])
print(tagged_token[1])

('They', 'PRP')
They
PRP


In [9]:
print("Text = ", tokens)
tokens = [a for (a, b) in tagged]
print("Tokens = ",tokens)
tags = [b for (a, b) in tagged]
print("POS Tags = ", tags)

Text =  ['They', 'refuse', 'to', 'permit', 'us', 'to', 'obtain', 'the', 'refuse', 'permit']
Tokens =  ['They', 'refuse', 'to', 'permit', 'us', 'to', 'obtain', 'the', 'refuse', 'permit']
POS Tags =  ['PRP', 'VBP', 'TO', 'VB', 'PRP', 'TO', 'VB', 'DT', 'NN', 'NN']


#### Exercise 

Load a text of your choice, tokenize it, and perform part of speech tagging on it. Then extract the nouns from the text, and perform a frequency anaysis, to identify the most common nouns in the text. (Warning: POS tagging takes a good amount of time when processing long texts, so try to select a text with less than 10K tokens, or simply perform POS tagging on the first 10K-20K tokens).

Repeat the exercise for adjectives.

PS: If you want to parse text from HTML without resorting to XPath expressions, you can use the "BeautifulSoup" library as follows:

In [3]:
from bs4 import BeautifulSoup
import requests

url = "https://www.nytimes.com/2017/05/22/world/europe/ariana-grande-manchester-police.html"
resp = requests.get(url)
html = resp.text 
raw = BeautifulSoup(html, "lxml").get_text()

# The code below is to remove the junk that was extracted in addition to the article
start = raw.index("MANCHESTER, England —")
end = raw.index("Rory Smith reported from Manchester, and Sewell Chan from London")
raw = raw[start:end]

# Let's do the NLTK stuff
tokens = nltk.word_tokenize(raw)
tagged = nltk.pos_tag(tokens)

In [4]:
tagged

[('MANCHESTER', 'NNP'),
 (',', ','),
 ('England', 'NNP'),
 ('—', 'NNP'),
 ('An', 'DT'),
 ('explosion', 'NN'),
 ('that', 'WDT'),
 ('may', 'MD'),
 ('have', 'VB'),
 ('been', 'VBN'),
 ('a', 'DT'),
 ('suicide', 'JJ'),
 ('bombing', 'NN'),
 ('killed', 'VBN'),
 ('at', 'IN'),
 ('least', 'JJS'),
 ('19', 'CD'),
 ('people', 'NNS'),
 ('on', 'IN'),
 ('Monday', 'NNP'),
 ('night', 'NN'),
 ('and', 'CC'),
 ('wounded', 'VBD'),
 ('about', 'RB'),
 ('50', 'CD'),
 ('others', 'NNS'),
 ('at', 'IN'),
 ('an', 'DT'),
 ('Ariana', 'NNP'),
 ('Grande', 'NNP'),
 ('concert', 'NN'),
 ('filled', 'VBN'),
 ('with', 'IN'),
 ('adoring', 'VBG'),
 ('adolescent', 'NN'),
 ('fans', 'NNS'),
 (',', ','),
 ('in', 'IN'),
 ('what', 'WP'),
 ('the', 'DT'),
 ('police', 'NN'),
 ('were', 'VBD'),
 ('treating', 'VBG'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('terrorist', 'JJ'),
 ('attack.Panic', 'NN'),
 ('and', 'CC'),
 ('mayhem', 'VB'),
 ('seized', 'VBN'),
 ('the', 'DT'),
 ('crowd', 'NN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Manchester', 'NNP'),
 ('Are

In [6]:
# Get the nouns from the text
nouns = [token for (token,tag) in tagged if  tag.startswith('NN') and token.isalpha()]
fd_nyt = nltk.FreqDist(nouns)
fd_nyt.most_common(20)

[('Manchester', 12),
 ('people', 9),
 ('name', 8),
 ('explosion', 8),
 ('concert', 7),
 ('Arena', 6),
 ('police', 6),
 ('arena', 6),
 ('lng', 6),
 ('https', 5),
 ('Grande', 5),
 ('children', 5),
 ('text', 4),
 ('data', 4),
 ('zoom', 4),
 ('Police', 4),
 ('center', 4),
 ('Monday', 4),
 ('show', 4),
 ('concertgoers', 4)]

In [8]:
# Get the adjectives from the text
adjectives = [token for (token,tag) in tagged if  tag.startswith('JJ')  and token.isalpha()]
fd_nyt = nltk.FreqDist(adjectives)
fd_nyt.most_common(20)

[('terrorist', 7),
 ('lat', 6),
 ('true', 6),
 ('young', 3),
 ('loud', 3),
 ('bubble', 2),
 ('many', 2),
 ('url', 2),
 ('suicide', 2),
 ('British', 2),
 ('like', 2),
 ('international', 2),
 ('least', 2),
 ('other', 2),
 ('deadliest', 1),
 ('pink', 1),
 ('huge', 1),
 ('numerous', 1),
 ('last', 1),
 ('domestic', 1)]