## Categorizing and Tagging Words

Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These are  very useful categories for many language processing tasks. Our goals chapter is to answer the following questions:

1. What are lexical categories and how are they used in natural language processing?
2. What is a good Python data structure for storing words and their categories?
3. How can we automatically tag each word of a text with its word class?

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. 


### Using a POS tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word:

In [1]:
import nltk

text = nltk.word_tokenize("And now for something completely different")

nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

NLTK provides documentation for each tag, which can be queried using the tag, e.g. `nltk.help.upenn_tagset('RB')`, or a regular expression, e.g. `nltk.help.upenn_tagset('NN.*')`.

In [5]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


In [6]:
nltk.help.upenn_tagset('VB.*')

VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
 

Let's look at another example, this time including some **homonyms**:

In [7]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)


[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)

See now how this information can be useful when trying to figure out the sense of a word in WordNet:

In [8]:
from nltk.corpus import wordnet as wn
wn.synsets('refuse')

[Synset('garbage.n.01'),
 Synset('refuse.v.01'),
 Synset('refuse.v.02'),
 Synset('defy.v.02'),
 Synset('deny.v.04'),
 Synset('resist.v.05'),
 Synset('reject.v.06')]

In [10]:
senses = [(s.lemma_names(), s.definition(), s.examples()) 
          for s in wn.synsets('refuse')]
for s in senses:
    print("Lemma name:", s[0])
    print("Definition:", s[1])
    print("Examples  :", s[2])
    print("=======================")

Lemma name: ['garbage', 'refuse', 'food_waste', 'scraps']
Definition: food that is discarded (as from a kitchen)
Examples  : []
Lemma name: ['refuse', 'decline']
Definition: show unwillingness towards
Examples  : ['he declined to join the group on a hike']
Lemma name: ['refuse', 'reject', 'pass_up', 'turn_down', 'decline']
Definition: refuse to accept
Examples  : ['He refused my offer of hospitality']
Lemma name: ['defy', 'resist', 'refuse']
Definition: elude, especially in a baffling way
Examples  : ['This behavior defies explanation']
Lemma name: ['deny', 'refuse']
Definition: refuse to let have
Examples  : ['She denies me every pleasure', 'he denies her her weekly allowance']
Lemma name: ['resist', 'reject', 'refuse']
Definition: resist immunologically the introduction of some foreign tissue or organ
Examples  : ['His body rejected the liver of the donor']
Lemma name: ['reject', 'turn_down', 'turn_away', 'refuse']
Definition: refuse entrance or membership
Examples  : ['They turned a

There is just one interpretation of _refuse_ that is a noun (garbage.n.01) and the most common interpretation of refuse as a verb means "show unwillingness towards" which is the correct interpretation in our context. 

#### Exercise

Many words, like ski and race, can be used as nouns or verbs with no difference in pronunciation. Can you think of others? Now make up a sentence with both uses of this word, and run the POS-tagger on this sentence.

In [20]:
# your code here

text = nltk.word_tokenize("The children had to run to get to run on time")
tagged = nltk.pos_tag(text)
tagged

[('The', 'DT'),
 ('children', 'NNS'),
 ('had', 'VBD'),
 ('to', 'TO'),
 ('run', 'VB'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('to', 'TO'),
 ('run', 'VB'),
 ('on', 'IN'),
 ('time', 'NN')]

### Representing Tagged Tokens

By convention in NLTK, a tagged token is represented using a **tuple** consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

In [23]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
tagged = nltk.pos_tag(text)
tagged_token = tagged[0]
tagged

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

In [24]:
print(tagged_token)
print(tagged_token[0])
print(tagged_token[1])

('They', 'PRP')
They
PRP


In [26]:
print("Text = ", text)
tokens = [a for (a, b) in tagged]
print("Tokens = ",tokens)

Text =  ['They', 'refuse', 'to', 'permit', 'us', 'to', 'obtain', 'the', 'refuse', 'permit']
Tokens =  ['They', 'refuse', 'to', 'permit', 'us', 'to', 'obtain', 'the', 'refuse', 'permit']


In [27]:
tags = [b for (a, b) in tagged]
print("POS Tags = ", tags)

POS Tags =  ['PRP', 'VBP', 'TO', 'VB', 'PRP', 'TO', 'VB', 'DT', 'NN', 'NN']


In [28]:
fd = nltk.FreqDist(tags)
print(fd)
fd.tabulate()

<FreqDist with 6 samples and 10 outcomes>
 TO  VB PRP  NN  DT VBP 
  2   2   2   2   1   1 


In [29]:
text = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
tagged = nltk.pos_tag(text)

In [30]:
tokens = [a for (a, b) in tagged]
tags = [b for (a, b) in tagged]
fd = nltk.FreqDist(tags)
fd.tabulate()

 NNP   NN    ,   IN  PRP    .   DT   JJ   VB   RB   CC PRP$    :  VBP  NNS  VBZ   TO  VBD   MD   ''  VBN   WP  WRB  POS  VBG  WDT   CD  JJS   EX  JJR NNPS  RBR   RP   UH    (    )  PDT  RBS  WP$   FW  SYM 
5217 4214 2892 2821 2463 2362 2263 1696 1538 1492 1295 1261  980  943  788  774  685  591  570  517  261  253  231  209  174  111   90   87   79   74   62   58   51   50   45   43   42   36   19   15    8 


In [31]:
from nltk.book import *
tagged_wsj = nltk.pos_tag(text7)

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [32]:
tokens_wsj = [a for (a, b) in tagged_wsj]
tags_wsj = [b for (a, b) in tagged_wsj]
fd_wsj = nltk.FreqDist(tags_wsj)
fd_wsj.tabulate()

   NN   NNP    IN    JJ    DT   NNS     ,    CD     .   VBD    RB    VB    CC    TO   VBZ   VBN   PRP   VBP   VBG    MD   POS  PRP$     $    ``    ''     :   JJR   WDT    WP    RP  NNPS   JJS   WRB   RBR    EX   RBS   PDT     #    FW   WP$    UH   SYM 
14666 10457 10055  8747  8117  6333  4885  4725  3874  3185  2875  2570  2312  2181  2036  1856  1712  1496  1396   930   852   771   726   712   693   563   359   250   244   216   182   179   177   142    91    35    27    16    15    14     3     1 


#### Exercise 

Load a text of your choice, tokenize it, and perform part of speech tagging on it. Then extract the nouns from the text, and perform a frequency anaysis, to identify the most common nouns in the text. (Warning: POS tagging takes a good amount of time when processing long texts, so try to select a text with less than 10K tokens, or simply perform POS tagging on the first 10K-20K tokens).

Repeat the exercise for adjectives.

PS: If you want to parse text from HTML without resorting to XPath expressions, you can use the "BeautifulSoup" library as follows:

In [39]:
from bs4 import BeautifulSoup
import requests

url = "http://www.nytimes.com/2014/11/11/world/asia/obama-apec-china-hong-kong.html"
resp = requests.get(url)
html = resp.text 
raw = BeautifulSoup(html, "lxml").get_text()

# The code below is to remove the junk that was extracted in addition to the article
start = raw.index(u"BEIJING —")
end = raw.index(u"than Shanghai Tang.")
raw = raw[start:end]

# Let's do the NLTK stuff
tokens = nltk.word_tokenize(raw)
tagged = nltk.pos_tag(tokens)

In [41]:
tagged

[('BEIJING', 'NNP'),
 ('—', 'NNP'),
 ('President', 'NNP'),
 ('Obama', 'NNP'),
 ('is', 'VBZ'),
 ('in', 'IN'),
 ('China', 'NNP'),
 ('for', 'IN'),
 ('less', 'JJR'),
 ('than', 'IN'),
 ('three', 'CD'),
 ('days', 'NNS'),
 ('this', 'DT'),
 ('week', 'NN'),
 (',', ','),
 ('but', 'CC'),
 ('he', 'PRP'),
 ('is', 'VBZ'),
 ('seeing', 'VBG'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('deal', 'NN'),
 ('of', 'IN'),
 ('President', 'NNP'),
 ('Xi', 'NNP'),
 ('Jinping.On', 'NNP'),
 ('Tuesday', 'NNP'),
 (',', ','),
 ('they', 'PRP'),
 ('will', 'MD'),
 ('go', 'VB'),
 ('for', 'IN'),
 ('a', 'DT'),
 ('quiet', 'JJ'),
 ('walk', 'NN'),
 ('in', 'IN'),
 ('Mr.', 'NNP'),
 ('Xi’s', 'NNP'),
 ('walled', 'VBD'),
 ('compound', 'NN'),
 ('and', 'CC'),
 ('have', 'VBP'),
 ('dinner', 'VBN'),
 ('.', '.'),
 ('The', 'DT'),
 ('next', 'JJ'),
 ('day', 'NN'),
 (',', ','),
 ('they', 'PRP'),
 ('will', 'MD'),
 ('take', 'VB'),
 ('part', 'NN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('formal', 'JJ'),
 ('welcoming', 'NN'),
 ('ceremony', 'NN'),
 ('at', 'IN'),

In [46]:
nouns = [token for (token,tag) in tagged if  tag.startswith('NN')]
fd_nyt = nltk.FreqDist(nouns)
fd_nyt.most_common(20)

[('Mr.', 23),
 ('Obama', 16),
 ('story', 9),
 ('China', 9),
 ('White', 9),
 ('House', 8),
 ('Continue', 8),
 ('Xi', 8),
 ('Advertisement', 8),
 ('President', 6),
 ('time', 6),
 ('president', 6),
 ('news', 6),
 ('officials', 5),
 ('leaders', 5),
 ('deal', 4),
 ('meeting', 4),
 ('part', 4),
 ('questions', 4),
 ('email', 3)]

In [48]:
adjectives = [token for (token,tag) in tagged if  tag.startswith('JJ')]
fd_nyt = nltk.FreqDist(adjectives)
fd_nyt.most_common(20)

[('Chinese', 10),
 ('main', 9),
 ('summary', 3),
 ('American', 3),
 ('product-title', 3),
 ('other', 3),
 ('public', 3),
 ('former', 2),
 ('less', 2),
 ('major', 2),
 ('prosperous', 2),
 ('willing', 2),
 ('illuminated', 1),
 ('last', 1),
 ('easier', 1),
 ('broader', 1),
 ('aware', 1),
 ('re-enter', 1),
 ('quiet', 1),
 ('vivid', 1)]

### Primitive sentiment analysis

Adjectives are known to be the primary carriers of sentiment. So now let's pick a piece of text and identify the adjectives that appear in the text and their sentiment score. For that, we will use the  SentiWordNet, a lexical resource for opinion mining.

In [53]:
# See http://www.nltk.org/_modules/nltk/corpus/reader/sentiwordnet.html for the documentation

from nltk.corpus import sentiwordnet as swn
print(swn.senti_synset('breakdown.n.03'))

<breakdown.n.03: PosScore=0.0 NegScore=0.25>


Now let's analyze a review text 

In [50]:
# Amazon review for Samsung Galaxy S5, White 16GB
# http://www.amazon.com/review/R3UULR1IWEUS4I/ref=cm_cr_dp_title?ie=UTF8&ASIN=B00IZ1X21K&nodeID=2335752011&store=wireless


content = u'''
First off, I am not a professional reviewer, nor am I employed or compensated by Samsung or any other company. Instead of boring you with facts - which you can find anywhere on the Net - I will just give you some real-world impressions on how it looks, feels, and runs. With that out of the way, let's get to the point and the nitty gritty, shall we?

* THE SCREEN - that is the very first thing you will notice when you look at the S5. Samsung has found its niche with AMOLED screens, which are BRIGHT & SATURATED. Everything almost literally jumps out at you, and sometimes even too much so. I had to switch to the "natural" setting, as the "vivid" and even "standard" profiles are too saturated(and FAKE) for me. It's better as a demo unit to draw you in, but for everyday use, I recommend switching to the natural profile.
FACTS: The Galaxy S5 has a 5.1-inch Super AMOLED capacitive touchscreen with Full HD resolution - 1080 x 1920 pixels or ~432 ppi pixel density, plus Gorilla Glass 3 to protect the screen from scratches.

* The Look - the S5 has a more squared-off edges look than the S4, which is more squared off than the S3, but all three are not as angular as the S2. In terms of roundness-to square-ness, it goes from the S3 - S4 - S5 - S2 (the original S just looks like an iPhone 3GS). Check out my images for an easier comparison. The S5 is the tallest and widest, but not the thickest of the Galaxy S's. The best thing I can say about this is it's an evolution. Beauty is subjective, so judge for yourself. The front side is almost the same as any other Galaxy phone: You have the physical Home button, flanked by the "back" and "menu" capacitive buttons. Probably the most improved aspect of the design is in its functionality - it is now dust-proof, and water-proof up to 3 feet!
FACTS: The dimensions are 5.59" x 2.85" x 0.32"(142cm x 72.5cm x 8.1cm), and weighs 5.11oz(145g).

* The Feel - Samsung has taken a lot of flack for making the Galaxy S line so cheap looking and feeling with its plastic bodies, for being the top Android phone maker. HTC has been known to have the best craftsmanship with their all-metal One phones. Perhaps Samsung feel they are so dominant that they don't have to spend more to mass-produce metal phones, but since they don't want to come off as too arrogant, so their compromise is a dimpled, faux-rubber backside like the Nexus 7(2012) and its very own Galaxy Note 3. It definitely gives a better feel - it doesn't slip and slide in your hands or pockets anymore - but it cannot compare to the feel and craftsmanship of the HTC One(both the m7 and m8). It is on the right track though, so let's hope that rumored luxury "F" line or next year's S6 will continue to get better.

* How it Runs - This phone is fast, fast, FAST! With a 2.5gHz Snapdragon 801, it has the fastest processor out there right now. It terms of real speed, I cannot say if it is faster than the HTC One m8 or the Sony Xperia Z2, but it is definitely up there. When you touch an app icon to launch it, it launches nearly instantly. To really see how this phone flies, just open the gallery app and scroll through all your photos and you'll see what I mean. Usually the gallery is where most phones stutter as it tries to load all your photos and albums - but NOT the S5!

* The Camera - FINALLY! Samsung has decided to make a decent camera, and not just as an afterthought. This 16mp camera is really awesome, so much better than the S4. I would always get washed out images with my S3/S4/Note 2, but with the S5, it actually looks like it's from a decent point-and-shoot dedicated camera with crisp, bright, and saturated images. Low-light shooting is also vastly improved, although not as good as the new HTC One m8. 16mp means 5312 x 2988 -resolution images, so you can actually blow them up or crop them down without fearing the dreaded pixelation monster. There are a myriad of other cool and useful camera features that I will save for you to find out(like macro and "Google Street View" modes :]). And lastly, the focus is quick, quick, QUICK! Nearly instantaneous focus allows you to capture those hard-to-capture moments easier. A definitely thumbs up to Samsung for paying attention to the camera and its functions.

* Software - I'm still trying to figure out everything, as there is A LOT of stuff under the hood. Samsung's TouchWiz user interface this time around is A LOT less intrusive though, as much as can be without being totally stock Android, I guess. The layout and iconography are flatter and simpler, and for the better in my view. There is also a new sensor on the back, just beneath the camera lens. It is a heart-rate monitor/pedometer, and it comes with its own health app called S Health. There is a new battery-saving mode which can save you precious minutes when you're caught in a bind. All in all, I think this version is a lot nicer-looking, more responsive, and better than the precious S phones.

The ultimate question is whether this phone is a worthy upgrade over the S4. As my review title suggests, it is an evolution, an incremental upgrade over the S4. So with that said I cannot whole-heartedly recommend it if you already have a good phone, or even over the S4. But I do feel this upgrade is more vast and much better than from the S3 to the S4, so in that sense Samsung has done a much better job this year. If you are switching from an older phone that was made at least 2 years ago, then I would tell you jump right in and try the S5 - it will not disappoint you. But for those with already a good phone, and/or say you just finished year one of your 2-year contract, then I would say think hard before you make the leap. For my money, I think the Note 4 and S6 will be the bigger upgrades more worth waiting for.
'''

In [51]:
tokens = nltk.word_tokenize(content)
text = nltk.Text(tokens)
tagged = nltk.pos_tag(tokens)

In [52]:
# Let's keep the adjectives only
adjectives = [word for (word , pos_tag) in tagged if pos_tag.startswith('JJ')]
print(adjectives)

['professional', 'other', 'Net', 'real-world', 'nitty', 'first', 'much', 'natural', 'standard', 'everyday', 'natural', '5.1-inch', 'ppi', 'squared-off', 'angular', 'roundness-to', 'original', 'easier', 'widest', 'best', 'subjective', 'front', 'same', 'other', 'physical', 'capacitive', 'improved', 'dust-proof', 'water-proof', 'x', 'cheap', 'top', 'best', 'all-metal', 'dominant', 'more', 'arrogant', 'dimpled', 'faux-rubber', 'own', 'better', 'right', 'next', 'fastest', 'real', 'app', 'decent', 'awesome', 'better', 'decent', 'Low-light', 'good', 'new', 'dreaded', 'other', 'useful', 'quick', 'quick', 'instantaneous', 'hard-to-capture', 'intrusive', 'much', 'better', 'new', 'heart-rate', 'own', 'new', 'precious', 'nicer-looking', 'responsive', 'better', 'precious', 'ultimate', 'worthy', 'incremental', 'good', 'vast', 'better', 'better', 'older', 'least', 'good', '2-year', 'hard', 'bigger', 'worth']


In [54]:
# Now we want to use WordNet and eliminate the words that do not appear in our lexicon
# Since we do not have much of information for further disambiguation, we will keep only the 
# most popular interpretation (list element 0) if there are multiple ones
resolved_adjectives = [(w, list(swn.senti_synsets(w, 'a'))[0]) 
                       for w in adjectives 
                       if len(list(swn.senti_synsets(w, 'a')))>0]
print(resolved_adjectives)

[('professional', SentiSynset('professional.a.01')), ('other', SentiSynset('other.a.01')), ('Net', SentiSynset('net.a.01')), ('first', SentiSynset('first.a.01')), ('much', SentiSynset('much.a.01')), ('natural', SentiSynset('natural.a.01')), ('standard', SentiSynset('standard.a.01')), ('everyday', SentiSynset('everyday.s.01')), ('natural', SentiSynset('natural.a.01')), ('angular', SentiSynset('angular.a.01')), ('original', SentiSynset('original.s.01')), ('easier', SentiSynset('easy.a.01')), ('widest', SentiSynset('wide.a.01')), ('best', SentiSynset('best.a.01')), ('subjective', SentiSynset('subjective.a.01')), ('front', SentiSynset('front.a.01')), ('same', SentiSynset('same.a.01')), ('other', SentiSynset('other.a.01')), ('physical', SentiSynset('physical.a.01')), ('capacitive', SentiSynset('capacitive.a.01')), ('improved', SentiSynset('improved.a.01')), ('x', SentiSynset('ten.s.01')), ('cheap', SentiSynset('cheap.a.01')), ('top', SentiSynset('top.a.01')), ('best', SentiSynset('best.a.01

In [55]:
# SentiWordNet assigns to each synset of WordNet three
# sentiment scores: positivity, negativity, and objectivity.

for (w,a) in resolved_adjectives:
    print("Word:", w)
    print("Synset:", a)
    print("Pos score:",  a.pos_score())
    print("Neg score:",  a.neg_score())
    print("Objectivity score:",  a.obj_score())
    print("======================================")

Word: professional
Synset: <professional.a.01: PosScore=0.0 NegScore=0.0>
Pos score: 0.0
Neg score: 0.0
Objectivity score: 1.0
Word: other
Synset: <other.a.01: PosScore=0.0 NegScore=0.625>
Pos score: 0.0
Neg score: 0.625
Objectivity score: 0.375
Word: Net
Synset: <net.a.01: PosScore=0.0 NegScore=0.0>
Pos score: 0.0
Neg score: 0.0
Objectivity score: 1.0
Word: first
Synset: <first.a.01: PosScore=0.0 NegScore=0.0>
Pos score: 0.0
Neg score: 0.0
Objectivity score: 1.0
Word: much
Synset: <much.a.01: PosScore=0.0 NegScore=0.0>
Pos score: 0.0
Neg score: 0.0
Objectivity score: 1.0
Word: natural
Synset: <natural.a.01: PosScore=0.25 NegScore=0.0>
Pos score: 0.25
Neg score: 0.0
Objectivity score: 0.75
Word: standard
Synset: <standard.a.01: PosScore=0.375 NegScore=0.375>
Pos score: 0.375
Neg score: 0.375
Objectivity score: 0.25
Word: everyday
Synset: <everyday.s.01: PosScore=0.125 NegScore=0.0>
Pos score: 0.125
Neg score: 0.0
Objectivity score: 0.875
Word: natural
Synset: <natural.a.01: PosScore=0.

In [56]:
# But let's take a look at what we rejected
rejected_adjectives = [w for w in adjectives if len(list(swn.senti_synsets(w, 'a')))==0]
print(rejected_adjectives)


['real-world', 'nitty', '5.1-inch', 'ppi', 'squared-off', 'roundness-to', 'dust-proof', 'water-proof', 'dimpled', 'faux-rubber', 'app', 'Low-light', 'hard-to-capture', 'heart-rate', 'nicer-looking', '2-year']


Perhaps we would also like to figure out what the adjectives in the text refer to. 

In [57]:
for i in range(0, len(tagged)):
    current_word = tagged[i][0]
    current_pos = tagged[i][1]
    if current_pos == 'NN':
        previous_word = tagged[i-1][0]
        previous_pos = tagged[i-1][1]
        if previous_pos == 'JJ':
            print(previous_word + " " + current_word)

professional reviewer
other company
nitty gritty
first thing
everyday use
natural profile
ppi pixel
roundness-to square-ness
front side
improved aspect
water-proof up
faux-rubber backside
right track
next year
real speed
app icon
decent camera
decent point-and-shoot
Low-light shooting
dreaded pixelation
other cool
useful camera
instantaneous focus
intrusive though
new sensor
heart-rate monitor/pedometer
own health
new battery-saving
ultimate question
worthy upgrade
incremental upgrade
good phone
good phone
2-year contract


#### Excercise 1

Try a new text with this type of sentiment analysis. Let's figure out what works and what does not

#### Exercise 2

Instead of adjectives-nouns, we can instead use adverbs and verbs (e.g., "works nicely"). Let's modify the code above to extract patterns involving verbs and adverbs

#### Exercise 3

How can you modify the code to find more patterns, instead of just JJ-NN (adjective followed by noun)?

### Named Entity Recognition

Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. Here is a list of some commonly used Named Entities:
 
* ORGANIZATION	(e.g., Georgia-Pacific Corp., WHO)
* PERSON	(e.g., Eddy Bonte, President Obama)
* LOCATION	(e.g., Murray River, Mount Everest)
* DATE	(e.g., June, 2008-06-29)
* TIME	(e.g., two fifty a m, 1:30 p.m.)
* MONEY	(e.g., 175 million Canadian Dollars, GBP 10.40)
* PERCENT	(e.g., twenty pct, 18.75 %)
* FACILITY	(e.g., Washington Monument, Stonehenge)
* GPE	(e.g., South East Asia, Midlothian)

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: identifying the boundaries of the NE, and identifying its type.

NLTK provides a module that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter `binary=True`, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [44]:
raw = u'''
Morgan Stanley received a long-awaited ratings upgrade from Moody’s Investors Service, an endorsement of the firm’s strategy shift and a move that could help the bank pick up new trading clients.

In raising Morgan Stanley’s rating two notches, Moody’s determined that strategic changes at the investment bank in recent years have resulted in a safer business model and improved profitability.

Goldman Sachs Group Inc., Bank of America Corp. and Citigroup Inc. received one-notch upgrades as part of a broader Moody’s review of the largest global banks.

The actions were in part a reversal of the downgrades Moody’s issued to several banks in 2012. Moody’s at that time found that the European sovereign-debt crisis and other macroeconomic and regulatory factors were crimping banks’ profitability.

Since the downgrade, Morgan Stanley Chairman and Chief Executive James Gorman has continued to push the bank away from volatile activities such as trading while expanding in businesses such as wealth management that generate earnings more steadily. Moody’s said that those switches to its business mix, as well as changes to its funding that would lead to fewer losses for creditors if the bank were to fail, led it to upgrade Morgan Stanley’s debt by two notches from Baa2 to A3.'''

In [45]:
sentences = nltk.sent_tokenize(raw)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

for sent in sentences:
    named_entities = nltk.ne_chunk(sent, binary=False)
    print(named_entities)

(S
  (PERSON Morgan/NNP)
  (PERSON Stanley/NNP)
  received/VBD
  a/DT
  long-awaited/JJ
  ratings/NNS
  upgrade/NN
  from/IN
  (ORGANIZATION Moody’s/NNP Investors/NNPS Service/NNP)
  ,/,
  an/DT
  endorsement/NN
  of/IN
  the/DT
  firm’s/NN
  strategy/NN
  shift/NN
  and/CC
  a/DT
  move/NN
  that/WDT
  could/MD
  help/VB
  the/DT
  bank/NN
  pick/VB
  up/RP
  new/JJ
  trading/NN
  clients/NNS
  ./.)
(S
  In/IN
  raising/VBG
  (PERSON Morgan/NNP)
  Stanley’s/NNP
  rating/NN
  two/CD
  notches/NNS
  ,/,
  (PERSON Moody’s/NNP)
  determined/VBD
  that/IN
  strategic/JJ
  changes/NNS
  at/IN
  the/DT
  investment/NN
  bank/NN
  in/IN
  recent/JJ
  years/NNS
  have/VBP
  resulted/VBN
  in/IN
  a/DT
  safer/NN
  business/NN
  model/NN
  and/CC
  improved/VBN
  profitability/NN
  ./.)
(S
  (PERSON Goldman/NNP)
  (PERSON Sachs/NNP Group/NNP Inc./NNP)
  ,/,
  (ORGANIZATION Bank/NNP)
  of/IN
  (ORGANIZATION America/NNP Corp./NNP)
  and/CC
  (ORGANIZATION Citigroup/NNP Inc./NNP)
  received/VBD
  