# Information Extraction
Build a system that extracts structured data, such as tables, from unstructured text

## Information Extraction Architecture
![information_extraction](http://www.nltk.org/images/ie-architecture.png)

In [1]:
import nltk
import re
import pprint

In [2]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

# Chunking
The basic technique we will use for entity detection is **chunking**, which segments and labels multi-token sequences as illustrated in the following figure. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.
![chunking](http://www.nltk.org/images/chunk-segmentation.png)

## Noun Phrase Chunking
*Example*:
    	[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.
    
As we can see, NP-chunks are often smaller pieces than complete noun phrases. For example, the market for system-management software for Digital's hardware is a single noun phrase (containing two nested noun phrases), but it is captured in NP-chunks by the simpler chunk the market. One of the motivations for this difference is that NP-chunks are defined so as not to contain other NP-chunks. Consequently, any prepositional phrases or subordinate clauses that modify a nominal will not be included in the corresponding NP-chunk, since they almost certainly contain further noun phrases.

In [3]:
sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('barked','VBD'),
           ('at','IN'),('the','DT'),('cat','NN')]

grammer = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(grammer)
result = cp.parse(sentence)
print(result)
result.draw()

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


## Tag Patterns
The rules that make up a chunk grammar use tag patterns to describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g.  <DT\>?<JJ\>\*<NN\>.

## Chunking with Regular Expressions
To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

In [6]:
grammer = r"""
        NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and noun
            {<NNP>+}              # chunk sequences of proper nouns}
"""
cp = nltk.RegexpParser(grammer)
sentence = sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), 
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


If a tag pattern matches at overlapping locations, the leftmost match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked:

In [9]:
nouns = [('money','NN'),('market','NN'),('fund','NN')]
grammer = "NP: {<NN><NN>}" #chunk two consecutive nouns
cp = nltk.RegexpParser(grammer)
print(cp.parse(nouns))

(S (NP money/NN market/NN) fund/NN)


## Exploring Text Corpora

In [27]:
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
(CHUNK expected/VBN to/TO approve/VB)
(CHUNK expected/VBN to/TO make/VB)
(CHUNK intends/VBZ to/TO make/VB)
(CHUNK seek/VB to/TO set/VB)
(CHUNK like/VB to/TO see/VB)
(CHUNK designed/VBN to/TO provide/VB)
(CHUNK get/VB to/TO hear/VB)
(CHUNK expects/VBZ to/TO tell/VB)
(CHUNK expected/VBN to/TO give/VB)
(CHUNK prefer/VB to/TO pay/VB)
(CHUNK required/VBN to/TO obtain/VB)
(CHUNK permitted/VBN to/TO teach/VB)
(CHUNK designed/VBN to/TO reduce/VB)
(CHUNK Asked/VBN to/TO elaborate/VB)
(CHUNK got/VBN to/TO go/VB)
(CHUNK raised/VBN to/TO pay/VB)
(CHUNK scheduled/VBN to/TO go/VB)
(CHUNK cut/VBN to/TO meet/VB)
(CHUNK needed/VBN to/TO meet/VB)
(CHUNK hastened/VBD to/TO add/VB)
(CHUNK found/VBN to/TO prevent/VB)
(CHUNK continue/VB to/TO insist/VB)
(CHUNK compelled/VBN to/TO make/VB)
(CHUNK mad

(CHUNK helping/VBG to/TO strengthen/VB)
(CHUNK designed/VBN to/TO promote/VB)
(CHUNK threatening/VBG to/TO expand/VB)
(CHUNK seeks/VBZ to/TO get/VB)
(CHUNK begin/VB to/TO see/VB)
(CHUNK continue/VB to/TO expand/VB)
(CHUNK failing/VBG to/TO render/VB)
(CHUNK decided/VBD to/TO tackle/VB)
(CHUNK expects/VBZ to/TO sign/VB)
(CHUNK tends/VBZ to/TO become/VB)
(CHUNK came/VBD to/TO understand/VB)
(CHUNK deserve/VB to/TO breathe/VB)
(CHUNK advised/VBN to/TO seek/VB)
(CHUNK attempting/VBG to/TO make/VB)
(CHUNK try/VB to/TO gun/VB)
(CHUNK began/VBD to/TO fill/VB)
(CHUNK proposes/VBZ to/TO preserve/VB)
(CHUNK asked/VBN to/TO approve/VB)
(CHUNK seeking/VBG to/TO break/VB)
(CHUNK tends/VBZ to/TO spread/VB)
(CHUNK want/VB to/TO amend/VB)
(CHUNK rejected/VBN to/TO seek/VB)
(CHUNK continued/VBN to/TO speak/VB)
(CHUNK trying/VBG to/TO make/VB)
(CHUNK expected/VBN to/TO head/VB)
(CHUNK tempted/VBN to/TO let/VB)
(CHUNK appear/VB to/TO cost/VB)
(CHUNK attempt/VB to/TO shore/VB)
(CHUNK seeking/VBG to/TO ach

(CHUNK managed/VBN to/TO hold/VB)
(CHUNK intended/VBN to/TO illustrate/VB)
(CHUNK tried/VBN to/TO get/VB)
(CHUNK learn/VB to/TO live/VB)
(CHUNK helping/VBG to/TO move/VB)
(CHUNK striving/VBG to/TO hold/VB)
(CHUNK choose/VB to/TO work/VB)
(CHUNK tried/VBD to/TO see/VB)
(CHUNK trying/VBG to/TO create/VB)
(CHUNK made/VBN to/TO appear/VB)
(CHUNK failed/VBD to/TO make/VB)
(CHUNK seemed/VBD to/TO deserve/VB)
(CHUNK managed/VBN to/TO mix/VB)
(CHUNK want/VB to/TO hurt/VB)
(CHUNK liked/VBD to/TO nip/VB)
(CHUNK manages/VBZ to/TO acquire/VB)
(CHUNK widened/VBN to/TO enchant/VB)
(CHUNK serve/VB to/TO contradict/VB)
(CHUNK dare/VB to/TO experiment/VB)
(CHUNK tried/VBD to/TO humanize/VB)
(CHUNK tries/VBZ to/TO preserve/VB)
(CHUNK helps/VBZ to/TO rebut/VB)
(CHUNK seems/VBZ to/TO make/VB)
(CHUNK began/VBD to/TO play/VB)
(CHUNK cares/VBZ to/TO remember/VB)
(CHUNK serve/VB to/TO show/VB)
(CHUNK want/VB to/TO collect/VB)
(CHUNK designed/VBN to/TO invite/VB)
(CHUNK attempt/VB to/TO make/VB)
(CHUNK designe

(CHUNK sized/VBN to/TO fit/VB)
(CHUNK continue/VB to/TO release/VB)
(CHUNK wish/VB to/TO create/VB)
(CHUNK trim/VB to/TO fit/VB)
(CHUNK cut/VBN to/TO fit/VB)
(CHUNK help/VB to/TO prevent/VB)
(CHUNK designed/VBN to/TO take/VB)
(CHUNK used/VBN to/TO transport/VB)
(CHUNK want/VB to/TO buy/VB)
(CHUNK used/VBN to/TO fasten/VB)
(CHUNK help/VB to/TO keep/VB)
(CHUNK needed/VBN to/TO build/VB)
(CHUNK designed/VBN to/TO accommodate/VB)
(CHUNK adjusted/VBN to/TO suit/VB)
(CHUNK used/VBN to/TO cut/VB)
(CHUNK want/VB to/TO avoid/VB)
(CHUNK agreed/VBN to/TO take/VB)
(CHUNK planned/VBD to/TO destroy/VB)
(CHUNK allowed/VBN to/TO issue/VB)
(CHUNK managed/VBD to/TO coerce/VB)
(CHUNK want/VB to/TO know/VB)
(CHUNK planning/VBG to/TO bring/VB)
(CHUNK urged/VBN to/TO keep/VB)
(CHUNK come/VB to/TO swim/VB)
(CHUNK enjoined/VBN to/TO look/VB)
(CHUNK prepared/VBN to/TO cope/VB)
(CHUNK want/VB to/TO make/VB)
(CHUNK allowed/VBN to/TO dry/VB)
(CHUNK pays/VBZ to/TO buy/VB)
(CHUNK want/VB to/TO play/VB)
(CHUNK expec

(CHUNK refusing/VBG to/TO keep/VB)
(CHUNK wishes/VBZ to/TO discuss/VB)
(CHUNK want/VB to/TO ask/VB)
(CHUNK want/VB to/TO tap/VB)
(CHUNK said/VBN to/TO use/VB)
(CHUNK employed/VBN to/TO see/VB)
(CHUNK shoot/VB to/TO kill/VB)
(CHUNK refused/VBD to/TO touch/VB)
(CHUNK threatened/VBD to/TO shoot/VB)
(CHUNK said/VBD to/TO let/VB)
(CHUNK begin/VB to/TO roll/VB)
(CHUNK held/VBN to/TO assure/VB)
(CHUNK going/VBG to/TO make/VB)
(CHUNK managed/VBD to/TO get/VB)
(CHUNK wanted/VBD to/TO play/VB)
(CHUNK prepared/VBD to/TO counterattack/VB)
(CHUNK failed/VBN to/TO rally/VB)
(CHUNK tried/VBD to/TO rape/VB)
(CHUNK refused/VBD to/TO speak/VB)
(CHUNK called/VBN to/TO look/VB)
(CHUNK refused/VBD to/TO say/VB)
(CHUNK mean/VB to/TO suggest/VB)
(CHUNK prepared/VBN to/TO carry/VB)
(CHUNK designed/VBN to/TO overthrow/VB)
(CHUNK trying/VBG to/TO put/VB)
(CHUNK needed/VBN to/TO work/VB)
(CHUNK disposed/VBN to/TO exploit/VB)
(CHUNK fail/VB to/TO see/VB)
(CHUNK bound/VBN to/TO fall/VB)
(CHUNK tempted/VBN to/TO pl

(CHUNK hoped/VBN to/TO become/VB)
(CHUNK forced/VBN to/TO restrict/VB)
(CHUNK began/VBD to/TO give/VB)
(CHUNK asked/VBN to/TO become/VB)
(CHUNK trying/VBG to/TO sell/VB)
(CHUNK serves/VBZ to/TO stimulate/VB)
(CHUNK seemed/VBD to/TO lack/VB)
(CHUNK offered/VBD to/TO make/VB)
(CHUNK assembled/VBN to/TO warrant/VB)
(CHUNK returned/VBD to/TO preside/VB)
(CHUNK sought/VBD to/TO prevent/VB)
(CHUNK expect/VB to/TO stand/VB)
(CHUNK compelled/VBN to/TO face/VB)
(CHUNK continue/VB to/TO live/VB)
(CHUNK refused/VBN to/TO move/VB)
(CHUNK refused/VBN to/TO obey/VB)
(CHUNK doomed/VBN to/TO become/VB)
(CHUNK tended/VBD to/TO romanticize/VB)
(CHUNK supposed/VBN to/TO keep/VB)
(CHUNK left/VBN to/TO rest/VB)
(CHUNK wants/VBZ to/TO see/VB)
(CHUNK tended/VBN to/TO dress/VB)
(CHUNK designed/VBN to/TO become/VB)
(CHUNK begins/VBZ to/TO feel/VB)
(CHUNK tends/VBZ to/TO depict/VB)
(CHUNK transferred/VBN to/TO become/VB)
(CHUNK impelled/VBN to/TO make/VB)
(CHUNK seeks/VBZ to/TO make/VB)
(CHUNK made/VBN to/TO lo

(CHUNK like/VB to/TO believe/VB)
(CHUNK bother/VB to/TO look/VB)
(CHUNK used/VBD to/TO go/VB)
(CHUNK seemed/VBD to/TO thaw/VB)
(CHUNK came/VBD to/TO give/VB)
(CHUNK wanted/VBD to/TO see/VB)
(CHUNK used/VBD to/TO look/VB)
(CHUNK meant/VBN to/TO help/VB)
(CHUNK like/VB to/TO straighten/VB)
(CHUNK hope/VB to/TO give/VB)
(CHUNK bark/VB to/TO let/VB)
(CHUNK dash/VB to/TO get/VB)
(CHUNK tried/VBD to/TO talk/VB)
(CHUNK decided/VBD to/TO leave/VB)
(CHUNK used/VBD to/TO tell/VB)
(CHUNK continue/VB to/TO reflect/VB)
(CHUNK appear/VB to/TO preach/VB)
(CHUNK intend/VB to/TO let/VB)
(CHUNK need/VB to/TO test/VB)
(CHUNK learned/VBD to/TO meet/VB)
(CHUNK said/VBN to/TO give/VB)
(CHUNK serves/VBZ to/TO reduce/VB)
(CHUNK thought/VBN to/TO provide/VB)
(CHUNK tends/VBZ to/TO give/VB)
(CHUNK wish/VB to/TO deny/VB)
(CHUNK expect/VB to/TO find/VB)
(CHUNK seek/VB to/TO capture/VB)
(CHUNK allowed/VBN to/TO claim/VB)
(CHUNK seeks/VBZ to/TO recapture/VB)
(CHUNK determined/VBN to/TO bulldoze/VB)
(CHUNK sought/VB

(CHUNK afford/VB to/TO lose/VB)
(CHUNK continues/VBZ to/TO add/VB)
(CHUNK helping/VBG to/TO pilot/VB)
(CHUNK prefer/VB to/TO speak/VB)
(CHUNK go/VB to/TO discuss/VB)
(CHUNK made/VBN to/TO replace/VB)
(CHUNK continuing/VBG to/TO seek/VB)
(CHUNK seem/VB to/TO add/VB)
(CHUNK seem/VB to/TO fix/VB)
(CHUNK known/VBN to/TO tax/VB)
(CHUNK like/VB to/TO see/VB)
(CHUNK continued/VBD to/TO run/VB)
(CHUNK voted/VBD to/TO continue/VB)
(CHUNK entitled/VBN to/TO benefit/VB)
(CHUNK needed/VBN to/TO establish/VB)
(CHUNK designed/VBN to/TO give/VB)
(CHUNK remain/VB to/TO preserve/VB)
(CHUNK gathered/VBD to/TO thank/VB)
(CHUNK continue/VB to/TO protect/VB)
(CHUNK amended/VBN to/TO read/VB)
(CHUNK construed/VBN to/TO alter/VB)
(CHUNK required/VBN to/TO correlate/VB)
(CHUNK amended/VBN to/TO read/VB)
(CHUNK directed/VBN to/TO make/VB)
(CHUNK directed/VBN to/TO establish/VB)
(CHUNK continued/VBD to/TO display/VB)
(CHUNK required/VBN to/TO move/VB)
(CHUNK planned/VBD to/TO furnish/VB)
(CHUNK agreed/VBD to/TO

(CHUNK seems/VBZ to/TO follow/VB)
(CHUNK known/VBN to/TO contribute/VB)
(CHUNK fail/VB to/TO elicit/VB)
(CHUNK failed/VBD to/TO evoke/VB)
(CHUNK fail/VB to/TO eat/VB)
(CHUNK trying/VBG to/TO study/VB)
(CHUNK try/VB to/TO study/VB)
(CHUNK wish/VB to/TO show/VB)
(CHUNK need/VB to/TO find/VB)
(CHUNK like/VB to/TO give/VB)
(CHUNK chosen/VBN to/TO give/VB)
(CHUNK need/VB to/TO know/VB)
(CHUNK plans/VBZ to/TO go/VB)
(CHUNK tossed/VBN to/TO decide/VB)
(CHUNK want/VB to/TO know/VB)
(CHUNK want/VB to/TO study/VB)
(CHUNK expect/VB to/TO face/VB)
(CHUNK choose/VB to/TO derive/VB)
(CHUNK seeking/VBG to/TO become/VB)
(CHUNK mobilized/VBN to/TO achieve/VB)
(CHUNK required/VBN to/TO make/VB)
(CHUNK work/VB to/TO realize/VB)
(CHUNK seek/VB to/TO encourage/VB)
(CHUNK prefer/VB to/TO live/VB)
(CHUNK afford/VB to/TO wait/VB)
(CHUNK begin/VB to/TO see/VB)
(CHUNK committed/VBN to/TO move/VB)
(CHUNK continue/VB to/TO satisfy/VB)
(CHUNK begun/VBN to/TO develop/VB)
(CHUNK programming/VBG to/TO go/VB)
(CHUNK f

(CHUNK required/VBN to/TO furnish/VB)
(CHUNK want/VB to/TO provide/VB)
(CHUNK attempt/VB to/TO represent/VB)
(CHUNK hopes/VBZ to/TO encourage/VB)
(CHUNK designed/VBN to/TO help/VB)
(CHUNK appointed/VBN to/TO act/VB)
(CHUNK expected/VBN to/TO vote/VB)
(CHUNK appointed/VBN to/TO study/VB)
(CHUNK tended/VBD to/TO take/VB)
(CHUNK attempted/VBD to/TO act/VB)
(CHUNK attempt/VB to/TO act/VB)
(CHUNK attempt/VB to/TO act/VB)
(CHUNK intend/VB to/TO act/VB)
(CHUNK fail/VB to/TO take/VB)
(CHUNK try/VB to/TO serve/VB)
(CHUNK tended/VBD to/TO use/VB)
(CHUNK found/VBN to/TO behave/VB)
(CHUNK impelled/VBN to/TO make/VB)
(CHUNK attempt/VB to/TO analyze/VB)
(CHUNK designed/VBN to/TO reflect/VB)
(CHUNK deemed/VBN to/TO vary/VB)
(CHUNK held/VBN to/TO constitute/VB)
(CHUNK seem/VB to/TO support/VB)
(CHUNK designed/VBN to/TO cover/VB)
(CHUNK found/VBN to/TO vary/VB)
(CHUNK taken/VBN to/TO rest/VB)
(CHUNK needs/VBZ to/TO know/VB)
(CHUNK attempts/VBZ to/TO stand/VB)
(CHUNK wishing/VBG to/TO know/VB)
(CHUNK ma

(CHUNK wanted/VBN to/TO hurt/VB)
(CHUNK bother/VB to/TO think/VB)
(CHUNK delighted/VBN to/TO see/VB)
(CHUNK began/VBD to/TO weep/VB)
(CHUNK began/VBD to/TO move/VB)
(CHUNK tried/VBD to/TO push/VB)
(CHUNK tried/VBD to/TO rescue/VB)
(CHUNK seemed/VBD to/TO hold/VB)
(CHUNK began/VBD to/TO think/VB)
(CHUNK strove/VBD to/TO think/VB)
(CHUNK run/VB to/TO tell/VB)
(CHUNK fail/VB to/TO hear/VB)
(CHUNK dared/VBD to/TO wait/VB)
(CHUNK dared/VBD to/TO pat/VB)
(CHUNK trying/VBG to/TO push/VB)
(CHUNK began/VBD to/TO whirl/VB)
(CHUNK started/VBD to/TO worry/VB)
(CHUNK tried/VBD to/TO push/VB)
(CHUNK wanted/VBD to/TO get/VB)
(CHUNK tryin/VBG to/TO fuck/VB)
(CHUNK tried/VBD to/TO stifle/VB)
(CHUNK seeking/VBG to/TO kill/VB)
(CHUNK failed/VBD to/TO check/VB)
(CHUNK tried/VBD to/TO shut/VB)
(CHUNK refuses/VBZ to/TO believe/VB)
(CHUNK begun/VBN to/TO study/VB)
(CHUNK amazed/VBN to/TO discover/VB)
(CHUNK appear/VB to/TO reject/VB)
(CHUNK trying/VBG to/TO write/VB)
(CHUNK want/VB to/TO weep/VB)
(CHUNK love

(CHUNK returning/VBG to/TO seek/VB)
(CHUNK bother/VB to/TO look/VB)
(CHUNK try/VB to/TO run/VB)
(CHUNK going/VBG to/TO make/VB)
(CHUNK began/VBD to/TO wave/VB)
(CHUNK gone/VBN to/TO get/VB)
(CHUNK Want/VB to/TO try/VB)
(CHUNK going/VBG to/TO call/VB)
(CHUNK want/VB to/TO find/VB)
(CHUNK want/VB to/TO see/VB)
(CHUNK designed/VBN to/TO put/VB)
(CHUNK turned/VBD to/TO see/VB)
(CHUNK forced/VBN to/TO use/VB)
(CHUNK trying/VBG to/TO drag/VB)
(CHUNK start/VB to/TO angle/VB)
(CHUNK tried/VBD to/TO flatten/VB)
(CHUNK managed/VBD to/TO hunch/VB)
(CHUNK brought/VBN to/TO make/VB)
(CHUNK surprised/VBN to/TO find/VB)
(CHUNK started/VBN to/TO back/VB)
(CHUNK want/VB to/TO try/VB)
(CHUNK trying/VBG to/TO catch/VB)
(CHUNK used/VBD to/TO keep/VB)
(CHUNK forget/VB to/TO turn/VB)
(CHUNK promised/VBD to/TO observe/VB)
(CHUNK started/VBD to/TO plod/VB)
(CHUNK tried/VBD to/TO turn/VB)
(CHUNK beginning/VBG to/TO feel/VB)
(CHUNK decided/VBD to/TO indulge/VB)
(CHUNK forgotten/VBN to/TO turn/VB)
(CHUNK meant/V

(CHUNK adjusted/VBN to/TO operate/VB)
(CHUNK like/VB to/TO see/VB)
(CHUNK bent/VBD to/TO observe/VB)
(CHUNK forced/VBN to/TO accompany/VB)
(CHUNK fear/VB to/TO tread/VB)
(CHUNK programed/VBN to/TO compute/VB)
(CHUNK remember/VB to/TO program/VB)
(CHUNK directed/VBN to/TO develop/VB)
(CHUNK schooled/VBN to/TO examine/VB)
(CHUNK appeared/VBD to/TO require/VB)
(CHUNK encouraged/VBN to/TO develop/VB)
(CHUNK remembered/VBN to/TO introduce/VB)
(CHUNK guided/VBN to/TO make/VB)
(CHUNK tried/VBD to/TO run/VB)
(CHUNK tried/VBD to/TO tell/VB)
(CHUNK tried/VBD to/TO ask/VB)
(CHUNK want/VB to/TO ask/VB)
(CHUNK going/VBG to/TO come/VB)
(CHUNK going/VBG to/TO happen/VB)
(CHUNK going/VBG to/TO happen/VB)
(CHUNK going/VBG to/TO take/VB)
(CHUNK inclined/VBN to/TO think/VB)
(CHUNK manage/VB to/TO follow/VB)
(CHUNK wanting/VBG to/TO tell/VB)
(CHUNK tried/VBN to/TO write/VB)
(CHUNK exhausted/VBN to/TO stay/VB)
(CHUNK afford/VB to/TO lose/VB)
(CHUNK afford/VB to/TO pay/VB)
(CHUNK used/VBD to/TO work/VB)
(CH

(CHUNK beginning/VBG to/TO fold/VB)
(CHUNK wanted/VBD to/TO smoke/VB)
(CHUNK seem/VB to/TO get/VB)
(CHUNK trying/VBG to/TO find/VB)
(CHUNK like/VB to/TO keep/VB)
(CHUNK seem/VB to/TO snap/VB)
(CHUNK like/VB to/TO think/VB)
(CHUNK beginning/VBG to/TO find/VB)
(CHUNK beginning/VBG to/TO look/VB)
(CHUNK going/VBG to/TO last/VB)
(CHUNK going/VBG to/TO prove/VB)
(CHUNK hoped/VBD to/TO die/VB)
(CHUNK gone/VBN to/TO live/VB)
(CHUNK stayed/VBD to/TO get/VB)
(CHUNK turned/VBD to/TO go/VB)
(CHUNK going/VBG to/TO see/VB)
(CHUNK going/VBG to/TO laugh/VB)
(CHUNK tried/VBD to/TO bite/VB)
(CHUNK seem/VB to/TO rise/VB)
(CHUNK come/VBN to/TO see/VB)
(CHUNK got/VBN to/TO know/VB)
(CHUNK seem/VB to/TO take/VB)
(CHUNK beginning/VBG to/TO creep/VB)
(CHUNK seemed/VBN to/TO rain/VB)
(CHUNK like/VB to/TO hear/VB)
(CHUNK come/VBN to/TO make/VB)
(CHUNK started/VBD to/TO move/VB)
(CHUNK bent/VBD to/TO pick/VB)
(CHUNK permitted/VBN to/TO operate/VB)
(CHUNK beginning/VBG to/TO get/VB)
(CHUNK seemed/VBD to/TO think

(CHUNK tried/VBD to/TO conceal/VB)
(CHUNK came/VBD to/TO know/VB)
(CHUNK refuses/VBZ to/TO continue/VB)
(CHUNK given/VBN to/TO understand/VB)
(CHUNK propose/VB to/TO vent/VB)
(CHUNK proceeded/VBD to/TO mask/VB)
(CHUNK withhold/VB to/TO keep/VB)
(CHUNK begin/VB to/TO wither/VB)
(CHUNK help/VB to/TO intensify/VB)
(CHUNK seems/VBZ to/TO overtake/VB)
(CHUNK want/VB to/TO buy/VB)


## Chinking
Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example,  barked/VBD at/IN is a chink:

 [ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]

Table: Three chinking rules applied to the same chunk

|` `|Entire chunk|Middle of a chunk|End of a chunk|
|---|------------|-----------------|--------------|
|Input|[a/DT little/JJ dog/NN]|[a/DT little/JJ dog/NN]|[a/DT little/JJ dog/NN]|
|Operation|Chink "DT JJ NN"|Chink "JJ"|Chink "NN"|
|Pattern|}DT JJ NN{|}JJ{|}NN{|
|Output|a/DT little/JJ dog/NN|[a/DT] little/JJ [dog/NN]|[a/DT little/JJ] dog/NN|

In [30]:
grammer = r"""
    NP:
        {<.*>+}       #chunk everything
        }<VBD|IN>+{   #chink sequences of VBD and IN
"""
sentence = [('the','DT'),('littel','JJ'),('yellow','JJ'),
           ('dog','NN'),('barked','VBD'),('at','IN'),('the','DT'),('cat','NN')]
cp = nltk.RegexpParser(grammer)
print(cp.parse(sentence))

(S
  (NP the/DT littel/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


## Representing Chunks: Tags Vs Trees
 The most widespread file representation uses IOB tags. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O. An example of this scheme is shown below:
 ![tag_repr](http://www.nltk.org/images/chunk-tagrep.png)

chunk structures can also be represented using trees. These have the benefit that each chunk is a constituent that can be manipulated directly.
![tree_repr](http://www.nltk.org/images/chunk-treerep.png)

# Developing and Evaluating Chunkers
## Reading IOB Format
Using the corpus module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP and PP. As we have seen, each sentence is represented using multiple lines, as shown below:
```
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
...
```

A conversion function chunk.conllstr2tree() builds a tree representation from one of these multi-line strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks:

In [31]:
text = '''
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
. . O
'''
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

In [32]:
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


In [33]:
print(conll2000.chunked_sents('train.txt',chunk_types=['NP'])[99])

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.)


## Simple Evaluation and Baselines

In [4]:
# baseline
from nltk.corpus import conll2000
cp = nltk.RegexpParser('')
test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


In [5]:
# naive regualr expression chunker
grammer = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammer)
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


In [6]:
# uigram chunker
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                     for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
        
    def parse(self,sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos,chunktag) in tagged_pos_tags]
        conlltags = [(word,pos,chunktag) for ((word,pos),chunktag)
                    in zip(sentence,chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)


In [7]:
test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt',chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


In [58]:
postags = sorted(set(pos for sent in train_sents 
                     for (word,pos) in sent.leaves()))
print(unigram_chunker.tagger.tag(postags))

[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]


In [8]:
# bigram chunker
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                     for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)
        
    def parse(self,sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos,chunktag) in tagged_pos_tags]
        conlltags = [(word,pos,chunktag) for ((word,pos),chunktag)
                    in zip(sentence,chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)


In [9]:
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


## Training Classifier-Based Chunkers
Both the regular-expression based chunkers and the n-gram chunkers decide what chunks to create entirely based on part-of-speech tags. However, sometimes part-of-speech tags are insufficient to determine how a sentence should be chunked. For example, consider the following two statements:

- Joey/NN sold/VBD the/DT farmer/NN rice/NN ./.

- Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.

These two sentences have the same part-of-speech tags, yet they are chunked differently. In the first sentence, the farmer and rice are separate chunks, while the corresponding material in the second sentence, the computer monitor, is a single chunk. Clearly, we need to make use of information about the content of the words, in addition to just their part-of-speech tags, if we wish to maximize chunking performance.

In [11]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self,train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word,tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent,i,history)
                train_set.append( (featureset,tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set)
        
    def tag(self,sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence,i,history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence,history)
    
class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                       nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
        
    def parse(self,sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)
    
def npchunk_features(sentence, i, history):
    word,pos = sentence[i]
    return {'pos':pos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.441
             2          -0.29280        0.836
             3          -0.28717        0.836
             4          -0.28345        0.836
             5          -0.28080        0.836
             6          -0.27882        0.836
             7          -0.27728        0.836
             8          -0.27606        0.836
             9          -0.27505        0.836
            10          -0.27422        0.836
            11          -0.27352        0.836
            12          -0.27291        0.836
            13          -0.27239        0.836
            14          -0.27193        0.836
            15          -0.27153        0.836
            16          -0.27117        0.836
            17          -0.27085        0.836
            18          -0.27056        0.836
            19          -0.27030        0.836
 

The previouse feature extractor only provides the pos tag of current token, which is similar to unigram chunker. Its performance is like that of unigram chunker. Now let's add a feature for the previous pos tag. It's closely related to the bigram chunker now.

In [14]:
def npchunk_features(sentence,i,history):
    word,pos = sentence[i]
    if i==0:
        prevword,prevpos = "<START>","<START>"
    else:
        prevword,prevpos = sentence[i-1]
    return {'pos':pos,'prevpos':prevpos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.441
             2          -0.24505        0.933
             3          -0.16970        0.932
             4          -0.14526        0.931
             5          -0.13407        0.933
             6          -0.12786        0.935
             7          -0.12394        0.935
             8          -0.12123        0.935
             9          -0.11922        0.937
            10          -0.11765        0.937
            11          -0.11637        0.937
            12          -0.11530        0.937
            13          -0.11439        0.937
            14          -0.11360        0.937
            15          -0.11290        0.937
            16          -0.11228        0.937
            17          -0.11173        0.937
            18          -0.11123        0.937
            19          -0.11078        0.937
 

Now add a feature for the current word.

In [15]:
def npchunk_features(sentence,i,history):
    word, pos = sentence[i]
    if i == 0:
        prevword,prevpos = '<START>','<START>'
    else:
        prevword,prevpos = sentence[i-1]
    return {'pos':pos,'word':word,'prevpos':prevpos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.441
             2          -0.22432        0.942
             3          -0.15620        0.946
             4          -0.13175        0.950
             5          -0.11906        0.952
             6          -0.11108        0.954
             7          -0.10546        0.955
             8          -0.10119        0.956
             9          -0.09781        0.957
            10          -0.09503        0.958
            11          -0.09270        0.958
            12          -0.09071        0.959
            13          -0.08899        0.959
            14          -0.08748        0.959
            15          -0.08614        0.959
            16          -0.08495        0.959
            17          -0.08388        0.959
            18          -0.08292        0.960
            19          -0.08204        0.960
 

Finally, extend the feature extractor to include a variety of features, such as lookahead features, paired features, and contextural features

In [16]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    if i == len(sentence)-1:
        nextword, nextpos = "<END>", "<END>"
    else:
        nextword, nextpos = sentence[i+1]
    return {"pos": pos,
            "word": word,
            "prevpos": prevpos,
            "nextpos": nextpos,
            "prevpos+pos": "%s+%s" % (prevpos, pos),
            "pos+nextpos": "%s+%s" % (pos, nextpos),
            "tags-since-dt": tags_since_dt(sentence, i)}

def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.441
             2          -0.22103        0.946
             3          -0.13900        0.955
             4          -0.11098        0.959
             5          -0.09684        0.962
             6          -0.08811        0.965
             7          -0.08200        0.967
             8          -0.07737        0.969
             9          -0.07366        0.970
            10          -0.07058        0.972
            11          -0.06795        0.973
            12          -0.06567        0.974
            13          -0.06366        0.975
            14          -0.06186        0.976
            15          -0.06024        0.977
            16          -0.05876        0.977
            17          -0.05742        0.978
            18          -0.05618        0.978
            19          -0.05504        0.979
 

# Recursion in Linguistic Structure
## Building Nested Structure with Cascaded Chunkers
So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally grouped under a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth, simply by creating a multi-stage chunk grammar containing recursive rules.

In [17]:
grammer = r"""
    NP: {<DT|JJ|NN.*>+}         # Chunk sequences of DT, JJ, NN
    PP: {<IN><NP>}              # Chunk prepositions followed by NP
    VP: {<VB.*><NP|PP|CLAUSE>+$}# Chunk verbs and their arguments
    CLAUSE: {<NP><VP>}          # Chunk NP, VP
"""
cp = nltk.RegexpParser(grammer)
sentence = [('Mary','NN'),('saw','VBD'),('the','DT'),('cat','NN'),
           ('sit','VB'), ('on','IN'),('the','DT'),('mat','NN')]
print(cp.parse(sentence))

(S
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


The above chunker missed the VP headed by saw.

In [18]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
    ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
    ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print(cp.parse(sentence))

(S
  (NP John/NNP)
  thinks/VBZ
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


The solution to these problems is to get the chunker to loop over its patterns: after trying all of them, it repeats the process.

In [19]:
cp = nltk.RegexpParser(grammer,loop=2)
print(cp.parse(sentence))

(S
  (NP John/NNP)
  thinks/VBZ
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))


## Trees
A tree is a set of connected labeled nodes, each reachable by a unique path from a distinguished root node. Here's an example of a tree (note that they are standardly drawn upside-down):

![tree](http://www.nltk.org/book/tree_images/ch07-tree-3.png)

In [20]:
tree1 = nltk.Tree('NP',['Alice'])
print(tree1)

tree2 = nltk.Tree('NP',['the','rabbit'])
print(tree2)

tree3 = nltk.Tree('VP',['chased',tree2])
tree4 = nltk.Tree('S',[tree1,tree3])
print(tree4)

(NP Alice)
(NP the rabbit)
(S (NP Alice) (VP chased (NP the rabbit)))


### Atrributes of tree objects

In [23]:
print(tree4[1])

print(tree4[1].label())

print(tree4.leaves())

print(tree4[1][1][1])

(VP chased (NP the rabbit))
VP
['Alice', 'chased', 'the', 'rabbit']
rabbit


The bracketed representation for complex trees can be difficult to read. In these cases, the draw method can be very useful. It opens a new window, containing a graphical representation of the tree. The tree display window allows you to zoom in and out, to collapse and expand subtrees, and to print the graphical representation to a postscript file (for inclusion in a document).

In [24]:
tree3.draw()

In [27]:
## Tree Traversal
def traverse(t):
    try:
        t.label()
    except AttributeError:
        print(t, end=' ')
    else:
        # Now we know that t.node is defined
        print('(',t.label(),end=' ')
        for child in t:
            traverse(child)
        print(')',end=' ')
        
t = nltk.Tree('S', [nltk.Tree('NP',['Alice']), 
                    nltk.Tree('VP', ['chased', 
                                     nltk.Tree('NP', ['the', 'rabbit'])])])
traverse(t)

( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) 

# Named Entity Recognition
. Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. The table below lists some of the more commonly used types of NEs. These should be self-explanatory, except for "Facility": human-made artifacts in the domains of architecture and civil engineering; and "GPE": geo-political entities such as city, state/province, and country.

*Tabele: Commonly Used Types of Named Entity*

|NE Type|Examples|
|-------|--------|
|ORGANIZATION|Georgia-Pacific Corp., WHO|
|PERSON|Eddy Bonte, President Obama|
|LOCATION|Murray River, Mount Everest|
|DATE|June, 2008-06-29|
|TIME|two fifty a m, 1:30 p.m.|
|MONEY|175 million Canadian Dollars, GBP 10.40|
|PERCENT|twenty pct, 18.75 %|
|FACILITY|Washington Monument, Stonehenge|
|GPE|South East Asia, Midlothian|

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: identifying the boundaries of the NE, and identifying its type. While named entity recognition is frequently a prelude to identifying relations in Information Extraction, it can also contribute to other tasks. For example, in Question Answering (QA), we try to improve the precision of Information Retrieval by recovering not whole pages, but just those parts which contain an answer to the user's question. Most QA systems take the documents returned by standard Information Retrieval, and then attempt to isolate the minimal text snippet in the document containing the answer.

### Difficulties of NER
- locations may not be covered by gazetteer (lookup method), even harder in the case of poeple or organizations
- many named entity terms are ambiguous. Thus May and North are likely to be parts of named entities for DATE and LOCATION, respectively, but could both be part of a PERSON; conversely Christian Dior looks like a PERSON but is more likely to be of type ORGANIZATION. A term like Yankee will be ordinary modifier in some contexts, but will be marked as an entity of type ORGANIZATION in the phrase Yankee infielders.
- Further challenges are posed by multi-word names like Stanford University,

In [30]:
sent = nltk.corpus.treebank.tagged_sents()[22]
print(nltk.ne_chunk(sent,binary=True))
print('--'*10)
print(nltk.ne_chunk(sent))

(S
  The/DT
  (NE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (NE Brooke/NNP)
  T./NNP
  Mossman/NNP
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE Vermont/NNP College/NNP)
  of/IN
  (NE Medicine/NNP)
  ./.)
--------------------
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WD

# Relation Extraction
Once named entities have been identified in a text, we then want to extract the relations that exist between them. As indicated earlier, we will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for. 

In [31]:
#(?!\b.+ing\b) is a negative lookahead assertion that allows us to 
#disregard strings such as success in supervising the transition of, 
#where in is followed by a gerund.
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG','LOC',doc,
                                    corpus='ieer',pattern=IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


Searching for the keyword in works reasonably well, though it will also retrieve false positives such as [ORG: House
Transportation Committee] , secured the most money in the [LOC: New
York]; there is unlikely to be simple string-based method of excluding filler strings such as this.

In [32]:
from nltk.corpus import conll2002
vnv = """
 (
 is/V|    # 3rd sing present and
 was/V|   # past forms of the verb zijn ('be')
 werd/V|  # and also present
 wordt/V  # past of worden ('become)
 )
 .*       # followed by anything
 van/Prep # followed by van ('of')
 """
VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc,
                                   corpus='conll2002', pattern=VAN):
        #The method clause() prints out the relations in a clausal form, 
        #where the binary relation symbol is specified as the value of parameter relsym
        print(nltk.sem.clause(r, relsym="VAN"))
#         print(rtuple(rel, lcon=True, rcon=True))

VAN("cornet_d'elzius", 'buitenlandse_handel')
VAN('johan_rottiers', 'kardinaal_van_roey_instituut')
VAN('annie_lennox', 'eurythmics')


# Summary
- Information extraction systems search large bodies of unrestricted text for specific types of entities and relations, and use them to populate well-organized databases. These databases can then be used to find answers for specific questions.
- The typical architecture for an information extraction system begins by segmenting, tokenizing, and part-of-speech tagging the text. The resulting data is then searched for specific types of entity. Finally, the information extraction system looks at entities that are mentioned near one another in the text, and tries to determine whether specific relationships hold between those entities.
- Entity recognition is often performed using chunkers, which segment multi-token sequences, and label them with the appropriate entity type. Common entity types include ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity).
- Chunkers can be constructed using rule-based systems, such as the RegexpParser class provided by NLTK; or using machine learning techniques, such as the ConsecutiveNPChunker presented in this chapter. In either case, part-of-speech tags are often a very important feature when searching for chunks.
- Although chunkers are specialized to create relatively flat data structures, where no two chunks are allowed to overlap, they can be cascaded together to build nested structures.
- Relation extraction can be performed using either rule-based systems which typically look for specific patterns in the text that connect entities and the intervening words; or using machine-learning systems which typically attempt to learn such patterns automatically from a training corpus.