## Information Extraction using NLTK

[NLTK](http://www.nltk.org/) is a python platform for natural language processing and information extraction. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.

The associated book, [Natural Language Processing with Python](http://www.nltk.org/book/) is also freely available and is a great resource both for NLP concepts and for practical examples of how to use the NLTK package.

To get started, you need to install NLTK:
    
    sudo pip install -U nltk
    
After running the first command below (download()), you will be presented with a window to select which 'data' to download. For space reasons, only download 'book' (Everything used in the NLTK book). This in itself is a few hundred MB download.

In [1]:
###################################################################################
# Purpose: NER (Named Entity Recognition) Information Extraction using NLTK       #
# input:   Sample sentence and NLTK corpus                                        #
# output:  POS and NER Tree                                                       #
 #
###################################################################################

import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\amgupta\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\amgupta\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [2]:
# Test it is working

sentence = """At eight o'clock on Thursday morning, Arthur didn't feel very good."""
#sentence = "one hundred and twenty five"
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
tagged


[('At', 'IN'),
 ('eight', 'CD'),
 ("o'clock", 'NN'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN'),
 (',', ','),
 ('Arthur', 'NNP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('feel', 'VB'),
 ('very', 'RB'),
 ('good', 'JJ'),
 ('.', '.')]

In [3]:
tokens

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 ',',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good',
 '.']

The above two commands tokenize the string, and tag each of the token with the part-of-speech. Here is a listing of tagsets that NLTK uses here (there are different tagsets used by different corpora).
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

The following set of commands extracts named entities.

In [4]:
entities = nltk.chunk.ne_chunk(tagged)
entities.__repr__()

'Tree(\'S\', [(\'At\', \'IN\'), (\'eight\', \'CD\'), ("o\'clock", \'NN\'), (\'on\', \'IN\'), (\'Thursday\', \'NNP\'), (\'morning\', \'NN\'), (\',\', \',\'), Tree(\'PERSON\', [(\'Arthur\', \'NNP\')]), (\'did\', \'VBD\'), ("n\'t", \'RB\'), (\'feel\', \'VB\'), (\'very\', \'RB\'), (\'good\', \'JJ\'), (\'.\', \'.\')])'

In [6]:
# Visual representation of the tree 
import nltk
nltk.download('treebank')
from nltk.corpus import treebank
entities.draw()

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\amgupta\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


###Named Entity Recognition 

As a somewhat more elaborate example, the following sequence of commands reads data from a file, and does NER on each of the sentences in the file. It doesn't do a very good job on this article, but in general, it seems to work quite well.

In [10]:
with open("news1.html", "r") as myfile:
    data = myfile.read()   
    
sentences = nltk.sent_tokenize(data)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
entities0 = nltk.ne_chunk(sentences[0])
print(entities0)
entities0.draw()

(S
  </JJ
  html/NN
  >/NNP
  </NNP
  body/NN
  >/NNP
  </NNP
  p/NN
  >/VBD
  The/DT
  1989/CD
  Tour/NNP
  de/FW
  (GPE France/NNP)
  was/VBD
  the/DT
  76th/CD
  edition/NN
  of/IN
  one/CD
  of/IN
  cycling/NN
  's/POS
  (FACILITY Grand/NNP Tours/NNP)
  ./.)


In [11]:
# for better Visual of tree
import nltk
nltk.download('treebank')
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\amgupta\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


<bound method Tree.draw of Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]), Tree(',', [',']), Tree('ADJP', [Tree('NP', [Tree('CD', ['61']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree(',', [','])]), Tree('VP', [Tree('MD', ['will']), Tree('VP', [Tree('VB', ['join']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['board'])]), Tree('PP-CLR', [Tree('IN', ['as']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])])]), Tree('NP-TMP', [Tree('NNP', ['Nov.']), Tree('CD', ['29'])])])]), Tree('.', ['.'])])>

In [12]:
t.draw()

### Relation Extraction

The second key task is to extract relations between entities. The following code snippet finds the relations between organizations and locations, in one of the existing datasets in NLTK. See the book webpage for more details on the regular expression pattern below: http://www.nltk.org/book/ch07.html

In [21]:
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
