# Extracting Information from Text

*Chapter 07, NLTK: https://www.nltk.org/book/ch07.html*

In [1]:
%matplotlib inline

In [137]:
from pprint import pprint

import nltk
import os
import pandas as pd
import re
import textwrap

In [161]:
%dirs

[]

### NLTK Data

#### Installing MegaM

In addition to the downloaded data, [MegaM](http://legacydirs.umiacs.umd.edu/~hal/megam/index.html) is required for some classifier-based chunking.

1. Download the source from http://legacydirs.umiacs.umd.edu/~hal/megam/index.html.
2. Make the following changes to the Makefile (as needed):
    * Update `WITHCLIBS` to point to your local caml lib dir. Invoking `ocamlc -where` may help.
    * Change `WITHSTR` to use `-lcamlstr` instead of `lstr`.
3. Build the optimized binary by invoking `make opt` (or `make` for the slow version).
4. Do one of:
    * Ensure that the location to the `megam.opt` binary is on your path.
    * Set the environment variable `MEGAM` to the location of `megam.opt`.

#### Downloading NLTK Data

Use the NLTK downloader to fetch any necessary datasets and corpora:

In [154]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')
nltk.download('conll2000')

[nltk_data] Downloading package punkt to /Users/mcwehner/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mcwehner/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to /Users/mcwehner/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/mcwehner/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!


True

## Information Extraction

### Preprocessing

In [3]:
def ie_sentence_segmentation(document):
    return nltk.sent_tokenize(document)

In [4]:
def ie_tokenization(sentences):
    return [nltk.word_tokenize(sent) for sent in sentences]

In [5]:
def ie_pos_tagging(sentences):
    return [nltk.pos_tag(sent) for sent in sentences]

In [6]:
def ie_preprocess(document):
    sentences = ie_sentence_segmentation(document)
    sentences = ie_tokenization(sentences)
    sentences = ie_pos_tagging(sentences)
    
    return sentences

#### Sandbox

In [152]:
for sentence in ie_preprocess('He has an ex who robbed a bank'):
    pprint(sentence)

[('He', 'PRP'),
 ('has', 'VBZ'),
 ('an', 'DT'),
 ('ex', 'NN'),
 ('who', 'WP'),
 ('robbed', 'VBD'),
 ('a', 'DT'),
 ('bank', 'NN')]


## Chunking

### Noun Phrase Chunking

`NP: {<DT>?<JJ>*<NN>}`: an NP chunk should be formed whenever the chunker finds an optional determiner `DT` followed by any number of adjectives `JJ` and then a noun `NN`.

In [110]:
def chunk(grammar, documents):
    for document in documents:
        print(document, '\n')

        for sentence in ie_preprocess(document):
            chunk_parser = nltk.RegexpParser(grammar)
            result       = chunk_parser.parse(sentence)

            print(textwrap.indent(str(result), '\t'), '\n')

In [68]:
grammar = r'''
    NP: {<DT|PRP\$>?<JJ.*|RBR|POS>*<CD|NN.*>+}
'''

chunk(grammar, [
    'the little yellow dog barked at the cat',
    'another sharp dive',
    'trade figures',
    'any new policy measures',
    'earlier stages',
    'Panamanian dictator Manuel Noriega',
    'his Mansion House speech',
    'the price cutting',
    '3% to 4%',
    'more than 10%',
    'the fastest developing trends',
    "man's skill",
    
    'the patient arrived earlier than was needed',
    
    "The market for system-management software for Digital's hardware is fragmented enough that a giant such as Computer Associates should do well there.",
])

the little yellow dog barked at the cat 

	(S
	  (NP the/DT little/JJ yellow/JJ dog/NN)
	  barked/VBD
	  at/IN
	  (NP the/DT cat/NN)) 

another sharp dive 

	(S (NP another/DT sharp/JJ dive/NN)) 

trade figures 

	(S (NP trade/NN figures/NNS)) 

any new policy measures 

	(S (NP any/DT new/JJ policy/NN measures/NNS)) 

earlier stages 

	(S (NP earlier/RBR stages/NNS)) 

Panamanian dictator Manuel Noriega 

	(S (NP Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP)) 

his Mansion House speech 

	(S (NP his/PRP$ Mansion/NNP House/NNP speech/NN)) 

the price cutting 

	(S (NP the/DT price/NN cutting/NN)) 

3% to 4% 

	(S (NP 3/CD %/NN) to/TO (NP 4/CD %/NN)) 

more than 10% 

	(S more/JJR than/IN (NP 10/CD %/NN)) 

the fastest developing trends 

	(S (NP the/DT fastest/JJS developing/NN trends/NNS)) 

man's skill 

	(S (NP man/NN) (NP 's/POS skill/NN)) 

the patient arrived earlier than was needed 

	(S
	  (NP the/DT patient/NN)
	  arrived/VBD
	  earlier/JJR
	  than/IN
	  was/VBD
	  needed/

In [111]:
grammar = r'''
    NP: {<DT|PRP\$>?<JJ.*>*<NN>} # determiner/possessive, adjectives, and noun
        {<NNP>+}                 # sequences of proper nouns
'''

chunk(grammar, [
    'Rapunzel let down her long golden hair',
])

Rapunzel let down her long golden hair 

	(S
	  (NP Rapunzel/NNP)
	  let/VBD
	  down/RP
	  (NP her/PRP$ long/JJ golden/JJ hair/NN)) 



### Exploring Text Corpora

#### `find_chunks(<grammar>, corpus=nltk.corpus.brown, limit=5)`

```python
>>> find_chunks('CHUNK: {<V.*> <TO> <V.*>}')
```

```
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
...
(CHUNK wanted/VBD to/TO wait/VB)
```


In [108]:
def find_chunks(grammar, corpus=nltk.corpus.brown, limit=5):
    cp = nltk.RegexpParser(grammar)
    
    for sent in corpus.tagged_sents():
        tree = cp.parse(sent)
        
        for subtree in tree.subtrees():
            if 'CHUNK' == subtree.label():
                print(subtree)
                
                if limit is not None:
                    limit -= 1
                    if limit <= 0: return

In [109]:
find_chunks('CHUNK: {<V.*> <TO> <V.*>}')

(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)


In [106]:
find_chunks('CHUNK: {<N(?!IL).*>{4,}}')

(CHUNK Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)
(CHUNK Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)
(CHUNK Georgia's/NP$ automobile/NN title/NN law/NN)
(CHUNK State/NN-TL Welfare/NN-TL Department's/NN$-TL handling/NN)
(CHUNK Fulton/NP-TL Tax/NN-TL Commissioner's/NN$-TL Office/NN-TL)


### Chinking

In [112]:
grammar = r'''
    NP: {<.*>+}     # chunk everything
        }<VBD|IN>+{ # chink sequences of VBD and IN
'''

chunk(grammar, [
    'the little yellow dog barked at the cat',
])

the little yellow dog barked at the cat 

	(S
	  (NP the/DT little/JJ yellow/JJ dog/NN)
	  barked/VBD
	  at/IN
	  (NP the/DT cat/NN)) 



## Developing and Evaluating Chunkers

In [126]:
test_sents  = nltk.corpus.conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = nltk.corpus.conll2000.chunked_sents('train.txt', chunk_types=['NP'])

def evaluate_chunker(cp):
    print(cp.evaluate(test_sents))

### Baseline

Positive IOB tag accuracy indicates that more than a third of the words are tagged with `O`, i.e. not in an NP chunk. No chunks are found however, and precision, recall, and f-measure are therefore zero.

In [130]:
evaluate_chunker(
    nltk.RegexpParser(''),
)

ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


### Naive Regexp

In [129]:
evaluate_chunker(
    nltk.RegexpParser(r'NP: {<[CDJNP].*>+}'),
)

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


### Unigram and Bigram

In [125]:
class TaggedChunker(nltk.ChunkParserI):
    def __init__(self, train_sents, tagger):
        train_data  = [[(t,c) for _,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = tagger(train_data)
    
    def parse(self, sentence):
        pos_tags            = [pos for (_, pos) in sentence]
        iob_tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags           = [chunktag for (_, chunktag) in iob_tagged_pos_tags]
        conlltags           = [(word, pos, chunktag) for ((word, pos), chunktag) in zip(sentence, chunktags)]
        
        return nltk.chunk.conlltags2tree(conlltags)

In [131]:
evaluate_chunker(
    TaggedChunker(train_sents, nltk.UnigramTagger),
)

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


In [132]:
evaluate_chunker(
    TaggedChunker(train_sents, nltk.BigramTagger),
)

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


### Classifier-Based

#### Tagger

In [133]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sentences):
        train_set = []
        
        for tagged_sentence in train_sentences:
            history           = []
            untagged_sentence = nltk.tag.untag(tagged_sentence)
            
            for i, (_, tag) in enumerate(tagged_sentence):
                featureset = npchunk_features(untagged_sentence, i, history)
                
                train_set.append( (featureset, tag) )
                history.append(tag)

        self.classifier = nltk.MaxentClassifier.train(train_set, algorithm='megam', trace=0)

    def tag(self, sentence):
        history = []
        
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag        = self.classifier.classify(featureset)
            
            history.append(tag)
        
        return zip(sentence, history)

#### Chunker

During training, `ConsecutiveNPChunker` maps the chunk trees in the training corpus into tag sequences; in the `parse()` method, it converts the tag sequence provided by the tagger back into a chunk tree:

In [149]:
class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sentences):
        tagged_sentences = [
            [((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sentence)]
            for sentence in train_sentences
        ]
        
        self.tagger = ConsecutiveNPChunkTagger(tagged_sentences)
    
    def parse(self, sentence):
        tagged_sentences = self.tagger.tag(sentence)
        conlltags        = [(w,t,c) for ((w,t),c) in tagged_sentences]
        
        return nltk.chunk.conlltags2tree(conlltags)

#### Feature Extractor

In [150]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    
    return { 'pos': pos }

#### Evaluation

In [151]:
evaluate_chunker(
    ConsecutiveNPChunker(train_sents),
)

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.7%%
    F-Measure:     83.2%%
