# <center>Book: Steven Bird, Ewan Klein, Edward Loper, 2009. **Natural Language Processing (NLP) with Python**, O'Reilly.</center> 

### Chapter 7 -  Extracting Information

##### 2. 
☼ Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases.

*This grammar is designed to work only with the examples in this exercise, and will likely fail when applied to general sentences.*

In [2]:
import nltk
grammar = r"""
    NP: {<DT>?<CD|JJ><NN.>}
"""
cp = nltk.RegexpParser(grammar)

In [3]:
nps =[[("many", "JJ"), ("researchers", "NNS")], 
      [("two", "CD"), ("weeks", "NNS")], 
      [("both", "DT"), ("new", "JJ"), ("positions", "NNS")]]

for n in nps:
    print(cp.parse(n))

(S (NP many/JJ researchers/NNS))
(S (NP two/CD weeks/NNS))
(S (NP both/DT new/JJ positions/NNS))


##### 3.

☼ Pick one of the three chunk types in the CoNLL corpus. Inspect the CoNLL corpus and try to observe any patterns in the POS tag sequences that make up this kind of chunk. Develop a simple chunker using the regular expression chunker `nltk.RegexpParser`. Discuss any tag sequences that are difficult to chunk reliably.

*Inspect the first 20 sentences in the corpus:*

In [4]:
from nltk.corpus import conll2000
for i in range(20):
    print(i, conll2000.chunked_sents('train.txt', chunk_types = ['VP'])[i])

0 (S
  Confidence/NN
  in/IN
  the/DT
  pound/NN
  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
  another/DT
  sharp/JJ
  dive/NN
  if/IN
  trade/NN
  figures/NNS
  for/IN
  September/NNP
  ,/,
  due/JJ
  for/IN
  release/NN
  tomorrow/NN
  ,/,
  (VP fail/VB to/TO show/VB)
  a/DT
  substantial/JJ
  improvement/NN
  from/IN
  July/NNP
  and/CC
  August/NNP
  's/POS
  near-record/JJ
  deficits/NNS
  ./.)
1 (S
  Chancellor/NNP
  of/IN
  the/DT
  Exchequer/NNP
  Nigel/NNP
  Lawson/NNP
  's/POS
  restated/VBN
  commitment/NN
  to/TO
  a/DT
  firm/NN
  monetary/JJ
  policy/NN
  (VP has/VBZ helped/VBN to/TO prevent/VB)
  a/DT
  freefall/NN
  in/IN
  sterling/NN
  over/IN
  the/DT
  past/JJ
  week/NN
  ./.)
2 (S
  But/CC
  analysts/NNS
  (VP reckon/VBP)
  underlying/VBG
  support/NN
  for/IN
  sterling/NN
  (VP has/VBZ been/VBN eroded/VBN)
  by/IN
  the/DT
  chancellor/NN
  's/POS
  failure/NN
  (VP to/TO announce/VB)
  any/DT
  new/JJ
  policy/NN
  measures/NNS
  in/IN
  his/PRP$
  Mansio

In [5]:
grammar = r"VP: {<[VRMT].*>+}"
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types = ['VP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  94.3%%
    Precision:     64.2%%
    Recall:        80.4%%
    F-Measure:     71.4%%


In [6]:
grammar = r"VP: {<VB.>?<RB>*<MD>?<VB.>?<TO>?<VB.>}"
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types = ['VP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.3%%
    Precision:     74.4%%
    Recall:        66.8%%
    F-Measure:     70.4%%


In [7]:
grammar = r"VP: {<VB.>?<RB.>*<MD>?<VB.>?<TO>?<VB.>}"
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types = ['VP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.4%%
    Precision:     72.7%%
    Recall:        66.9%%
    F-Measure:     69.7%%


In [8]:
grammar = r"VP: {<VB.>?<RB>*<MD>?<VB.>?<TO>?<MD>?<RB>*<VB.>}"
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types = ['VP'])
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.3%%
    Precision:     74.6%%
    Recall:        66.9%%
    F-Measure:     70.5%%
