#**A05 Information Extraction**

In [None]:
import nltk
nltk.download('all')

##**問題 １**

###問題文

Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases.

In [None]:
from nltk import pos_tag, word_tokenize

grammar1 = r"""
  NP: {<DT>?<CD>?<JJ.*>*<NN.*>+}
"""
cp1 = nltk.RegexpParser(grammar1)

test1 = pos_tag(word_tokenize("There are many researchers on Artificial Intelligence in the lab."))
test2 = pos_tag(word_tokenize("The package is going to be shipped in two weeks."))
test3 = pos_tag(word_tokenize("Both new positions proposed to me caught my attention seriously."))
print(cp1.parse(test1))
print(cp1.parse(test2))
print(cp1.parse(test3))

```
(S
  There/EX
  are/VBP
  (NP many/JJ researchers/NNS)
  on/IN
  (NP Artificial/JJ Intelligence/NN)
  in/IN
  (NP the/DT lab/NN)
  ./.)
```

```
(S
  (NP The/DT package/NN)
  is/VBZ
  going/VBG
  to/TO
  be/VB
  shipped/VBN
  in/IN
  (NP two/CD weeks/NNS)
  ./.)
```

```  
(S
  (NP Both/DT new/JJ positions/NNS)
  proposed/VBN
  to/TO
  me/PRP
  caught/VB
  my/PRP$
  (NP attention/NN)
  seriously/RB
  ./.)
```

##**問題 ２**

###問題文

Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising.

In [None]:
from nltk import pos_tag, word_tokenize

grammar2 = r"""
  NP:
      {<DT>?<CD>?<JJ.*>*<NN.*>+<VBG><NN.*>+} # VBGがNNの前に来る場合  
      {<DT>?<CD>?<JJ.*>*<VBG><NN.*>+}        # VBGがNNの後に来る場合
      {<DT>?<CD>?<JJ.*>*<NN.*>+}             # VBGが含まれない場合
"""

cp2 = nltk.RegexpParser(grammar2)
test1 = pos_tag(word_tokenize("the receiving end"))
test2 = pos_tag(word_tokenize("I heard that the assistant managing editor has fired."))
test3 = pos_tag(word_tokenize("Nowadays virtual currency is one of the booming Internet businesses."))
print(cp2.parse(test1))
print(cp2.parse(test2))
print(cp2.parse(test3))

```
(S (NP the/DT receiving/VBG end/NN))
```

```
(S
  I/PRP
  heard/VBD
  that/IN
  (NP the/DT assistant/NN managing/VBG editor/NN)
  has/VBZ
  fired/VBN
  ./.)
```

```
(S
  Nowadays/RB
  (NP virtual/JJ currency/NN)
  is/VBZ
  one/CD
  of/IN
  (NP the/DT booming/JJ Internet/NNP businesses/NNS)
  ./.)
```

テスト３に含まれるboomingはVBGのように思われるが、実際には形容詞(JJ)である。grammar2がこのことを正しく識別できるていることから、grammar2はVBGとingが末尾に付く形容詞を区別する能力があることがわかる。

残りのテストセットに対してもVBGを含む場合と含まない場合ともに正しく認識できていることが見て取れる。

##**問題 ３**

###問題文

Carry out the following evaluation tasks for any of the chunkers you have developed earlier. (Note that most chunking corpora contain some internal inconsistencies, such that any reasonable rule-based approach will produce errors.)

###a. Evaluate your chunker on 100 sentences from a chunked corpus, and report the precision, recall and F-measure.

In [None]:
# 自作のchunkerに対して、評価作業を行う。

from nltk import pos_tag, word_tokenize
from nltk.corpus import conll2000

# 自作chunkerの定義
grammar3 = r"""
  NP:
      {<DT|PRP\$|POS>?<CD>?<JJ.*>*<NN.*>+<VBG><NN.*>+}
      {<DT|PRP\$|POS>?<CD>?<JJ.*>*<VBG><NN.*>+}
      {<DT|PRP\$|POS>?<CD>?<JJ.*>*<NN.*|PRP>+}
"""

cp3 = nltk.RegexpParser(grammar3)

test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp3.evaluate(test_sents[:100]))

ChunkParse score:
    IOB Accuracy:  84.9%%
    Precision:     82.6%%
    Recall:        76.0%%
    F-Measure:     79.2%%


###*b*. Use the chunkscore.missed() and chunkscore.incorrect() methods to identify the errors made by your chunker. Discuss.

In [None]:
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from nltk import Tree

chunkscore = ChunkScore()
for sent in  test_sents[:4]:
  unchunked_sent = sent.flatten()
  cp3_sent = cp3.parse(unchunked_sent)
  print(sent)
  print(cp3_sent)

  chunkscore.score(sent, cp3_sent)
  print(chunkscore)
  print("Missed " + str(chunkscore.missed()))
  print("Incorrect " + str(chunkscore.incorrect()))

(S
  (NP Rockwell/NNP International/NNP Corp./NNP)
  (NP 's/POS Tulsa/NNP unit/NN)
  said/VBD
  (NP it/PRP)
  signed/VBD
  (NP a/DT tentative/JJ agreement/NN)
  extending/VBG
  (NP its/PRP$ contract/NN)
  with/IN
  (NP Boeing/NNP Co./NNP)
  to/TO
  provide/VB
  (NP structural/JJ parts/NNS)
  for/IN
  (NP Boeing/NNP)
  (NP 's/POS 747/CD jetliners/NNS)
  ./.)
(S
  (NP Rockwell/NNP International/NNP Corp./NNP)
  (NP 's/POS Tulsa/NNP unit/NN)
  said/VBD
  (NP it/PRP)
  signed/VBD
  (NP a/DT tentative/JJ agreement/NN)
  extending/VBG
  (NP its/PRP$ contract/NN)
  with/IN
  (NP Boeing/NNP Co./NNP)
  to/TO
  provide/VB
  (NP structural/JJ parts/NNS)
  for/IN
  (NP Boeing/NNP)
  (NP 's/POS 747/CD jetliners/NNS)
  ./.)
ChunkParse score:
    IOB Accuracy: 100.0%%
    Precision:    100.0%%
    Recall:       100.0%%
    F-Measure:    100.0%%
Missed []
Incorrect []
(S
  (NP Rockwell/NNP)
  said/VBD
  (NP the/DT agreement/NN)
  calls/VBZ
  for/IN
  (NP it/PRP)
  to/TO
  supply/VB
  (NP 200/CD additi

今回自作したchunkerは、どの評価手法に対しても90%程度の結果を残していることから、大抵の場合は上手く動作していることがわかる。

しかし、複雑な名詞句に対しては正しくchunk化を行えていないことが、chunkscore.missed()関数やchunkscore.incorrect()関数の出力から認められる。

具体的には、名詞句内でVBNがVBGとともに現れたケースや複数の名詞をコンマと接続詞を用いて一つの名詞句にまとめたケースなどが挙げられる。

今後、さらなるchunk化能力の向上のためには上で上げたような複雑な名詞句に対応する仕組みを考えてあげる必要があるだろう。

### c. Compare the performance of your chunker to the baseline chunker discussed in the evaluation section of this chapter.

In [None]:
# baseline chunkerに対して、評価作業を行う。

grammar_baseline = r"NP: {<[CDJNP].*>+}"
cp_bc = nltk.RegexpParser(grammar_baseline)
print(cp_bc.evaluate(test_sents[:100]))

ChunkParse score:
    IOB Accuracy:  83.9%%
    Precision:     62.1%%
    Recall:        58.2%%
    F-Measure:     60.1%%


**baseline chunkerのパフォーマンス**
```
    IOB Accuracy:  83.9%%
    Precision:     62.1%%
    Recall:        58.2%%
    F-Measure:     60.1%%
```

**自作のchunkerのパフォーマンス**
```
    IOB Accuracy:  84.9%%
    Precision:     82.6%%
    Recall:        76.0%%
    F-Measure:     79.2%%
```

それぞれのchunkerを評価した結果についてまとめた上の二つの表からは、４つのある評価指標のいずれの場合においても上回っていることが見て取れる。

特にIOB Accuracy以外の３つの評価指標においてはbaseline chunkerに比べて10%から20%の大幅なchunking能力の向上を果たしている。