# 开发和评估分块器

## 读取 IOB 格式与 CoNLL2000 分块语料库

使用转换函数 [chunk.conllstr2tree()](https://www.nltk.org/_modules/nltk/chunk/util.html#conllstr2tree) 可以将 IOB 格式的字符串转换成树表示。此外，它允许我们选择使用语料库提供的三种块类型：NP、VP 和 PP 的任何子集。

In [1]:
import nltk

text = """
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
. . O
"""

nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

![ch07-tree-2.png](resources/ch07-tree-2.png)

NLTK 的 corpus 模块包含了大量已分块的文本。CoNLL2000 语料库包含 27 万词的《华尔街日报》文本，分为“训练”和“测试”两部分，标注有词性标记和 IOB 格式分块标记：

In [2]:
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


正如你看到的，CoNLL2000 语料库包含了三种块类型：NP 块如 a cup；VP 块如 told；PP 块如 of。由于现在我们唯一感兴趣的是 NP 块，我们可以使用 chunk_types 参数选择它：

In [3]:
test_sent = conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99]
print(type(test_sent))
print(test_sent)

<class 'nltk.tree.Tree'>
(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.)


## 简单评估和基准

首先，我们为不创建任何块的分块器建立一个基准（baseline）：

In [4]:
cp = nltk.RegexpParser('')
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.parse(test_sent))
print(cp.evaluate(test_sents))

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.)
ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


IOB 标记准确性表明超过三分之一的词被标注为 O，即没有出现在 NP 块中。然而，由于我们的标注器没有找到任何块，其精度、召回率和 F 度量均为零。现在让我们尝试一个初级的正则表达式分类器，查找以名词短语标记的特征字母（如 CD、DT 和 JJ）开头的标记：

In [5]:
grammar = r'NP: {<[CDJNP].*>+}'
cp = nltk.RegexpParser(grammar)
print(cp.parse(test_sent))
print(cp.evaluate(test_sents))

(S
  Over/IN
  (NP (NP a/DT cup/NN))
  of/IN
  (NP (NP coffee/NN))
  ,/,
  (NP (NP Mr./NNP Stone/NNP))
  told/VBD
  (NP (NP his/PRP$ story/NN))
  ./.)
ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


这种方法达到了不错的结果，但是我们可以采用更多数据驱动的方法改善它。这里我们定义了 UnigramChunker 类，使用 unigram 标注器给句子加块标记。这个类实现了 [nltk.ChunkParserI](https://www.nltk.org/_modules/nltk/chunk/api.html#ChunkParserI) 接口，定义了两个方法：一个构造函数，当我们建立新的 UnigramChunker 时调用；一个 parse 方法，用来给新句子分块。

构造函数需要训练句子的一个链表，每个句子都是块树的形式。它首先通过 tree2conlltags 方法将块树转换成 IOB 标记，然后训练一个基于词性标记的 unigram 块标注器。

parse 方法取一个已标注的句子作为输入，首先提取词性标记，然后使用在构造函数中训练过的标注器为词性标记标注 IOB 标记。接下来将块标记与原句组合，产生 conlltags。最后使用 conlltags2tree 将结果转换成一个块树。

In [6]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)]
                         for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
    
    def parse(self, sentence):
        pos_tags = [pos for (_, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag) 
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


这个分块器相当不错，达到整体 F 度量 83% 的得分。现在我们来分析一下 unigram 标注器给每个词性标记分配了什么样的块标记：

In [7]:
postags = sorted(set(pos for sent in train_sents
                    for (_, pos) in sent.leaves()))
print(unigram_chunker.tagger.tag(postags))

[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]


可以发现大多数标点符号都出现在 NP 块外，除了货币符号 # 和 \$ ；限定词（DT）和所有格（PRP\$ 和 WP\$）出现在 NP 块的开头，而名词类型（NN，NNP，NNPS，NNS）大多出现在 NP 块内。

我们对 unigram 分块器稍作修改，建立一个 bigram 分块器，性能略有提升：

In [8]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)]
                         for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)
    
    def parse(self, sentence):
        pos_tags = [pos for (_, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag) 
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


## 训练基于分类器的分块器

无论是基于正则表达式的分块器还是 n-gram 分块器，决定创建什么块完全基于词性标记。然而有时词性标记不足以确定一个句子应如何分块。例如：

a. Joey/NN sold/VBD the/DT farmer/NN rice/NN ./.

b. Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.

这两句话的词性标记相同，但分块方式不同。第一句中 the farmer 和 rice 都是单独的块，而第二个句子中相应的部分 the computer monitor 是一个单独的块。因此，为了最大限度地提升分块的性能，我们需要使用词的内容信息作为词性标注的补充。

我们包含词的内容信息的方法之一是使用基于**分类器**的标注器对句子分块。在下面的例子中包括两个类：第一个类与 6.1 节中的 ConsecutivePosTagger 类似，仅有的区别在于使用 MaxentClassifier 代替 NaiveBayesClassifier；第二个类是标注器类的一个包装器，将它变成一个分块器。

In [9]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set, trace=0)
    
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)
    
class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w, t), c) for (w, t, c) in
                        nltk.chunk.tree2conlltags(sent)]
                       for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
    
    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

还需要定义用到的特征提取器 npchunk_features。首先，我们定义一个简单的特征提取器，它只提供当前标识符的词性标记。利用这个体征提取器的分块器性能与 unigram 分块器非常类似：

In [10]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {'pos': pos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


接着我们再添加一个特征：前面词的词性标记。添加此特征允许分类器模拟相邻标记间的相互作用，由此产生的分块器与 bigram 分块器非常接近：

In [11]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = '<START>', '<START>'
    else:
        prevword, prevpos = sentence[i - 1]
    return {'pos': pos, 'prevpos': prevpos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.6%%
    Precision:     82.0%%
    Recall:        87.2%%
    F-Measure:     84.6%%


下一步，我们尝试把当前词增加为特征，可以发现这个特征确实提高了分块器的性能，大约 1.5 个百分点：

In [12]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = '<START>', '<START>'
    else:
        prevword, prevpos = sentence[i - 1]
    return {'pos': pos, 'word': word, 'prevpos': prevpos}

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  94.6%%
    Precision:     84.6%%
    Recall:        89.8%%
    F-Measure:     87.1%%


最后，我们尝试用多种附加特征扩展特征提取器，例如：预取特征、配对功能和复杂的语境特征等：

In [13]:
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = '<START>', '<START>'
    else:
        prevword, prevpos = sentence[i - 1]
    if i == len(sentence) - 1:
        nextword, nextpos = '<END', 'END>'
    else:
        nextword, nextpos = sentence[i + 1]
    return {'pos': pos,
            'word': word, 
            'prevpos': prevpos,
            'nextpos': nextpos,
            'prevpos+pos': '%s+%s' % (prevpos, pos),
            'pos+nextpos': '%s+%s' % (pos, nextpos),
            'tags-since-dt': tags_since_dt(sentence, i)}

def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  96.0%%
    Precision:     88.3%%
    Recall:        91.1%%
    F-Measure:     89.7%%
