# 7. 从文本提取信息
1. 我们如何能构建一个系统，从非结构化文本中提取结构化数据？
2. 有哪些稳健的方法识别一个文本中描述的实体和关系？
3. 哪些语料库适合这项工作，我们如何使用它们来训练和评估我们的模型？

## 7.1 信息提取

### 信息提取结构
文本->句子分割->词语切分->词性标注->实体提取->关系提取

In [1]:
import nltk
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document) # 句子分割
    sentences = [nltk.word_tokenize(sent) for sent in sentences] # 词语切分
    sentences = [nltk.pos_tag(sent) for sent in sentences] # 词性标注

## 7.2 分块(chunking)

### 名词短语分块

In [2]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}" # 定义名词短语的形式：限定词后接任何数目的形容词，再加一个名词
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
result.draw()

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


### 用正则表达式分块

In [3]:
# 如果存在多条规则中，只要其中有一条被匹配就可以分块
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   
      {<NNP>+}                
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
result = cp.parse(sentence)
print(result)
result.draw()

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


In [4]:
# 如果句子中有多个成分满足规则，优先匹配位置靠前的部分
nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]
grammar = "NP: {<NN><NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(nouns)
print(result)
result.draw()

(S (NP money/NN market/NN) fund/NN)


### 探索文本语料库

In [5]:
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
for sent in nltk.corpus.brown.tagged_sents()[:100]:
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
(CHUNK expected/VBN to/TO approve/VB)
(CHUNK expected/VBN to/TO make/VB)
(CHUNK intends/VBZ to/TO make/VB)
(CHUNK seek/VB to/TO set/VB)
(CHUNK like/VB to/TO see/VB)


In [6]:
# 将上述过程定义成一个函数
def find_chunks(reg, n_sent=-1):
    cp = nltk.RegexpParser(reg)
    for sent in nltk.corpus.brown.tagged_sents()[:n_sent]:
        tree = cp.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == reg.split(":")[0]: print(subtree)
# 查找4个或以上连续名词的部分
find_chunks(reg='NOUNS:{<N.*>{4,}}', n_sent=10)

(NOUNS Court/NN-TL Judge/NN-TL Durwood/NP Pye/NP)
(NOUNS Mayor-nominate/NN-TL Ivan/NP Allen/NP Jr./NP)


### 加缝隙(chinking)

In [7]:
grammar = r"""
  NP:
    {<.*>+}          # 匹配所有
    }<VBD|IN>+{      # 去除动词和介词的部分
  """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))
cp.parse(sentence).draw()

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


### 块的表示：标记与树
IOB：B表示块的开始；I表示块中部分；O表示块外部分

树结构：如上面result.draw()所绘制出的树结构

## 7.3 开发和评估分块器

### IOB格式文本与分块语料库

In [8]:
# text为(名词,词性,IOB符号)的三元形式
text = '''
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
. . O
'''
# 直接利用三元形式的text绘制树
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

In [9]:
# 一些语料库中包含了已经分块好的句子
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt')[99])
conll2000.chunked_sents('train.txt')[99].draw()

(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)


In [10]:
# 可以指定只关注某特定类型的块
print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])
conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99].draw()

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.)


### 简单评估和基准

In [11]:
# 基准：不匹配任何形式，将所有词都标记为0
from nltk.corpus import conll2000
cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents)) # 43.4%的准确率说明了O的实际比例，但由于没有匹配成分，精确率、召回率和F-度量都是0

ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%


In [12]:
# 正则表达式分块器
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print(cp.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  87.7%%
    Precision:     70.6%%
    Recall:        67.8%%
    F-Measure:     69.2%%


In [13]:
# 基于Unigram的分块器：本质上是用词性来预测IOB标记(参考之前的Unigram标注器，用词来预测词性)
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%


In [14]:
# 用Unigram分块器来分块一个句子
sent=[("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
print(unigram_chunker.parse(sent))
unigram_chunker.parse(sent).draw()

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [15]:
# 利用相似方法，构建bigram分块器
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


### 训练基于分类器的分块器

In [16]:
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    '''标注器
    train_sents：训练文本(已词性标注)
    train_set：训练集(包含特征和类别)
    history：标注集
    '''
    def __init__(self, train_sents, verbose=1):
        train_set = []
        for tagged_sent in train_sents: # 提取训练文本中的句子
            untagged_sent = nltk.tag.untag(tagged_sent) # 去掉标注
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history) # 提取特征
                train_set.append((featureset, tag)) # 向训练集中添加数据
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set, trace=0)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI):
    '''分块器
    train_sents：训练文本(包含词、词性标注和IOB标注)
    tagged_sents：将词和词性作为X部分，IOB标注作为需要预测的Y部分
    conlltags：将预测后的结果还原为(词, 词性标注, IOB标注)的形式
    '''
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents] # 将(词, 词性标注, IOB标注)形式的三元组重塑为((词, 词性标注), IOB标注)的形式
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents) # 利用之前之前创建的Tagger来预测IOB标注

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence) # 预测IOB标注
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents] # 将((词, 词性标注), IOB标注)形式返回为三元组(词, 词性标注, IOB标注)
        return nltk.chunk.conlltags2tree(conlltags)

In [17]:
# 只包含词性标注一个特征
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    return {"pos": pos}
chunker = ConsecutiveNPChunker(train_sents[:100])
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.7%%
    Precision:     79.2%%
    Recall:        86.3%%
    F-Measure:     82.6%%


In [18]:
# 添加前一个词的词性标注特征
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "word": word, "prevpos": prevpos}
chunker = ConsecutiveNPChunker(train_sents[:100])
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  92.7%%
    Precision:     80.2%%
    Recall:        85.6%%
    F-Measure:     82.8%%


In [19]:
# 添加多个特征
def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    if i == len(sentence)-1:
        nextword, nextpos = "<END>", "<END>"
    else:
        nextword, nextpos = sentence[i+1]
    return {"pos": pos,                                    # 当前词性
            "word": word,                                  # 当前词
            "prevpos": prevpos,                            # 前一个词的词性
            "nextpos": nextpos,                            # 下一个词的词性
            "prevpos+pos": "%s+%s" % (prevpos, pos),       # 前一个词与当前词的词性组合
            "pos+nextpos": "%s+%s" % (pos, nextpos),       # 当前词与后一个词的词性组合
            "tags-since-dt": tags_since_dt(sentence, i)}   # 自从上一个限定词后至今包含的所有词性(依靠tags_since_dt函数)
chunker = ConsecutiveNPChunker(train_sents[:100])
print(chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.7%%
    Precision:     83.1%%
    Recall:        85.6%%
    F-Measure:     84.4%%


## 7.4 语言结构中的递归

In [20]:
# 块结构不再仅仅是平行的，而是有深度的
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """
cp = nltk.RegexpParser(grammar)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print(cp.parse(sentence))
cp.parse(sentence).draw() # 注意：分块器丢失了saw开头的动词短语

(S
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


In [21]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
    ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
    ("on", "IN"), ("the", "DT"), ("mat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))
cp.parse(sentence).draw() # 注意：分块器还是无法识别saw开头的动词短语块

(S
  (NP John/NNP)
  thinks/VBZ
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))


以上两个正则表达式分块器都没有包含从句后面的动词短语，主要原因是第一次判断形成VP规则的时候，后面的部分还没有形成从句CLAUSE，因此无法构成VP规则；而形成CLAUSE之后，不能重复判断之前的VP规则了，因此没有连成VP。

解决的办法是在一次顺序规则判断后，再循环，进行一次新的规则判断。这样在发现CLAUSE之后，再次循环规则时，VP规则中的VB+CLAUSE就会被触发，判断出VP形式。

In [22]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
    ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
    ("on", "IN"), ("the", "DT"), ("mat", "NN")]
cp = nltk.RegexpParser(grammar, loop=2)
print(cp.parse(sentence))
cp.parse(sentence).draw() # 注意：这一次saw+从句的VP形式被判断出来了，而thinks的VP形式还是没有被判断出来

(S
  (NP John/NNP)
  thinks/VBZ
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))


In [23]:
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
    ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
    ("on", "IN"), ("the", "DT"), ("mat", "NN")]
cp = nltk.RegexpParser(grammar, loop=3)
print(cp.parse(sentence))
cp.parse(sentence).draw()

(S
  (CLAUSE
    (NP John/NNP)
    (VP
      thinks/VBZ
      (CLAUSE
        (NP Mary/NN)
        (VP
          saw/VBD
          (CLAUSE
            (NP the/DT cat/NN)
            (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))))


### 树

In [24]:
tree1 = nltk.Tree('NP', ['Alice'])
print(tree1)
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
print(tree2)
tree3 = nltk.Tree('VP', ['chased', tree2])
print(tree3)
tree4 = nltk.Tree('S', [tree1, tree3])
print(tree4)
tree4.draw() 

(NP Alice)
(NP the rabbit)
(VP chased (NP the rabbit))
(S (NP Alice) (VP chased (NP the rabbit)))


### 树遍历

In [25]:
def traverse(t):
    try:
        t.label()
    except AttributeError:
        print(t, end=" ")
    else:
        print('(', t.label(), end=" ")
        for child in t:
            traverse(child)
        print(')', end=" ")
traverse(tree4)

( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) 

## 7.5 命名实体识别
非常适合用分类器的方法来解决。首先建立一个IOB标注器，再对实体进行标注。

In [26]:
sent = nltk.corpus.treebank.tagged_sents()[22]
print(nltk.ne_chunk(sent, binary=True)) # binary=True时，只显示是不是实体

(S
  The/DT
  (NE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (NE Brooke/NNP)
  T./NNP
  Mossman/NNP
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE Vermont/NNP College/NNP)
  of/IN
  (NE Medicine/NNP)
  ./.)


In [27]:
print(nltk.ne_chunk(sent, binary=False)) # binary=False时，会显示实体的标记结果

(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
  the/DT
  few/JJ
  industrialized/VBN
  nations/NNS
  that/WDT
  *T*-7/-NONE-
  does/VBZ
  n't/RB
  have/VB
  a/DT
  higher/JJR
  standard/NN
  of/IN
  regulation/NN
  for/IN
  the/DT
  smooth/JJ
  ,/,
  needle-like/JJ
  fibers/NNS
  such/JJ
  as/IN
  crocidolite/NN
  that/WDT
  *T*-1/-NONE-
  are/VBP
  classified/VBN
  *-5/-NONE-
  as/IN
  amphobiles/NNS
  ,/,
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (PERSON Vermont/NNP College/NNP)
  of/IN
  (GPE Medicine/NNP)
  ./.)


## 7.6 关系抽取

In [28]:
# 利用正则表达式提取文本中的词性标注
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


In [29]:
# nltk中包含了已经标注好的语料库
from nltk.corpus import conll2002
vnv = """
(
is/V|    # 3rd sing present and
was/V|   # past forms of the verb zijn ('be')
werd/V|  # and also present
wordt/V  # past of worden ('become)
)
.*       # followed by anything
van/Prep # followed by van ('of')
"""
VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc,
                                   corpus='conll2002', pattern=VAN):
        print(nltk.sem.clause(r, relsym="VAN"))

VAN("cornet_d'elzius", 'buitenlandse_handel')
VAN('johan_rottiers', 'kardinaal_van_roey_instituut')
VAN('annie_lennox', 'eurythmics')
