# 标注语料库

## 表示已标注的标识符

在 NLTK 中，一个已标注的标识符使用一个由标识符和标记组成的元组来表示，我们可以使用函数 str2tuple() 从一个已标注的标识符的标准字符串创建这样的元组：

In [1]:
import nltk

tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

('fly', 'NN')

我们可以直接从一个字符串构造一个已标注的标识符链表：

In [2]:
sent = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interest/NN of/IN both/ABX governments/NNS ''/'' ./.
'''
print([nltk.tag.str2tuple(t) for t in sent.split()][:5])

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN')]


## 读取已标注的语料库

NLTK 中包括的若干语料库已标注了词性，并提供了同意的阅读器接口：

In [3]:
print(nltk.corpus.brown.tagged_words())
print(nltk.corpus.nps_chat.tagged_words())
print(nltk.corpus.conll2000.tagged_words())

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]


并非所有的语料库都采用同一组标记，为了避免这些标记集的复杂化，可以使用 [target='universal'](https://www.nltk.org/_modules/nltk/tag/mapping.html) 将它们映射到简化的标记集上：

In [4]:
print(nltk.corpus.brown.tagged_words(tagset='universal'))
print(nltk.corpus.nps_chat.tagged_words(tagset='universal'))
print(nltk.corpus.conll2000.tagged_words(tagset='universal'))

[('The', 'DET'), ('Fulton', 'NOUN'), ...]
[('now', 'ADV'), ('im', 'PRON'), ('left', 'VERB'), ...]
[('Confidence', 'NOUN'), ('in', 'ADP'), ('the', 'DET'), ...]


## 简化的词性标记集

|   标记   |      含义     |  示例 |
|----------|:-------------:|------:|
|ADJ	|形容词	|new, good, high, special, big, local
|ADP	|介词	|on, of, at, with, by, into, under
|ADV	|副词	|really, already, still, early, now
|CONJ	|连词	|and, or, but, if, while, although
|DET	|限定词	|the, a, some, most, every, no, which
|NOUN	|名词	|year, home, costs, time, Africa
|NUM	|数词	|twenty-four, fourth, 1991, 14:24
|PRT	|助词	|at, on, out, over per, that, up, with
|PRON	|代词	|he, their, her, its, my, I, us
|VERB	|动词	|is, say, told, given, playing, would
|.	|标点符号	|. , ; !
|X	|其它	|ersatz, esprit, dunno, gr8, univeristy

我们来看看这些标记中哪些是布朗语料库的新闻类中最常见的：

In [5]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
print(tag_fd.most_common())

[('NOUN', 30654), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389), ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264), ('NUM', 2166), ('X', 92)]


## 名词

名词一般指的是人、地点、事情或概念，可能出现在限定词和形容词之后，也可以是动词的主语或宾语。

下例检查最常出现在名词前的词性标记：

In [6]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)
print([tag for (tag, _) in fdist.most_common()])

['NOUN', 'DET', 'ADJ', 'ADP', '.', 'VERB', 'CONJ', 'NUM', 'ADV', 'PRT', 'PRON', 'X']


## 动词

动词是用来描述事件和行动的词。

新闻文本中最常见的动词时什么？让我们按频率排序所有动词：

In [7]:
wsj = nltk.corpus.treebank.tagged_words(tagset='universal')
word_tag_fd = nltk.FreqDist(wsj)
print([wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'VERB'][:30])

['is', 'said', 'was', 'are', 'be', 'has', 'have', 'will', 'says', 'would', 'were', 'had', 'been', 'could', "'s", 'can', 'do', 'say', 'make', 'may', 'did', 'rose', 'made', 'does', 'expected', 'buy', 'take', 'get', 'might', 'sell']


我们可以将每个标记元组中的词作为条件，词性标记作为事件，初始化一个条件概率分布，查找给定词的标记频率顺序列表：

In [8]:
cfd1 = nltk.ConditionalFreqDist(wsj)
print(cfd1['yield'].most_common())
print(cfd1['cut'].most_common())

[('VERB', 28), ('NOUN', 20)]
[('VERB', 25), ('NOUN', 3)]


我们可以颠倒标记元组，将词性标记作为条件，词汇作为事件，查找对于给定词性标记可能的词：

In [9]:
wsj = nltk.corpus.treebank.tagged_words()
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
print(list(cfd2['VBN'])[:20])

['named', 'used', 'caused', 'exposed', 'reported', 'replaced', 'sold', 'died', 'expected', 'diagnosed', 'studied', 'industrialized', 'owned', 'found', 'classified', 'rejected', 'outlawed', 'imported', 'tracked', 'thought']


## 未简化的标记

未简化的标记有多种变种，如以 NN 开始的名词标记，包含有 $ 的名词所有格，有 S 的复数名词。此外，大多数标记都有后缀修饰符：-NC 表示引用，-HL 表示摘要中的词，-TL 表示标题。

In [10]:
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                    if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("city's", 6)]
NN$-HL [("Golf's", 1), ("Navy's", 1)]
NN$-TL [("President's", 11), ("Administration's", 3), ("Army's", 3), ("League's", 3), ("University's", 3)]
NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('cut', 2), ('party', 2)]
NN-NC [('ova', 1), ('eva', 1), ('aya', 1)]
NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
NN-TL-HL [('Fort', 2), ('Mayor', 1), ('Commissioner', 1), ('City', 1), ('Oak', 1)]
NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
NNS$ [("children's", 7), ("women's", 5), ("men's", 3), ("janitors'", 3), ("taxpayers'", 2)]
NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Princes'", 1), ("Bombers'", 1)]
NNS-HL [('Wards', 1), ('deputies', 1), ('bonds', 1), ('aspects', 1), ('Decisions', 1)]
NNS-TL [