# 6. 学习分类文本
1. 我们怎样才能识别语言数据中能明显用于对其分类的特征？
2. 我们怎样才能构建语言模型，用于自动执行语言处理任务？
3. 从这些模型中我们可以学到哪些关于语言的知识？

## 6.1 有监督分类
- 邮件是否为垃圾邮件
- 文章的主题是什么
- 某个词语的具体语义是什么

### 性别鉴定

In [1]:
import nltk
from nltk.corpus import names
def gender_features(word):
    return {'last_letter': word[-1]} # 提取特征，名字最后一个字母
labeled_names = ([(name, 'male') for name in names.words('male.txt')] 
                 + [(name, 'female') for name in names.words('female.txt')]) # 获得有标签的姓名数据
import random
random.shuffle(labeled_names) # 打乱数据顺序
featuresets = [(gender_features(n), g) for (n,g) in labeled_names] # 获取姓名的尾字母特征
train_set, test_set = featuresets[500:], featuresets[:500] # 划分训练和测试集
classifier = nltk.NaiveBayesClassifier.train(train_set) # 训练分类器
print(classifier.classify(gender_features('Neo')))
print(classifier.classify(gender_features('Trinity')))
print('the accuracy of train set {}'.format(nltk.classify.accuracy(classifier, train_set))) # 查看训练集准确率
print('the accuracy of test set {}'.format(nltk.classify.accuracy(classifier, test_set))) # 查看测试集准确率
print(classifier.show_most_informative_features(5)) # 查看最有用的几个特征

male
female
the accuracy of train set 0.7619559376679205
the accuracy of test set 0.776
Most Informative Features
             last_letter = 'a'            female : male   =     37.0 : 1.0
             last_letter = 'k'              male : female =     31.9 : 1.0
             last_letter = 'f'              male : female =     16.5 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =     10.4 : 1.0
None


### 选择正确的特征

In [2]:
# 添加更多特征
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower() # 首字母
    features["last_letter"] = name[-1].lower() # 尾字母
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter) # 名字中某个字母的个数
        features["has({})".format(letter)] = (letter in name.lower()) # 名字中是否包含某字母
    return features
featuresets = [(gender_features2(n), g) for (n,g) in labeled_names] # 提取特征
train_set, test_set = featuresets[500:], featuresets[:500] # 划分训练和测试集
classifier = nltk.NaiveBayesClassifier.train(train_set) # 训练分类器
print('the accuracy of train set {}'.format(nltk.classify.accuracy(classifier, train_set))) # 查看训练集准确率
print('the accuracy of test set {}'.format(nltk.classify.accuracy(classifier, test_set))) # 查看测试集准确率
print(classifier.show_most_informative_features(5)) # 查看最有用的几个特征

the accuracy of train set 0.7760612573885008
the accuracy of test set 0.796
Most Informative Features
             last_letter = 'a'            female : male   =     37.0 : 1.0
             last_letter = 'k'              male : female =     31.9 : 1.0
             last_letter = 'f'              male : female =     16.5 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =     10.4 : 1.0
None


In [3]:
# 训练集、开发集和测试集
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.749


In [4]:
# 错误分析：查看具体错误
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))
for (tag, guess, name) in sorted(errors)[:10]:
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Adrian                        
correct=female   guess=male     name=Aidan                         
correct=female   guess=male     name=Aigneis                       
correct=female   guess=male     name=Allis                         
correct=female   guess=male     name=Allsun                        
correct=female   guess=male     name=Allyson                       
correct=female   guess=male     name=Anett                         
correct=female   guess=male     name=Ann                           
correct=female   guess=male     name=Aryn                          
correct=female   guess=male     name=Ashlen                        


In [5]:
# 改进特征
def gender_features(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:]}
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.777


### 文档分类

In [6]:
# 获取文本和标签
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

In [7]:
# 统计词频
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] # 提取前2000个词

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words) # 特征：是否包含某一个词
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print('the accuracy of train set {}'.format(nltk.classify.accuracy(classifier, train_set))) # 查看训练集准确率
print('the accuracy of test set {}'.format(nltk.classify.accuracy(classifier, test_set))) # 查看测试集准确率
print(classifier.show_most_informative_features(5))

the accuracy of train set 0.8752631578947369
the accuracy of test set 0.78
Most Informative Features
    contains(recognizes) = True              pos : neg    =      8.1 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.8 : 1.0
    contains(schumacher) = True              neg : pos    =      7.8 : 1.0
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
     contains(atrocious) = True              neg : pos    =      6.4 : 1.0
None


### 词性标注

In [8]:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


In [9]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features
tagged_words = brown.tagged_words(categories='news')[:1000]
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set) # 使用决策树分类器
print('the accuracy of train set {}'.format(nltk.classify.accuracy(classifier, train_set)))
print('the accuracy of test set {}'.format(nltk.classify.accuracy(classifier, test_set)))

the accuracy of train set 0.7133333333333334
the accuracy of test set 0.64


In [10]:
# 查看决策树分类逻辑
print(classifier.pseudocode(depth=4))

if endswith(the) == False: 
  if endswith(s) == False: 
    if endswith(.) == False: 
      if endswith(,) == False: return 'CD'
      if endswith(,) == True: return ','
    if endswith(.) == True: return '.'
  if endswith(s) == True: 
    if endswith(as) == False: 
      if endswith('s) == False: return 'NPS'
      if endswith('s) == True: return 'NN$'
    if endswith(as) == True: 
      if endswith(was) == False: return 'CS'
      if endswith(was) == True: return 'BEDZ'
if endswith(the) == True: return 'AT'



### 探索上下文语境

In [11]:
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1] # 特征表示当前词的前一个词
    return features

print(pos_features(brown.sents()[0], 8))

tagged_sents = brown.tagged_sents(categories='news')[:1000]
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, i), tag))

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print('the accuracy of train set {}'.format(nltk.classify.accuracy(classifier, train_set)))
print('the accuracy of test set {}'.format(nltk.classify.accuracy(classifier, test_set)))

{'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion', 'prev-word': 'an'}
the accuracy of train set 0.820450885668277
the accuracy of test set 0.7947439963751699


### 序列分类

In [12]:
import nltk
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:], # 特征：单词的尾字母
                 "suffix(2)": sentence[i][-2:], # 特征：单词结尾两个字母
                 "suffix(3)": sentence[i][-3:]} # 特征：单词结尾三个字母
    if i == 0:
        features["prev-word"] = "<START>" # 特征：单词的前一个词，如果为局首词标注<START>
        features["prev-tag"] = "<START>"  # 特征：单词的前一个词的词性，如果为局首词标注<START>
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = [] # 训练集
        for tagged_sent in train_sents: # 提取训练文本中的句子
            untagged_sent = nltk.tag.untag(tagged_sent) # 去掉标注
            history = [] # 标注集
            for i, (word, tag) in enumerate(tagged_sent): # 对句子中的每个(序号, (词，词性))形式
                featureset = pos_features(untagged_sent, i, history) # 对单词训练特征
                train_set.append((featureset, tag)) # 向训练集中添加特征和标签
                history.append(tag) # 向标注集中添加标注(此处的标注是真实值)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set) # 训练模型

    def tag(self, sentence): # 测试集的标注过程
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset) # 对于每个单词，预测词性
            history.append(tag) # 将预测结果添加到标注集里(此处的标注是预测值)
        return zip(sentence, history)
tagged_sents = brown.tagged_sents(categories='news')[:1000]
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
print('the accuracy of train set {}'.format(tagger.evaluate(train_sents)))
print('the accuracy of test set {}'.format(tagger.evaluate(test_sents)))

the accuracy of train set 0.8231285649386705
the accuracy of test set 0.798941798941799


序列分类存在一个问题，如果前面一个词性标注错了，后面的词性也很受影响。一种可能的解决方案是对所有可能的序列打分，选择最高分序列。

隐形马尔可夫就采用了这种策略，它只看最近一个标记。也有更先进的方法，比如最大熵马尔可夫和条件随机场模型。

## 6.2 有监督分类的更多例子

### 句子分割

In [13]:
sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent) # 获取句子内容
    offset += len(sent)
    boundaries.add(offset-1) # 找到句子边界

In [14]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(), # 下一个是否为大写
            'prev-word': tokens[i-1].lower(), # 上一个单词是否为小写
            'punct': tokens[i], # 是否标点符号
            'prev-word-is-one-char': len(tokens[i-1]) == 1} # 前一个档次的长度是否为1
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print('the accuracy of train set {}'.format(nltk.classify.accuracy(classifier, train_set)))
print('the accuracy of test set {}'.format(nltk.classify.accuracy(classifier, test_set)))

the accuracy of train set 0.9699290250280165
the accuracy of test set 0.936026936026936


### 识别对话行为类型

In [15]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features
featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print('the accuracy of train set {}'.format(nltk.classify.accuracy(classifier, train_set)))
print('the accuracy of test set {}'.format(nltk.classify.accuracy(classifier, test_set)))

the accuracy of train set 0.6943333333333334
the accuracy of test set 0.668


## 6.3 评估
- 准确度
- 精确度和召回率
- 混淆矩阵

### 6.4 决策树

### 6.5 朴素贝叶斯分类器

### 6.6 最大熵分类器

### 6.7 为语言模式建模