### 包含内容
- 用社交网络的API下载数据
- 用于处理文本的转换器
- 朴素贝叶斯分类器
- 用JSON保存和加载数据集
- 用NLTK库从文本中抽取特征
- 用F值评估分类效果

### 1. 消歧
- 1. 文本通常被称为无结构格式，
- 2. 文本挖掘的一个难点来自于歧义，消除歧义常被简称为消歧

In [36]:
# !pip install tweepy twitter nltk -i https://pypi.tuna.tsinghua.edu.cn/simple

In [49]:
import twitter
import tweepy
from tweepy import OAuthHandler
import json

import warnings
warnings.filterwarnings('ignore')

In [10]:
consumer_key = "hEDGEdp4p2Tp8U6h5vpOGOpm9"
consumer_secret = "4C8P1YeOtsjebf6dYyTnRCWae3GWWoF9n6VLnQOKTEorlK3KbT"
access_token = "846992744690208768-QQKUada7IjhguoQyjrARQ99nCkHVMop"
access_token_secret = "5zfmqKV2eLPO2SU0XoW3DKtOUqwmvpPB1r7KoQ48OQnq8"
authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret)
t = twitter.Twitter(auth=authorization)
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)  
for status in tweepy.Cursor(api.home_timeline).items(2):
    print (status.text)

In [42]:
tweets = []
with open('./data/python_tweets.json', 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip() == '': continue
        tweets.append(json.loads(line)['text'])
print("Loaded {} tweets".format(len(tweets)))
labels = []
with open('./data/python_classes.json', 'r', encoding='utf-8') as f:
    labels = json.load(f)

Loaded 400 tweets


In [43]:
def get_next_tweet():
    return tweets[len(labels)]['text']

### Classify Twitter Data

### 朴素贝叶斯
- 贝叶斯定理公式<br>
$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

- 含有单词drugs的为垃圾邮件的概率
    * P(A) 为垃圾邮件, P(A)就是先验概率
    * P(B) 表示该封邮件含有drugs, 计算P(B)时，我们不关注邮件是不是垃圾邮件
    * P(B|A) 指的是垃圾邮件中含有单词drugs的概率，统计所有垃圾邮件的数量以及含有单词drugs的数量
    * P(A|B) = P(B|A)P(A) / P(B)

- 用C表示某种类别, 用 D 表示数据集中一篇文档
    * P(C) 为某一类别的概率
    * P(D) 为某一文档的概率，牵扯到很多特征，计算比较困难，对于所有类别来说 P(D) 相同
    * P(D|C) 为文档 D 属于 C 类的概率， 朴素贝叶斯假设各个特征之间是相互独立的，分别求D1，D2，D3的概率，再求积<br>
    P(D|C) = P(D1|C) * P(D2|C) *...*P(Dn|C)
    

##### 示例
- 假如数据集中有以下一条用二值特征表示的数据：[1, 0, 0, 1]。训练集中有75%的数据属于类别0, 25%属于类别1，且每个特征的属于每个类别的似然度如下：
    * 类别0：[0.3, 0.4, 0.4, 0.7]
    * 类别1：[0.7, 0.3, 0.4, 0.9]
    * 上面的数据可以理解为：类别0中有30%的数据，特征值为1
- P(C=0) = 0.75
    * 因为用不到P(D), 所以不需要计算
    * P(D|C=0) = P(D1|C=0) * P(D2|C=0) * P(D3|C=0) * P(D4|C=0)
    * = 0.3 * 0.6 * 0.6 * 0.7
    * = 0.0756
- P(C=0|D) = P(C=0) * P(D|C=0) = 0.75 * 0.0756 = 0.0567
- P(C=1) = 0.25
    * P(D|C=1) = P(D1|C=1) x P(D2|C=1) x P(D3|C=1) x P(D4|C=1)
    * = 0.7 x 0.7 x 0.6 x 0.9 
    * = 0.2646 
- P(C=1|D) = P(C=1)P(D|C=1) = 0.25 * 0.2646 = 0.06615 
- 注意: 通常，P(C=0|D) + P(C=1|D)应该等于1，但是因为我们计算时省去了 P(D), 所以朴素贝叶斯的概率和并不为1

In [66]:
import numpy as np
from sklearn.base import TransformerMixin
from nltk import word_tokenize
import nltk

from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [45]:
n_samples = min(len(tweets), len(labels))
sample_tweets = [t.lower() for t in tweets[:n_samples]]
labels = labels[:n_samples]
y_true = np.array(labels)

In [71]:
class NLTKBOW(TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # 字典中的各项为第一条消息的所有词语，每一项用单词作为键，值为True，表示该词在该条消息中出现过，字典中没出现的词，表示这条消息里不包含该词
        # 也可以使用False代表不存在的词，不过太浪费存储空间了
        return [{word: True for word in word_tokenize(document)} for document in X]

In [72]:
# nltk.download('punkt')
pipeline = Pipeline([('bag-of-words', NLTKBOW()), ('vectorizer', DictVectorizer()), ('naive-bayes', BernoulliNB())])
scores = cross_val_score(pipeline, sample_tweets, y_true, cv=10, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))

Score: 0.000


In [73]:
bow = NLTKBOW()
vectorizer = DictVectorizer()
clf = BernoulliNB()
X_bow = bow.fit_transform(sample_tweets)
X_vec = vectorizer.fit_transform(X_bow)
scores = cross_val_score(pipeline, sample_tweets, y_true, cv=10, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))

Score: 0.000


##### 用F1值评估
- 正确率应用范围很广，理解起来比较容易，计算起来也方便，但是在样本集不均匀的时候效果不是很好
- F1值是以每个类别为基础进行定义的，包括准确率(precision) 和 召回率(recall)
    * 准确率是指预测结果属于某一类的个体，实际属于该类的比例
    * 召回率是指被正确预测为某个类别的个体数量与数据集中该类别个体总量的比例
- F1值为准确率和召回率的调和平均数<br>
$$ F1 = 2 * \frac{precision * recall}{precision + recall} $$

##### 从模型中获取更多有用的特征

In [84]:
model = pipeline.fit(sample_tweets, labels)
nb = model.named_steps['naive-bayes']
top_features = np.argsort(-nb.feature_log_prob_[1])[:50]

# 通过DictVectorizer的feature_names_属性查找最佳特征名称
dv = model.named_steps['vectorizer']
for i, feature_index in enumerate(top_features):
    print(i, dv.feature_names_[feature_index], nb.feature_log_prob_[1][feature_index])

0 : -0.2231435513142097
1 @ -0.5108256237659905
2 https -0.5108256237659905
3 rt -0.5108256237659905
4 ! -0.916290731874155
5 📰 -0.916290731874155
6 initiated -0.916290731874155
7 https… -0.916290731874155
8 getting -0.916290731874155
9 from -0.916290731874155
10 docprogrammer -0.916290731874155
11 called -0.916290731874155
12 murderous -0.916290731874155
13 by -0.916290731874155
14 away -0.916290731874155
15 army -0.916290731874155
16 and -0.916290731874155
17 alert -0.916290731874155
18 ] -0.916290731874155
19 [ -0.916290731874155
20 docker -0.916290731874155
21 news -0.916290731874155
22 nginx -0.916290731874155
23 ngi… -0.916290731874155
24 🏃 -0.916290731874155
25 🇳🇬 -0.916290731874155
26 新着 -0.916290731874155
27 ☞ -0.916290731874155
28 — -0.916290731874155
29 with -0.916290731874155
30 vaccination -0.916290731874155
31 tutorial -0.916290731874155
32 stay -0.916290731874155
33 started -0.916290731874155
34 sam_ezeh -0.916290731874155
35 redis -0.916290731874155
36 pythonによるスクレイピング初

In [94]:
import joblib
import os
output_filename = os.path.join('./data/', "python_context.pkl")
joblib.dump(model, output_filename)

['./data/python_context.pkl']