## 朴素贝叶斯

**问题描述：**

文本分类问题，判断互联网上的一篇文章属于哪个类型？（科技、娱乐、体育……）

我们将文本转换为向量后，即求 $P(y|x)$ 在给定 x 样本的前提下，它属于 y 类别的概率。

所以我们要将类别变量遍历一遍，找到使得概率值最大的那个类别，即样本 x 所属的类别。

我们定义 $x = <x_1, x_2, ... , x_n>$，$y \in \{ c_1, c_2, ... , c_k \}$

根据贝叶斯原理：
$$
P(y|x) = \frac{P(x|y)P(y)}{P(x)}
$$

即求：
$\mathop{\arg\,\max}\limits_y P(y|x)$

因为P(x)对于同一样本都是相同的，因此可以忽略，即：
$\mathop{\arg\,\max}\limits_y P(x|y)P(y)$

再根据我们的前提假设，x向量每一维度上的特征被分类的条件概率之间是相互独立的，即：
$P(x_1, x_2, ..., x_n|y) = P(x_1|y)P(x_2|y)...P(x_n|y)$

最后$P(x_1|y_1)的概率可以在样本中统计得到：
$P(x=x_1|y=y_1)=\frac{P(x=x_1\,y=y_1)}{P(y=y_1)}=\frac{#(x=x_1\,y=y_1)}{#(y=y_1)}$

In [5]:
# 导入数据 20 类新闻样本
# 共 18846 篇文章
from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')
print len(news.data)
print news.data[0]
print news.target
print news.target_names

18846
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


[10  3 17 ...,  3  1  7]
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.

In [6]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=33)

# 将文本信息转换为向量
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()

X_train = vec.fit_transform(X_train)
X_test  = vec.transform(X_test)

# 引入贝叶斯模型
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(X_train, y_train)
mnb_y_predict = mnb.predict(X_test)

# 性能评估
from sklearn.metrics import classification_report

print "The accuracy of MultinomialNB:", mnb.score(X_test, y_test)
print classification_report(y_test, mnb_y_predict, target_names=news.target_names)



The accuracy of MultinomialNB: 0.839770797963
                          precision    recall  f1-score   support

             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc       0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
   comp.sys.mac.hardware       0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       249
                 sci.med     