#### 贝叶斯分类器分类垃圾邮件

**数据集**
+ 垃圾邮件分类数据集
+ 每一行一个样本
+ 分类特征：3000个常用单词，以邮件中每个词出现的次数为特征
+ 样本数：5172
+ 两分类问题

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("emails.csv")
data.head(10)

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0
5,Email 6,4,5,1,4,2,3,45,1,0,...,0,0,0,0,0,0,0,0,0,1
6,Email 7,5,3,1,3,2,1,37,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Email 8,0,2,2,3,1,2,21,6,0,...,0,0,0,0,0,0,0,1,0,1
8,Email 9,2,2,3,0,0,1,18,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Email 10,4,4,35,0,1,0,49,1,16,...,0,0,0,0,0,0,0,0,0,0


**数据准备**
+ 提取数据表中第1~3000列数据作为样本数据
+ 提取数据表最后一列作为类别标记
+ 特征变换：将单词在邮件中出现的次数，转换为出现的频度
+ 随机划分数据集：75%作为训练样本集，25%作为测试样本集

In [2]:
from sklearn.model_selection import train_test_split

X = data.iloc[:,1:3001].to_numpy()
y = data.iloc[:,-1].to_numpy()

n, d = X.shape
X = X / np.tile(np.reshape(X.sum(axis=1),[n,1]),[1,d])
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,random_state=42)

print("Shape of X:",X.shape, ", Shape of y",y.shape)
print("Shape of train:", X_train.shape, y_train.shape)
print("Shape of test:", X_test.shape, y_test.shape)

Shape of X: (5172, 3000) , Shape of y (5172,)
Shape of train: (3879, 3000) (3879,)
Shape of test: (1293, 3000) (1293,)


**学习正态分布贝叶斯分类器**
+ 根据类别标注划分两个类别的训练样本集
+ 统计两个类别的先验概率
+ 分别使用每个类别的训练样本集，学习正态分布的参数
+ 显示每个类别正态分布的均值和协方差矩阵

In [3]:
from sklearn.mixture import GaussianMixture

id0 = np.where(y_train==0); n0 = id0[0].shape[0]
id1 = np.where(y_train==1); n1 = id1[0].shape[0]
p0 = n0/(n0+n1); p1 = n1/(n0+n1)

Gauss0 = GaussianMixture(n_components=1).fit(X_train[id0[0],:])
Gauss1 = GaussianMixture(n_components=1).fit(X_train[id1[0],:])

print("Gauss 0:\n\tMean:", Gauss0.means_.shape)
print("\tCovariance:", Gauss0.covariances_.shape)
print("Gauss 1:\n\tMean:", Gauss1.means_.shape)
print("\tCovariance:", Gauss1.covariances_.shape)

Gauss 0:
	Mean: (1, 3000)
	Covariance: (1, 3000, 3000)
Gauss 1:
	Mean: (1, 3000)
	Covariance: (1, 3000, 3000)


**测试正态分布贝叶斯分类器**
+ 使用训练集和测试集分别测试分类器的分类性能
+ 判别函数：分别计算两个类别正态分布在样本集上的对数概率密度，加上类别先验概率的对数
+ 比较两个类别判别函数值的大小，得到样本的预测结果
+ 比较预测类别与真实类别，统计分类的正确率

In [4]:
pv0 = Gauss0.score_samples(X_train) + np.log(p0)
pv1 = Gauss1.score_samples(X_train) + np.log(p1)
predict_y = (pv0<pv1).astype(int)
train_score = (y_train==predict_y).astype(int).sum() / y_train.shape[0]

pv0 = Gauss0.score_samples(X_test) + np.log(p0)
pv1 = Gauss1.score_samples(X_test) + np.log(p1)
predict_y = (pv0<pv1).astype(int)
test_score = (y_test==predict_y).astype(int).sum() / y_test.shape[0]

print("Train score:", train_score, "\nTest score:", test_score)

Train score: 0.9816963134828564 
Test score: 0.9613302397525135


#### 朴素贝叶斯分类器 

**学习朴素贝叶斯分类器**
+ 使用训练样本集学习朴素贝叶斯分类器
+ 显示类别的先验概率、均值和方差

In [5]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB().fit(X_train,y_train)

print("Class priors:", nb.class_prior_)
print("Class means:", nb.theta_.shape)
print("Class variances:", nb.var_.shape)

Class priors: [0.7099768 0.2900232]
Class means: (2, 3000)
Class variances: (2, 3000)


**测试朴素贝叶斯分类器**
+ 分别使用训练集和测试集测试朴素贝叶斯分类器的分类正确率

In [6]:
print("Train score:", nb.score(X_train,y_train))
print("Test score:", nb.score(X_test,y_test))

Train score: 0.9956174271719516
Test score: 0.9621036349574633
