我们构建朴素贝叶斯模型进行垃圾邮件分类的步骤如下：
1. 创建词典，使用已经处理过的数据集Ling-spam
2. 提取特征
3. 训练分类器
4. 测试分类器效果

## 创建词典
我们打开数据集中的一封样本邮件可以得到如下信件内容：

In [4]:
sample_email = open("./lingspam_public/lemm_stop/part1/3-1msg1.txt","r")
print(sample_email.read())

Subject: re : 2 . 882 s - > np np

> deat : sun , 15 dec 91 2 : 25 : 2 est > : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 query > > wlodek zadrozny ask " anything interest " > construction " s > np np " . . . second , > much relate : consider construction form > discuss list late reduplication ? > logical sense " john mcnamara name " tautologous thus , > level , indistinguishable " , , here ? " . ' john mcnamara name ' tautologous support those logic-base semantics irrelevant natural language . sense tautologous ? supplies value attribute follow attribute value . fact value name-attribute relevant entity ' chaim shmendrik ' , ' john mcnamara name ' false . tautology , . ( reduplication , either . )



In [7]:
# python读文件的另一种方式
with open('./lingspam_public/lemm_stop/part1/3-1msg2.txt') as f:
    print(f.read())

# 再读取一封没有经过文本清洗处理的信件
with open('./lingspam_public/bare/part1/3-1msg2.txt') as f:
    print(f.read())

Subject: s - > np + np

discussion s - > np + np remind ago read , source forget , critique newsmagazine ' unique tendency write style , most writer overly " cute " . one item tersely put down follow : " 's favorite : colon . " - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - lee hartman ga5123 @ siucvmb . bitnet department foreign language southern illinoi university carbondale , il 62901 u . s . .

Subject: s - > np + np

the discussion of s - > np + np reminds me that some years ago i read , in a source now forgotten , a critique of some newsmagazines ' unique tendencies in writing style , most of which the writer found overly " cute " . one item was tersely put down as follows : " time 's favorite : the colon . " - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - lee hartman ga5123 @ siucvmb . bitnet department of foreign languages southern illinois university carbondale , il 62901 u . s . a .



从上面两封邮件可以看出，信件的内容是从第三行开始的。我们的第一步是建立一个词典，词典中单词和它的出现的频率一一映射。我们先读取出训练数据集中信件中的所有单词，然后使用Counter类计算出，每个词对应的次数，存入dictionary中。

In [29]:
import os
import collections
def make_Dictionary(train_dir):
    #emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]    
    emails = []
    emails.append(train_dir)
    all_words = []       
    for mail in emails:    
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:  #Body of email is only 3rd line of text file
                    words = line.split()
                    all_words += words
     
    dictionary = collections.Counter(all_words)
    
    # clear word that is not alpha, like numbers
    # and clear word length = 1, like '>'
    # make the frequency of that words to be zero
    list_to_remove = dictionary.keys()
    for item in list_to_remove:
        if item.isalpha() == False: 
            dictionary[item]=0
        elif len(item) == 1:
            dictionary[item]=0
    # end of clear
    # choose the words which frequecy > 2 as feature
    new_dict = []
    for word,freq in dictionary.items():
        if freq > 2:
            new_dict.append({word,freq})
    print(new_dict)
    #dictionary = dictionary.most_common(3)
    return dictionary

In [30]:
# 测试make_Dictionary函数
# 针对一封信进行测试
print(make_Dictionary('./lingspam_public/lemm_stop/part1/3-1msg1.txt'))

[{3, 'john'}, {3, 'mcnamara'}, {3, 'name'}, {'tautologous', 3}, {'value', 3}]
Counter({'john': 3, 'mcnamara': 3, 'name': 3, 'tautologous': 3, 'value': 3, 'construction': 2, 'np': 2, 'reduplication': 2, 'sense': 2, 'attribute': 2, 'deat': 1, 'sun': 1, 'dec': 1, 'est': 1, 'michael': 1, 'mmorse': 1, 'yorku': 1, 'ca': 1, 'subject': 1, 're': 1, 'query': 1, 'wlodek': 1, 'zadrozny': 1, 'ask': 1, 'anything': 1, 'interest': 1, 'second': 1, 'much': 1, 'relate': 1, 'consider': 1, 'form': 1, 'discuss': 1, 'list': 1, 'late': 1, 'logical': 1, 'thus': 1, 'level': 1, 'indistinguishable': 1, 'here': 1, 'support': 1, 'those': 1, 'semantics': 1, 'irrelevant': 1, 'natural': 1, 'language': 1, 'supplies': 1, 'follow': 1, 'fact': 1, 'relevant': 1, 'entity': 1, 'chaim': 1, 'shmendrik': 1, 'false': 1, 'tautology': 1, 'either': 1, '>': 0, ':': 0, ',': 0, '15': 0, '91': 0, '2': 0, '25': 0, '<': 0, '@': 0, 'vm1': 0, '.': 0, '864': 0, '"': 0, 's': 0, '?': 0, "'": 0, 'logic-base': 0, 'name-attribute': 0, '(': 0, ')

推广一封信至整个数据集

In [42]:
import os
import collections
def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]    
    all_words = []       
    for mail in emails:    
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:  #Body of email is only 3rd line of text file
                    words = line.split()
                    all_words += words
     
    dictionary = collections.Counter(all_words)
    
    # clear word that is not alpha, like numbers
    # and clear word length = 1, like '>'
    # make the frequency of that words to be zero
    list_to_remove = dictionary.keys()
    for item in list_to_remove:
        if item.isalpha() == False: 
            dictionary[item]=0
        elif len(item) == 1:
            dictionary[item]=0
    # end of clear
    # choose the words which frequecy > 20 as feature
    # copy new_dict
    new_dict = dictionary.copy()
    # delete the freq < 20 in new dict
    for word,freq in dictionary.items():
        if freq < 20:
            del new_dict[word]
    return new_dict

In [44]:
# 对目录'./lingspam_public/lemm_stop/part1/'中的所有信件建立字典
# 打印字典的大小
# 打印出前十个高频词
dictionary = make_Dictionary('./lingspam_public/lemm_stop/part1/')
print(len(dictionary))
print(dictionary.most_common(10))

546
[('language', 520), ('university', 296), ('one', 290), ('de', 253), ('linguistic', 234), ('work', 232), ('email', 216), ('information', 204), ('order', 203), ('address', 200)]


由上可知，我们的高频词有546个。我们用这些词作为特征计算出训练集中每一封信件的单词数向量，这个向量有546维。通常一封邮件的词频向量大部分维度都可能为零。
我们要生成一个包含所有训练集中邮件的特征矩阵，矩阵的行数表示邮件数，列数表示特征向量长度。矩阵的值$M_i,j$表示，第i封信件中是否出现了第j个敏感词，值为一表示出现，值为零表示不出现。

In [59]:
import numpy as np
def extract_features(mail_dir,dictionary): 
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),len(dictionary)))
    print(len(files))
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = 1
                  #features_matrix[docID,wordID] = words.count(word)
        docID = docID + 1    
    return features_matrix

matrix = extract_features('./lingspam_public/lemm_stop/part1/',dictionary)

289


In [71]:
# 打印出倒数第二份信的词向量
#print(matrix[-2])
# 数一数有多少个高频次在这封信中存在
print(matrix[-2].sum(axis=0))
# 竟然有182个高频词在这封信中出现
print(matrix.shape)
print(matrix[:3].shape)

182.0
(289, 546)
(3, 546)


# 训练模型参数
从训练集的数据可以知道，训练集每封邮件的标签，在'./lingspam_public/lemm_stop/part1/'中有289封邮件，其中241封为正常邮件，48封为垃圾邮件，所以我们可以定义训练集的标签数组，前241为0，后48为1，其中0表示正常邮件，1表示垃圾邮件。

In [None]:
# 还是用现成的模型吧，自己写的模型有问题
train_labels = np.zeros(289)
train_labels[-48:]=1
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
model = MultinomialNB()
model.fit(matrix, train_labels)


In [78]:
train_labels = np.zeros(289)
train_labels[-48:]=1
print(train_labels.shape)
n_1=train_labels.sum(axis=0)
print(n_1)
n=len(train_labels)
print(n)
P_Y = np.zeros(2)
P_Y[1] = n_1/n
P_Y[0] = 1 - P_Y[1]
print(P_Y)
# 获得了P(Y)的先验概率，Y就是本例中的邮件类别C

(289,)
48.0
289
[0.83391003 0.16608997]


In [None]:
# 没有自己没有做出来，感觉不对
# 如果计算最大后验概率，直接计算与取对数计算，感觉数值有点奇怪
# 没有继续往下写了

求条件概率$P(X\mid Y=是垃圾邮件)$ 

In [104]:
print(matrix[:241].sum(axis=0).shape)
# 再求条件概率P(X|Y) 
k = len(dictionary)
P_X_Y = np.zeros((2,k))
# 当邮件为垃圾邮件时，高频词在信中存在时的条件概率估计
P_X_Y[1] = (matrix[-48:].sum(axis=0)+1)/(n_1+n)
# 当邮件为垃圾邮件时，高频词不在信中存在时的条件概率
P_X_Y[0] = 1-P_X_Y[1]

P_X_Y_0 = np.zeros((2,k))
# 当邮件为正常时，高频词在信中存在的条件概率估计
P_X_Y_0[1] = (matrix[:241].sum(axis=0)+1)/(n-n_1+n)
# 当邮件为正常邮件时，高频词不在信中存在时的条件概率
P_X_Y_0[0] = 1-P_X_Y_0[1]
# 打印当邮件为垃圾邮件时，前10个高频词在信中存在时的条件概率
print(P_X_Y[1,1:10])
# 当邮件为垃圾邮件时，高频词不在信中的概率
print(P_X_Y[0,1:10])
# 打印当邮件为正常邮件时，前10个高频词在信中存在时的条件概率
print(P_X_Y_0[1,1:10])
# 当邮件为正常时，高频词不在信中的概率
print(P_X_Y_0[0,1:10])
# 预测当邮件正常时，高频词在信中存在的概率比邮件为垃圾邮件时的概率要低

(546,)
[0.02967359 0.06824926 0.0148368  0.00890208 0.00296736 0.00296736
 0.00296736 0.02967359 0.06824926]
[0.97032641 0.93175074 0.9851632  0.99109792 0.99703264 0.99703264
 0.99703264 0.97032641 0.93175074]
[0.07358491 0.09622642 0.03962264 0.00377358 0.00188679 0.00188679
 0.00188679 0.07358491 0.09622642]
[0.92641509 0.90377358 0.96037736 0.99622642 0.99811321 0.99811321
 0.99811321 0.92641509 0.90377358]


In [103]:
# test 条件概率和
print(P_X_Y_0.shape)

(2, 546)
29.166037735849056


np.sum(axis=0)表示按列的方向求和

# 使用模型进行分类
模型的两个参数已经估计完成，使用测试集对模型进行测试。
我们选用'./lingspam_public/lemm_stop/part10/'的所有信件作为测试数据集。我们已经知道测试数据集中有291封信，其中垃圾邮件49封，正常邮件242封。

In [85]:
# 提取测试数据集中的特征词频向量组成的矩阵
# 词典还是使用测试集中找出的高频词
dictionary = make_Dictionary('./lingspam_public/lemm_stop/part1/')
test_matrix = extract_features('./lingspam_public/lemm_stop/part10/',dictionary)
# 打印出倒数第二份信的词向量
#print(matrix[-2])
# 数一数有多少个高频次在这封信中存在
print(test_matrix[-2].sum(axis=0))
# 竟然有182个高频词在这封信中出现
print(test_matrix.shape)
print(test_matrix[:3].shape)

291
0.0
(291, 546)
(3, 546)


In [139]:
# 数一数有多少个高频次在这封信中存在
print(test_matrix[-34].sum(axis=0))

21.0


In [147]:
# 使用贝叶斯公式求解，给定一封信件，输出它的概率
# 对测试集特征矩阵求反，可以方便点乘计算
# 求反，将其中为1的变为0，为0的变为1
_test_matrix = abs(test_matrix-1)
print(_test_matrix[-1,1:10])

[0. 0. 0. 1. 1. 1. 1. 0. 0.]


In [118]:
# 求邮件是垃圾邮件的概率
A = np.dot(_test_matrix, np.log(P_X_Y[0].T))\
+np.dot(test_matrix, np.log(P_X_Y[1].T))+np.log(P_Y[1])
B = np.dot(_test_matrix, np.log(P_X_Y_0[0].T))\
+np.dot(test_matrix, np.log(P_X_Y_0[1].T))+np.log(P_Y[0])

In [143]:
print(A.shape)
print(np.exp(A[-34]))
print(np.exp(B[-34]))

(291,)
1.3023263119408707e-43
5.218506424368712e-41


In [157]:
C = np.exp(B)/np.exp(A)
C[C>3]=1
C[C<=3]=0
print(len(C[C>3]))

0


  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [158]:
# 我们已经知道测试数据集中有291封信，其中垃圾邮件49封，正常邮件242封。
test_labels=np.zeros(291)
test_labels[-49:]=1
D=test_labels-C
len(D[D==0])


213

In [159]:
len(D[D>0])


  """Entry point for launching an IPython kernel.


46

In [160]:
len(D[D<0])

  """Entry point for launching an IPython kernel.


0