# 垃圾邮件分类 - 使用朴素贝叶斯方式
- 朴素贝叶斯
- 特征工程基本流程
- sklearn
- jieba分词
- TfidfVectorizer 文本数值化;TF-IDF 统计权重，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。
- CountVectorizer 文本数值化：统计数

## 朴素贝叶斯实现垃圾邮件分类的步骤
- （1）读取数据：读取文本文件数据
- （2）数据预处理（ETL）：数据ETL、格式整理、数据规约
- （3）特征工程：将文本文件进行解析词条向量等处理
- （4）模型训练（机器学习）：将训练数据集通过算法进行训练模型
- （5）模型验证/校验（模型评估）：使用测试数据集校验模型
- （6）模型使用（提供服务）：利用模型进行预测

In [5]:
import os
import pandas as pd
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB


In [6]:
def word_segment(line, stopwords_list):
    """
     param line: 一行文字
     param stopwords_list: 停止词列表
     return: 空格分隔的分词列表
    """
    word_list = []
    for word in jieba.cut(line):
        if word.isalpha() and word not in stopwords_list:
            word_list.append(word)
    return " ".join(word_list)

In [7]:
def feature_transform(texts):
    transformer = CountVectorizer()
    word_cnt_df = pd.DataFrame(transformer.fit_transform(texts).toarray())
    word_cnt_freq = pd.DataFrame(word_cnt_df.apply(sum, axis=0))
    word_keep = [word_cnt_freq.index[i] for i in range(word_cnt_freq.shape[0]) if word_cnt_freq.iloc[i, 0] > 5]
    features = word_cnt_df[word_keep]
    return word_cnt_df

In [8]:
def load_data(base_path):
    email_file_name = os.path.join(base_path, "chinesespam.xlsx")
    stopword_file_name = os.path.join(base_path, "stopwords.txt")
    stopwords_list = [i.strip() for i in open(stopword_file_name, 'r', encoding='utf8').readlines()]
    email_df = pd.read_excel(email_file_name, sheet_name=0)
    email_df['text'] = email_df.text.apply(lambda x: word_segment(x, stopwords_list))
    features = feature_transform(email_df['text'])
    labels = email_df['type']
    x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=5)
    return x_train, x_test, y_train, y_test


In [9]:
# 伯努利分布模型训练
def model_BernoulliNB(x_train, y_train):
    model = BernoulliNB()
    model.fit(x_train, y_train)
    return model

In [10]:
# 高斯分布模型训练
def model_GaussianNB(x_train, y_train):
    model = GaussianNB()
    model.fit(x_train, y_train)
    return model

In [11]:
# 多项式分布模型训练
def model_MultinomialNB(x_train, y_train):
    model = MultinomialNB()
    model.fit(x_train, y_train)
    return model

In [12]:
base_path = "D:/GitTest/badou/上课课件集合/公开课/课件/"
x_train, x_test, y_train, y_test = load_data(base_path)
model = model_MultinomialNB(x_train, y_train)
print(model.score(x_test, y_test))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\明柯\AppData\Local\Temp\jieba.cache
Loading model cost 1.082 seconds.
Prefix dict has been built successfully.


0.9555555555555556
