## 使用python实现Naive Bayes算法

### 1.处理数据
首先我们需要读取数据。csv格式是没有任何标题，我们可以使用csv模块对数据进行读写

In [1]:
import csv

def load_csv(filename):
    with open(filename) as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        for i in range(len(dataset)):
            dataset[i] = [float(x) for x in dataset[i]]
    return dataset

我们可以通过pima indians数据集来测试loadCsv函数

In [2]:
filename = 'E:/Kaggle/pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print (('Loaded data file {0} with {1} rows').format(filename, len(dataset)))

Loaded data file E:/Kaggle/pima-indians-diabetes.data.csv with 768 rows


接下来我们需要将数据划分为训练集和测试集,我们采用训练集和测试的比例是66/34。

In [3]:
import random

def split_dataset(dataset, ratio):
    trainsize = int(len(dataset) * ratio)
    trainset = []
    copy = list(dataset)
    while len(trainset) < trainsize:
        index = random.randrange(len(copy))
        trainset.append(copy.pop(index))
    return [trainset, copy]

我们可以使用一个含有5个实例的数据，将其划分为训练集和测试集，然后看看最后训练集和测试集最后划分是什么样子的

In [4]:
dataset = [[1], [2], [3], [4], [5]]
ratio = 0.66
train, test = split_dataset(dataset, ratio)
print (('Split {0} rows into train with {1} and test with {2}').format(len(dataset), train, test))

Split 5 rows into train with [[5], [1], [4]] and test with [[2], [3]]


### 2.概览数据

#### 按照类别划分数据
第一，对训练集根据类别值进行划分，这样我们可以对每个类别进行统计。我们可以创建一个类值映射到数据实例列表

In [5]:
def separate_by_class(dataset):
    separated = {}
    for i in range(len(dataset)):
        vector = dataset[i]
        if (vector[-1] not in separated):
            separated[vector[-1]] = []
        separated[vector[-1]].append(vector)
    return separated

在separateByClass函数中假定最后的一个属性是类别值，这个函数返回一个类值映射到数据实例列表。

我们可以使用一些样例数据测试separateByClass函数。

In [6]:
dataset = [[1, 20, 1], [2, 21, 0], [3, 22, 1]]
separated = separate_by_class(dataset)
print (('Separated instances: {0}').format(separated))

Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}


#### 计算平均值
我们需要计算每一个类别的平均值，当计算概率的时候将其作为高斯分布的平均值。

我们也需要计算每一个类别的标准差，当计算概率的时候将其作为高斯分布的标准差。标准差的计算方法为方差的平方根。

In [7]:
import math

def mean(numbers):
    return sum(numbers)/float(len(numbers))

def stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x-avg ,2) for x in numbers])/float(len(numbers)-1)
    return math.sqrt(variance)

我们可以测试计算1到5的平均值

In [8]:
numbers = [1, 2, 3, 4, 5]
print (('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers), stdev(numbers)))

Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.5811388300841898


#### 概括数据集
现在我们有工具来概括数据集。对于给定的实例列表（对于类值），我们可以计算每个特征属性的均值和标准差.。

In [9]:
def summarize(dataset):
    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del summaries[-1]
    return summaries

In [10]:
dataset = [[1,20,0], [2,21,1], [3,22,0]]
summary = summarize(dataset)
print (('Attribute summaries: {0}').format(summary))

Attribute summaries: [(2.0, 1.0), (21.0, 1.0)]


#### 根据类别概括数据
我们可以把我们的训练数据集按照类别分组，然后计算每个属性的平均值和标准差。

In [11]:
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = {}
    for classValue, instances in separated.items():
        summaries[classValue] = summarize(instances)
    return summaries

我们可以通过一个小的测试数据来测试summarizeByClass函数

In [12]:
dataset = [[1,20,1], [2,21,0], [3,22,1], [4,22,0]]
summaries = summarize_by_class(dataset)
print (('Summary by class value: {0}').format(summaries))

Summary by class value: {0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)], 1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)]}


### 3.作出预测

#### 计算高斯概率密度函数
我们可以使用高斯函数对训练集根据一个给定的特征值、平均值、标准差来估计其概率。

In [13]:
import math

def calculate_probability(x, mean, stdev):
    exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
    return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

我们可以使用一些样例数据来测试calculateProbability函数

In [14]:
x = 71.5
mean = 73
stdev = 6.2
probability = calculate_probability(x, mean, stdev)
print (('Probability of belonging to this class: {0}').format(probability))

Probability of belonging to this class: 0.06248965759370005


#### 计算类的概率
现在，我们可以计算属于一类属性的概率

In [15]:
def calculate_class_probabilities(summaries, inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            mean, stdev = classSummaries[i]
            x = inputVector[i]
            probabilities[classValue] *= calculate_probability(x, mean, stdev)
    return probabilities

我们可以测试calculateClassProbabilities函数

In [16]:
summaries = {0:[(1, 0.5)], 1:[(20, 5.0)]}
inputVector = [1.1, '?']
probabilities = calculate_class_probabilities(summaries, inputVector)
print (('Probabilities for each class: {0}').format(probabilities))

Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05}


#### 做出预测
现在可以计算属于每个类值数据实例的概率，并返回最大概率的相关的类。

In [17]:
def predict(summaries, inputVector):
    probabilities = calculate_class_probabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
    return bestLabel

我们可以测试predict函数

In [18]:
summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]}
inputVector = [1.1, '?']
result = predict(summaries, inputVector)
print (('Prediction: {0}').format(result))

Prediction: A


### 4.作出最终预测
最后，我们可以估计模型的准确率通过预测每一个测试集中的数据实例。

In [19]:
def get_predictions(summaries, testset):
    predictions = []
    for i in range(len(testset)):
        result = predict(summaries, testset[i])
        predictions.append(result)
    return predictions

我们可以测试getPredictions函数

In [20]:
summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]}
testset = [[1.1, '?'], [19.1, '?']]
predictions = get_predictions(summaries, testset)
print (('Predictions: {0}').format(predictions))

Predictions: ['A', 'B']


### 5.计算模型的准确率

In [21]:
def get_accuracy(testset, predictions):
    correct = 0
    for x in range(len(testset)):
        if testset[x][-1] == predictions[x]:
            correct += 1
    return (correct/float(len(testset))) * 100.0

我们可以测试getAccuracy函数通过样例数据

In [22]:
testset = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = get_accuracy(testset, predictions)
print (('Accuracy: {0}').format(accuracy))

Accuracy: 66.66666666666666


朴素贝叶斯算法完整代码如下所示

In [23]:
import csv
import random
import math

def load_csv(filename):
    with open(filename) as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        for i in range(len(dataset)):
            dataset[i] = [float(x) for x in dataset[i]]
    return dataset

def split_dataset(dataset, ratio):
    trainsize = int(len(dataset) * ratio)
    trainset = []
    copy = list(dataset)
    while len(trainset) < trainsize:
        index = random.randrange(len(copy))
        trainset.append(copy.pop(index))
    return [trainset, copy]

def separate_by_class(dataset):
    separated = {}
    for i in range(len(dataset)):
        vector = dataset[i]
        if (vector[-1] not in separated):
            separated[vector[-1]] = []
        separated[vector[-1]].append(vector)
    return separated

def mean(numbers):
    return sum(numbers)/float(len(numbers))

def stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x-avg ,2) for x in numbers])/float(len(numbers)-1)
    return math.sqrt(variance)

def summarize(dataset):
    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del summaries[-1]
    return summaries

def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = {}
    for classValue, instances in separated.items():
        summaries[classValue] = summarize(instances)
    return summaries

def calculate_probability(x, mean, stdev):
    exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
    return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

def calculate_class_probabilities(summaries, inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            mean, stdev = classSummaries[i]
            x = inputVector[i]
            probabilities[classValue] *= calculate_probability(x, mean, stdev)
    return probabilities

def predict(summaries, inputVector):
    probabilities = calculate_class_probabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
    return bestLabel

def get_predictions(summaries, testset):
    predictions = []
    for i in range(len(testset)):
        result = predict(summaries, testset[i])
        predictions.append(result)
    return predictions

def get_accuracy(testset, predictions):
    correct = 0
    for x in range(len(testset)):
        if testset[x][-1] == predictions[x]:
            correct += 1
    return (correct/float(len(testset))) * 100.0


def main():
    filename = 'E:/Kaggle/pima-indians-diabetes.data.csv'
    ratio = 0.66
    dataset = load_csv(filename)
    trainset, testset = split_dataset(dataset, ratio)
    print (('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainset), len(testset)))
    # prepare model
    summaries = summarize_by_class(trainset)
    # test model
    predictions = get_predictions(summaries, testset)
    accuracy = get_accuracy(testset, predictions)
    print (('Accuracy: {0}%').format(accuracy))

main()

Split 768 rows into train=506 and test=262 rows
Accuracy: 74.80916030534351%
