Naive-Bayes Algorithm
=====================
***

What Is It?
-----------

The Naive-Bayes algorithm is an intuitive approach to making predictions based on prior beliefs or probabilities. Quoting Jason Brownlee, "it is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically".

Let's dive into the mathematics. We start off with a belief or a *prior probability* of event $A$. This is denoted as $P(A)$. Now, everything seems to be going well until we're hit with some new evidence $X$, which implies something that affects the probability of our belief. As much as we'd like to, we can't simply ignore $X$ and go home. Instead, given evidence $X$, we must calculate a new value for event $A$ called the *posterior probability*. This is denoted as $P(A | X)$. Finally, for the sake of completion, $P(X | A)$ is the probability of observing evidence $X$ for event $A$ and $P(X)$ is the untouched probability of observing evidence $X$.

\begin{align}
 P( A | X ) = & \frac{ P(X | A) P(A) } {P(X) } \\\\[5pt]
\end{align}

You're probably wondering what makes this algorithm *naive*. Well, it's due to the underlying assumption that the probability of event $A$ given any evidence $X_n$ is totally independent of each other. This simplifies a lot of things and explains its popularity in many fields.

The content of this notebook uses Python to classify whether a patient is diagnosed with diabetes given a set of attributes. The data set is called the "Pima Indians Diabetes Data Set" provided by the National Institute of Diabetes and Digestive and Kidney Diseases. The target accuracy to indicate the algorithm's credibility is between 70% - 76%.

Data Loading and Formatting
-------------------------------

The data set is given as a `csv` file, which requires parsing and partitioning to form a training set and a test set.

In [4]:
import csv
def load_csv(file):
    lines = csv.reader(open(file, 'rb'))
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset

file = "pima-indians-diabetes.data.csv"
dataset = load_csv(file)
print('Loaded data from {0} with {1} rows').format(file, len(dataset))

Loaded data from pima-indians-diabetes.data.csv with 768 rows


In [6]:
from random import randrange
def partition_data(dataset, ratio):
    train_size = int(len(dataset) * ratio)
    test_set = list(dataset)
    train_set = []
    
    while len(train_set) < train_size:
        index = randrange(len(test_set))
        train_set.append(test_set.pop(index))
        
    return [train_set, test_set]

train_set, test_set = partition_data(dataset)
print('Split total data ({0} rows) into training set ({1} rows) and testing set ({2} rows)').format(len(dataset), len(train_set))
    

Split total data (768 rows) into training set (514 rows) and testing set (254 rows)


Data Organization and Pre-calculations
--------------------------------------

Now that our dataset has been partitioned, let's visualize what it actually looks like and discuss how we should transform it going forward. Currently, our training set, $T$, can be described as an $m \times n$ matrix,

\begin{align}
T = 
\begin{bmatrix}
    x_{11}       & x_{12} & x_{13} & \dots & x_{1n} \\
    x_{21}       & x_{22} & x_{23} & \dots & x_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{m1}       & x_{m2} & x_{m3} & \dots & x_{mn}
\end{bmatrix}
\end{align}

where $m$ is the number of data points and $n$ is the number of attributes plus the classification value for each data point.

Next, we want to group our data points by class, $T(0)$ and $T(1)$,

\begin{align}
T(0) = 
\begin{bmatrix}
    x_{11} & x_{12} & x_{13} & \dots & x_{1n-1} & 0 \\
    x_{21} & x_{22} & x_{23} & \dots & x_{2n-1} & 0 \\
    \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\
    x_{m1} & x_{m2} & x_{m3} & \dots & x_{mn-1} & 0
\end{bmatrix}
\\
\\
T(1) = 
\begin{bmatrix}
    x_{11} & x_{12} & x_{13} & \dots & x_{1n-1} & 1 \\
    x_{21} & x_{22} & x_{23} & \dots & x_{2n-1} & 1 \\
    \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\
    x_{m1} & x_{m2} & x_{m3} & \dots & x_{mn-1} & 1
\end{bmatrix}
\end{align}

In the implementation, we'll organize each datapoint into a map where the key is the classification value and its values are the data points that belong to it. The classification value for each data point will be removed afterwards.

In [15]:
def group_by_class(dataset):
    klass_map = {}
    for el in dataset:
        klass = int(el[-1])
        if klass not in klass_map:
            klass_map[klass] = []
        klass_map[klass].append(el[:-1])
    return klass_map

For each attribute, we want to calculate the mean and standard deviation