Naive-Bayes Algorithm
=====================
***

What Is It?
-----------

The Naive-Bayes algorithm is an intuitive approach to making predictions based on prior beliefs or probabilities. Quoting Jason Brownlee, "it is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically".

Let's dive into the mathematics. We start off with a belief or a *prior probability* of event $A$. This is denoted as $P(A)$. Everything seems to be going well until we're hit with some new evidence $X$, which implies something about our belief. As much as we'd like to, we can't simply ignore $X$ and go home. Instead, given evidence $X$, we must calculate a new value for event $A$ called the *posterior probability*. This is denoted as $P(A | X)$. Finally, for the sake of completion, $P(X | A)$ is the probability of observing evidence $X$ for event $A$ and $P(X)$ is the untouched probability of observing evidence $X$.

\begin{align}
 P( A | X ) = & \frac{ P(X | A) P(A) } {P(X) } \\\\[5pt]
\end{align}

You're probably wondering what makes this algorithm *naive*. Well, it's due to the underlying assumption that the probability of event $A$ given any evidence $X_n$ is totally independent of each other. This simplifies a lot of things and explains its popularity in many fields.

The content of this notebook uses Python to classify whether a patient is diagnosed with diabetes given a set of attributes. The data set is called the "Pima Indians Diabetes Data Set" provided by the National Institute of Diabetes and Digestive and Kidney Diseases. The target accuracy to indicate the algorithm's credibility is between 70% - 76%.

Data Acquisition and Formatting
-------------------------------

The data set is given as a `csv` file, which requires parsing and partitioning to form a training set and a test set.

In [4]:
import csv
def load_csv(file):
    lines = csv.reader(open(file, 'rb'))
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset

file = "pima-indians-diabetes.data.csv"
dataset = load_csv(file)
print('Loaded data from {0} with {1} rows').format(file, len(dataset))

Loaded data from pima-indians-diabetes.data.csv with 768 rows


In [5]:
from random import randrange
def partition_data(dataset, ratio):
    train_size = int(len(dataset) * ratio)
    test_set = list(dataset)
    train_set = []
    
    while len(train_set) < train_size:
        index = randrange(len(test_set))
        train_set.append(test_set.pop(index))
        
    return [train_set, test_set]

train_set, test_set = partition_data(dataset, 0.67)
print('Split total data ({0} rows) into training set ({1} rows) and testing set ({2} rows)').format(len(dataset), len(train_set), len(test_set))
    

TypeError: partition_data() takes exactly 2 arguments (1 given)