## Predict whether or not a patient has diabetes, based on certain diagnostic measurements
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

- The datasets consists of several medical predictor variables and one target variable, Outcome. 
- Predictor variables includes:
    - the number of pregnancies the patient has had, 
    - their BMI, 
    - insulin level, 
    - age, and so on.

### 1. Import the necessary packages

In [4]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np

### 2. Load and preprocess the data

In [6]:
data = np.loadtxt('../data/prima-indians-diabetes.csv', delimiter=',')

In [7]:
X = data[:,:8]
Y = data[:,8:]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.2, random_state=5)

### 3. Train a classifer/Build diabetes prediction model

In [8]:
# Train a model using NB
clf = GaussianNB()
clf.fit(x_train, y_train.flatten())

In [9]:
?GaussianNB

[1;31mInit signature:[0m [0mGaussianNB[0m[1;33m([0m[1;33m*[0m[1;33m,[0m [0mpriors[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mvar_smoothing[0m[1;33m=[0m[1;36m1e-09[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Gaussian Naive Bayes (GaussianNB).

Can perform online updates to model parameters via :meth:`partial_fit`.
For details on algorithm used to update feature means and variance online,
see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:

    http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

Read more in the :ref:`User Guide <gaussian_naive_bayes>`.

Parameters
----------
priors : array-like of shape (n_classes,)
    Prior probabilities of the classes. If specified, the priors are not
    adjusted according to the data.

var_smoothing : float, default=1e-9
    Portion of the largest variance of all features that is added to
    variances for calculation stability.

    .. versionadded:: 0.20

Attributes

#### var_smoothing
A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.

The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.

### 4. Make predictions using test samples

In [10]:
# Make predictions on tets data
predictions = clf.predict(x_test)

### 5. Evaluate the model

In [11]:
# Evaluate the accuracy
print('Accuracy Score: ', accuracy_score(predictions, y_test))

Accuracy Score:  0.7662337662337663
