In [7]:
# Load CSV using Pandas as pandas.DataFrame object

import matplotlib.pyplot as plt
import pandas
import numpy

filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(filename, names=names)

shape = dataframe.shape
print(shape)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

(768, 9)


In [4]:
import sklearn.datasets
data = sklearn.datasets.load_iris()

X = data.data
Y = data.target
print(data.DESCR)
print(X.shape)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

## Algorithms Overview

Linear machine learning algorithms:
  + Logistic Regression.
  + Linear Discriminant Analysis.

Nonlinear machine learning algorithms:
    + K-Nearest Neighbors.
    + Naive Bayes.
    + Classification and Regression Trees.
    + Support Vector Machines.

In [10]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

## Logistic Regression

Logistic regression assumes a Gaussian distribution for the numeric input variables and can
model binary classification problems.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [12]:
from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits=5, shuffle=True, random_state=5)
model = LogisticRegression()

results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", (results.mean()*100.0)) 
print(results.std()*100)

Accuracy: 95.33333333333334
3.399346342395189


## Linear Discriminant Analysis

LDA is a statistical technique for binary and multiclass classification. 

It too assumes a Gaussian distribution for the numerical input variables. 

http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

In [13]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

model = LinearDiscriminantAnalysis()

results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", (results.mean()*100.0)) 
print(results.std()*100)

Accuracy: 98.00000000000001
2.666666666666666




## K-Nearest Neighbors

K-Nearest Neighbors (or KNN) uses a distance metric to find the K most similar instances in the
training data for a new instance and takes the mean outcome of the neighbors as the prediction.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [14]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", (results.mean()*100.0)) 
print(results.std()*100)

Accuracy: 96.66666666666666
2.108185106778919


## Naive Bayes

Naive Bayes calculates the probability of each class and the conditional probability of each class
given each input value. These probabilities are estimated for new data and multiplied together,
assuming that they are all independent (a simple or naive assumption). When working with
real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for
input variables using the Gaussian Probability Density Function. 

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [15]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", (results.mean()*100.0)) 
print(results.std()*100)

Accuracy: 95.33333333333334
3.399346342395189


## Classification and Regression Trees

CART or just decision trees) construct a binary tree from the training data. 
Split points are chosen greedily by evaluating each attribute and each value
of each attribute in the training data in order to minimize a cost function 
(like the Gini index).

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [16]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(criterion='entropy')

results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", (results.mean()*100.0)) 
print(results.std()*100)

Accuracy: 95.33333333333334
3.9999999999999996


## Support Vector Machine

SVM seek a line that best separates two classes. 

Those data instances that are closest to the line that best separates the 
classes are called support vectors and influence where the line is placed. 

SVM supports multiple classes. The Radial Basis Function is used as the kernel by default.

In [17]:
from sklearn.svm import SVC

model = SVC()

results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", (results.mean()*100.0)) 
print(results.std()*100)

Accuracy: 97.33333333333334
1.3333333333333333
