# Iris classification
https://github.com/jwheeldon/test_ml.git

Machine learning test using the following techniques on iris dataset:
* Descriptive statistics
* Linear support vector machine (SVM)
* Evaluation of model

In [98]:
# Import packages
from sklearn import datasets, svm, metrics
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
import matplotlib as mpl

## Import and sample dataset

In [99]:
iris = datasets.load_iris()
n_samples = len(iris.data)
data_and_classes = list(zip(iris.data, iris.target))
data_pd = pd.DataFrame(data_and_classes)

print(data_pd.head(5))

                      0  1
0  [5.1, 3.5, 1.4, 0.2]  0
1  [4.9, 3.0, 1.4, 0.2]  0
2  [4.7, 3.2, 1.3, 0.2]  0
3  [4.6, 3.1, 1.5, 0.2]  0
4  [5.0, 3.6, 1.4, 0.2]  0


## Explore data using descriptive statistics

In [106]:
data_pd.describe()

Unnamed: 0,1,2,3,4,5
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667,1.0
std,0.828066,0.433594,1.76442,0.763161,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


## Shuffle dataset
Since iris data is ordered by species, shuffling rows will help split training vs test data.

In [101]:
np.random.seed(10)
data_pd = data_pd.sample(frac=1)
data_pd = pd.concat([data_pd[0].apply(pd.Series), data_pd[1]], axis=1)
data_pd.columns = [1,2,3,4,5]

print(data_pd.head(5))

       1    2    3    4  5
87   6.3  2.3  4.4  1.3  1
111  6.4  2.7  5.3  1.9  2
10   5.4  3.7  1.5  0.2  0
91   6.1  3.0  4.6  1.4  1
49   5.0  3.3  1.4  0.2  0


## Define training data and targets
Split data into two groups: 50% model training - 50% test data. Best practice would typically involve a split of 70% training - 10% validation - 20% test data.

In [102]:
training = data_pd[:][:n_samples//2]
training_data = training[[1,2,3,4]].values
training_target = training[5].values

print(training_data[0:10])
print(training_target[0:10])

[[ 6.3  2.3  4.4  1.3]
 [ 6.4  2.7  5.3  1.9]
 [ 5.4  3.7  1.5  0.2]
 [ 6.1  3.   4.6  1.4]
 [ 5.   3.3  1.4  0.2]
 [ 5.   2.   3.5  1. ]
 [ 6.3  2.5  4.9  1.5]
 [ 5.8  2.7  4.1  1. ]
 [ 5.1  3.4  1.5  0.2]
 [ 5.7  2.8  4.5  1.3]]
[1 2 0 1 0 1 1 1 0 1]


## Support vector machine classification

Use an unsupervised linear SVM to predict the species of unlabelled data.

In [103]:
# Linear support vector machine classifier
clf = svm.LinearSVC()

# Train model to fit training_data to training_target
clf.fit(training_data, training_target)

# Predict iris.targets and define expected vs predicted classifications
expected = data_pd[5][n_samples//2:].values
predicted = clf.predict(data_pd[[1,2,3,4]][n_samples//2:])

In [104]:
# Generate confusion matrix and classification report via metrics
print(metrics.confusion_matrix(expected,predicted))

[[27  0  0]
 [ 0 23  0]
 [ 0  3 22]]


Accuracy score: Determines the accuracy of classification by comparing two arrays for an exact match.

Classification report: Precision (P) reflects true positive events, while Recall (R) represents true negative events.
F1-score is defined as the harmonic mean between P and R.



![image](http://scikit-learn.org/stable/_images/math/4427313bbf584ad5dc95320f3fe1f8356b3c5f9f.png)
![image](http://scikit-learn.org/stable/_images/math/3bcc92cd7cbd26a3d6c6aa0b407cb9d7c89c6254.png)
![image](http://scikit-learn.org/stable/_images/math/6d03e9528e504c0801b231b90bdb480e01269ea1.png)

In [105]:
print("Accuracy score:", metrics.accuracy_score(expected, predicted))
print(metrics.classification_report(expected, predicted))

Accuracy score: 0.96
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        27
          1       0.88      1.00      0.94        23
          2       1.00      0.88      0.94        25

avg / total       0.96      0.96      0.96        75

