Classification is used for pattern recognition. The following notebook examines four attributes which correspond to three different kinds of Iris flowers. The data is split into training and testing data.
The testing data allows us to get an idea of how well our model performs after training on the training data. 

In [1]:
# Dependicies
import pandas as pd
from scipy.io import arff
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
import graphviz

### Loading data from iris.arff file

In [2]:
data = arff.loadarff('iris.arff')
df = pd.DataFrame(data[0])
df['class'] = df['class'].apply(lambda x: x.decode('utf-8'))

Looking at our data..

In [3]:
df.head(3)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


Identifying data as  attributes and coded class labels and seperating for training and testing our classifier

In [4]:
x = df.loc[:,'sepallength':'petalwidth']
y = df['class'].astype('category').cat.codes

#splitting the data into test and train
x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.33, random_state=50)

Training a Decision Tree Classifier through the sklearn module

In [5]:
dtc = tree.DecisionTreeClassifier()
dtc.fit(x_train,y_train)
dtc_acc = dtc.score(x_test, y_test)
y_pred_dtc = dtc.predict(x_test)

Training a K-Neighbors Classifier also through the sklearn module

In [6]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train,y_train)
knn_acc = knn.score(x_test,y_test)
y_pred_knn = knn.predict(x_test)

A nice way to view the performance of these models is through a confusion matrix.
The confusion matrix compares the models predictions to the true labels associated to a group of attributes.

In [7]:
cm_dtc = confusion_matrix(y_test, y_pred_dtc)
cm_knn = confusion_matrix(y_test, y_pred_knn)

The rows correspond the the true labels, while the columns correspond to the labels predicted by the classifier.

In [8]:
print('Decision Tree Classifier confusion matrix:')
print(cm_dtc)
print('K-Neighbors Classifier confusion matrix:')
print(cm_knn)

Decision Tree Classifier confusion matrix:
[[17  0  0]
 [ 0 16  1]
 [ 0  1 15]]
K-Neighbors Classifier confusion matrix:
[[17  0  0]
 [ 0 16  1]
 [ 0  0 16]]


Estimating accuracy is the proportion of correct answers over all samples, or the sum of the diagonal components of the confusion matrix over the sum of all components.

In [9]:
print('Decision Tree Classifier accuracy:')
print(dtc_acc)
print('K-Neighbors Classifier accuracy:')
print(knn_acc)

Decision Tree Classifier accuracy:
0.96
K-Neighbors Classifier accuracy:
0.98


 .
 .
 .

If we were classifying something as true or positive, for example whether a doctor's patient is sick, it might be important to know more than just the accuracy of the classifier. The individual components of our confusion matrix are labeled as True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN).

TP: Number of instances where classifier said the patient was sick when the patient was sick

FP: Number of instances where classifier said the patient was sick when the patient was not sick

TN: Number of instances where classifier said the patient was not sick when the patient was not sick

FN: Number of instances where classifier said the patient was not sick when the patient was sick

In [10]:
print('Components to confusion matrix representation:')
m = [['TP','FP'],
     ['FN','TN']]
print('[TP,FP]\n[FN,TN]')

Components to confusion matrix representation:
[TP,FP]
[FN,TN]
