## Performance Metrics for Classification

Metrics for classification
In Chapter 1, we evaluated the performance of a k-NN classifier based on its accuracy. However, accuracy is not always an informative metric. In this exercise, we will dive more deeply into evaluating the performance of binary classifiers by computing a __confusion matrix__ and generating a classification report.

Here, we'll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. The dataset has been preprocessed to deal with missing values.

### k-NN Classifier

In [29]:
# Import necessary modules
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report 
from sklearn.metrics import confusion_matrix 

# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split 

In [30]:
column_names = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',
       'dpf', 'age', 'diabetes']

In [31]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data',
                 names = column_names)

In [32]:
#df.isnull().sum()

In [33]:
# build predictor and target df
X, y = df.drop('diabetes', axis=1).values, df['diabetes'].values

Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the .values attribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.

### Evaluate k-NN classifier's performance using a confusion matrix and classification report.

In [42]:
# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X_train, y_train) 

# Predict the labels for the test data X
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print("Confusion Matrix: \n{}".format(confusion_matrix(y_test, y_pred)))
print('\nClassification_report: \n{}'.format(classification_report(y_test, y_pred)))

Confusion Matrix: 
[[176  30]
 [ 56  46]]

Classification_report: 
             precision    recall  f1-score   support

          0       0.76      0.85      0.80       206
          1       0.61      0.45      0.52       102

avg / total       0.71      0.72      0.71       308

