# Classification with Scitkit-learn

Term 1 2019 - Instructor: Teerapong Leelanupab

Teaching Assistant: Suttida Satjasunsern
***

Classification involves using labeled (known) training examples to make an accurate prediction for new unseen input examples. In this lab we will use the classification functionality provided by the Scitkit-learn Python package.

The *k-Nearest Neighbour (KNN) classifier* is a simple but effective "lazy"classifier. Given a new input example, it finds the most similar previous examples for which a decision has already been made (i.e. their nearest neighbours from the training set). Based on the majority vote among the K neighbours, a prediction will be made for the input.

#### Example 1: KNN Classifier

The scikit-learn package includes a number of datasets, which are useful for getting a handle on a given machine learning algorithm before using it in your own work. We will load the version of the Iris dataset which is provided by scikit-learn:

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

In [2]:
type(iris)

sklearn.utils.Bunch

This dataset has four different descriptive features:

In [4]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Each example in the dataset has a class label or a "target" from three possible classes:

In [5]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


Build a nearest neighbour classifier using $k=1$ nearest neighbour. In this case we will use the full dataset and all of the target labels for those examples:

In [7]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(iris.data, iris.target)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

We can test it out by making a prediction for a new input example described by 4 feature values:

In [9]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [8]:
import numpy as np
xinput = np.array([[3.0, 5.0, 4.1, 2.0]])
# make the prediction, the output is the number of the class
pred_class_number = knn.predict(xinput)
# to get the name of the class
print( iris.target_names[pred_class_number] )

['virginica']


We can also predict for multiple input examples at once:

In [10]:
xinput = np.array([[3, 5, 4, 2], [3, 5, 2, 2]])
pred_class_numbers = knn.predict(xinput)
print( iris.target_names[pred_class_numbers] )

['virginica' 'setosa']


#### Example 2: KNN Classifier

Next, we will load a CSV copy of the Pima Indian diabetes dataset from the UCI Machine Learning Repository, where the target is to make a prediction of 1 (tested positive for diabetes) or 0 (tested negative for diabetes). 

In [11]:
import pandas as pd
# load the CSV file as a numpy matrix
df = pd.read_csv("data/diabetes.csv")
print(df.shape)
df.head(10)

(739, 9)


Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,positive
1,1,85,66,29,0,26.6,0.351,31,negative
2,8,183,64,0,0,23.3,0.672,32,positive
3,1,89,66,23,94,28.1,0.167,21,negative
4,0,137,40,35,168,43.1,2.288,33,positive
5,5,116,74,0,0,25.6,0.201,30,negative
6,3,78,50,32,88,31.0,0.248,26,positive
7,10,115,0,0,0,35.3,0.134,29,negative
8,2,197,70,45,543,30.5,0.158,53,positive
9,8,125,96,0,0,0.0,0.232,54,positive


In [19]:
df.values[:,0:7]

array([[6, 148, 72, ..., 0, 33.6, 0.627],
       [1, 85, 66, ..., 0, 26.6, 0.35100000000000003],
       [8, 183, 64, ..., 0, 23.3, 0.672],
       ...,
       [0, 126, 86, ..., 120, 27.4, 0.515],
       [8, 65, 72, ..., 0, 32.0, 0.6],
       [2, 99, 60, ..., 160, 36.6, 0.45299999999999996]], dtype=object)

The CSV format of the dataset contains data for 739 rows (patients), each with 9 columns. These are 8 descriptive numeric features, and the binary target value. We will separate out the descriptive columns from the target column (i.e. the class labels). 

In [20]:
#convert from dataframe to numpy array
raw_dataset = df.values
#seperate data into descriptive features and a target class
# data to train
dataset = raw_dataset[:,0:7]
# target(label) data
target = raw_dataset[:,8]

In [21]:
print(dataset)

[[6 148 72 ... 0 33.6 0.627]
 [1 85 66 ... 0 26.6 0.35100000000000003]
 [8 183 64 ... 0 23.3 0.672]
 ...
 [0 126 86 ... 120 27.4 0.515]
 [8 65 72 ... 0 32.0 0.6]
 [2 99 60 ... 160 36.6 0.45299999999999996]]


In [24]:
print(target[:5])

['positive' 'negative' 'positive' 'negative' 'positive']


Now, we will randomly split the complete dataset into a training test (used to build the model) and an unseen test set (used to try out and evaluate the model). Scikit-learn provides functionality to do this. We will specify that 20% (0.2) of the data will be used for the test set.

In [40]:
from sklearn.model_selection import train_test_split
dataset_train, dataset_test, target_train, target_test = train_test_split(dataset, target, test_size=0.1)

In [41]:
print("Training set size is %d" % dataset_train.shape[0] )
print("Test set size is %d" % dataset_test.shape[0] )

Training set size is 665
Test set size is 74


Next, we will fit a k-nearest neighbor model to the data using $k=3$ nearest neighbours:

In [42]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(dataset_train, target_train)
print(model)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')


Make predictions for the test set, based on the model that we just built:

In [43]:
predicted = model.predict(dataset_test)
predicted

array(['negative', 'negative', 'negative', 'negative', 'negative',
       'positive', 'negative', 'positive', 'negative', 'positive',
       'negative', 'negative', 'negative', 'negative', 'negative',
       'positive', 'negative', 'positive', 'positive', 'positive',
       'positive', 'positive', 'negative', 'positive', 'negative',
       'negative', 'negative', 'positive', 'negative', 'negative',
       'positive', 'negative', 'negative', 'negative', 'positive',
       'positive', 'negative', 'negative', 'negative', 'negative',
       'negative', 'negative', 'positive', 'positive', 'negative',
       'positive', 'negative', 'negative', 'negative', 'negative',
       'positive', 'positive', 'negative', 'negative', 'negative',
       'positive', 'negative', 'negative', 'negative', 'positive',
       'positive', 'negative', 'negative', 'negative', 'negative',
       'negative', 'negative', 'negative', 'negative', 'negative',
       'negative', 'negative', 'negative', 'negative'], dtype=

In [44]:
num_pos = (predicted == 1).sum()
num_neg = (predicted == 0).sum()
print( "Number of patients predicted positive for diabetes: %d" % num_pos )
print( "Number of patients predicted negative for diabetes: %d" % num_neg )

Number of patients predicted positive for diabetes: 0
Number of patients predicted negative for diabetes: 0


We can compare our predictions to the "correct answer" based on the labels for the test data:

In [45]:
print("Predictions\n", predicted)
print("Correct labels\n", target_test)

Predictions
 ['negative' 'negative' 'negative' 'negative' 'negative' 'positive'
 'negative' 'positive' 'negative' 'positive' 'negative' 'negative'
 'negative' 'negative' 'negative' 'positive' 'negative' 'positive'
 'positive' 'positive' 'positive' 'positive' 'negative' 'positive'
 'negative' 'negative' 'negative' 'positive' 'negative' 'negative'
 'positive' 'negative' 'negative' 'negative' 'positive' 'positive'
 'negative' 'negative' 'negative' 'negative' 'negative' 'negative'
 'positive' 'positive' 'negative' 'positive' 'negative' 'negative'
 'negative' 'negative' 'positive' 'positive' 'negative' 'negative'
 'negative' 'positive' 'negative' 'negative' 'negative' 'positive'
 'positive' 'negative' 'negative' 'negative' 'negative' 'negative'
 'negative' 'negative' 'negative' 'negative' 'negative' 'negative'
 'negative' 'negative']
Correct labels
 ['negative' 'positive' 'positive' 'negative' 'positive' 'positive'
 'negative' 'positive' 'negative' 'positive' 'negative' 'positive'
 'negativ

We can quantitatively check how accurate these predictions are, by measuring *accuracy*, which will return a value between 0.0 (predictions are completely wrong) and 1.0 (predictions are 100% accurate):

In [46]:
from sklearn.metrics import accuracy_score
accuracy_score(target_test, predicted)

0.7027027027027027

In the laterlab we will look at evaluation measures for classification in more detail.