## Exploring kNN classifiers using Wisconsin Breast Cancer data from [UCI](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

#### Analysis follows Chapter 3 of *Machine Learning with R* by Brett Lantz (though of course here we use Python, not R)


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import itertools

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score

warnings.filterwarnings("ignore")
%matplotlib
sns.set(style="white", color_codes=True)

Using matplotlib backend: MacOSX


### Objective:  Use a kNN classifier to predict whether a tumor is benign or malignant.

## Import data into a pandas dataframe

In [2]:
data = pd.read_csv('wdbc.data.txt', header=None)

## Basic data exploration

In [3]:
data.shape

(569, 32)

In [4]:
data.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [5]:
data[1].value_counts()

B    357
M    212
Name: 1, dtype: int64

## Data cleaning

#### There are 569 data points with 32 columns, including a good mix of diagnoses in column 1 to be used in the classifier.  

#### Column 0 contains an ID that can be dropped:

In [6]:
data.drop(0, axis=1, inplace=True)

#### Column 1 contains the diagnosis, either M ("Malignant") or B ("Benign").  Explicitly rename the column and the values for clarity:

In [7]:
data.rename(columns={1:"Diagnosis"}, inplace=True)

In [8]:
def diagnosis_text(diag):
    if diag == 'M':
        return 'Malignant'
    elif diag == 'B':
        return 'Benign'
    else:
        return diag
    
Y = data['Diagnosis'].apply(diagnosis_text)
X = data.drop('Diagnosis', axis=1)

#### The dataframe X contains the 30 remaining columns that will be used in the classifier.  The exact column names are irrelevant for the classifer so we'll just keep the numerical labels.
#### The dataframe Y contains only the diagnosis, which the classifier will try to predict.
#### We'll first scale the data before applying the classifier so that distances between different columns make sense.  Two types of scaling will be compared:
1. "Z-scaled":  the values are converted to Z-values so that each column has mean = 0 and standard deviation = 1.
2. "Min/Max scaled": the values are converted to a range of 0-1.  0 is assigned to the minimum value of each column, and 1 to the maximum.

In [9]:
sc = StandardScaler()
mm = MinMaxScaler()
X_z = pd.DataFrame(sc.fit_transform(X))
X_mm = pd.DataFrame(mm.fit_transform(X))
X = pd.concat([X_z, X_mm], axis=1)

## More data exploration
#### The first 30 columns of X now contain the Z-scaled data, while the next 30 columns contain the Min/Max scaled data.

In [10]:
X.iloc[:,:30].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,569.0,-1.256562e-16,1.00088,-2.029648,-0.689385,-0.215082,0.469393,3.971288
1,569.0,1.049736e-16,1.00088,-2.229249,-0.725963,-0.104636,0.584176,4.651889
2,569.0,-1.272171e-16,1.00088,-1.984504,-0.691956,-0.23598,0.499677,3.97613
3,569.0,-1.900452e-16,1.00088,-1.454443,-0.667195,-0.295187,0.363507,5.250529
4,569.0,1.490704e-16,1.00088,-3.112085,-0.710963,-0.034891,0.636199,4.770911
5,569.0,2.544342e-16,1.00088,-1.610136,-0.747086,-0.22194,0.493857,4.568425
6,569.0,-1.338511e-16,1.00088,-1.114873,-0.743748,-0.34224,0.526062,4.243589
7,569.0,-8.429110000000001e-17,1.00088,-1.26182,-0.737944,-0.397721,0.646935,3.92793
8,569.0,2.081912e-16,1.00088,-2.744117,-0.70324,-0.071627,0.530779,4.484751
9,569.0,5.408679e-16,1.00088,-1.819865,-0.722639,-0.178279,0.470983,4.910919


In [11]:
X.iloc[:,30:].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,569.0,0.338222,0.166787,0.0,0.223342,0.302381,0.416442,1.0
1,569.0,0.323965,0.145453,0.0,0.218465,0.308759,0.40886,1.0
2,569.0,0.332935,0.167915,0.0,0.216847,0.293345,0.416765,1.0
3,569.0,0.21692,0.149274,0.0,0.117413,0.172895,0.271135,1.0
4,569.0,0.394785,0.126967,0.0,0.304595,0.390358,0.47549,1.0
5,569.0,0.260601,0.161992,0.0,0.139685,0.224679,0.340531,1.0
6,569.0,0.208058,0.186785,0.0,0.06926,0.144189,0.306232,1.0
7,569.0,0.243137,0.192857,0.0,0.100944,0.166501,0.367793,1.0
8,569.0,0.379605,0.138456,0.0,0.282323,0.369697,0.45303,1.0
9,569.0,0.270379,0.148702,0.0,0.163016,0.243892,0.340354,1.0


#### The data are generally more heavily concentrated in the lower values, with some high outliers in several columns.

## Data analysis

#### First, split the data into test (30%) and training (70%) sets:

In [12]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=324)

In [13]:
#first 30 columns are Z scaled, second 30 columns are min/max scaled
X_train.shape

(398, 60)

In [14]:
Y_train.shape

(398,)

#### Run the kNN classifier using both types of scaled data.  Choose k = sqrt(n) as a first guess.

In [15]:
#choose k = 19, odd number close to sqrt(398)

model_z = KNeighborsClassifier(n_neighbors=19)
model_z.fit(X_train.iloc[:,:30], Y_train)
predictions_z = model_z.predict(X_test.iloc[:,:30])
model_mm = KNeighborsClassifier(n_neighbors=19)
model_mm.fit(X_train.iloc[:,30:], Y_train)
predictions_mm = model_mm.predict(X_test.iloc[:,30:])

#### Code for plotting confusion matrices from the scikit-learn [website](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

In [16]:
#from scikit-learn website:
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    print(title)
    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    

# Compute confusion matrix
cnf_matrix_z = confusion_matrix(Y_test, predictions_z)
cnf_matrix_mm = confusion_matrix(Y_test, predictions_mm)

# Plot Z-scaled confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_z, classes=['Benign','Malignant'],
                      title='Confusion matrix, Z-scaled data')

# Plot Min/Max-scaled confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_mm, classes=['Benign','Malignant'],
                      title='Confusion matrix, Min/Max-scaled data')

plt.show()

Confusion matrix, Z-scaled data
[[104   0]
 [ 10  57]]
Confusion matrix, Min/Max-scaled data
[[104   0]
 [  9  58]]


In [17]:
print('For z-normalized data, k = 19: precision = {0:.3f}, recall = {1:.3f}'.format(precision_score(Y_test, 
     predictions_z, average=None)[0],recall_score(Y_test, predictions_z, average=None)[0]))
print('F1 score for z-normalized data, k = 19: {0:.3f}'.format(f1_score(Y_test, predictions_z, average=None)[0]))
print('For Min/Max-normalized data, k = 19: precision = {0:.3f}, recall = {1:.3f}'.format(precision_score(Y_test, 
     predictions_mm, average=None)[0],recall_score(Y_test, predictions_mm, average=None)[0]))
print('F1 score for Min/Max-normalized data, k = 19: {0:.3f}'.format(f1_score(Y_test, predictions_mm, average=None)[0]))

For z-normalized data, k = 19: precision = 0.912, recall = 1.000
F1 score for z-normalized data, k = 19: 0.954
For Min/Max-normalized data, k = 19: precision = 0.920, recall = 1.000
F1 score for Min/Max-normalized data, k = 19: 0.959


#### Ok, so the classifier has perfect recall both types of scaled data, but the Min/Max-scaled data appears to have a slightly better F-score on the test data at this k value.  
#### Now, let's try to determine if there's a better choice for k.  We'll do this by carving out a validation set from the training set, and then running a loop over all odd k-values to find the one with the highest F-score, and do it for each type of data.  We'll also look for the k's with the highest precision, which would minimize the number of false negatives (especially bad in this cancer use case).
 

In [18]:
#return precision, recall, f-score for a model
def model_score(model, y_tr, x_tr, y_tst, x_tst):
    
    score_dict = {}
    model.fit(x_tr, y_tr)
    predictions = model.predict(x_tst)
    score_dict['Precision'] = precision_score(y_tst, predictions, average=None)[0]
    score_dict['Recall'] = recall_score(y_tst, predictions, average=None)[0]
    score_dict['Fscore'] = f1_score(y_tst, predictions, average=None)[0]
    return score_dict

#create validation set
X_tr, X_cv, Y_tr, Y_cv = train_test_split(X_train, Y_train, test_size=0.3, random_state=464)

#store results in a list of dictionaries
scores_z = []
scores_mm = []

#all possible odd values
for k in range(1,X_tr.shape[0]+1,2):
    
    #calculate scores for z-scaled data
    model_z = KNeighborsClassifier(n_neighbors=k)
    dict_z = model_score(model_z, Y_tr, X_tr.iloc[:,:30], Y_cv, X_cv.iloc[:,:30])
    dict_z['k'] = k
    scores_z.append(dict_z)
    
    #calculate scores for min/max-scaled data
    model_mm = KNeighborsClassifier(n_neighbors=k)
    dict_mm = model_score(model_z, Y_tr, X_tr.iloc[:,30:], Y_cv, X_cv.iloc[:,30:])
    dict_mm['k'] = k
    scores_mm.append(dict_mm)

df_scores_z = pd.DataFrame(scores_z)
df_scores_mm = pd.DataFrame(scores_mm)
    

In [19]:
print('For z-scaled data, the highest F-score occurs at k = {0}' \
      .format(df_scores_z.loc[df_scores_z['Fscore'] == max(df_scores_z['Fscore'])]['k'].to_string(index=False)))
print('For min/max-scaled data, the highest F-score occurs at k = {0}' \
      .format(df_scores_mm.loc[df_scores_mm['Fscore'] == max(df_scores_mm['Fscore'])]['k'].to_string(index=False)))
print('For z-scaled data, the highest precision occurs at k = {0}' \
      .format(df_scores_z.loc[df_scores_z['Precision'] == max(df_scores_z['Precision'])]['k'].to_string(index=False)))
print('For min/max-scaled data, the highest precision occurs at k = {0}' \
      .format(df_scores_mm.loc[df_scores_mm['Precision'] == max(df_scores_mm['Precision'])]['k'].to_string(index=False)))

For z-scaled data, the highest F-score occurs at k = 1
For min/max-scaled data, the highest F-score occurs at k = 3
For z-scaled data, the highest precision occurs at k = 1
For min/max-scaled data, the highest precision occurs at k = 3


#### Our original choice of k = 19 was sub-optimal, so let's re-run the classifiers created using better values for k, and then try to predict the diagnoses in the test data:

In [20]:
#Run full model on best k
    
model_z = KNeighborsClassifier(n_neighbors=1)
dict_z = model_score(model_z, Y_train, X_train.iloc[:,:30], Y_test, X_test.iloc[:,:30])

model_mm = KNeighborsClassifier(n_neighbors=3)
dict_mm = model_score(model_mm, Y_train, X_train.iloc[:,30:], Y_test, X_test.iloc[:,30:])

print('For z-normalized data, k = 1, test data classifier scores: precision = {0:.3f}, recall = {1:.3f}, F-score = {2:.3f}' \
      .format(dict_z['Precision'], dict_z['Recall'], dict_z['Fscore']))
print('For min/max-normalized data, k = 3, test data classifier scores: precision = {0:.3f}, recall = {1:.3f}, F-score = {2:.3f}' \
      .format(dict_mm['Precision'], dict_mm['Recall'], dict_mm['Fscore']))



For z-normalized data, k = 1, test data classifier scores: precision = 0.944, recall = 0.971, F-score = 0.957
For min/max-normalized data, k = 3, test data classifier scores: precision = 0.954, recall = 0.990, F-score = 0.972


In [21]:
model_z.fit(X_train.iloc[:,:30], Y_train)
predictions_z = model_z.predict(X_test.iloc[:,:30])
model_mm.fit(X_train.iloc[:,30:], Y_train)
predictions_mm = model_mm.predict(X_test.iloc[:,30:])

cnf_matrix_z = confusion_matrix(Y_test, predictions_z)
cnf_matrix_mm = confusion_matrix(Y_test, predictions_mm)

# Plot Z-scaled confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_z, classes=['Benign','Malignant'],
                      title='Confusion matrix, Z-scaled data')

# Plot Min/Max-scaled confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix_mm, classes=['Benign','Malignant'],
                      title='Confusion matrix, Min/Max-scaled data')

plt.show()

Confusion matrix, Z-scaled data
[[101   3]
 [  6  61]]
Confusion matrix, Min/Max-scaled data
[[103   1]
 [  5  62]]


## Results and Summary

#### Now the classifiers no longer have perfect recall, but the overall F-scores are better.  Again, the Min/Max scaled data performs better than the Z-scaled data.

#### However, the overall error rate is still almost 3%.  This doesn't seem particularly good for this use case, especially since the number of false negatives is so high.  It might be worth trying other distance metrics and/or only a subset of columns for further analysis in an attempt to improve the classification rate.