# COMP3314 - Assignment 1 
created by krohak 2018-02-15

** This notebook contains the implementation of a k-nn classifier from scratch and some experiments with it. **

Let's first import some essential Python libraries we'll use throughout the notebook:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

## Part 1 - Preprocessing
### Loading the Data

Let's load the data in a Pandas dataframe and have a quick look at it:

In [None]:
wisc_data = pd.read_csv("wisc_bc_data.csv")
wisc_data.head()

In [None]:
wisc_data.shape

Lets get rid of the 'id' column of the data since it is unnecessary. 

In [None]:
wisc_data = wisc_data.iloc[:,1:] # get rid of the id
wisc_data.head()

Lets also store the 'diagnosis' column in a separate dataframe since it corresponds to the labels:

In [None]:
labels = wisc_data.iloc[:,0]
wisc_data = wisc_data.iloc[:,1:] # get rid of the labels
labels.head()

We can also get a quick look at the graph of the raw dataframe:

In [None]:
wisc_data.plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

### Normalization:
We will use two ways to normalize the data:
1. Feature scaling: Making it range from 0-1 for all columns using the max() and min() value
2. Standard score: Using mean() and std() to find out how many standard deviations above the mean a certain value lies

Lets create functions for both types of normalization.

In [None]:
def feature_scalling(dataframe):
    return (dataframe - dataframe.min(0))/ (dataframe.max(0) - dataframe.min(0))

In [None]:
def standard_score(dataframe):
    return (dataframe - dataframe.mean()) / dataframe.std()

Visualizing the feature scaled normalized data which should be in the range 0-1

In [None]:
wisc_normalized = feature_scalling(wisc_data)
wisc_normalized.plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

Visualizing the standard score normalized data:

In [None]:
wisc_normalized = standard_score(wisc_data)
wisc_normalized.plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

We will normalize the data using feature scalling in our initial experiments, thus:

In [None]:
wisc_normalized = feature_scalling(wisc_data)

### Splitting training and testing datasets:

Initially, we split the data from 0-468 for training and the remaining 100 for testing.
We can use `train_test_split()` from `sklearn.cross_validation` to split the testing and training data according to the percentage of the testing data like so:

In [None]:
# `test_size` is the fraction of the data used for testing. 
# Note that instead of passing the entire dataframe, we pass just the values for convenience
X_train, X_test, y_train, y_test = train_test_split(wisc_normalized.values, labels.values, test_size = 0.175, random_state = 0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Part 2 - KNN Classifier from scratch
In this part, we create our KNeighboursClassifier class with the `fit()` and `predict()` functions (implementing the interface of scikitlearn's KNeighboursClassifier).

A brief overview of what the functions in `MyKNeighboursClassifier` do:

`fit()`- 

`uniform_distances()` - 

`uniform()` -

`predict()` - 

In [None]:
class MyKNeighborsClassifier(object):

    def __init__(self,k=5,weights='uniform'):
        self.k=k
        self.weights=weights

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        
    def uniform(self, X_test):
        distances = self.uniform_distances(self.X_train, X_test)
        y_pred = []
        
        # for all indices of X_test
        for i in range(X_test.shape[0]):
            top_k = []

            # for each k, take its label
            for j in range(self.k):
                top_k.append(distances[i][j][1])
                
            # count the number of labels of each type, take the most common label
            pred = Counter(top_k).most_common(1)[0][0]
            y_pred.append(pred)

        return y_pred

    
    def uniform_distances(self, X_train, X_test):
        distances = []

        # for each node in x_test
        for i in range(X_test.shape[0]):
            euclidian_dist = np.zeros(X_train.shape[0])
            distance_i = []

            # for each node in x_train
            for j in range(X_train.shape[0]):

                # compute the euclidian distance from i to j
                euclidian_dist[j] = np.sqrt(np.sum(np.square(np.array(X_test[i]) - np.array(X_train[j]))))

                # append in distance_i list along with the label of j
                distance_i.append([euclidian_dist[j], self.y_train[j]])

            # sort in decreasing order of distances
            distance_i = sorted(distance_i)
            distances.append(distance_i)

        return distances

    
    def predict(self,X_test):
        return self.uniform(X_test)

Testing its accuracy:

In [None]:
neigh = MyKNeighborsClassifier(k=21)

In [None]:
neigh.fit(X_train,y_train)

In [None]:
y_pred = neigh.predict(X_test)

In [None]:
print(accuracy_score(y_pred , y_test)*100)

Now we created the weighted version of MyKNeighboursClassifier. We include two new functions:

`weighted_distances()`- 

`distance()`-

In [None]:
keys={'M':0,'B':1,0:'M',1:'B'} # for converting the labels into binary classifications and vice-versa

class MyKNeighborsClassifier(object):

    def __init__(self,k=5,weights='uniform'):
        self.k=k
        self.weights=weights

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def weighted(self, X_test):
        distances = self.weighted_distances(self.X_train, X_test)
        y_pred = []

        # for all indices of X_test
        for i in range(X_test.shape[0]):
            top_k = []
            k_dist = []

            # for each k
            for j in range(self.k):
                top_k.append(distances[i][j][1])
                
                # we store the inverse distances for the denominator
                k_dist.append(1/distances[i][j][0])
                
            # sum of k labels weighted acording to the inverse of their distances / sum of k inverse distances
            sum_k = sum(top_k)/sum(k_dist)
            
            # round off to classify between 0 to 1 and convert back to M or B
            y_pred.append(keys[sum_k.round()])

        return y_pred
    

    def weighted_distances(self, X_train, X_test):
        distances = []

        # for each node in x_test
        for i in range(X_test.shape[0]):
            euclidian_dist_sq = np.zeros(X_train.shape[0])
            distance_i = []

            # for each node in x_train
            for j in range(X_train.shape[0]):

                # compute the square of the euclidian distance from i to j
                euclidian_dist_sq[j] = np.sum(np.square(np.array(X_test[i]) - np.array(X_train[j])))

                # calculate the label weight by converting j's label into binary classification
                # and multiplying with inverse of the distance from i to j
                label_weight = keys[self.y_train[j]]/euclidian_dist_sq[j]
                
                # append in distance_i list 
                distance_i.append([euclidian_dist_sq[j], label_weight])
                
            # sort in decreasing order of distances and append
            distances.append(sorted(distance_i))

        return distances
  

    def uniform(self, X_test):
        distances = self.uniform_distances(self.X_train, X_test)
        y_pred = []       
        for i in range(X_test.shape[0]):
            top_k = []
            for j in range(self.k):
                top_k.append(distances[i][j][1])
            pred = Counter(top_k).most_common(1)[0][0]
            y_pred.append(pred)
        return y_pred
    
    def uniform_distances(self, X_train, X_test):
        distances = []
        for i in range(X_test.shape[0]):
            euclidian_dist = np.zeros(X_train.shape[0])
            distance_i = []
            for j in range(X_train.shape[0]):
                euclidian_dist[j] = np.sqrt(np.sum(np.square(np.array(X_test[i]) - np.array(X_train[j]))))
                distance_i.append([euclidian_dist[j], self.y_train[j]])
            distance_i = sorted(distance_i)
            distances.append(distance_i)
        return distances
    
    def predict(self,X_test):
        if self.weights == 'distance':
            return self.weighted(X_test)
        return self.uniform(X_test)

Testing its accuracy:

In [None]:
neigh = MyKNeighborsClassifier(k=21,weights='distance')
neigh.fit(X_train,y_train)
y_pred = neigh.predict(X_test)
print(accuracy_score(y_pred , y_test)*100)

Improvement!

## Part 3 - Visualization

In [None]:
plt.plot([i for i in range(X_test.shape[0])], [keys[x] for x in y_test], "co")
plt.plot([i for i in range(X_test.shape[0])], [keys[x] for x in y_pred] ,"r+")
plt.axis([0,100,-1,+2])
plt.show()

In [None]:
fig1, axes = plt.subplots(1,2)

colors1 = ['#c2c2f0','#ffcc99']
colors2 = ['#66b3ff','#ffff55']

labels = 'True_M', 'True_B'
counts = Counter(y_test)
sizes = [counts['M'], counts['B']]
axes[0].pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors1, startangle=90)
axes[0].axis('equal')

labels = 'Pred_M', 'Pred_B'
counts = Counter(y_pred)
sizes = [counts['M'], counts['B']]
axes[1].pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors2, startangle=90)
axes[1].axis('equal')  

plt.tight_layout()
plt.show()

In [None]:
labels1 = 'True_M', 'True_B'
counts1 = Counter(y_test)
sizes1 = [counts1['M'], counts1['B']]


labels2 = 'Pred_M', 'Pred_B'
counts2 = Counter(y_pred)
sizes2 = [counts2['M'], counts2['B']]

colors1 = ['#c2c2f0','#ffcc99']
colors2 = ['#66b3ff','#ffff55']

explode = (0.2,0.2) 

plt.pie(sizes1, autopct='%1.1f%%', colors=colors1, pctdistance=0.85, startangle=90, frame=True)
plt.pie(sizes2, autopct='%1.1f%%', colors=colors2, pctdistance=0.85, radius=0.75, startangle=90)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.axis('equal')
plt.tight_layout()
plt.legend(labels = labels1 + labels2)
plt.show()

In [None]:
B_fake = 0
M_fake = 0

for i,y in enumerate(y_test):
    if y=='M' and y_pred[i]=='B':
        B_fake+=1
    elif y=='B' and y_pred[i]=='M':
        M_fake+=1
        
B_real = counts2['B'] - B_fake
M_real = counts2['M'] - M_fake
print(B_real,B_fake,M_real,M_fake)

In [None]:
labels1 = 'Pred_M', 'Pred_B'
counts1 = Counter(y_test)
sizes1 = [counts2['M'], counts2['B']]

explode1 = (0,0)

labels2 = 'True Positive', 'False Positive'
counts2 = Counter(y_pred)
sizes2 = [M_real,M_fake,B_real,B_fake]

explode2 = (0,0,0,0)

colors1 = ['#66b3ff','#ffff55']
colors2 = ['#aaff77','#ff6666']

plt.pie(sizes1, autopct='%1.1f%%', colors=colors1, explode=explode1, pctdistance=1.1, startangle=90, frame=True)
plt.pie(sizes2, autopct='%1.1f%%', colors=colors2, explode=explode2, pctdistance=0.65, radius=0.75, startangle=90)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.axis('equal')
plt.tight_layout()
plt.legend(labels = labels1 + labels2)
plt.show()

## Part 4 - Experiments and Comparison
### Sklearn KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5,weights='distance')

In [None]:
neigh.fit(X_train,y_train)

In [None]:
y_pred = neigh.predict(X_test)

In [None]:
print(accuracy_score(y_pred , y_test)*100)