In [1]:
from preamble import *
%matplotlib inline

# KNN Activity

## The Classification Problem

HR has come up with a plan to distribute sweets to all Scripbox employees for our 7th year anniversary celebrations. They would like to reward people who have been with Scripbox for a long time by giving them more sweets. However, they are also aware that the older the person is the less likely they should have more sweets. To make this easier they want to split the employees into groups by tenure and age. To do this, they would like to specify the grouping emperically. For X kind of a person, some $S_{x}$ is how many sweets they get. For Y kind of a person, $S_{y}$ number of sweets and so on. To make things simple they will restrict themselves to three groups: Red, Black, and Green. Once they identify a small number of people as belonging to a class, the KNN classifier should assign labels to other employees.

## Setup

### Parameter Space

![ParameterSpace.jpg](attachment:ParameterSpace.jpg)

### Positioning Yourself

![PositionOfDataInParameterSpace.jpg](attachment:PositionOfDataInParameterSpace.jpg)

## Data concepts

- Features/Attributes
  - Age
  - Tenure in scripbox
  - Class label
- Instances
  - Each of you
- Train and Test datasets
  - Manual tagging of class labels

## Classifier concepts

- KNN  
  - For any new instance the class of its nearest neighbour(s) is the class of the new instance
  - Non-Parametric 
    - Stores all the training instance data 
    - No parameterised model of the training data
  - Instance based classifier
    - Class of learners that compare new instances to previously seen training instances 

- Distance metrics
  - Manhattan
    - $\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}$
  - Euclidean
    - $\sqrt{\sum_{i=1}^n (x_i-y_i)^2}$
  - Minkowski distance
    - $\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}$
- Choice of nearest neighbour search algorithm
  - Brute force $O(ND)$
  - KD Tree $O(D log N)$ for small $D$, for larger $D$ it is close to $O(DN)$
  - Ball Tree $O(D log N)$
- Parameter tuning
  - Influence of k
  - Effect of leaf size

# KNN using Scikit-learn

## Data

**Breast Cancer Wisconsin (Diagnostic) Data Set**

Link: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link] 

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes. 

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]. 

This database is also available through the UW CS ftp server: 
ftp ftp.cs.wisc.edu 
cd math-prog/cpo-dataset/machine-learn/WDBC/

In [2]:
import pandas as pd

> Load train and test datasets from 'data/breast_cancer_wisconsin/train.csv' and 'data/breast_cancer_wisconsin/test.csv' respectively and assign it to `train_df` and `test_df`

> Print column names

> Extract the train features into `train_x` and class variable into `train_y`

> Extract the test features into `test_x` and test class variable into `actual`

> import KNN classifier from the scikit-learn library using the following command:<br/>
`from sklearn.neighbors import KNeighborsClassifier`

> Create an instance of KNN classifier using the following command <br/>
`knn = KNeighborsClassifier(n_neighbors=1)`

> Train the classifier using `train_x` and `train_y` as follows: <br/>
`knn.fit(train_x, train_y)`

> Now classify the test instances using the trained classifier <br/>
`prediction = knn.predict(test_x)`

> Check the accuracy of the classifier by running the following command:

In [None]:
print("Test set score: {:.2f}".format(np.mean(predicted == actual)))

## Evaluation

![TPNP.png](attachment:TPNP.png)

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + TN}$

> Import metrics from sklearn using the following imports <br/>
`from sklearn.metrics import precision_score` <br/>
`from sklearn.metrics import recall_score` <br/>
`from sklearn.metrics import f1_score` <br/>

> Calculate precision and recall using the above functions