# Intro To Classification
- Regression is used to predict continuous numbers, while classification is used to predict discrete numbers. In other words, classification algorithms predict categories or classes by splitting data into groups and placing new data into those groups.

## K-Nearest Neighbors (KNN)

- KNN takes a different approach to modeling than with linear models, similar to how humans might approach classification.
- In order to estimate a value (regression) or class membership (classification), the algorithm :
> - Finds the observations in its training data that are "nearest" to the observation it has to predict
> - Averages (takes a vote) of those training observations' target values to estimate the value for the new data point.
- Distance is usually calculated usinng the Euclidean distance
- The "k" refers to the number of nearest neighbors contributing to the prediction

In [51]:
# Data Mining
import numpy as np
import pandas as pd

# Standardization
from sklearn.preprocessing import StandardScaler

# Model Building 
from sklearn.neighbors import KNeighborsClassifier

# Model Validation
from sklearn.model_selection import cross_val_score

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [52]:
data = pd.read_csv('./data.csv', header=None, index_col=None)
data.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
1,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
2,842517,M,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956,0.1238,0.1866,0.2416,0.186,0.275,0.08902


# Exploratory Data Analysis
### _Reassigning Column Names_

In [53]:
column_names = ['id','diagnosis', 'radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean',
                'compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean',
                'radius_se','texture_se','perimeter_se','area_se','smoothness_se','compactness_se','concavity_se',
                'concave points_se','symmetry_se','fractal_dimension_se','radius_worst','texture_worst',
                'perimeter_worst','area_worst','smoothness_worst','compactness_worst','concavity_worst','concave points_worst',
                'symmetry_worst','fractal_dimension_worst']
data.columns = column_names
data.head(3)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
1,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
2,842517,M,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [54]:
data = data.drop([0])  # Removing now duplicated column names
data.head(3)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
1,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
2,842517,M,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956,0.1238,0.1866,0.2416,0.186,0.275,0.08902
3,84300903,M,19.69,21.25,130.0,1203,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


### _Removing columns pertaining to standard deviation and "worst" measurements, leaving only the mean measurement columns_

In [55]:
data = data[[c for c in data.columns if not '_worst' in c and not '_se' in c]]
data.head(3)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
1,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
2,842517,M,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
3,84300903,M,19.69,21.25,130.0,1203,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999


# Preparing for Model Building
### _Encoding Target to Binary Variable_
   - Malignant = 1 
   - Benign = 0

In [56]:
data['diagnosis'] = data['diagnosis'].map(lambda x: 0 if x=='B' else 1)
data['diagnosis']

1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     0
21     0
22     0
23     1
24     1
25     1
26     1
27     1
28     1
29     1
30     1
      ..
540    0
541    0
542    0
543    0
544    0
545    0
546    0
547    0
548    0
549    0
550    0
551    0
552    0
553    0
554    0
555    0
556    0
557    0
558    0
559    0
560    0
561    0
562    0
563    1
564    1
565    1
566    1
567    1
568    1
569    0
Name: diagnosis, Length: 569, dtype: int64

### _Creating Target Vector and Predictor Matrix_
- Target = Diagnosis Column
- Predictors are up to you 

In [57]:
y = data['diagnosis'].values
X = data.drop('diagnosis',axis=1) # Because many columns were already dropped, using all remaining columns 

### _Standardizing Predictor Matrix_
- Crucial in a KNN model because the nearest neighbors are found according to a distance metric
- If predictors are left unstandardized, it's possible some predictors will have unfair impact on distance measure simply because they're on a larger scale 

In [58]:
ss = StandardScaler()
Xs = ss.fit_transform(X)
Xs

array([[-2.36405166e-01,  1.09706398e+00, -2.07333501e+00, ...,
         2.53247522e+00,  2.21751501e+00,  2.25574689e+00],
       [-2.36403445e-01,  1.82982061e+00, -3.53632408e-01, ...,
         5.48144156e-01,  1.39236330e-03, -8.68652457e-01],
       [ 4.31741086e-01,  1.57988811e+00,  4.56186952e-01, ...,
         2.03723076e+00,  9.39684817e-01, -3.98007910e-01],
       ...,
       [-2.35727466e-01,  7.02284249e-01,  2.04557380e+00, ...,
         1.05777359e-01, -8.09117071e-01, -8.95586935e-01],
       [-2.35725168e-01,  1.83834103e+00,  2.33645719e+00, ...,
         2.65886573e+00,  2.13719425e+00,  1.04369542e+00],
       [-2.42405862e-01, -1.80840125e+00,  1.22179204e+00, ...,
        -1.26181958e+00, -8.20069901e-01, -5.61032377e-01]])

### _Calculating Baseline Accuracy _
- Before we can evaluate whether our classifier's accuracy is good or bad, we need to know the baseline accuracy.
- The baseline accuracy is the proportion of the majority class. 
- Therefore, the baseline accuracy for our dataset is the percent of the labeled malignant or benign, depending on whether malignant or benign is greater. 

In [59]:
# More data is labeled as benign
data['diagnosis'].value_counts()

0    357
1    212
Name: diagnosis, dtype: int64

In [66]:
# We are wanting to predict when observation is malignant, therefore
baseline = 1 - np.mean(y)
print(baseline,'is the baseline accuracy KNN models seeks to improve on.')

0.6274165202108963 is the baseline accuracy KNN models seeks to improve on.


# Building KNN Models
- Model building centers around predicting the malignant target class using cross-validated scores and standard deviations.
- Arguments :
    - n_neighbors: Specifies how many neighbors will vote on the class.
    - weights: Uniform weights indicate that all neighbors have the same weight.

In [72]:
# with 1 Neighbor 
knn1 = KNeighborsClassifier(n_neighbors=1,
                            weights='uniform')
knn1_scores = cross_val_score(knn1,Xs,y,cv=5)


# with 3 Neighbors 
knn3 = KNeighborsClassifier(n_neighbors=3,
                            weights='uniform')
knn3_scores = cross_val_score(knn3,Xs,y,cv=5)


# with 5 Neighbors
knn5 = KNeighborsClassifier(n_neighbors=5,
                            weights='uniform')
knn5_scores = cross_val_score(knn5,Xs,y,cv=5)


# Printing cross-validated scores and standard deviations
print('KNN-1 Neighbor Score & Std:', np.mean(knn1_scores),',', np.std(knn1_scores))
print('KNN-3 Neighbor Score & Std:', np.mean(knn3_scores),',', np.std(knn3_scores))
print('KNN-5 Neighbor Score & Std:', np.mean(knn5_scores),',', np.std(knn5_scores))

KNN-1 Neighbor Score & Std: 0.9087033474413235 , 0.017539444048369737
KNN-3 Neighbor Score & Std: 0.9332974220854174 , 0.016922759124889712
KNN-5 Neighbor Score & Std: 0.9314351673720662 , 0.015307277690503916


# Interpreting Results
- As the number of neighbors increases, the model's accuracy seems to increase while the model's standard deviation (variance) seems to decrease