<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 6.3
# *KNN classification Lab*

**In this lab, we will:**
- Practice KNN classification on a dataset of breast cancer.
- Predict the `diagnosis` of a patient from predictor variables of your choice.

### 1. Load Data

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

In [1]:
# IMPORT LIBRARIES
from itertools import combinations
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
breast_cancer = pd.read_csv('../DATA/breast-cancer-wisconsin-data.csv')
breast_cancer.head()

### 2. EDA

Explore dataset. Clean data. Find correlation.

In [None]:
breast_cancer.shape

In [None]:
breast_cancer.info()

In [None]:
breast_cancer.describe()

In [49]:
breast_cancer.drop(columns="Unnamed: 32", inplace=True)

### 3. Set up the `diagnosis` variable as your target. How many classes are there?

In [None]:
# ANSWER
breast_cancer.diagnosis.value_counts()

In [None]:
# Show as proportion
breast_cancer.diagnosis.value_counts(normalize=True)


In [None]:
#Show as percent
breast_cancer.diagnosis.value_counts(normalize=True) * 100


In [None]:
#Show as percent
(breast_cancer.diagnosis.value_counts(normalize=True) * 100).plot(kind='bar')


### 4. What is the baseline accuracy?

In [None]:
# ANSWER





### 5. Choose features to be your predictor variables and set up your X.

In [None]:
# ANSWER
sns.pairplot(breast_cancer)

### 6. Fit a `KNeighborsClassifier` with 1 neighbor using the target and predictors.

In [53]:
# ANSWER
# Make an instance of a KNeighborsClassifier object with 1 neighbor

mean_only_columns = [c for c in breast_cancer.columns if "_mean" in c]
mean_only_columns

selected_breast_cancer = breast_cancer[mean_only_columns].copy()
# selected_breast_cancer['diagnosis'] = breast_cancer.diagnosis.map({'M': 1, 'B': 0})

# fit on the unstandardized data:

In [None]:
# Create a custom function that plots correlation in heatmap
# Create a custom function that plots correlation in heatmap

def plot_corr_heatmap(df):

    # Copied code from seaborn examples
    # https://seaborn.pydata.org/examples/many_pairwise_correlations.html
    sns.set(style="white")

    # Generate a mask for the upper triangle
    mask = np.zeros_like(df.corr())
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(18, 18))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(df.corr(), mask=mask, cmap=cmap, vmax=1, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
    
    
plot_corr_heatmap(selected_breast_cancer)

In [None]:
y = breast_cancer['diagnosis']
feature_columns = []
excluded_columns = ['diagnosis', 'area_mean', 'radius_mean', 'concavity_mean', 'compact_points_mean']
for column in selected_breast_cancer.columns:
    if column not in excluded_columns:
        feature_columns.append(column)
# list comprehension version:
# feature_columns = [c for c in selected_breast_cancer.columns if c not in excluded_columns]
# Filter for the feature columns
X = selected_breast_cancer[feature_columns]
X.head()

In [56]:
from sklearn.neighbors import KNeighborsClassifier

In [57]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X, y)

### 7. Evaluate the accuracy of your model.
- Is it better than baseline?
- Is it legitimate?

In [None]:
# ANSWER
# predict the response values for the observations in X ("test the model")
# store the predicted response values
# use this to compute the accuracy

In [None]:
y_pred_class = knn.predict(X)

from sklearn import metrics
metrics.accuracy_score(y, y_pred_class)

### 8. Create a 80-20 train-test-split of your target and predictors. Refit the KNN and assess the accuracy.

In [None]:
# ANSWER
# STEP 1: split X and y into training and testing sets (using random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)



# STEP 2: train the model on the training set (using K=1)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

# STEP 3: test the model on the testing set, and check the accuracy
y_pred_class = knn.predict(X_test)
metrics.accuracy_score(y_test, y_pred_class)

### 9. Evaluate the test accuracy of a KNN where K == number of rows in the training data.

In [None]:
# ANSWER
# Create an instance of KNeighborsClassifier where n_neighbors = number of rows in the training data

k = X_train.shape[0]
knn = KNeighborsClassifier(n_neighbors=k)

# Fit Train Data
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)

# Print accuracy_score
metrics.accuracy_score(y_test, y_pred_class)

### 10. Fit the KNN at values of K from 1 to the number of rows in the training data.
- Store the test accuracy in a list.
- Plot the test accuracy vs. the number of neighbors.

In [None]:
# ANSWER
test_acc = []
for k in range(1, X_train.shape[0]+1):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    test_acc.append(knn.score(X_test, y_test))



# plot test accuracy by number of neighbors:
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(list(range(1, X_train.shape[0]+1)), test_acc, lw=3.)
plt.show()


### 11. Fit KNN across different values of K and plot the mean cross-validated accuracy with 5 folds.


In [63]:
from sklearn.model_selection import cross_val_score

In [None]:
folds = 5
max_neighbors = 200

test_acc = []
for i in range(1, int(max_neighbors)):
    knn = KNeighborsClassifier(n_neighbors=i)
    test_acc.append(np.mean(cross_val_score(knn, X, y, cv=5)))
print(max(test_acc))


In [None]:
# plot test accuracy by number of neighbors:
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(list(range(1, int(max_neighbors))), test_acc, lw=3.)
plt.show()

In [None]:
# ANSWER


k= 1

Five models - five fold - average accuracy - cv. SKlearn

k = 2
.
.
.
k=n

### 12. Standardize the predictor matrix and cross-validate across the different K.
- Plot the standardized mean cross-validated accuracy against the unstandardized. Which is better?
- Why?

In [67]:
# ANSWER
# Standarize X

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [None]:
test_acc_std = []
for i in range(1, int(max_neighbors)):
    knn = KNeighborsClassifier(n_neighbors=i)
    test_acc_std.append(np.mean(cross_val_score(knn, Xs, y, cv=5)))
print(max(test_acc_std))

In [None]:
# ANSWER
# plot test accuracy by number of neighbors:

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(list(range(1, int(max_neighbors))), test_acc, lw=3.)
ax.plot(list(range(1, int(max_neighbors))), test_acc_std, lw=3., color="darkred")
plt.show()

**References**

[Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/downloads/breast-cancer-wisconsin-data.zip/2)



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



