KNN is a supervised, instance based learning algorithm that classifies data based on majority class 
among the k nearest neighbors using distance measures like Eucledian measures.

The following dataset contains details of patients with 2 types of breast cancer
Our Objective is to predict the diagnosis of a patient (Malignant or benign) based on the given features.
KNN is the algorithm we will use to predict/classify the new instances.

Step 1 - Load and initialise data

In [13]:
import pandas as pd

#load the dataset
df = pd.read_csv('KNNAlgorithmDataset.csv')

#Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(df.head().to_markdown(index=False,numalign='left',stralign='left'))

#Display the concise memory of the dataset
print("\nconciise summary of the dataset:")
print(df.info())

#Display the descriptive statistical report
print("\nStatistical report of the dataset:")
print(df.describe().to_markdown(index=False,numalign='left',stralign='left'))

#Display the no of unique values in each columns
print("\nUnique columns in the dataset:")
print(df.nunique().to_markdown(index=False,numalign='left',stralign='left'))

#Display the no of missing values
print("\nNo of missing values:")
print(df.isnull().sum().to_markdown(index=False,numalign='left',stralign='left'))



First 5 rows of the dataset:
| id       | diagnosis   | radius_mean   | texture_mean   | perimeter_mean   | area_mean   | smoothness_mean   | compactness_mean   | concavity_mean   | concave points_mean   | symmetry_mean   | fractal_dimension_mean   | radius_se   | texture_se   | perimeter_se   | area_se   | smoothness_se   | compactness_se   | concavity_se   | concave points_se   | symmetry_se   | fractal_dimension_se   | radius_worst   | texture_worst   | perimeter_worst   | area_worst   | smoothness_worst   | compactness_worst   | concavity_worst   | concave points_worst   | symmetry_worst   | fractal_dimension_worst   | Unnamed: 32   |
|:---------|:------------|:--------------|:---------------|:-----------------|:------------|:------------------|:-------------------|:-----------------|:----------------------|:----------------|:-------------------------|:------------|:-------------|:---------------|:----------|:----------------|:-----------------|:---------------|:-------------------

Step 2  - Data preprocessing
#Here we will drop the unnamed column 32 as the entire column is ull
#We shall convert the diagnosis column into numerical data as it is categorical data
#We drop the id column as it is no longer required for our prediction model.

In [14]:
from sklearn.preprocessing import LabelEncoder

#Drop the id and unnamed column '32'
df = df.drop(['id', 'Unnamed: 32'], axis=1)

#Encode the diagnosis column
label_encoder = LabelEncoder()

df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])

#Let's display the first 5 rows of the preprocessed dataset
print("First 5 rows of the preprocessed dataset:")
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

#let's display the data types after encoding
print("\nData types after encoding:")
print(df.info())

# let's display the value counts for the 'diagnosis' column after encoding
print("\nValue counts for 'diagnosis' column after encoding:")
print(df['diagnosis'].value_counts().to_markdown(numalign="left", stralign="left"))

#The output will diplay 0 for Benign and 1 for Malignant

First 5 rows of the preprocessed dataset:
| diagnosis   | radius_mean   | texture_mean   | perimeter_mean   | area_mean   | smoothness_mean   | compactness_mean   | concavity_mean   | concave points_mean   | symmetry_mean   | fractal_dimension_mean   | radius_se   | texture_se   | perimeter_se   | area_se   | smoothness_se   | compactness_se   | concavity_se   | concave points_se   | symmetry_se   | fractal_dimension_se   | radius_worst   | texture_worst   | perimeter_worst   | area_worst   | smoothness_worst   | compactness_worst   | concavity_worst   | concave points_worst   | symmetry_worst   | fractal_dimension_worst   |
|:------------|:--------------|:---------------|:-----------------|:------------|:------------------|:-------------------|:-----------------|:----------------------|:----------------|:-------------------------|:------------|:-------------|:---------------|:----------|:----------------|:-----------------|:---------------|:--------------------|:--------------|:------

Step- 3 - Data Preparation for KNN

--In this step we're going to assign target variable y and feature variables x
--Standard scaling to reduce the mean by 1 and standard deviation by 0 since large numerical values might cause    disproportinate the distance calculation
--Do a train test split

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

#Assign x and y

X = df.drop('diagnosis',axis=1)
y = df['diagnosis']


#Display the shape of X and y

print('Shape of X:',X.shape)
print('\nShape of Y:',y.shape)

#Feauture scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#Display the first 5 rows of the scaled features
print("\nFirst 5 rows of scaled features (X_scaled):")
print(pd.DataFrame(X_scaled,columns=X.columns).head().to_markdown(index=False,numalign='left',stralign='left'))

#Let's split the data now for training and testing

X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size = 0.3, random_state = 42)

# Display the shapes of the training and testing sets
print("\nShape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)



Shape of X: (569, 30)

Shape of Y: (569,)

First 5 rows of scaled features (X_scaled):
| radius_mean   | texture_mean   | perimeter_mean   | area_mean   | smoothness_mean   | compactness_mean   | concavity_mean   | concave points_mean   | symmetry_mean   | fractal_dimension_mean   | radius_se   | texture_se   | perimeter_se   | area_se   | smoothness_se   | compactness_se   | concavity_se   | concave points_se   | symmetry_se   | fractal_dimension_se   | radius_worst   | texture_worst   | perimeter_worst   | area_worst   | smoothness_worst   | compactness_worst   | concavity_worst   | concave points_worst   | symmetry_worst   | fractal_dimension_worst   |
|:--------------|:---------------|:-----------------|:------------|:------------------|:-------------------|:-----------------|:----------------------|:----------------|:-------------------------|:------------|:-------------|:---------------|:----------|:----------------|:-----------------|:---------------|:--------------------|:-----

In [23]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

#Initialise the KNN classifier with n_neighbors = 5
knn = KNeighborsClassifier(n_neighbors=5)

#We'll now train the model
knn.fit(X_train,y_train)

#We'll now make predictions on the test set
y_pred = knn.predict(X_test)

#Evaluate the model to idenntitfy accuracy, error rate in prediction
accuracy = accuracy_score(y_test,y_pred)
conf_matrix = confusion_matrix(y_test,y_pred)
class_report = classification_report(y_test,y_pred)

print(f"accuracy: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

accuracy: 0.9591

Confusion Matrix:
[[105   3]
 [  4  59]]

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97       108
           1       0.95      0.94      0.94        63

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171




## Applying PCA for Dimensionality Reduction

Here, we apply Principal Component Analysis (PCA) before using KNN to reduce dimensionality while retaining 95% variance.


In [None]:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Assuming X_train and X_test are already defined in your earlier cells
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

print(f"Original shape: {X_train.shape}")
print(f"Reduced shape after PCA: {X_train_pca.shape}")

# Plot explained variance
plt.figure(figsize=(8,5))
plt.plot(range(1, len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA - Cumulative Explained Variance')
plt.grid(True)
plt.show()



## Decision Tree Classifier

Now, we apply Decision Tree Classifier on the same dataset to compare performance with KNN.


In [None]:

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report

dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)

print("Classification Report for Decision Tree:")
print(classification_report(y_test, y_pred_dt))

# Visualize the tree
plt.figure(figsize=(12,8))
plot_tree(dt_clf, filled=True, feature_names=None, class_names=True)
plt.show()



## XGBoost Classifier

Next, we implement XGBoost Classifier for additional comparison.


In [None]:

from xgboost import XGBClassifier

xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)

print("Classification Report for XGBoost:")
print(classification_report(y_test, y_pred_xgb))



## Cross-Validation and Hyperparameter Tuning

We will perform cross-validation and hyperparameter tuning using GridSearchCV on SVM to demonstrate tuning workflow.


In [None]:

from sklearn.model_selection import GridSearchCV

from sklearn.svm import SVC

param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

svc = SVC()
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters found:", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

best_svc = grid_search.best_estimator_
y_pred_svc = best_svc.predict(X_test)
print("Classification Report for Tuned SVM:")
print(classification_report(y_test, y_pred_svc))



## Model Comparison Summary

| Model | Accuracy | Notes |
|-------|----------|-------|
| KNN | *To be filled* | Original Model |
| Decision Tree | *To be filled* | Simple Tree |
| XGBoost | *To be filled* | Gradient Boosting |
| Tuned SVM | *To be filled* | With GridSearchCV |

Fill the above table with your observed accuracy after running the above cells to summarize your learning.
