# **Comparing Preprocessing Methods for KNN Classification**

A quick demonstration of common preprocessing techniques applied to the `Biomechanical Features of Orthopedic Patients Dataset` for binary class prediction using the `KNeighborsClassifier`. Original feature values were processed using dimensionality reduction (PCA), scaling (StandardScaler), both, or none. These preprocessing transformations influence how distance is measured which is crucial for algorithms such as KNN. GridsearchCV was applied for determining the optimal model and distance measurement for each preprocessing pipeline. The resulting test accuracies were used for evaluation and comparison.

## Results

The distance-based KNeighborsClassifier proved sufficiently accurate at identifying the binary target class given the baseline Proportional Chance Criterion (PCC). For this specific dataset, however, it was found that preprocessing techniques such as scaling and dimensionality reduction did not yield a significantly higher test accuracy when compared to the original dataset. Dimensionality reduction usually yields more significant improvements when applied to high dimension datasets, while scaling tends to add value when features differ greatly in their magnitude and units. Ultimately, the best way to validate whether one method or another is more suitable is to empirically test and compare results on a case-to-case basis.

**PCC: 0.563**

**PCCx1.25: 0.704**


| Name        | Dist Metric| n_neighbors | Test Accuracy | Train Accuracy | Test Precision | Test Recall |
|-------------|------------|-------------|---------------|----------------|----------------|-------------|
| Orig        | euclidean  | 17          | 0.855         | 0.863          | 0.739          | 0.85        |
| PCA         | euclidean  | 5           | 0.839         | 0.859          | 0.708          | 0.85        |
| Scaled      | manhattan  | 6           | 0.790         | 0.839          | 0.706          | 0.60        |
| Scaled+PCA  | manhattan  | 28          | **0.887**     | 0.835          | 0.810          | 0.85        |

# Execution

## Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA

## Data Loading and Preprocessing

**Source**: [Kaggle](https://www.kaggle.com/datasets/uciml/biomechanical-features-of-orthopedic-patients)

Each patient is represented in the data set by six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine (each one is a column):

* pelvic incidence
* pelvic tilt
* lumbar lordosis angle
* sacral slope
* pelvic radius
* grade of spondylolisthesis

The target column is a binary class of either "Normal" or "Abnormal".

In [3]:
# Read and Prepare
data = pd.read_csv('column_2C_weka.csv')
data.drop_duplicates(inplace=True)
data.dropna(inplace=True)
X, y = data.iloc[:, :-1], data.iloc[:, -1]
y = LabelEncoder().fit_transform(y)

data.sample(n=3, random_state=1)

Unnamed: 0,pelvic_incidence,pelvic_tilt numeric,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis,class
78,67.412538,17.442797,60.14464,49.969741,111.12397,33.157646,Abnormal
244,63.0263,27.33624,51.605017,35.69006,114.506608,7.43987,Normal
185,91.468741,24.508177,84.620272,66.960564,117.307897,52.623047,Abnormal


## Data Preprocessing

In [4]:
# Split, Original Data
X_train_orig, X_test_orig, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

# Data Transformations
## Scaled Feature Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_orig)
X_test_scaled =scaler.transform(X_test_orig)

## Dimension Reduced Data
pca = PCA(n_components=4)
X_train_reduced = pca.fit_transform(X_train_orig)
X_test_reduced = pca.transform(X_test_orig)

# Scaled and Dimension Reduced Data
pca = PCA(n_components=4)
X_train_scaled_reduced = pca.fit_transform(X_train_scaled)
X_test_scaled_reduced = pca.transform(X_test_scaled)

# Store
original = ["Orig", X_train_orig, X_test_orig]
reduced = ["PCA", X_train_reduced, X_test_reduced]
scaled = ["Scaled", X_train_scaled, X_test_scaled]
reduced_scaled = ["Scaled+PCA", X_train_scaled_reduced, X_test_scaled_reduced]
datasets = [original, reduced, scaled, reduced_scaled]


## Hyperparameter Tuning

In [8]:
results = []
param_grid = {
    'n_neighbors': range(5, 31),
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

for name, X_train, X_test in datasets:
    model = KNeighborsClassifier()
    grid_search = GridSearchCV(model, param_grid, cv=3, verbose=0, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    model = grid_search.best_estimator_

    y_pred = model.predict(X_test)
    result = {
        "name": name,
        "params": grid_search.best_params_,
        "test_accuracy": accuracy_score(y_test, y_pred),
        "train_accuracy": grid_search.best_score_,
        "test_precision": precision_score(y_test, y_pred),
        "test_recall": recall_score(y_test, y_pred),
    }
    results.append(result)

## Evaluation

**PCC Baseline**

In [9]:
pcc = np.sum([np.mean(y==i)**2 for i  in set(y)])
print(f"PCC: {pcc:.3f}")
print(f"PCC*1.25: {pcc*1.25:.3f}")

PCC: 0.563
PCC*1.25: 0.704


**Model Evaluation**

In [10]:
pd.DataFrame(results)

Unnamed: 0,name,params,test_accuracy,train_accuracy,test_precision,test_recall
0,Orig,"{'metric': 'euclidean', 'n_neighbors': 17}",0.854839,0.862817,0.73913,0.85
1,PCA,"{'metric': 'euclidean', 'n_neighbors': 5}",0.83871,0.85885,0.708333,0.85
2,Scaled,"{'metric': 'manhattan', 'n_neighbors': 6}",0.790323,0.838819,0.705882,0.6
3,Scaled+PCA,"{'metric': 'manhattan', 'n_neighbors': 28}",0.887097,0.834705,0.809524,0.85


## **Acknowledgements**
*The original dataset was downloaded from UCI ML repository:* <br>
*Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science*
<br>
*Files were converted to CSV.*

Data retrieved from: <br>
[UCI Machine Learning Repository: Biomechanical Features of Orthopedic Patients Dataset](https://www.kaggle.com/datasets/uciml/biomechanical-features-of-orthopedic-patients) (Accessed May 10, 2024).
