#### Summary ####

As a junior data scientist at MediTech Solutions, you've joined a team developing a diagnostic support tool for primary care physicians. Your specific task is to build a k-NN model that helps doctors identify similar patient cases from historical records. The choice of distance metric is crucial, as it will determine how "similarity" between patients is measured.

##### The Dataset and the Challenge #####

The dataset contains patient records with the following features:

- Vital signs (continuous): blood pressure, heart rate, temperature, etc.
- Lab results (continuous): cholesterol levels, blood glucose, etc.
- Symptoms (binary): presence/absence of 20 different symptoms
- Patient demographics (mixed): age (continuous), gender (binary), ethnicity (categorical)
- Medical history (ordinal): severity levels of various conditions

The challenge is particularly complex because:

- Feature importance varies significantly (some vitals are more critical than others).
- Features have different scales and units.
- Features are highly correlated (especially among lab results).
- The dataset contains mixed data types.

##### Exploring Different Distance Metrics #####

Begin by testing how different distance metrics affect the model's ability to correctly identify similar cases with known diagnoses.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score

# Load the patient dataset
patient_data = pd.read_csv('patient_records.csv')

# Separate features and target
X = patient_data.drop('diagnosis', axis=1)
y = patient_data['diagnosis']

# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (critical for distance-based algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Test different distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev', 'mahalanobis']
results = {}

for metric in metrics:
    # Special case for Mahalanobis distance which requires covariance matrix
    if metric == 'mahalanobis':
        # Calculate the covariance matrix from the training data
        cov = np.cov(X_train_scaled, rowvar=False)
        metric_params = {'V': cov}
        knn = KNeighborsClassifier(n_neighbors=7, metric=metric, metric_params=metric_params)
    else:
        knn = KNeighborsClassifier(n_neighbors=7, metric=metric)
    
    # Use cross-validation to evaluate performance
    cv_scores = cross_val_score(knn, X_train_scaled, y_train, cv=5)
    results[metric] = cv_scores.mean()
    
    print(f"Average accuracy with {metric} distance: {cv_scores.mean():.4f}")

After analyzing the results, the Mahalanobis distance significantly outperforms the others, achieving 78.3% accuracy compared to 71.8% with Euclidean distance. 

This makes sense because:

- Vital signs and lab results are highly correlated (e.g., related metabolic measurements).
- Mahalanobis distance accounts for these correlations, preventing redundant features from having too much influence.
- Medical conditions often manifest as patterns of related abnormalities rather than isolated extreme values.

##### Fine Tune the Model #####

Based on this insight, optimize the model further by tuning the number of neighbors and weighting scheme:

In [None]:
from sklearn.model_selection import GridSearchCV

# Set up the covariance matrix for Mahalanobis
cov = np.cov(X_train_scaled, rowvar=False)
metric_params = {'V': cov}

# Create the classifier
knn = KNeighborsClassifier(metric='mahalanobis', metric_params=metric_params)

# Define parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 13],
    'weights': ['uniform', 'distance']
}

# Perform grid search
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

# Get best parameters
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Create optimized model
optimal_knn = KNeighborsClassifier(
    n_neighbors=best_params['n_neighbors'],
    weights=best_params['weights'],
    metric='mahalanobis',
    metric_params=metric_params
)

# Train and evaluate final model
optimal_knn.fit(X_train_scaled, y_train)
final_accuracy = optimal_knn.score(X_test_scaled, y_test)
print(f"Final model accuracy: {final_accuracy:.4f}")

The grid search reveals that the optimal configuration uses k=7 with distance-weighted voting, achieving 81.6% accuracy on the test set.

##### Clinical Implementation and Impact #####

When integrated into the diagnostic support tool:

- Explainability for physicians: The system not only suggests potential diagnoses but also presents the most similar past cases, showing why certain conditions might be suspected. This transparency helps doctors trust and learn from the system.
- Sensitivity to correlated symptoms: Unlike previous rule-based systems that treated symptoms independently, this k-NN model with Mahalanobis distance correctly identifies conditions that manifest as patterns of related abnormalities.
- Reduced false alarms: The distance-weighted voting means that extremely similar cases have more influence than borderline ones, reducing false positives by 31% compared to the previous system.