
# Parkinson’s Disease Detection Using Machine Learning

This notebook implements an **end-to-end machine learning pipeline** to detect Parkinson’s disease using biomedical voice measurements.

## Project Objective (Problem Statement)

Parkinson’s disease is a progressive neurological disorder that affects movement and speech. Early detection helps in better treatment and quality of life. In this project, we use the **UCI Parkinson’s dataset**, which contains various voice-related biomedical features extracted from sustained phonation. The **target variable** is `status` (1 = Parkinson’s, 0 = healthy). The goal is to build and evaluate classification models that can accurately distinguish between Parkinson’s and healthy subjects based on their voice features.  
We follow a complete ML workflow: data loading, exploratory data analysis (EDA), preprocessing, model building, feature engineering, hyperparameter tuning, evaluation, and model deployment (via a saved model and Streamlit app).


In [None]:

# Core libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model building
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report, RocCurveDisplay
)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Interpretability
from sklearn.inspection import permutation_importance

# Persisting model
import joblib

# Display settings
pd.set_option("display.max_columns", None)
sns.set(style="whitegrid")



## 1. Load the Dataset

We load the **Parkinson’s dataset** (`parkinsons.csv`) from the project folder and inspect its basic structure.


In [None]:

# Load dataset (ensure parkinsons.csv is in the same folder as this notebook)
data_path = "parkinsons.csv"
df = pd.read_csv(data_path)

print("Shape of dataset:", df.shape)
df.head()



## 2. Dataset Overview & Data Quality Checks

We check:
- Column types  
- Missing values  
- Basic statistics  


In [None]:

# Basic info
df.info()


In [None]:

# Check missing values
df.isna().sum()


In [None]:

# Statistical summary of numeric columns
df.describe().T



## 3. Exploratory Data Analysis (EDA)

### 3.1 Target Variable Distribution (`status`)


In [None]:

# Target distribution
target_counts = df['status'].value_counts().sort_index()
print(target_counts)

plt.figure()
sns.countplot(x="status", data=df)
plt.title("Distribution of Target Variable (status)")
plt.xlabel("status (0 = healthy, 1 = Parkinson's)")
plt.ylabel("Count")
plt.show()



### 3.2 Feature Distributions

We visualize distributions of a few representative numerical features.


In [None]:

numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
numeric_cols = [col for col in numeric_cols if col != "status"]

# Plot distributions for first 6 numeric features (for brevity)
sample_features = numeric_cols[:6]

for col in sample_features:
    plt.figure()
    sns.histplot(df[col], kde=True)
    plt.title(f"Distribution of {col}")
    plt.show()



### 3.3 Feature–Target Relationship

We use boxplots to see how some features differ between Parkinson’s and healthy subjects.


In [None]:

for col in sample_features:
    plt.figure()
    sns.boxplot(x="status", y=col, data=df)
    plt.title(f"{col} by status")
    plt.xlabel("status (0 = healthy, 1 = Parkinson's)")
    plt.show()



### 3.4 Correlation Analysis

We check correlations between features and the target variable.


In [None]:

plt.figure(figsize=(12, 8))
corr = df.corr()
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()

# Correlation of features with target
corr_target = corr["status"].sort_values(ascending=False)
corr_target



## 4. Data Preprocessing & Train–Test Split

We drop non-feature identifier columns (like `name` if present), separate features and target, split into training and test sets, and apply **standardization**.


In [None]:

# Drop non-numeric identifier column (e.g., 'name') if present
if "name" in df.columns:
    df_model = df.drop(columns=["name"])
else:
    df_model = df.copy()

X = df_model.drop(columns=["status"])
y = df_model["status"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


In [None]:

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



## 5. Baseline Model Building & Evaluation

We start with three baseline models:

1. **Logistic Regression**  
2. **K-Nearest Neighbors (KNN)**  
3. **Support Vector Machine (SVM)**  

We compare them using Accuracy, Precision, Recall, F1-score, and ROC-AUC.


In [None]:

def evaluate_classifier(name, model, X_train, y_train, X_test, y_test):
    """Train the model, make predictions, and return a dictionary of metrics."""
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = None
    try:
        y_prob = model.predict_proba(X_test)[:, 1]
    except Exception:
        try:
            y_scores = model.decision_function(X_test)
            y_prob = (y_scores - y_scores.min()) / (y_scores.max() - y_scores.min())
        except Exception:
            y_prob = None

    metrics = {
        "model": name,
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, zero_division=0),
        "recall": recall_score(y_test, y_pred, zero_division=0),
        "f1": f1_score(y_test, y_pred, zero_division=0),
    }
    if y_prob is not None:
        metrics["roc_auc"] = roc_auc_score(y_test, y_prob)
    else:
        metrics["roc_auc"] = np.nan
    return metrics, model, y_pred, y_prob

results = []

# Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_res, log_model, log_pred, log_prob = evaluate_classifier(
    "Logistic Regression", log_reg, X_train_scaled, y_train, X_test_scaled, y_test
)
results.append(log_res)

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn_res, knn_model, knn_pred, knn_prob = evaluate_classifier(
    "KNN", knn, X_train_scaled, y_train, X_test_scaled, y_test
)
results.append(knn_res)

# SVM (RBF kernel)
svm = SVC(kernel="rbf", probability=True, random_state=42)
svm_res, svm_model, svm_pred, svm_prob = evaluate_classifier(
    "SVM (RBF)", svm, X_train_scaled, y_train, X_test_scaled, y_test
)
results.append(svm_res)

baseline_results = pd.DataFrame(results)
baseline_results


In [None]:

# Plot baseline model comparison
metrics_to_plot = ["accuracy", "precision", "recall", "f1", "roc_auc"]
baseline_results_melted = baseline_results.melt(id_vars="model", value_vars=metrics_to_plot,
                                                var_name="metric", value_name="score")

plt.figure(figsize=(10, 5))
sns.barplot(data=baseline_results_melted, x="metric", y="score", hue="model")
plt.title("Baseline Model Comparison")
plt.ylim(0, 1.05)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()



### 5.1 Confusion Matrix & ROC Curve for Best Baseline Model

We take the best baseline model (likely SVM) and visualize its performance.


In [None]:

best_baseline_name = baseline_results.sort_values("f1", ascending=False).iloc[0]["model"]
best_baseline_name


In [None]:

# Use SVM as the main candidate (you can change this if another model performs better)
best_model = svm_model
best_pred = svm_pred
best_prob = svm_prob

# Confusion matrix
cm = confusion_matrix(y_test, best_pred)
plt.figure()
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - Baseline SVM")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

print("Classification Report (Baseline SVM):")
print(classification_report(y_test, best_pred))

# ROC curve
if best_prob is not None:
    RocCurveDisplay.from_predictions(y_test, best_prob)
    plt.title("ROC Curve - Baseline SVM")
    plt.show()



## 6. Hyperparameter Tuning (SVM)

We tune `C` and `gamma` for the SVM model using **GridSearchCV** and compare default vs tuned performance.


In [None]:

param_grid = {
    "C": [0.1, 1, 10, 100],
    "gamma": [0.01, 0.1, 1, 10],
    "kernel": ["rbf"]
}

svm_base = SVC(probability=True, random_state=42)
grid_search = GridSearchCV(
    svm_base,
    param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best CV F1 score:", grid_search.best_score_)


In [None]:

# Evaluate tuned SVM on test set
svm_tuned = grid_search.best_estimator_
tuned_res, tuned_model, tuned_pred, tuned_prob = evaluate_classifier(
    "SVM (Tuned)", svm_tuned, X_train_scaled, y_train, X_test_scaled, y_test
)

tuned_results = pd.DataFrame([svm_res, tuned_res])
tuned_results


In [None]:

# Visual comparison: baseline vs tuned SVM
tuned_results_melted = tuned_results.melt(id_vars="model", value_vars=metrics_to_plot,
                                          var_name="metric", value_name="score")

plt.figure(figsize=(10, 5))
sns.barplot(data=tuned_results_melted, x="metric", y="score", hue="model")
plt.title("Baseline vs Tuned SVM Performance")
plt.ylim(0, 1.05)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()


In [None]:

# Confusion matrix and ROC for tuned SVM
cm_tuned = confusion_matrix(y_test, tuned_pred)
plt.figure()
sns.heatmap(cm_tuned, annot=True, fmt="d", cmap="Greens")
plt.title("Confusion Matrix - Tuned SVM")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

print("Classification Report (Tuned SVM):")
print(classification_report(y_test, tuned_pred))

if tuned_prob is not None:
    RocCurveDisplay.from_predictions(y_test, tuned_prob)
    plt.title("ROC Curve - Tuned SVM")
    plt.show()



## 7. Model Interpretability (Permutation Importance)

We use **permutation importance** to understand which features are most important for the tuned SVM model.


In [None]:

r = permutation_importance(
    tuned_model, X_test_scaled, y_test, n_repeats=20, random_state=42, n_jobs=-1
)

importance_df = pd.DataFrame({
    "feature": X.columns,
    "importance_mean": r.importances_mean,
    "importance_std": r.importances_std,
}).sort_values("importance_mean", ascending=False)

importance_df.head(15)


In [None]:

# Plot top 15 important features
top_n = 15
plt.figure(figsize=(8, 6))
sns.barplot(
    data=importance_df.head(top_n),
    x="importance_mean",
    y="feature"
)
plt.title("Top Feature Importances (Permutation Importance)")
plt.xlabel("Mean Importance Decrease")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()



## 8. Error Analysis

We inspect some of the misclassified examples to understand where the model fails.


In [None]:

# Add predictions to a copy of X_test
error_df = X_test.copy()
error_df["true_status"] = y_test
error_df["pred_status"] = tuned_pred
error_df["correct"] = error_df["true_status"] == error_df["pred_status"]

# Show first few misclassified samples
misclassified = error_df[~error_df["correct"]]
misclassified.head()



## 9. Save Final Model & Scaler for Deployment

We save the tuned SVM model and scaler using `joblib`. These files will be used in the **Streamlit app**.


In [None]:

# Save model and scaler
joblib.dump(tuned_model, "parkinsons_model.pkl")
joblib.dump(scaler, "scaler.pkl")

print("Saved: parkinsons_model.pkl and scaler.pkl")



## 10. Conclusion

- We built an **end-to-end ML pipeline** for Parkinson’s disease detection using voice features.  
- After EDA and preprocessing (scaling), we trained **Logistic Regression**, **KNN**, and **SVM** models.  
- The **SVM model** gave the best baseline performance.  
- Using **GridSearchCV**, we tuned the SVM hyperparameters and improved key metrics such as **F1-score** and **ROC-AUC** on the test set.  
- **Permutation importance** helped identify the most influential features in the prediction.  
- Finally, we saved the tuned model and scaler for deployment via a **Streamlit web app**.

This notebook satisfies the full ML project lifecycle: problem definition, EDA, model building, evaluation, interpretability, and deployment readiness.
