# Heart Disease Prediction - Model Experimentation

This notebook systematically evaluates the impact of data preprocessing, feature engineering, and model selection on heart disease prediction.

## Table of Contents

### PART 1: Baseline Experiments (No Feature Engineering)
1. [Setup and Data Loading](#1-setup-and-data-loading)
2. [Define Parameters](#2-define-parameters)
3. [Imputation Experiments](#3-imputation-experiments)
   - 3.1 [Experiment Group 1: Categorization for -9](#experiment-group-1-categorization-for--9)
   - 3.2 [Experiment Group 2: Median/Mode for -9](#experiment-group-2-medianmode-for--9)
   - 3.3 [Experiment Group 3: Healthy Values for -9](#experiment-group-3-healthy-values-for--9)
   - 3.4 [Experiment Group 4: KNN for -9](#experiment-group-4-knn-for--9)
   - 3.5 [Experiment Group 5: MICE for -9](#experiment-group-5-mice-for--9)
4. [Results Comparison](#4-results-comparison)

### PART 2: Feature Engineering Experiments
5. [Imputation Experiments WITH Feature Engineering](#5-imputation-experiments-with-feature-engineering)
   - 5.1 [Experiment Group 1 (FE): Categorization for -9](#fe_group1)
   - 5.2 [Experiment Group 2 (FE): Median/Mode for -9](#fe_group2)
   - 5.3 [Experiment Group 3 (FE): Healthy Values for -9](#fe_group3)
   - 5.4 [Experiment Group 4 (FE): KNN for -9](#fe_group4)
   - 5.5 [Experiment Group 5 (FE): MICE for -9](#fe_group5)
6. [Results Comparison: Feature Engineering Impact](#6-results-comparison-feature-engineering-impact)

### PART 3: Advanced Modeling & Model Comparison
7. [Best Configuration Analysis](#7-best-configuration-analysis)
8. [Competitive Model Leaderboard](#8-competitive-model-leader-board)
9. [Custom Hierarchical Architectures](#9-custom-hierarchical-architectures)
   - 9.1 [Cascade Logistic Regression](#group1)
   - 9.2 [Thresholded Cascaded Model](#group2)
10. [Conclusion](#9-conclusion)

---

## PART 1: Baseline Experiments (No Feature Engineering)

---

## 1. Setup and Data Loading

In [44]:
import os
import sys
import warnings

sys.path.append(os.path.abspath(os.path.join('..')))

warnings.filterwarnings('ignore', category=ResourceWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=UserWarning)
os.environ['PYTHONWARNINGS'] = 'ignore'

sys.stderr = open(os.devnull, 'w')

import numpy as np
import pandas as pd

import src
from src import (
    categorize_minus_nine,
    healthy_values_minus_nine,
    median_mode_imputation_minus_nine,
    knn_imputation_minus_nine,
    iterative_imputation_minus_nine,
    median_mode_imputation_question_mark,
    knn_imputation_question_mark,
    iterative_imputation_question_mark,
    
    CascadedLogisticRegression,
    ThresholdedCascadedLogisticRegression,
    generate_submission,
    run_complete_pipeline,
    run_model_comparison,
    
    plot_learning_curve,
    plot_feature_importance,
    plot_experiment_comparison
)

sys.stderr = sys.__stderr__

In [2]:
train = pd.read_csv('../data/raw/train.csv')
test = pd.read_csv('../data/raw/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"\nClass distribution in training set:")
print(train['label'].value_counts().sort_index())

print(train['thal'].unique())

Train shape: (732, 14)
Test shape: (184, 13)

Class distribution in training set:
label
0    327
1    156
2    108
3    107
4     34
Name: count, dtype: int64
['3.0' '7.0' '?' '-9.0' '6.0' '7' '6' '3']


## 2. Define Parameters

In [3]:
LABEL_VARS = ['sex', 'fbs', 'exang', 'slope']
ONEHOT_VARS = ['cp', 'restecg', 'ca', 'thal']

param_grid = [
    {
        'solver': ['liblinear'],
        'penalty': ['l1', 'l2'],
        'C': [0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 5.0, 10.0],
        'class_weight': [None, 'balanced']
    },
    {
        'solver': ['lbfgs'],
        'penalty': ['l2'],
        'C': [0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 5.0, 10.0],
        'class_weight': [None, 'balanced']
    }
]

## 3. Imputation Experiments

We systematically test different imputation strategies for both -9 and '?' missing value indicators.

**Imputation methods:**
- **Categorization**: Treats missing as a separate category
- **Median/Mode**: Statistical imputation
- **Healthy Values**: Domain-knowledge based imputation
- **KNN**: K-Nearest Neighbors imputation
- **MICE**: Multiple Imputation by Chained Equations

### Experiment Group 1: Categorization for -9

Test categorization approach for -9 with different methods for '?'

In [4]:
print("=" * 65)
print("EXPERIMENT 1.1: Categorization for -9 + Median/Mode for ?")
print("=" * 65)

_, _, _, _, _, grid1_1, score1_1 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=categorize_minus_nine,
    impute_question_func=median_mode_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 1.1: Categorization for -9 + Median/Mode for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5314
Performance Range (CV): 0.5314 +/- 0.0136



In [5]:
print("=" * 65)
print("EXPERIMENT 1.2: Categorization for -9 + KNN for ?")
print("=" * 65)

_, _, _, _, _, grid1_2, score1_2 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=categorize_minus_nine,
    impute_question_func=knn_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 1.2: Categorization for -9 + KNN for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.2, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5369
Performance Range (CV): 0.5369 +/- 0.0266



In [6]:
print("=" * 65)
print("EXPERIMENT 1.3: Categorization for -9 + MICE for ?")
print("=" * 65)

_, _, _, _, _, grid1_3, score1_3 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=categorize_minus_nine,
    impute_question_func=iterative_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 1.3: Categorization for -9 + MICE for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5328
Performance Range (CV): 0.5328 +/- 0.0137



### Experiment Group 2: Median/Mode for -9

Test statistical imputation (median/mode) for -9 with different methods for '?'

In [7]:
print("=" * 65)
print("EXPERIMENT 2.1: Median/Mode for -9 + Median/Mode for ?")
print("=" * 65)

_, _, _, _, _, grid2_1, score2_1 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=median_mode_imputation_minus_nine,
    impute_question_func=median_mode_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 2.1: Median/Mode for -9 + Median/Mode for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5205
Performance Range (CV): 0.5205 +/- 0.0142



In [8]:
print("=" * 65)
print("EXPERIMENT 2.2: Median/Mode for -9 + KNN for ?")
print("=" * 65)

_, _, _, _, _, grid2_2, score2_2 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=median_mode_imputation_minus_nine,
    impute_question_func=knn_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 2.2: Median/Mode for -9 + KNN for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 1.0, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5369
Performance Range (CV): 0.5369 +/- 0.0354



In [9]:
print("=" * 65)
print("EXPERIMENT 2.3: Median/Mode for -9 + MICE for ?")
print("=" * 65)

_, _, _, _, _, grid2_3, score2_3 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=median_mode_imputation_minus_nine,
    impute_question_func=iterative_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 2.3: Median/Mode for -9 + MICE for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.2, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5382
Performance Range (CV): 0.5382 +/- 0.0155



### Experiment Group 3: Healthy Values for -9

Test domain-knowledge based imputation for -9 with different methods for '?'

In [10]:
print("=" * 65)
print("EXPERIMENT 3.1: Healthy Values for -9 + Median/Mode for ?")
print("=" * 65)

_, _, _, _, _, grid3_1, score3_1 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=healthy_values_minus_nine,
    impute_question_func=median_mode_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 3.1: Healthy Values for -9 + Median/Mode for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5314
Performance Range (CV): 0.5314 +/- 0.0136



In [11]:
print("=" * 65)
print("EXPERIMENT 3.2: Healthy Values for -9 + KNN for ?")
print("=" * 65)

_, _, _, _, _, grid3_2, score3_2 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=healthy_values_minus_nine,
    impute_question_func=knn_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 3.2: Healthy Values for -9 + KNN for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.2, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5369
Performance Range (CV): 0.5369 +/- 0.0266



In [12]:
print("=" * 65)
print("EXPERIMENT 3.3: Healthy Values for -9 + MICE for ?")
print("=" * 65)

_, _, _, _, _, grid3_3, score3_3 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=healthy_values_minus_nine,
    impute_question_func=iterative_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 3.3: Healthy Values for -9 + MICE for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5328
Performance Range (CV): 0.5328 +/- 0.0137



### Experiment Group 4: KNN for -9

Test K-Nearest Neighbors imputation for -9 with different methods for '?'

In [13]:
print("=" * 65)
print("EXPERIMENT 4.1: KNN for -9 + Median/Mode for ?")
print("=" * 65)

_, _, _, _, _, grid4_1, score4_1 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=knn_imputation_minus_nine,
    impute_question_func=median_mode_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 4.1: KNN for -9 + Median/Mode for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.2, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5232
Performance Range (CV): 0.5232 +/- 0.0299



In [14]:
print("=" * 65)
print("EXPERIMENT 4.2: KNN for -9 + KNN for ?")
print("=" * 65)

_, _, _, _, _, grid4_2, score4_2 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=knn_imputation_minus_nine,
    impute_question_func=knn_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 4.2: KNN for -9 + KNN for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5314
Performance Range (CV): 0.5314 +/- 0.0246



In [15]:
print("=" * 65)
print("EXPERIMENT 4.3: KNN for -9 + MICE for ?")
print("=" * 65)

_, _, _, _, _, grid4_3, score4_3 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=knn_imputation_minus_nine,
    impute_question_func=iterative_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 4.3: KNN for -9 + MICE for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 1.0, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5314
Performance Range (CV): 0.5314 +/- 0.0214



### Experiment Group 5: MICE for -9

Test MICE (Multiple Imputation by Chained Equations) for -9 with different methods for '?'

In [16]:
print("=" * 65)
print("EXPERIMENT 5.1: MICE for -9 + Median/Mode for ?")
print("=" * 65)

_, _, _, _, _, grid5_1, score5_1 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=iterative_imputation_minus_nine,
    impute_question_func=median_mode_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 5.1: MICE for -9 + Median/Mode for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5259
Performance Range (CV): 0.5259 +/- 0.0176



In [17]:
print("=" * 65)
print("EXPERIMENT 5.2: MICE for -9 + KNN for ?")
print("=" * 65)

_, _, _, _, _, grid5_2, score5_2 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=iterative_imputation_minus_nine,
    impute_question_func=knn_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 5.2: MICE for -9 + KNN for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5396
Performance Range (CV): 0.5396 +/- 0.0237



In [18]:
print("=" * 65)
print("EXPERIMENT 5.3: MICE for -9 + MICE for ?")
print("=" * 65)

_, _, _, _, _, grid5_3, score5_3 = run_complete_pipeline(
    train, test,
    impute_minus_nine_func=iterative_imputation_minus_nine,
    impute_question_func=iterative_imputation_question_mark,
    label_vars=LABEL_VARS,
    onehot_vars=ONEHOT_VARS,
    param_grid=param_grid
)

EXPERIMENT 5.3: MICE for -9 + MICE for ?
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.2, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5273
Performance Range (CV): 0.5273 +/- 0.0145



## 4. Results Comparison

Compile and compare all experimental results to identify the best imputation strategy.

In [19]:
results = pd.DataFrame([
    {'Group': 1, 'Experiment': '1.1', 'Method -9': 'Categorization', 'Method ?': 'Median/Mode', 'CV Accuracy': score1_1},
    {'Group': 1, 'Experiment': '1.2', 'Method -9': 'Categorization', 'Method ?': 'KNN', 'CV Accuracy': score1_2},
    {'Group': 1, 'Experiment': '1.3', 'Method -9': 'Categorization', 'Method ?': 'MICE', 'CV Accuracy': score1_3},
    
    {'Group': 2, 'Experiment': '2.1', 'Method -9': 'Median/Mode', 'Method ?': 'Median/Mode', 'CV Accuracy': score2_1},
    {'Group': 2, 'Experiment': '2.2', 'Method -9': 'Median/Mode', 'Method ?': 'KNN', 'CV Accuracy': score2_2},
    {'Group': 2, 'Experiment': '2.3', 'Method -9': 'Median/Mode', 'Method ?': 'MICE', 'CV Accuracy': score2_3},
    
    {'Group': 3, 'Experiment': '3.1', 'Method -9': 'Healthy Values', 'Method ?': 'Median/Mode', 'CV Accuracy': score3_1},
    {'Group': 3, 'Experiment': '3.2', 'Method -9': 'Healthy Values', 'Method ?': 'KNN', 'CV Accuracy': score3_2},
    {'Group': 3, 'Experiment': '3.3', 'Method -9': 'Healthy Values', 'Method ?': 'MICE', 'CV Accuracy': score3_3},
    
    {'Group': 4, 'Experiment': '4.1', 'Method -9': 'KNN', 'Method ?': 'Median/Mode', 'CV Accuracy': score4_1},
    {'Group': 4, 'Experiment': '4.2', 'Method -9': 'KNN', 'Method ?': 'KNN', 'CV Accuracy': score4_2},
    {'Group': 4, 'Experiment': '4.3', 'Method -9': 'KNN', 'Method ?': 'MICE', 'CV Accuracy': score4_3},
    
    {'Group': 5, 'Experiment': '5.1', 'Method -9': 'MICE', 'Method ?': 'Median/Mode', 'CV Accuracy': score5_1},
    {'Group': 5, 'Experiment': '5.2', 'Method -9': 'MICE', 'Method ?': 'KNN', 'CV Accuracy': score5_2},
    {'Group': 5, 'Experiment': '5.3', 'Method -9': 'MICE', 'Method ?': 'MICE', 'CV Accuracy': score5_3}
])

results_sorted = results.sort_values('CV Accuracy', ascending=False).reset_index(drop=True)

print("\n" + "=" * 58)
print("COMPREHENSIVE RESULTS SUMMARY - ALL EXPERIMENTS")
print("=" * 58)
print(results_sorted.to_string(index=False))


COMPREHENSIVE RESULTS SUMMARY - ALL EXPERIMENTS
 Group Experiment      Method -9    Method ?  CV Accuracy
     5        5.2           MICE         KNN     0.539605
     2        2.3    Median/Mode        MICE     0.538244
     2        2.2    Median/Mode         KNN     0.536893
     1        1.2 Categorization         KNN     0.536856
     3        3.2 Healthy Values         KNN     0.536856
     1        1.3 Categorization        MICE     0.532793
     3        3.3 Healthy Values        MICE     0.532793
     4        4.3            KNN        MICE     0.531442
     1        1.1 Categorization Median/Mode     0.531432
     3        3.1 Healthy Values Median/Mode     0.531432
     4        4.2            KNN         KNN     0.531386
     5        5.3           MICE        MICE     0.527304
     5        5.1           MICE Median/Mode     0.525934
     4        4.1            KNN Median/Mode     0.523213
     2        2.1    Median/Mode Median/Mode     0.520483


---

## PART 2: Experiments WITH Feature Engineering

---

## 5. Imputation Experiments WITH Feature Engineering

Now we repeat all experiments with `use_feature_engineering=True` to see how engineered features impact performance.

### Experiment Group 1 (FE): Categorization for -9

In [20]:
print("=" * 65)
print("Experiment 1.1 (FE): Categorization + Median/Mode")
print("=" * 65)

_, _, _, _, _, _, score1_1_fe = run_complete_pipeline(
    train, test,
    categorize_minus_nine,
    median_mode_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 1.1 (FE): Categorization + Median/Mode
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5423
Performance Range (CV): 0.5423 +/- 0.0174



In [21]:
print("=" * 65)
print("Experiment 1.2 (FE): Categorization + KNN")
print("=" * 65)

_, _, _, _, _, _, score1_2_fe = run_complete_pipeline(
    train, test,
    categorize_minus_nine,
    knn_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 1.2 (FE): Categorization + KNN
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5328
Performance Range (CV): 0.5328 +/- 0.0221



In [22]:
print("=" * 65)
print("Experiment 1.3 (FE): Categorization + MICE")
print("=" * 65)

_, _, _, _, _, _, score1_3_fe = run_complete_pipeline(
    train, test,
    categorize_minus_nine,
    iterative_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 1.3 (FE): Categorization + MICE
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5369
Performance Range (CV): 0.5369 +/- 0.0207



### Experiment Group 2 (FE): Median/Mode for -9

In [23]:
print("=" * 65)
print("Experiment 2.1 (FE): Median/Mode + Median/Mode")
print("=" * 65)

_, _, _, _, _, _, score2_1_fe = run_complete_pipeline(
    train, test,
    median_mode_imputation_minus_nine,
    median_mode_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 2.1 (FE): Median/Mode + Median/Mode
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5410
Performance Range (CV): 0.5410 +/- 0.0109



In [24]:
print("=" * 65)
print("Experiment 2.2 (FE): Median/Mode + KNN")
print("=" * 65)

_, _, _, _, _, _, score2_2_fe = run_complete_pipeline(
    train, test,
    median_mode_imputation_minus_nine,
    knn_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 2.2 (FE): Median/Mode + KNN
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5369
Performance Range (CV): 0.5369 +/- 0.0234



In [25]:
print("=" * 65)
print("Experiment 2.3 (FE): Median/Mode + MICE")
print("=" * 65)

_, _, _, _, _, _, score2_3_fe = run_complete_pipeline(
    train, test,
    median_mode_imputation_minus_nine,
    iterative_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 2.3 (FE): Median/Mode + MICE
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.2, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5451
Performance Range (CV): 0.5451 +/- 0.0073



### Experiment Group 3 (FE): Healthy Values for -9

In [26]:
print("=" * 65)
print("Experiment 3.1 (FE): Healthy Values + Median/Mode")
print("=" * 65)

_, _, _, _, _, _, score3_1_fe = run_complete_pipeline(
    train, test,
    healthy_values_minus_nine,
    median_mode_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 3.1 (FE): Healthy Values + Median/Mode
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5423
Performance Range (CV): 0.5423 +/- 0.0174



In [27]:
print("=" * 65)
print("Experiment 3.2 (FE): Healthy Values + KNN")
print("=" * 65)

_, _, _, _, _, _, score3_2_fe = run_complete_pipeline(
    train, test,
    healthy_values_minus_nine,
    knn_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 3.2 (FE): Healthy Values + KNN
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5328
Performance Range (CV): 0.5328 +/- 0.0221



In [28]:
print("=" * 65)
print("Experiment 3.3 (FE): Healthy Values + MICE")
print("=" * 65)

_, _, _, _, _, _, score3_3_fe = run_complete_pipeline(
    train, test,
    healthy_values_minus_nine,
    iterative_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 3.3 (FE): Healthy Values + MICE
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5369
Performance Range (CV): 0.5369 +/- 0.0207



### Experiment Group 4 (FE): KNN for -9

In [29]:
print("=" * 65)
print("Experiment 4.1 (FE): KNN + Median/Mode")
print("=" * 65)

_, _, _, _, _, _, score4_1_fe = run_complete_pipeline(
    train, test,
    knn_imputation_minus_nine,
    median_mode_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 4.1 (FE): KNN + Median/Mode
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5437
Performance Range (CV): 0.5437 +/- 0.0202



In [30]:
print("=" * 65)
print("Experiment 4.2 (FE): KNN + KNN")
print("=" * 65)

_, _, _, _, _, _, score4_2_fe = run_complete_pipeline(
    train, test,
    knn_imputation_minus_nine,
    knn_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 4.2 (FE): KNN + KNN
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5382
Performance Range (CV): 0.5382 +/- 0.0290



In [31]:
print("=" * 65)
print("Experiment 4.3 (FE): KNN + MICE")
print("=" * 65)

_, _, _, _, _, _, score4_3_fe = run_complete_pipeline(
    train, test,
    knn_imputation_minus_nine,
    iterative_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 4.3 (FE): KNN + MICE
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.5, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5382
Performance Range (CV): 0.5382 +/- 0.0204



### Experiment Group 5 (FE): MICE for -9

In [32]:
print("=" * 65)
print("Experiment 5.1 (FE): MICE + Median/Mode")
print("=" * 65)

_, _, _, _, _, _, score5_1_fe = run_complete_pipeline(
    train, test,
    iterative_imputation_minus_nine,
    median_mode_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 5.1 (FE): MICE + Median/Mode
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 1.0, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5396
Performance Range (CV): 0.5396 +/- 0.0182



In [33]:
print("=" * 65)
print("Experiment 5.2 (FE): MICE + KNN")
print("=" * 65)

_, _, _, _, _, _, score5_2_fe = run_complete_pipeline(
    train, test,
    iterative_imputation_minus_nine,
    knn_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 5.2 (FE): MICE + KNN
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5519
Performance Range (CV): 0.5519 +/- 0.0372



In [34]:
print("=" * 65)
print("Experiment 5.3 (FE): MICE + MICE")
print("=" * 65)

_, _, _, _, _, _, score5_3_fe = run_complete_pipeline(
    train, test,
    iterative_imputation_minus_nine,
    iterative_imputation_question_mark,
    LABEL_VARS, ONEHOT_VARS, param_grid,
    use_feature_engineering=True
)

Experiment 5.3 (FE): MICE + MICE
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.05, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Mean CV Accuracy: 0.5437
Performance Range (CV): 0.5437 +/- 0.0298



## 6. Results Comparison: Feature Engineering Impact

In [35]:
results_fe = pd.DataFrame([
    {'Group': 1, 'Experiment': '1.1', 'Method -9': 'Categorization', 'Method ?': 'Median/Mode', 'CV Accuracy': score1_1_fe},
    {'Group': 1, 'Experiment': '1.2', 'Method -9': 'Categorization', 'Method ?': 'KNN', 'CV Accuracy': score1_2_fe},
    {'Group': 1, 'Experiment': '1.3', 'Method -9': 'Categorization', 'Method ?': 'MICE', 'CV Accuracy': score1_3_fe},

    {'Group': 2, 'Experiment': '2.1', 'Method -9': 'Median/Mode', 'Method ?': 'Median/Mode', 'CV Accuracy': score2_1_fe},
    {'Group': 2, 'Experiment': '2.2', 'Method -9': 'Median/Mode', 'Method ?': 'KNN', 'CV Accuracy': score2_2_fe},
    {'Group': 2, 'Experiment': '2.3', 'Method -9': 'Median/Mode', 'Method ?': 'MICE', 'CV Accuracy': score2_3_fe},

    {'Group': 3, 'Experiment': '3.1', 'Method -9': 'Healthy Values', 'Method ?': 'Median/Mode', 'CV Accuracy': score3_1_fe},
    {'Group': 3, 'Experiment': '3.2', 'Method -9': 'Healthy Values', 'Method ?': 'KNN', 'CV Accuracy': score3_2_fe},
    {'Group': 3, 'Experiment': '3.3', 'Method -9': 'Healthy Values', 'Method ?': 'MICE', 'CV Accuracy': score3_3_fe},

    {'Group': 4, 'Experiment': '4.1', 'Method -9': 'KNN', 'Method ?': 'Median/Mode', 'CV Accuracy': score4_1_fe},
    {'Group': 4, 'Experiment': '4.2', 'Method -9': 'KNN', 'Method ?': 'KNN', 'CV Accuracy': score4_2_fe},
    {'Group': 4, 'Experiment': '4.3', 'Method -9': 'KNN', 'Method ?': 'MICE', 'CV Accuracy': score4_3_fe},

    {'Group': 5, 'Experiment': '5.1', 'Method -9': 'MICE', 'Method ?': 'Median/Mode', 'CV Accuracy': score5_1_fe},
    {'Group': 5, 'Experiment': '5.2', 'Method -9': 'MICE', 'Method ?': 'KNN', 'CV Accuracy': score5_2_fe},
    {'Group': 5, 'Experiment': '5.3', 'Method -9': 'MICE', 'Method ?': 'MICE', 'CV Accuracy': score5_3_fe}
])

results_fe_sorted = results_fe.sort_values('CV Accuracy', ascending=False).reset_index(drop=True)

print("\n" + "=" * 57)
print("RESULTS WITH FEATURE ENGINEERING - Sorted by CV Accuracy")
print("=" * 57)
print(results_fe_sorted.to_string(index=False))


RESULTS WITH FEATURE ENGINEERING - Sorted by CV Accuracy
 Group Experiment      Method -9    Method ?  CV Accuracy
     5        5.2           MICE         KNN     0.551859
     2        2.3    Median/Mode        MICE     0.545075
     4        4.1            KNN Median/Mode     0.543724
     5        5.3           MICE        MICE     0.543714
     1        1.1 Categorization Median/Mode     0.542345
     3        3.1 Healthy Values Median/Mode     0.542345
     2        2.1    Median/Mode Median/Mode     0.541003
     5        5.1           MICE Median/Mode     0.539614
     4        4.2            KNN         KNN     0.538244
     4        4.3            KNN        MICE     0.538216
     1        1.3 Categorization        MICE     0.536893
     3        3.3 Healthy Values        MICE     0.536893
     2        2.2    Median/Mode         KNN     0.536884
     1        1.2 Categorization         KNN     0.532756
     3        3.2 Healthy Values         KNN     0.532756


In [36]:
comparison = pd.DataFrame({
    'Experiment': results['Experiment'],
    'Method -9': results['Method -9'],
    'Method ?': results['Method ?'],
    'No FE': results['CV Accuracy'],
    'With FE': results_fe['CV Accuracy'],
    'Improvement': results_fe['CV Accuracy'] - results['CV Accuracy']
})

comparison_sorted = comparison.sort_values('Improvement', ascending=False).reset_index(drop=True)

print("\n" + "=" * 68)
print("FEATURE ENGINEERING IMPACT - Side-by-Side Comparison")
print("=" * 68)
print(comparison_sorted.to_string(index=False))
print("\n" + "=" * 68)
print(f"Average Improvement: {comparison['Improvement'].mean():.4f}")
print(f"Max Improvement: {comparison['Improvement'].max():.4f} (Experiment {comparison.loc[comparison['Improvement'].idxmax(), 'Experiment']})")
print(f"Min Improvement: {comparison['Improvement'].min():.4f} (Experiment {comparison.loc[comparison['Improvement'].idxmin(), 'Experiment']})")
print("=" * 68)


FEATURE ENGINEERING IMPACT - Side-by-Side Comparison
Experiment      Method -9    Method ?    No FE  With FE  Improvement
       2.1    Median/Mode Median/Mode 0.520483 0.541003     0.020520
       4.1            KNN Median/Mode 0.523213 0.543724     0.020511
       5.3           MICE        MICE 0.527304 0.543714     0.016410
       5.1           MICE Median/Mode 0.525934 0.539614     0.013680
       5.2           MICE         KNN 0.539605 0.551859     0.012254
       1.1 Categorization Median/Mode 0.531432 0.542345     0.010912
       3.1 Healthy Values Median/Mode 0.531432 0.542345     0.010912
       4.2            KNN         KNN 0.531386 0.538244     0.006859
       2.3    Median/Mode        MICE 0.538244 0.545075     0.006831
       4.3            KNN        MICE 0.531442 0.538216     0.006775
       1.3 Categorization        MICE 0.532793 0.536893     0.004100
       3.3 Healthy Values        MICE 0.532793 0.536893     0.004100
       2.2    Median/Mode         KNN 0.536893 0.

In [37]:
fig_comp = plot_experiment_comparison(comparison, comparison_sorted)
fig_comp.show()

---

## PART 3: Advanced Modeling & Model Comparison

---

## 7. Best Configuration Analysis

Now we analyze the learning curve and feature importance of our model configuration. Although advanced methods like **KNN** and **MICE** showed slightly higher metrics during internal cross-validation—with **MICE** reaching a peak mean CV accuracy of —the simple imputation strategy (**Median/Mode**) demonstrated superior generalization on the external test set (Kaggle).

This suggests that the more complex methods suffered from **overfitting**, creating artificial relationships in the training data that did not hold up when applied to new, unseen data. This phenomenon is particularly evident with **MICE**, where the iterative process can force relationships between variables that may not actually exist, especially in smaller datasets like this one.

The following analysis examines the model's behavior and the specific features that contributed most to its predictive power.

In [38]:
X_train_scaled, y_train, best_model, feature_names, grid_search, grid_search.best_score_ = run_complete_pipeline(
    train, test,
    median_mode_imputation_minus_nine,
    median_mode_imputation_question_mark,
    LABEL_VARS, 
    ONEHOT_VARS, 
    param_grid,
    use_feature_engineering=True 
)


Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
Mean CV Accuracy: 0.5410
Performance Range (CV): 0.5410 +/- 0.0109



In [45]:
fig_imp = plot_feature_importance(
    feature_names=feature_names, 
    importances=best_model.coef_[0], 
    title="Risk Factors",
    save_path="../images/feature_importance.png"
)
fig_imp.show()

In [46]:
fig_lc = plot_learning_curve(
    best_model, 
    X_train_scaled, 
    y_train, 
    cv=5,
    save_path="../images/learning_curve.png"
)
fig_lc.show()

### Risk Factors and Feature Engineering

A key takeaway from the feature importance analysis is that the **best-performing feature is an engineered one**: `combined_risk`. This confirms that creating domain-specific interactions can provide a stronger signal than the raw variables alone.

### Bias/Variance Diagnosis

Based on the learning curve, the model primarily suffers from **high bias (underfitting)**. The training score drops quickly and plateaus at approximately 0.60 accuracy, indicating that the model struggles to capture the underlying patterns even in the training data. Conversely, the model exhibits **low variance**, as the gap between the training and cross-validation scores narrows significantly toward the end, showing that the model generalizes consistently rather than memorizing the data.

Because the curves converge but at a relatively low accuracy level, it is likely that the model could benefit from **more complex models** or better features rather than simply adding more data. This is precisely what we will explore in the next section.

## 8. Competitive Model Leaderboard

In this section, we move beyond the baseline Logistic Regression to evaluate whether non-linear models or ensemble methods can better capture the complexities of heart disease severity.

In [41]:
leaderboard_df = run_model_comparison(X_train_scaled, y_train, cv_folds=5)

print("\n" + "="*76)
print("COMPETITIVE MODEL LEADERBOARD")
print("="*76)
print(leaderboard_df.to_string(index=False))


COMPETITIVE MODEL LEADERBOARD
              model  best_score                                  best_params
Logistic Regression    0.541003                {'C': 0.1, 'solver': 'lbfgs'}
      Random Forest    0.538188       {'max_depth': 20, 'n_estimators': 200}
  Gradient Boosting    0.532774 {'learning_rate': 0.01, 'n_estimators': 100}
                SVM    0.528693                    {'C': 1, 'kernel': 'rbf'}


Despite testing **Random Forest**, **Gradient Boosting**, and **SVM**, the Logistic Regression model maintained the strongest performance.

The fact that complex ensembles did not significantly outperform the linear baseline reinforces the **High Bias** diagnosis from our learning curve analysis. This suggests the dataset is either largely linear or too small for complex models to find reliable high-dimensional patterns.

## 9. Custom Hierarchical Architectures

Heart disease is naturally hierarchical: the primary clinical question is often "Is disease present?" followed by "How severe is it?". The custom classes in `models.py` mirror this clinical logic.

### Cascaded Logistic Regression

This model uses two specialized estimators: a binary model for the healthy/diseased split and a multiclass model for levels 1–4.

Although training accuracy is higher, this model is more complex. By splitting the task, each sub-estimator "sees" fewer samples, which can lead to higher variance and slightly lower performance on unseen Kaggle data compared to the robust baseline.

In [42]:
cascade_model = CascadedLogisticRegression(C_zero=0.1, C_multi=0.1)
cascade_model.fit(X_train_scaled, y_train)

cascade_preds = cascade_model.predict(X_train_scaled)
cascade_acc = np.mean(cascade_preds == y_train)

print(f"Cascade Model Training Accuracy: {cascade_acc:.4f}")

Cascade Model Training Accuracy: 0.5697


### Thresholded Cascaded Model

This variation introduces a `zero_threshold` to prioritize the "Healthy" (0) class, allowing for a more conservative or aggressive clinical approach.

he increased accuracy at  suggests the model is effectively "tuning" itself to the training distribution. However, the fact that the simpler **Median/Mode + Logistic Regression** performs better on Kaggle highlights a classic case of **overfitting**; the more complex the hierarchy and thresholding, the more the model risks capturing noise specific to the training set.

In [43]:
threshold_model = ThresholdedCascadedLogisticRegression(
    C_zero=0.1, 
    C_multi=0.1, 
    zero_threshold=0.4
)
threshold_model.fit(X_train_scaled, y_train)

threshold_preds = threshold_model.predict(X_train_scaled)
threshold_acc = np.mean(threshold_preds == y_train)

print(f"Thresholded Cascade (T=0.65) Training Accuracy: {threshold_acc:.4f}")

Thresholded Cascade (T=0.65) Training Accuracy: 0.5806


## 10. Conclusion

The extensive experimentation across 15 imputation strategies and multiple architectures reveals a clear conclusion: **Simplicity and Feature Engineering are the strongest drivers of performance for this dataset.**

While **MICE** and **KNN** showed promise in CV, the **Median/Mode** strategy proved most robust for external generalization. Although the **Thresholded Cascaded Model** is the most accurate in training, the **Standard Logistic Regression with Median/Mode Imputation** remains the best candidate for production/Kaggle because it avoids the complexity that leads to overfitting in small datasets.