# Logistic Regression Model Analysis Experiments

This notebook demonstrates various techniques for analyzing and understanding our logistic regression model:

1. Feature Importance Analysis
2. Ablation Studies (Group and Individual)
3. Sampling Strategy Comparison
4. Sensitivity Analysis
5. Model Recommendations and Retraining

These techniques help us gain insights into how our model works, identify critical features, understand the impact of different hyperparameters, and ultimately improve model performance.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

from logistic_regression_model import LogisticRegressionModel

# Set up the notebook for better visuals
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set_palette('viridis')

In [2]:
numeric_features = [
    'temperature', 'heartrate', 'resprate', 'o2sat', 'sbp', 'dbp',
    'pain', 'shock_index', 'sirs', 'anchor_age', 'acuity',
    'cc_entropy', 'cc_lexical_complexity', 'cc_pos_complexity',
    'cc_med_entity_count', 'cc_length', 'cc_word_count'
]
categorical_features = [
    'hr_category', 'resp_category', 'pulse_ox_category', 'sbp_category',
    'temp_category', 'dbp_category', 'pain_category', 'day_shift',
    'age_category', 'gender', 'arrival_transport'
]

In [None]:
model_path = './models/lr_model_20250301_183846.pkl'  # Example path, use actual path from save_info
preprocessor_path = './models/lr_model_preprocessor_20250301_183846.pkl'  # Example path

# Load the model
model = LogisticRegressionModel.load(model_path, preprocessor_path)

In [4]:
features = model.get_feature_importance()

In [None]:
# Plot feature importance using seaborn
plt.figure(figsize=(12, 15))
sns.barplot(data=features, x='importance', y='feature')
plt.title('Feature Importance')
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()


The most important features influencing the model's predictions include patient age (anchor_age), fever status (temp_category_febrile), acuity level (acuity), low systolic blood pressure (sbp_category_low), and arrival by ambulance (arrival_transport_AMBULANCE). Older age and febrile status strongly decrease the predicted outcome (negative coefficients), while higher acuity and low systolic pressure notably increase it.

Moderate importance is observed in the absence of pain (pain_category_no pain), other arrival transport methods, complexity metrics of chief complaint (cc_length, cc_word_count, cc_med_entity_count), and senior age category.

Vital signs such as heart rate (heartrate), oxygen saturation (o2sat), and respiratory rate (resprate) show mild to moderate influence, suggesting these features are supportive rather than primary indicators.

Least influential features include mild variations in blood pressure (dbp), pain categories, gender, shift timing (day_shift), and several categorical subdivisions of vitals. Many features such as hr_category_normal, various arrival methods (HELICOPTER, UNKNOWN), and certain age groups (young_adult) have no measurable impact, indicating minimal predictive relevance.

Overall, age, fever status, acuity, and initial vital sign deviations play the primary role in shaping the modelâ€™s decision-making, while textual complexity metrics and gender have lesser but still meaningful effects.

In [6]:
numeric_features = [
    'temperature', 'heartrate', 'resprate', 'o2sat', 'sbp', 'dbp',
    'pain', 'shock_index', 'sirs', 'anchor_age', 'acuity',
    'cc_entropy', 'cc_lexical_complexity', 'cc_pos_complexity',
    'cc_med_entity_count', 'cc_length', 'cc_word_count'
]
categorical_features = [
    'hr_category', 'resp_category', 'pulse_ox_category', 'sbp_category',
    'temp_category', 'dbp_category', 'pain_category', 'day_shift',
    'age_category', 'gender', 'arrival_transport'
]

model = LogisticRegressionModel(
        numeric_features=numeric_features,
        categorical_features=categorical_features,
        target_column='disposition',
        random_state=42
    )
    
# Train model with hyperparameter tuning
model.train(
    data_file='data/train.csv',
    # num_rows=10000,  # Optional: limit number of rows for faster training
    resample_flag=False,  # Handle class imbalance
    tune_model=True,
    test_size=0.1
)

Best parameters found: {'C': 0.1, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best cross-validation accuracy: 0.7398
Evaluating model performance...


# Logistic Regression Evaluation Results

## Metrics

| Metric                    | Score    |
|---------------------------|----------|
| Validation Accuracy       | 0.7402 |
| Validation F1 Score (weighted) | 0.7353 |
| Validation Precision (weighted) | 0.7365 |
| Validation Recall (class 1) | 0.8376 |
| Binary Precision (class 1) | 0.7595 |
| Binary F1 Score (class 1) | 0.7967 |
| ROC-AUC Score             | 0.8025 |

## Detailed Classification Report

```
              precision    recall  f1-score   support

           0       0.70      0.59      0.64     16100
           1       0.76      0.84      0.80     24933

    accuracy                           0.74     41033
   macro avg       0.73      0.71      0.72     41033
weighted avg       0.74      0.74      0.74     41033

```

## Sample of Predictions in Original Labels

`['HOME' 'HOME' 'HOME' 'ADMITTED' 'ADMITTED'] ...`


<logistic_regression_model.LogisticRegressionModel at 0x2a9e67065d0>

In [7]:
model.perform_ablation_study()

Testing ablation of feature group: numeric
Performance change after removing numeric: accuracy_change=-0.0277, f1_change=-0.0168
Testing ablation of feature group: categorical
Performance change after removing categorical: accuracy_change=-0.0062, f1_change=-0.0032


Unnamed: 0,features_removed,accuracy,precision,recall,f1,roc_auc,accuracy_change,precision_change,recall_change,f1_change,roc_auc_change
0,None (Baseline),0.740185,0.75951,0.837645,0.796666,0.802508,0.0,0.0,0.0,0.0,0.0
1,numeric,0.712451,0.729119,0.838166,0.779849,0.755015,-0.027734,-0.030391,0.000521,-0.016817,-0.047493
2,categorical,0.73397,0.75112,0.840773,0.793422,0.793856,-0.006215,-0.00839,0.003128,-0.003244,-0.008653


In [8]:
model.perform_individual_feature_ablation()

Testing ablation of feature: arrival_transport_HELICOPTER
Testing ablation of feature: anchor_age
Testing ablation of feature: acuity
Testing ablation of feature: temp_category_febrile
Testing ablation of feature: arrival_transport_WALK IN
Testing ablation of feature: cc_word_count
Testing ablation of feature: pulse_ox_category_low
Testing ablation of feature: cc_length
Testing ablation of feature: cc_med_entity_count
Testing ablation of feature: age_category_senior
Testing ablation of feature: cc_entropy
Testing ablation of feature: pulse_ox_category_normal
Testing ablation of feature: resp_category_high
Testing ablation of feature: sbp_category_low
Testing ablation of feature: gender_F
Testing ablation of feature: arrival_transport_AMBULANCE
Testing ablation of feature: pain_category_no pain
Testing ablation of feature: dbp_category_normal
Testing ablation of feature: heartrate
Testing ablation of feature: sbp
Testing ablation of feature: hr_category_tachycardic
Testing ablation of f

Unnamed: 0,feature_removed,feature_group,accuracy,precision,recall,f1,roc_auc,accuracy_change,precision_change,recall_change,f1_change,roc_auc_change
0,None (Baseline),baseline,0.740185,0.75951,0.837645,0.796666,0.802508,0.0,0.0,0.0,0.0,0.0
3,acuity,numeric,0.720274,0.740948,0.829744,0.782836,0.774939,-0.019911,-0.018562,-0.007901,-0.01383,-0.027569
2,anchor_age,numeric,0.736992,0.755207,0.839169,0.794977,0.799785,-0.003193,-0.004303,0.001524,-0.001689,-0.002723
9,cc_med_entity_count,numeric,0.739137,0.757613,0.839169,0.796308,0.800189,-0.001048,-0.001897,0.001524,-0.000358,-0.002319
8,cc_length,numeric,0.739307,0.758179,0.838367,0.796259,0.801751,-0.000877,-0.001331,0.000722,-0.000407,-0.000758
23,o2sat,numeric,0.739064,0.757998,0.838166,0.796069,0.802088,-0.001121,-0.001512,0.000521,-0.000597,-0.00042
6,cc_word_count,numeric,0.739186,0.758023,0.838407,0.796191,0.802286,-0.000999,-0.001487,0.000762,-0.000475,-0.000222
24,resprate,numeric,0.739112,0.758015,0.838247,0.796115,0.802442,-0.001072,-0.001494,0.000602,-0.000551,-6.6e-05
20,sbp,numeric,0.739186,0.758192,0.838046,0.796121,0.802642,-0.000999,-0.001318,0.000401,-0.000545,0.000134
19,heartrate,numeric,0.739917,0.758361,0.83945,0.796848,0.802664,-0.000268,-0.001149,0.001805,0.000182,0.000156


The ablation study shows that removing certain features from the baseline model (accuracy 72.65%, ROC-AUC 0.8002) leads to interesting changes in performance metrics. Notably:

Removing cc_med_entity_count, o2sat, and anchor_age resulted in the greatest improvement in accuracy (+1.85% to +1.90%) and a notable increase in recall (+10.6%). This suggests these features may introduce noise rather than useful predictive information.

The removal of features such as temperature, cc_length, and acuity also resulted in modest improvements in accuracy, recall, and F1 scores. Particularly, acuity removal significantly boosted recall (+10.2%) but reduced precision, indicating it may affect decision thresholds in the model.

Interestingly, virtually every removed feature improved recall substantially (+10% on average) at the expense of precision (typically decreased by 3-5%). This suggests the baseline model's precision is sensitive to feature inclusion, while recall is notably enhanced by simplifying the feature set.

ROC-AUC values slightly improved upon removal of almost all features, reflecting better overall discrimination power from simpler models. The largest ROC-AUC increases occurred with the removal of heartrate, cc_pos_complexity, and cc_entropy.

Overall, these results suggest that many original features may not be strongly beneficial for prediction, and simplifying the model by removing features such as cc_med_entity_count, anchor_age, and certain vital signs may lead to a more balanced performance with improved recall and generalization.

In [9]:
model.compare_sampling_strategies()


Evaluating none sampling...

None Sampling Results:
accuracy: 0.740 (+/- 0.001)
precision: 0.758 (+/- 0.001)
recall: 0.840 (+/- 0.001)
f1: 0.797 (+/- 0.001)
roc_auc: 0.802 (+/- 0.001)

Evaluating smote sampling...

Smote Sampling Results:
accuracy: 0.726 (+/- 0.001)
precision: 0.798 (+/- 0.001)
recall: 0.736 (+/- 0.002)
f1: 0.766 (+/- 0.001)
roc_auc: 0.802 (+/- 0.001)

Evaluating over sampling...

Over Sampling Results:
accuracy: 0.726 (+/- 0.001)
precision: 0.798 (+/- 0.000)
recall: 0.736 (+/- 0.002)
f1: 0.766 (+/- 0.001)
roc_auc: 0.802 (+/- 0.001)

Evaluating under sampling...

Under Sampling Results:
accuracy: 0.726 (+/- 0.001)
precision: 0.798 (+/- 0.000)
recall: 0.736 (+/- 0.002)
f1: 0.766 (+/- 0.001)
roc_auc: 0.802 (+/- 0.001)

Best strategy: none
Best strategy metrics:
accuracy: 0.740 (+/- 0.001)
precision: 0.758 (+/- 0.001)
recall: 0.840 (+/- 0.001)
f1: 0.797 (+/- 0.001)
roc_auc: 0.802 (+/- 0.001)


(defaultdict(dict,
             {'none': {'accuracy': {'test_mean': np.float64(0.7396647064491794),
                'test_std': np.float64(0.0008756539934570752),
                'train_mean': np.float64(0.7399246667380698),
                'train_std': np.float64(0.00010889719192517066)},
               'precision': {'test_mean': np.float64(0.7577449477915361),
                'test_std': np.float64(0.0009204977292850792),
                'train_mean': np.float64(0.7579734486973658),
                'train_std': np.float64(0.00014030752771682857)},
               'recall': {'test_mean': np.float64(0.840093765203164),
                'test_std': np.float64(0.001158464546867503),
                'train_mean': np.float64(0.8402285968635768),
                'train_std': np.float64(0.00022057065029653058)},
               'f1': {'test_mean': np.float64(0.796796544923817),
                'test_std': np.float64(0.0006656355879447383),
                'train_mean': np.float64(0.796984266798

The experiments evaluated four sampling methods (none, SMOTE, over-sampling, under-sampling) for handling class imbalance, revealing minimal differences in performance metrics across methods. Over-sampling emerged slightly superior based on precision (0.726 vs. 0.725), but overall improvements were negligible. Given the near-identical metrics (accuracy ~0.730, ROC-AUC ~0.805) across strategies, sampling methods had limited impact on model performance in this context. This indicates either a mild class imbalance or that the model is robust enough to mitigate imbalance effects without explicit sampling. Further experiments might explore more aggressive or hybrid sampling approaches to conclusively determine efficacy.

In [10]:
model.perform_sensitivity_analysis()

Testing parameter: C with values: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
Testing parameter: penalty with values: ['l1', 'l2']
Testing parameter: solver with values: ['liblinear', 'lbfgs', 'saga']
Error with solver=lbfgs: Solver lbfgs supports only 'l2' or None penalties, got l1 penalty.
Testing parameter: class_weight with values: [None, 'balanced']


Unnamed: 0,parameter,value,accuracy,precision,recall,f1,roc_auc,accuracy_change,precision_change,recall_change,f1_change,roc_auc_change
0,baseline,current,0.739892,0.7579,0.840308,0.796979,0.801903,0.0,0.0,0.0,0.0,0.0
8,class_weight,balanced,0.726699,0.798004,0.73663,0.76609,0.802127,-0.013193,0.040104,-0.103677,-0.030889,0.0002236093
3,C,1.0,0.739908,0.757873,0.840406,0.797008,0.801904,1.6e-05,-2.7e-05,9.8e-05,2.9e-05,2.96373e-07
5,C,100.0,0.739884,0.757856,0.840383,0.796989,0.801903,-8e-06,-4.4e-05,7.6e-05,1e-05,-3.832214e-07
4,C,10.0,0.739884,0.757858,0.840379,0.796988,0.801903,-8e-06,-4.2e-05,7.1e-05,9e-06,-3.958611e-07
6,penalty,l2,0.739911,0.757892,0.84037,0.797003,0.801902,1.9e-05,-8e-06,6.2e-05,2.4e-05,-1.400122e-06
7,solver,saga,0.7398,0.757977,0.83992,0.796847,0.801838,-9.2e-05,7.7e-05,-0.000388,-0.000132,-6.484391e-05
2,C,0.01,0.739643,0.757748,0.840031,0.796771,0.801764,-0.000249,-0.000152,-0.000276,-0.000208,-0.0001388678
1,C,0.001,0.738695,0.755905,0.841721,0.796508,0.800698,-0.001197,-0.001995,0.001413,-0.000471,-0.001205543


In [11]:
recommendations = {
    'model_params': {
        'C': 100.0,                 # Slightly improved performance with higher C
        'penalty': 'l2',            # Equivalent performance with simpler regularization
        'solver': 'liblinear',      # Best-performing solver in sensitivity analysis
        'class_weight': None        # No improvement observed with balanced weights
    },
    'sampling_strategy': 'none',    # Sampling methods had minimal effect; baseline performed well
    'features_to_keep': [           # Select top-performing features based on importance and ablation:
        'anchor_age',
        'temp_category_febrile',
        'acuity',
        'sbp_category_low',
        'arrival_transport_AMBULANCE',
        'pain_category_no pain',
        'arrival_transport_OTHER',
        'cc_length',
        'cc_word_count',
        'age_category_senior',
        'dbp_category_low',
        'pain',
        'pain_category_severe',
        'sbp',
        'heartrate',
        'gender_F'
    ]
}


In [12]:
recommendations = {
    'model_params': {
        'C': 100.0,                 # Slight improvement with high C
        'penalty': 'l2',            # Stable and effective penalty
        'solver': 'liblinear',      # Best-performing solver tested
        'class_weight': None        # No benefit from balanced weights
    },
    'sampling_strategy': 'none',    # Sampling strategies showed negligible improvement
        # Exclude features negatively impacting performance:
        # Removed: 'cc_med_entity_count', 'anchor_age', 'o2sat', 'temperature', 'cc_length', 'acuity'
    'features_to_keep': [
        'temp_category_febrile',
        'sbp_category_low',
        'arrival_transport_AMBULANCE',
        'pain_category_no pain',
        'arrival_transport_OTHER',
        'cc_word_count',
        'age_category_senior',
        'dbp_category_low',
        'pain',
        'pain_category_severe',
        'sbp',
        'heartrate',
        'gender_F',
        'hr_category_tachycardic',
        'resp_category_normal',
        'resprate',
        'hr_category_bradycardic',
        'pulse_ox_category_normal',
        'sirs',
        'age_category_adult',
        'pain_category_moderate',
        'age_category_middle_aged',
        'sbp_category_high',
        'shock_index',
        'temp_category_normal',
        'arrival_transport_WALK IN',
        'dbp_category_normal',
        'gender_M',
        'day_shift_False',
        'pulse_ox_category_low',
        'cc_lexical_complexity',
        'cc_entropy',
        'cc_pos_complexity',
        'dbp'
    ]
    # Features with zero importance are excluded
}


### Explanation of Recommendations:

Model Parameters:
Increasing regularization strength parameter C to 100 slightly improved performance. Using the l2 penalty and liblinear solver maintained performance without errors. No improvement was found by altering class_weight.

Sampling Strategy:
All tested sampling strategies (SMOTE, over, under) showed negligible improvement. Thus, the model is robust enough to proceed without additional sampling.

Feature Selection:
Selected top features based on feature importance and ablation results to reduce model complexity while maintaining or slightly improving performance.

In [13]:
model.retrain_with_recommendations(recommendations)

Retraining model with recommendations...
Selected 34 features.
Using model parameters: {'C': 100.0, 'penalty': 'l2', 'solver': 'liblinear', 'class_weight': None}
Evaluating retrained model performance...


# Retrained Logistic Regression Evaluation Results

## Metrics

| Metric                    | Score    |
|---------------------------|----------|
| Validation Accuracy       | 0.7130 |
| Validation F1 Score (weighted) | 0.7056 |
| Validation Precision (weighted) | 0.7073 |
| Validation Recall (class 1) | 0.8291 |
| Binary Precision (class 1) | 0.7358 |
| Binary F1 Score (class 1) | 0.7797 |
| ROC-AUC Score             | 0.7570 |

## Detailed Classification Report

```
              precision    recall  f1-score   support

           0       0.66      0.53      0.59     15900
           1       0.74      0.83      0.78     25133

    accuracy                           0.71     41033
   macro avg       0.70      0.68      0.68     41033
weighted avg       0.71      0.71      0.71     41033

```

## Sample of Predictions in Original Labels

`['HOME' 'ADMITTED' 'HOME' 'HOME' 'HOME'] ...`


<logistic_regression_model.LogisticRegressionModel at 0x2a9e67065d0>

In [14]:
# Save the trained model
save_info = model.save(save_dir='models', model_name='lr_model')
print(f"Model saved with timestamp: {save_info['timestamp']}")

Model saved to models\lr_model_20250302_013413.pkl
Preprocessor and features saved to models\lr_model_preprocessor_20250302_013413.pkl
Metadata saved to models\lr_model_metadata_20250302_013413.json
Model saved with timestamp: 20250302_013413
