 # Predicting High Healthcare Costs: Risk Modeling  
## Notebook 2 • Predictive Modeling

**Goal:**  
Build a classification model that predicts which members are likely to incur high annual medical charges (> $20,000) using only demographic and behavioral characteristics.

### Why This Matters
A small portion of members often drive a large share of total healthcare costs.  
Health plans can use predictive analytics to:
- Identify rising-risk members early  
- Support care management and preventive interventions  
- Improve budgeting and actuarial forecasts  
- Allocate resources more efficiently

### Modeling Approach  
This notebook evaluates two supervised classification models:

1. **Logistic Regression**  
   - Interpretable, fast, and effective.
   - Recommended when explainability matters.

2. **Random Forest**  
   - Nonlinear, handles interactions well.  
   - Recommended when prediction performance is prioritized.

Both models are trained and compared based on:
- ROC-AUC  
- Recall, Precision, F1  
- Real-world implications


**Author:** Ivy Maina

In [10]:
#import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('/content/medical_insurance_costs.csv')


In [11]:
#High-Cost Label Distribution
#Rule: Chargers higher than $20,000 are high cost.

df['high_cost'] = np.where(df['charges'] > 20000, 1, 0)
df['high_cost'].value_counts()


Unnamed: 0_level_0,count
high_cost,Unnamed: 1_level_1
0,1065
1,273


**High-Cost Label Distribution**

To prepare for modeling, members were labeled as high cost if their annual medical charges exceeded $20,000.

**High-cost population breakdown**
- **273 members (20%)** fall into the high-cost category.  
- **1,065 members (80%)** are in the low/normal cost group.

This imbalance is common in healthcare claims where a small share of members often drives a large share of total cost. It also means we’ll need to pay extra attention to metrics like recall and ROC-AUC, not just accuracy.


In [12]:
#Choose features + Target
X = df[['age','sex','bmi','children','smoker','region']]
y = df['high_cost']


**Feature & Target Selection**

To train a predictive model, we split the dataset into:
- **Input features (X):**  
  `age`, `sex`, `bmi`, `children`, `smoker`, `region`  
  These represent demographic and lifestyle factors that may influence healthcare spending.
  
- **Target variable (y):**  
  `high_cost` — a binary label indicating whether a member incurred **> $20,000** in charges.

By selecting only member-level factors (rather than charges), we ensure the model learns to predict high spenders using information available before medical expenses occur.

In [13]:
#Train-test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


**Train–Test Split**

To evaluate how well our model generalizes to unseen members, the dataset was split into:

- **Training set:** 80% of observations.  
- **Test set:** 20% of observations.  
- **Stratified sampling:** ensures the high-cost class (20%) is represented proportionally in both sets.

Stratifying is critical in healthcare risk modeling because without it, the minority high-cost population might be under-represented in the train or test set, leading to biased and unreliable predictions.

In [14]:
#Preprocess categorical + numeric

numeric = ['age','bmi','children']
categorical = ['sex','smoker','region']

preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric),
        ('cat', OneHotEncoder(drop='first'), categorical)
    ]
)


**Data Preprocessing**

Before training any models, we prepare the features using a ColumnTransformer:

**Numeric features** (`age`, `bmi`, `children`)  
- Scaled using StandardScaler to normalize ranges.  
- Helps algorithms (especially Logistic Regression) converge more reliably.

**Categorical features** (`sex`, `smoker`, `region`)  
- Converted into numerical form using OneHotEncoder.
- First category dropped to avoid multicollinearity (“dummy trap”).

This preprocessing pipeline ensures that both numeric and categorical inputs are handled correctly and consistently across training and evaluation. This is critical for fair and interpretable healthcare models.


In [15]:
#Logistic Regression Model

logreg = Pipeline(steps=[
    ('prep', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

logreg.fit(X_train, y_train)
pred_lr = logreg.predict(X_test)
proba_lr = logreg.predict_proba(X_test)[:,1]

print("ROC-AUC:", roc_auc_score(y_test, proba_lr))
print(classification_report(y_test, pred_lr))



ROC-AUC: 0.9508322663252241
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       213
           1       0.85      0.85      0.85        55

    accuracy                           0.94       268
   macro avg       0.91      0.91      0.91       268
weighted avg       0.94      0.94      0.94       268



**Logistic Regression**

The Logistic Regression model performs strongly at predicting high-cost members.

**Evaluation Metrics**
- **ROC-AUC:** 0.95

This indicates excellent ability to distinguish between high-cost and low-cost groups

**Classification Performance**
| Metric | High Cost (1) | Low Cost (0) |
|--------|----------------|---------------|
| Precision | 0.85 | 0.96 |
| Recall    | 0.85 | 0.96 |
| F1-score  | 0.85 | 0.96 |

**Overall Accuracy:** 94%

**Key takeaway:**

The model correctly identifies 85% of high-cost members, which is a strong recall score for healthcare risk prediction, where missing high-cost patients can lead to major spend surprises.  

This suggests that Logistic Regression is already capturing meaningful signals from member demographics and lifestyle factors.


In [16]:
# Random Forest Model

rf = Pipeline(steps=[
    ('prep', preprocess),
    ('model', RandomForestClassifier(n_estimators=300, random_state=42))
])

rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)
proba_rf = rf.predict_proba(X_test)[:,1]

print("ROC-AUC:", roc_auc_score(y_test, proba_rf))
print(classification_report(y_test, pred_rf))


ROC-AUC: 0.9200597524541186
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       213
           1       1.00      0.82      0.90        55

    accuracy                           0.96       268
   macro avg       0.98      0.91      0.94       268
weighted avg       0.96      0.96      0.96       268



**Random Forest**

The Random Forest model delivers excellent performance with strong balance between catching high-cost members and minimizing false alarms.

**Evaluation Metrics**
- **ROC-AUC:** 0.92  
Slightly lower than Logistic Regression, but still very strong signal separation.

**Classification Performance**
| Metric | High Cost (1) | Low Cost (0) |
|--------|----------------|---------------|
| Precision | **1.00** | 0.96 |
| Recall    | 0.82 | **1.00** |
| F1-score  | 0.90 | 0.98 |

**Overall Accuracy:** 96%

**Key takeaway:**  
Random Forest identifies all predicted high-cost members correctly (precision = 100%), meaning it rarely misclassifies someone as high-cost when they are not.  
However, it misses a few true high-cost members (82% recall), suggesting a mild risk of underestimating cost impact.

This makes Random Forest appealing for use cases where false positives are costly, while Logistic Regression may be stronger where capturing every high-cost member is critical.


In [17]:
#Feature Importance (RF)

# Get encoded feature names
ohe = rf.named_steps['prep'].named_transformers_['cat']
feature_names = numeric + list(ohe.get_feature_names_out(categorical))

importances = rf.named_steps['model'].feature_importances_

fi = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values(by='importance', ascending=False)

fi


Unnamed: 0,feature,importance
4,smoker_yes,0.473064
1,bmi,0.242266
0,age,0.17416
2,children,0.054231
3,sex_male,0.016979
6,region_southeast,0.014482
5,region_northwest,0.013353
7,region_southwest,0.011464


**Feature Importance - Random Forest**

To understand what drives high medical spending, we examined feature importance scores from the Random Forest model.

**Top Predictors of High Cost**
1. **Smoker status** - 0.47 importance
- Strongest signal by far.
- Smoking alone accounts for nearly half the predictive power.  
- Aligns with real-world risk where tobacco use leads to dramatically higher claims.

2. **BMI** - 0.24 importance  
- Higher BMI likely increases risk for chronic conditions, hospitalizations, and procedures

3. **Age** - 0.17 importance
- Costs naturally rise with age due to higher utilization, comorbidities, and preventive screenings.

4. **Children** - 0.05 importance  
- Minor impact, but may reflect additional covered dependents.

5. **Sex & Region** — Less than 2% contribution each  
- These variables matter less for predicting extreme cost in this dataset.

**Key Insight:**  
Lifestyle and physiological factors (**smoking + BMI**) are far more predictive of high-cost claims than geography or gender.

This suggests health plans can:
- Target smoking cessation programs
- Flag obesity risk early
- Predict high spenders using minimal demographic inputs


###**Model Evaluation Summary**

Both models performed exceptionally well in predicting whether a member will incur over $20,000 in annual medical charges, but each excels in a different way.

**Logistic Regression**
- ROC-AUC: 0.95  
- Recall (High Cost): 0.85  
- Precision (High Cost): 0.85  
- Accuracy: 94%

- Best for identifying as many high-cost members as possible. It is critical when the goal is early intervention.

**Random Forest**
- ROC-AUC: 0.92  
- Recall (High Cost): 0.82  
- Precision (High Cost): 1.00  
- Accuracy: 96%

- Best when false positives are costly — it rarely misclassifies a low-cost member as high-cost, but misses a few true high-cost members.*

---

**Key Predictors of High Medical Spending**

Feature importance analysis confirms that:
1. **Smoker status** (largest driver, nearly 50% of model signal)
2. **BMI**
3. **Age**

Together, these three factors explain the majority of variation in high spending.

---

**Business Takeaways**

- Smoking cessation and weight management programs could meaningfully reduce high claims.
- Health plans can risk-stratify members using simple demographic fields, even before full claims history is available.
- Predictive modeling supports proactive outreach, budgeting, and care management planning
- Different model types support different operational use cases:
  - Logistic Regression: maximize high-risk capture  
  - Random Forest: minimize false alarms

---

**Final Conclusion**

This modeling demonstrates that even a small number of member-level features can reliably predict high-cost individuals.  
Such models can help payers, providers, and care management teams act earlier, enabling better health outcomes and more sustainable spending.
