This project explores the Medical Cost Personal Dataset through the lenses of both Regression and Classification. The goal is to build a robust system that can not only predict the specific dollar amount of medical insurance charges but also categorize individuals into "cost brackets" with high confidence.

### Table of contents

| Section | Phase | Technical Highlights |
| :--- | :--- | :--- |
| **I** | [**Data Preparation**](#prep) | Train/Test Split (80/20), Standardizing, & One-Hot Encoding. |
| **II** | [**Probabilistic Regression**](#regression) | Ridge/Lasso & **95% Confidence Intervals** implementation. |
| **III** | [**Ensemble Methods**](#ensemble) | **Bagging, Boosting (XGBoost)**, and **Stacking** architectures. |
| **IV** | [**Discretization Strategy**](#discrete) | Converting `charges` via **Equal-Width Binning** into classes. |
| **V** | [**Probability & Overfitting**](#probs) | Analyzing **Softmax distributions** and Underfitting vs. Overfitting. |
| **VI** | [**GridSearchCV Optimization**](#opt) | Hyperparameter tuning for the best classification variant. |
| **VII** | [**Metric Bridging**](#metrics) | Converting **MAE to Accuracy** via mathematical transformations. |

---

<a id="prep"></a>
#### Phase I: Data Preparation & Preprocessing
> **Goal:** Create a clean, standardized pipeline. We ensure that parameters (mean, std) from the training set are strictly applied to the test set to avoid data leakage.

* **Numeric:** Age, BMI, Children (Standardization).
* **Categorical:** Sex, Smoker, Region (One-Hot Encoding).

---

<a id="regression"></a>
#### Phase II: Probabilistic Regression
> **Goal:** Predict continuous costs with statistical uncertainty.

We implement baseline models and calculate **95% Confidence Intervals**. For a given prediction $\hat{y}$, the interval is defined as:
$$\hat{y} \pm t_{\alpha/2, n-2} \cdot \sigma$$



---

<a id="ensemble"></a>
#### Phase III: Ensemble Methods
> **Goal:** Leverage multi-model architectures to minimize variance (Bagging) and bias (Boosting).

* **Bagging:** Random Forest Regressor.
* **Boosting:** XGBoost / Gradient Boosting.
* **Stacking:** A meta-model trained on the predictions of base learners.



---

<a id="discrete"></a>
#### Phase IV: Discretization (Regression to Classification)
> **Goal:** Transform the problem into a classification task using **Equal-Width Binning**.

We divide the `charges` range into $N$ equal intervals to categorize individuals into risk levels (e.g., Low, Medium, High).

---

<a id="probs"></a>
#### Phase V: Probability Distributions & Confidence
> **Goal:** Evaluate model "certainty" and diagnose the fit.

We analyze the **Softmax** output (or `predict_proba`) to visualize how "sure" the model is.
* **Overfitting:** High confidence on incorrect predictions.
* **Underfitting:** Uniformly low confidence across all classes.



---

<a id="opt"></a>
#### Phase VI: Optimization via GridSearchCV
> **Goal:** Fine-tune the best-performing classifier.

Using a parameter grid, we find the optimal `max_depth`, `learning_rate`, and `n_estimators` to maximize our F1-score.

---

<a id="metrics"></a>
#### Phase VII: Metric Bridging & Final Conversion
> **Goal:** Mathematically link Error (Regression) to Accuracy (Classification).

We investigate if a model with a lower **MAE** (Mean Absolute Error) naturally results in higher **Accuracy** when the results are discretized.
$$\text{Accuracy} \approx f(\text{MAE}, \text{Bin Width})$$

---