# ▸ Extended Model Evaluation and Case Insights


Beyond raw performance metrics, I explored how different preprocessing choices, model types, and real-world samples influenced turbulence prediction outcomes. These analyses helped validate the model's consistency and offered ideas for operational relevance.

---
## 1. Preprocessing Strategy: Why It Matters

To understand the impact of different balancing and transformation techniques, I evaluated six preprocessing combinations using XGBoost as the base classifier.

![ablation_study_table](images/ablation_study_table.png)

As shown above, the raw dataset (without any balancing) achieved high overall accuracy (96.6%) but failed to detect many of the SEV–EXTRM turbulence cases. On the other hand, approaches that included SMOTE and downsampling improved recall and F1-score substantially, especially for the rare, high-risk class.

The Full Pipeline, which combined SMOTE, anomaly-guided downsampling, PCA-based dimensionality reduction, and K-Means clustering for risk labeling, delivered the best overall balance:

- Recall: 91% for SEV–EXTRM
- F1-score: 0.88
- False negatives: **reduced more effectively than any other pipeline**, which is critical for a risk-sensitive domain


While SMOTE + Downsampling came very close in terms of recall and F1, the Full Pipeline had several subtle advantages:

- PCA helped **reduce noise** and **feature redundancy, improving generalization**.
- K-Means labeling enabled **better structure-aware learning**, especially when fed into models like XGBoost that benefit from informative feature clusters.
- From a modeling perspective, the combined pipeline allowed for **more interpretable** and **spatially consistent results**, making downstream evaluation and validation more robust.

So while the gains may seem incremental numerically, the Full Pipeline offered better reliability when scaled across validation folds and test scenarios, which ultimately made it the more trustworthy setup for operational use.

---

## 2. Model Behavior Comparison - SVM vs XGBoost

Since Mizuno et al. originally used an SVM model, I replicated their approach and contrasted it with my pipeline’s results.

**(SVM vs SVM)**
![conf_final](images/conf_final.png)
My SVM model showed reduced performance on this dataset, especially for correctly identifying severe turbulence. 

On the other hand:

**(SVM vs XGBoost)**
![conf_final_xgboost](images/conf_final_xgboost.png)

✮ XGBoost offered stronger generalization. It correctly identified 90.7% of SEV–EXTRM reports while keeping false positives (false alarms) under control which is important in operational aviation scenarios.

---

## 3. Case Study: February 16, 2025

To test model robustness on unseen data, I selected a real high-risk day from the 2025 test set. The map below shows how well the XGBoost model captured true positive zones:

![feb_16th_map_view](images/feb_16th_map_view.png)

I focused on two locations where the model accurately flagged SEV–EXTRM cases: Ohio and North Carolina. Here's how their conditions differed from the 2024 baseline:

![case_study_analysis](images/case_study_analysis.png)

- Wind speeds, vertical velocity, and temperature all deviated significantly. This supports the model’s reliance on meaningful physical indicators.

ⓘ These explanations aren’t meant as expert meteorological validation but illustrate how the model learned useful, generalizable patterns.

---
## ▸ Key Insights and Model Reflections

This pipeline was built for more than just scoring high on cross-validation. It was designed to:

- Prioritize recall and F1-score for severe events
- Reduce false negatives, which are critical in safety contexts
- Offer a way to visually interpret predictions using K-Means clustering
- Provide traceability and feature deviation analysis on high-risk days

While XGBoost performed the best overall, the deeper evaluations gave confidence that the predictions weren’t just fitting noise but **they matched real-world patterns** in **wind speed, vertical motion, and cloud features**.