# Class Balancing with Isolation Forest and SMOTE

This notebook addresses the significant class imbalance between **NEG** (no turbulence) and **SEV–EXTRM** (severe/extreme turbulence) reports in the binary turbulence-risk classification task. A combined strategy was adopted (only on the training set):

- Downsampling **NEG** using anomaly scores from **Isolation Forest**
- Oversampling **SEV–EXTRM** using **SMOTE**

---

## ▸ Step 1: Downsampling NEG Class with Isolation Forest

### Objective

Retain only the most informative and diverse NEG (nagtive) reports by filtering out redundant, overly common cases that dominate the dataset.

### Method

An `IsolationForest` model was trained on weather-based features of NEG samples. The goal was to compute anomaly scores and select only the top anomalous samples, which are likely to be near decision boundaries.

```python
from sklearn.ensemble import IsolationForest

iso_model = IsolationForest(n_estimators=100, contamination=0.8, random_state=42)
neg_df["anomaly_score"] = iso_model.fit_predict(neg_df[feature_cols])
neg_df_anomalous = neg_df[neg_df["anomaly_score"] == -1]

```

This ensures that the model doesn’t learn a biased representation of the negative class and instead focuses on unusual but real-world NEG cases.

### Visualization

Density graph comparing anomaly score distributions of original vs. downsampled NEG samples.

`neg_vs_downsampled_density.png`

![PIREPs Map](images/neg_vs_downsampled_density.png)

---
## ▸ Step 2: Oversampling SEV–EXTRM Class Using SMOTE

### Objective

Improve model sensitivity to rare SEV–EXTRM turbulence cases by synthetically increasing their representation, without affecting evaluation integrity.

### Method

`SMOTE` (Synthetic Minority Oversampling Technique) was applied only to the training set after splitting. This ensures that no synthetic data leaks into the test set and keeps evaluation realistic.

```python
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
```

This step helps the model generalize better by preventing it from being overwhelmed by NEG samples.

## ▸ Final Class Distribution
The final binary-labeled dataset after downsampling NEG and oversampling SEV–EXTRM achieved a significantly improved class balance while preserving natural diversity in both categories.


### Visualization

U.S. map of sampled PIREPs colored by binary target class.

`PIREP_Binary_Target_Distribution.png`

![PIREPs Map](images/PIREP_Binary_Target_Distribution.png)

