# ▸ PCA Analysis for Dimensionality Reduction


This notebook applies Principal Component Analysis (PCA) to reduce the dimensionality of the weather-based feature set. The goal is to retain 95% of the variance while simplifying the dataset for clustering and visualization.

---
## 1. Standardization and PCA Application


```python
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[feature_cols])

# Apply PCA to retain 95% of variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"Shape after PCA: {X_pca.shape}")

```


---
## 2. 2D PCA Projection of Turbulence Samples & Variance

A 2D scatter plot using PC1 and PC2 was generated externally to highlight the distribution of NEG and SEV–EXTRM samples. The figure below helps visually assess the overlap and separation between turbulence categories in PCA-reduced space.

```python
plt.figure(figsize=(12, 8), dpi=300)
sns.scatterplot(
    data=pca_5_1, 
    x='PC1', 
    y='PC2', 
    hue='binary_target', 
    alpha=0.3, 
    palette={0: 'blue', 1: 'red'},
    legend=False  # Added custom legend below
)

# Labels and title
plt.title("Scatter Plot for PC1 and PC2", fontsize=16)
plt.xlabel("PC1", fontsize=14)
plt.ylabel("PC2", fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(True)

# Custom legend
legend_handles = [
    plt.Line2D([0], [0], marker='o', color='w', label='NEG', markerfacecolor='blue', markersize=8),
    plt.Line2D([0], [0], marker='o', color='w', label='SEV+EXTRM', markerfacecolor='red', markersize=8)
]
plt.legend(handles=legend_handles, title="Turbulence", fontsize=12, title_fontsize=13)

plt.tight_layout()
plt.savefig("pca_scatter_PC1_vs_PC2.png", dpi=300, bbox_inches='tight')
plt.show()
```

<!-- `pca_scatter_PC1_vs_PC2.png` -->
This 2D scatter plot shows the projection of turbulence samples on the first two principal components (PC1 and PC2).  

🔵 NEG: No turbulence

🔴 SEV–EXTRM: Severe or extreme turbulence

![PIREPs Map](images/pca_scatter_PC1_vs_PC2.png)

The overlap between classes highlights the challenge of linearly separating turbulence using raw features, justifying further clustering.

```python
explained_variance = pca_obj_5_1.explained_variance_ratio_
print(f"PC1: {explained_variance[0]:.2%}, PC2: {explained_variance[1]:.2%}")

▸ PC1: 29.15%, PC2: 15.15%
```

---

## 3. Top Feature Contributions to PC1

To interpret what PC1 represents, the top 10 features contributing to this principal component were visualized.

![PIREPs Map](images/pc1_top_features.png)

Interpretation:
PC1 is heavily influenced by:

- Cloud cover, humidity, ice water content, and vertical wind motion, which are all meteorological indicators related to potential turbulence conditions.

## 4. Cross-Validation with SHAP Importance
To validate PCA findings, SHAP (SHapley Additive exPlanations) was applied to a trained XGBoost model using the same features. The SHAP results confirm that cloud-related features and vertical velocity play a critical role.

![SHAP_importance](images/SHAP_importance.png)

Interpretation:
- **SHAP and PCA both highlight similar influential features**. This consistency strengthens trust in both the model and the dimensionality reduction step.

## 5. Why PCA was useful: A business perspective
Principal Component Analysis helped in several ways:
- Dimensionality Reduction: Reduced feature set from 15+ to just a handful of uncorrelated components without losing critical variance.
- Noise Filtering: Suppressed redundant weather features, keeping only those most useful for distinguishing turbulence patterns.
- Interpretability: Enabled visual analysis of turbulent vs. non-turbulent samples, showing that turbulence forms unique clusters in transformed space.
- Downstream Clustering: The transformed PCA data was used for K-Means clustering to assign unsupervised high-risk labels, which are important for detecting turbulence regimes even when labeled data is scarce.