## 3. Feature Engineering
 
### 3.1 Matching PIREPs to ERA5
- Retrieved ERA5 reanalysis data (28 pressure levels, 16+ variables) using `cdsapi`.
- For each report, matched ERA5 data based on:
  - Time (nearest hour)
  - Location (nearest grid point)
  - Altitude (nearest pressure level)
 
```python
import xarray as xr
grib_data = xr.open_dataset('january_data.grib', engine='cfgrib')
weather_columns = ['temperature', 'u_component_of_wind', 'v_component_of_wind', 'relative_humidity']
 
def extract_weather_data(row):
    lat, lon, time, pressure = row['LAT'], row['LON'], row['VALID'], row['Altitude (hpa)']
    time = pd.to_datetime(time)
    point = grib_data.sel(
        time=time, level=pressure,
        method='nearest',
        latitude=lat, longitude=lon
    )
    return pd.Series({var: float(point[var].values) for var in weather_columns})
 
pirep_df[weather_columns] = pirep_df.apply(lambda row: extract_weather_data(row), axis=1)
```
 
### 3.2 Feature Generation
- Extracted 16+ ERA5 variables like:
  - Wind (u/v components, vertical velocity)
  - Temperature, humidity, cloud water, vorticity
- Engineered new features:
  - `wind_speed`, `wind_direction` from u/v components
  - `wind_shear` using vertical/horizontal gradients with spatial-temporal filtering
  - Seasonal encoding based on `VALID` timestamp
 
```python
import numpy as np
pirep_df['wind_speed'] = np.sqrt(pirep_df['u_component_of_wind']**2 + pirep_df['v_component_of_wind']**2)
pirep_df['wind_direction'] = np.degrees(np.arctan2(pirep_df['v_component_of_wind'], pirep_df['u_component_of_wind']))
pirep_df['wind_direction'] = (pirep_df['wind_direction'] + 360) % 360
```
 
```python
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
pirep_df['aircraft_encoded'] = encoder.fit_transform(pirep_df['AIRCRAFT'].fillna('UNKNOWN'))
```
 
### 3.3 Final Dataset Output
- Cleaned and merged month-wise ERA5 data.
- Created `full_year_df.csv` → Complete 2024 dataset with labels + ERA5 features.
- Additional filtering:
  - Limited to U.S. and surrounding airspace (based on bounding box)
  - Dropped unused fields and outliers
 
```python
pirep_df.to_csv('full_year_df.csv', index=False)
```
 
---
 
## 4. Label Simplification and Deduplication
 
- Converted turbulence labels to binary:
  - `NEG` → 0
  - `SEV`/`EXTRM` → 1
 
```python
binary_map = {'NEG': 0, 'SEV': 1, 'EXTRM': 1}
pirep_df['binary_target'] = pirep_df['severity_level'].map(binary_map)
```
 
- Filtered NEG duplicates using diversity in weather features.
- Removed NEG samples from aircraft types not seen in SEV/EXTRM.
- Saved final binary dataset as `air_final_df.csv`:
  - 147K NEG
  - 6.7K SEV/EXTRM
 
```python
pirep_df.to_csv('air_final_df.csv', index=False)
```
 
- The refined dataset is now fully structured for class balancing, modeling, and downstream risk analysis.