
This notebook outlines the full cleaning and feature generation pipeline for the turbulence dataset, beginning from raw pilot reports and resulting in a refined dataset enriched with meteorological features. This data serves as input to downstream modeling.

---

## 2. PIREP Cleaning and Label Extraction

### 2.1 Initial Filtering
- Loaded over **1.1 million** raw PIREPs.
- Removed null and irrelevant reports, retaining only entries with turbulence observations and valid altitude data.

```python
import pandas as pd
pirep_df = pd.read_csv("../sample_data/pirep_raw.csv")
pirep_df = pirep_df.dropna(subset=['TURB', 'LAT', 'LON'])
pirep_df = pirep_df.drop_duplicates()
```

### 2.2 Altitude Processing
- Extracted flight levels from `REPORT` column using regex.
- Converted to altitude in feet (MSL) and then to pressure level (hPa) using `metpy`.
- Mapped each flight level to the nearest ERA5 pressure level.

```python
from metpy.calc import pressure_at_height
from metpy.units import units

# Example conversion from feet to pressure
pirep_df['pressure_hpa'] = pressure_at_height(pirep_df['altitude_ft'].values * units.feet)
```d labeled turbulence dataset.

...
))
```

- Saved outputs:
  - `pirep_sample.csv` → 10 sample raw PIREPs.
  - `final_turb_df.csv` → Cleaned and labeled turbulence dataset.rbulence dataset.



### 2.3 Label Preprocessing

Labeling was **one of the most painful parts** of the pipeline. The `TURB` field was filled with typos, slang, abbreviations, and subjective descriptions like "beautiful ride" or "stable ride". A basic map wouldn't cut it, had to go *brute-force*, curating patterns through trial, error, and months of exposure.

```python
import re

def preprocess_and_categorize(label):
    original_label = label
    label = label.upper()
    label = re.sub(r'\d+-\d+|\d+', '', label)
    label = re.sub(r'\b(CONS|CONSISTENT|OCCASIONAL|CONTINUOUS|OCNL|INTMT|OCCL|KTS)\b', '', label)
    label = label.strip()

    category_map = [
        # NEG bin
        (r'\b(BEAUTIFUL RIDE|CALM|CLR|CLEAR|GOOD RIDE|'
         r'JUST FINE|N|NEG|NEGATIVE|NEGNEG|NEGH|NEV|NET|'
         r'NG|NIL|NO|NO TURB|NONE|NOT BAD|REASONABLY CALM|'
         r'SMMOOTH|SMOOTH|SMOOTHE|SMOOTHED|SMOOTHEN|SMOOTHM|SMOOTJH|SMOOVE|SMOTH|SMMOTH|'
         r'SNOOTH|SOOTH|STABLE|STABLE RIDE|VMC)\b', 'NEG'),

        # LGT bin
        (r'\b(LGHT|LGT|Lgt Bumpy|LGTBLO|LGTCHOP|LIGHT|LIGHT CHOP|LIGHT TURB|LIGHTEST|
         r'LIGHTEST OF CHOP|LIGHTLY|LIGHTY|LIHGT|LIT|LITE|'
         r'LITTLE CHOP|LHT|LT|LTG|MINOR|MINOR CHOP|SLGT)\b', 'LGT'),

        # MOD bin
        (r'\b(CONSMOD|LIGHT TO MODERATE|LGT-MOD|LG-TMOD|LGTMOD|LMOD'
         r'|MD|MDT|MDO|MOD|MODCAT|MODCHOP|MODCONS|MODD|MODE|'
         r'MODERATE|MODTURB|MOID|MOP|MOT MODT|MOS|MTW)\b', 'MOD'),

        # SEV bin
        (r'\b(HEAVY|MOD-SEV|MODERATE SEVERE|MODERATE TO SEVERE|MODTOSEV|SEV|SEVERE'
         r'|SEVRE|SEVERS|SEVER|SEVR|SERV|SERVER|SERVERE|SVR)\b', 'SEV'),

        # EXTRM bin
        (r'\b(EXT|EXTRM|EXTMR|EXTREM|EXTREME|EXTREME TURB|'
         r'EXTREMELY SEVERE|SEV-EXTRM|SEVERE EXTREME)\b', 'EXTRM')
    ]

    for pattern, category in category_map:
        if re.search(pattern, label):
            return category, original_label, label
    return 'UNKNOWN', original_label, label



copied_turb_df = turbulence_only_df.copy()
copied_turb_df[['turbulence_category', 'original_label', 'processed_label']] = copied_turb_df['TURB'].apply(lambda x: pd.Series(preprocess_and_categorize(x)))

final_df = copied_turb_df[copied_turb_df['turbulence_category'] != 'UNKNOWN']
unknown_labels = copied_turb_df[copied_turb_df['turbulence_category'] == 'UNKNOWN']
unknown_labels.to_csv('unknown_turbulence_labels.csv', index=False)

print("Total reports:", len(copied_turb_df))
print("Known labels:", len(final_df))
print("Unknown labels:", len(unknown_labels))
print("\nCategory distribution:")
print(final_df['turbulence_category'].value_counts())
print("\nrandom 10 final_df rows:")
print(final_df[['turbulence_category', 'original_label', 'processed_label']].sample(10))
print("\nrandom 10 unknown labels:")
print(unknown_labels[['original_label', 'processed_label']].sample(10))
```

- Saved outputs:
  - `pirep_sample.csv` → 10 sample raw PIREPs.
  - `final_turb_df.csv` → Cleaned and labeled turbulence dataset.


...
