# Data Sources and Acquisition

This notebook demonstrates how raw aviation and weather data were collected and processed to support a turbulence risk prediction system. All work was carried out independently as part of a Master's project.

---

## 1. Pilot Reports (PIREPs)

**Source**: Iowa Environmental Mesonet (IEM)  
[https://mesonet.agron.iastate.edu/request/gis/pireps.php](https://mesonet.agron.iastate.edu/request/gis/pireps.php)

PIREPs provide real-time turbulence observations submitted by pilots during flight. These were used to derive ground truth turbulence labels and contextual metadata.

### -- Spatial Coverage Visualization
```python
from IPython.display import Image
Image(filename='../assets/1. PIREP_Reports_2024_Map.png')
```
  ![PIREPs Map](/assets/1.%20PIREP_Reports_2024_Map.png)
This map showcases the full extent of ~1.1 Million PIREPs across the U.S., Alaska, Hawaii, and nearby regions.

### -- Sample Raw Data (10 rows)
```python
import pandas as pd
raw_df = pd.read_csv("../sample_data/pirep_sample.csv")
raw_df.head(10)
```

### -- Derived Sample (with new columns)
```python
derived_df = pd.read_csv("../sample_data/pirep_derived_sample.csv")
derived_df.head(10)
```

These additional columns include altitude (hPa), timestamp parsing, and turbulence severity labels, preprocessed to match with ERA5 data.

---

## 2. ERA5 Reanalysis Data

**Source**: Copernicus Climate Data Store (CDS)  
[ERA5 hourly data on pressure levels](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-pressure-levels?tab=overview)

<!-- ### -- Weather variables used include: -->
- Weather variables used include:
  - Temperature, Relative Humidity, U/V Wind Components, Vertical Velocity
  - Cloud Liquid/Ice Water Content
  - Vorticity, Geopotential, Potential Vorticity, Divergence, etc.

- 28 pressure levels spanning tropospheric layers
- Downloaded using the `cdsapi` Python client


    ⓘ Over **1.5 TB** of ERA5 data processed across 12 months (2024)

---

## 3. ERA5 Retrieval Script (cdsapi)
```python
import cdsapi

client = cdsapi.Client()
client.retrieve(
    "reanalysis-era5-pressure-levels",
    {
        "product_type": "reanalysis",
        "variable": ["temperature", "u_component_of_wind", "v_component_of_wind", ...],
        "pressure_level": [400, 450, 500, 550, 700],
        "year": ["2024"],
        "month": ["01"],
        "day": ["01", ..., "31"],
        "time": ["00:00", ..., "23:00"],
        "format": "grib",
        "area": [72, -180, 15, -60],
    },
    "january_data.grib"
)
```

---

## 4. Weather Data Extraction (GRIB to CSV)
Once GRIB files are downloaded, weather variables per PIREP row were extracted using `xarray` and `cfgrib`. Each PIREP was matched by:
- UTC timestamp (nearest hour)
- Latitude and Longitude (nearest grid point)
- Altitude mapped to nearest pressure level

```python
import xarray as xr

grib_file = 'january_data.grib'
grib_data = xr.open_dataset(grib_file, engine='cfgrib')

# Define weather variables
weather_columns = ["temperature", "u_component_of_wind", "v_component_of_wind", "relative_humidity", ...]

# Automation function for data extraction
def extract_weather_data(row):
    lat, lon, time, pressure = row['LAT'], row['LON'], row['VALID'], row['Altitude (hpa)']
    ... # selection using xarray.sel
    return pd.Series({var: val for var in weather_columns})
```

---

## 5. Automation: Month-wise Extraction
Processed GRIB files day-by-day for multiple months:
- Automated matching for all pressure levels
- Saved enriched files like `january_with_weather_data.csv`, `february_with_weather_data.csv`, etc.

```python
# Example: Processing one GRIB file
def process_grib_file(grib_file, valid_pressure_levels, month_df, output_csv):
    grib_data = xr.open_dataset(grib_file, engine='cfgrib')

    month_df[weather_columns] = month_df.apply(
        lambda row: extract_weather_data(row, grib_data, valid_pressure_levels), axis=1
    )

    month_df.to_csv(output_csv, index=False)
    print(f"Updated data saved to {output_csv}")

# Example usage:
grib_file = 'jan.grib'
valid_pressure_levels = [400, 450, 500, 550, 700]
process_grib_file(grib_file, valid_pressure_levels, january_df, 'january_with_weather_data.csv')
```

---

##  Notes
- Only sample rows are shown in this notebook
- Full code and data are not included

📎 For questions or collaboration, please contact me
