# Data Ingest of HRRR weather model data

### Intro

Weather data predictors for the ML models of FMC are retrieved from the HRRR weather model in this project. The 3D pressure model product from HRRR is utilized, since it has a larger set of variables than other products and it is used internally in other areas of the `wrfxpy` project. Additionally, since we require rainfall for modeling, we utilize the 3-hour forecast from HRRR and use the difference in accumulated precipitation from the 2 to 3 hour forecasts. This notebook will demonstrate reading and calculating a set of predictors derived from the HRRR model for a spatial bounding box.

There are 2 main uses for the HRRR weather data:

1. Constructing training data sets
2. Forecasting with a trained model over a spatial domain

There are 2 ways to access HRRR weather data within this project:

1. API retrieval using the `Herbie` package. This is the way that real-time data will be ingested
2. Reading from a formatted stash of HRRR weather data. (Ask J for the location on Alderaan)

The stash will be used to construct training data, and the API will be used for real-time ingest. The API could be used to construct training data, but during testing the processes has been killed automatically when too much data is requested at once. 

### Metadata File

The metadata file `../etc/variable_metadata/hrrr_metadata.yaml` has information about how to construct various predictors of FMC from HRRR grib file data. There are 4 types of features used in this project: HRRR modeled variables (e.g. wind speed), HRRR dimension variables (e.g. time), features engineered from HRRR modeled data (e.g. equilibrium moisture), and features engineered from HRRR dimension variables (e.g. hour of day). These 4 types of features must be extracted and constructed differently. Top level keys in the metadata file are fmda names used within this project:

- HRRR data variables will specify HRRR naming convention, regex search string, and layer/level. Common layers are grouped together in data retrieval
- HRRR dimension variables will specify a HRRR naming convention, but they can be read from any other set of HRRR data
- Engineered features from HRRR data variables will specify the names of variables needed to calculate them. The names will exist as other top-level keys in this file
- Engineered features from HRRR dimension variables will specify the names of the dimensions needed to calculate them

### Code

A configuration file is used to control data ingest. For automated processes, the code will look for a json configuration file depending on the use case: 

* For building training data, `../etc/training_data_config.json`
* For deploying the model on a grid, `../etc/forecast_config.json`

Retrieval of atmospheric weather predictors is done with the python software package `Herbie`. A module `ingest/HRRR.py` has functions and other metadata for directing data ingest. A list of predictors will be provided in order to control the data downloading. Some of these predictors are derived features, such as equilibrium moisture content which is calculated from relative humidity and air temperature. 

## References

For more info on HRRR data bands and definitions, see [HRRR inventory](https://www.nco.ncep.noaa.gov/pmb/products/hrrr/hrrr.t00z.wrfprsf02.grib2.shtml) for pressure model f02-f38 forecast hours.

For more info on python package, see Brian Blaylock's `Herbie` [python package](https://github.com/blaylockbk/Herbie)

## Setup

User definitions, these will come from config files in other areas of this project.

In [None]:
import matplotlib.pyplot as plt
from herbie import FastHerbie, Herbie
# from herbie import paint
# from herbie.toolbox import EasyMap, ccrs, pc
import xarray as xr
from datetime import datetime
from dateutil.relativedelta import relativedelta
import sys
import os.path as osp
import pandas as pd
import numpy as np
sys.path.append("../src")
from utils import Dict, read_yml, str2time, print_dict_summary, read_pkl
import ingest.HRRR as ih
# from viz import map_var

In [None]:
with open("../etc/training_data_config.json", "r") as json_file:
    config = json.load(json_file)   
    config = Dict(config)

bbox = config.bbox
# start = str2time(config.start_time)
# end = str2time(config.end_time)
start = str2time('2023-06-20T15:00:00Z')
end = str2time('2023-06-20T21:00:00Z')
features_list = [*ih.hrrr_meta.keys()]

print(f"Start Date of retrieval: {start}")
print(f"End Date of retrieval: {end}")
print(f"Spatial Domain: {bbox}")
print(f"Required Features: {features_list}")

In [None]:
bbox

## Retrieve Data - API

This function from `herbie` sets up a connection to read, but only what is requested later will be downloaded. Available data can be viewed with the `inventory()` method. *Note:* this will display a separate row for each time step requested.

The data retrieval steps include:
- Based on input time range, use `FastHerbie` to open a connection to the files
- Based on hrrr metadata, construct a set of regex search strings that are used internally in `Herbie`. The data read is grouped by level (e.g. surface, 2m) as HRRR groups these variables by "hypercube"
- Retrieve HRRR data based on search strings, combine by level
- Calculate engineered features, like equilibrium moisture, day of year, hour of day

Then, optional processes after this include
- Rename data based on metadata naming conventions
- Subset HRRR data to a set of points defined by RAWS locations, using `pick_points` in `Herbie`

In [None]:
# Create a range of dates
dates = pd.date_range(
    start = start.replace(tzinfo=None),
    end = end.replace(tzinfo=None),
    freq="1h"
)

In [None]:
FH = FastHerbie(
    dates, 
    model="hrrr", 
    product="prs",
    fxx=range(3, 4)
)

In [None]:
inv = FH.inventory()
inv

In [None]:
ds = ih.retrieve_hrrr_api(start, end, bbox)

In [None]:
ds

In [None]:
raws_test = {
    "STID1":{
        "loc": {"stid": "STID1", "lat": 42, "lon": -102}
    },
    "STID2":{
        "loc": {"stid": "STID2", "lat": 44, "lon": -104}
    }
}

ds_raws = ih.subset_hrrr2raws(ds, raws_test)

In [None]:
ih.get_units_xr(ds)

In [None]:
# ds = ih.rename_ds(ds)

## Visualizations

Maps are made with a wrapper function to the `EasyMap` functionality in the `Herbie` package. The function accesses metadat that should make it robust to renaming. The metadata stores color schemes from the NWS for certain variables

In [None]:
# If you rename data this should still work
# ds = ih.rename_ds(ds.copy())

In [None]:
map_var(ds, "wind", save_path = "../outputs/wind_map.png")

In [None]:
map_var(ds, "temp", save_path = "../outputs/temp_map.png")

In [None]:
map_var(ds, "rh", save_path = "../outputs/rh_map.png")

## Reading from Stash

In [None]:
from utils import retrieve_url

In [None]:
start-relativedelta(hours=3)

In [None]:
end-relativedelta(hours=3)

## Spatial Subset

NOTE: as of Dec 31 2024, there are package issues with this solution. Herbie environment doesn't work either. TODO

Brian Blaylock recommends downloaded the data and spatially subsetting using Herbie's wrapper for `wgrib2`, then recreating the objects and reading into memory.

In [None]:
# bbox

In [None]:
# def get_fh_layer(FH, search_string, remove_grib=True, bbox=None, subset_naming="myRegion"):
#     """
#     Get HRRR data from fastherbie object given regex search string. 
#     Search string groups variables by layer/level. 
#     Optional bounding box spatially subsets data

#     Arguments:
#         - FH: FastHerbie object, defined with start and stop times
#         - remove_grib: bool, whether or not to delete grib files returning to local read
#         - search_string: str, based on regex. see utility function features_to_searchstr
#         - bbox: list, optional bounding box to subset region

#     Notes: As of Dec 18, 2024, Brian Blaylock recommends downloading data and using 
#         wgrib2 to spatially subset the data
        
#     Returns:
#         xarray, optionally subsetted to a bounding box
#     """

#     if bbox is None:
#         print("Returning data for entire conus, deleting all downloaded gribs")
#         ds = FH.xarray(search_string, remove_grib=remove_grib)
#     else:
#         print(f"Subsetting data to region within bbox: {bbox}")
#         print(f"Downloading Data to run wgrib2")

#         files = FH.download(search_string)
#         files = sorted(files, key=lambda x: int(x.name.split('__hrrr.t')[1][:2])) # sort by hour
        
#         # Reorder bbox to match format (min_lon, max_lon, min_lat, max_lat)
#         extent = (bbox[1], bbox[3], bbox[0], bbox[2]) 
#         subset_files=[]
#         for file in files:
#             subset_files.append(wgrib2.region(file, extent, name=subset_naming))

#         # Convert PosixPath list to strings
#         file_list = [str(path) for path in subset_files]
        
#         # Open files as a combined dataset
#         ds = xr.open_mfdataset(
#             file_list,
#             engine="cfgrib",
#             concat_dim="time",  # Replace 'time' with the appropriate dimension
#             combine="nested" 
#         )        
#         ds = ds.sortby('time')  

#         # Delete Files
#         if remove_grib:
#             for file in files:
#                 if file.exists():  # Check if the file exists before attempting to delete it
#                     file.unlink()        
#             for file in subset_files:
#                 if file.exists():  # Check if the file exists before attempting to delete it
#                     file.unlink()    
                
#     return ds

In [None]:
# ss = search_strings['2m']

# ds1 = get_fh_layer(FH, ss)

In [None]:
# ds2 = get_fh_layer(FH, ss, remove_grib=False, bbox = bbox)

In [None]:
# # Get CRS from geographic herbie 
# ## Assuming this info doesn't change over time
# H = Herbie("2023-08-01", product="sfc")
# ds_hgt = H.xarray("(?:HGT|LAND):surface")
# crs = ds_hgt.herbie.crs

In [None]:
# from herbie.toolbox import EasyMap

In [None]:
# ax = EasyMap(crs=crs).STATES(color="k").ax
# ax.pcolormesh(ds_hgt.longitude, ds_hgt.latitude, ds_hgt.orog, cmap=paint.LandGreen.cmap, alpha=0.5, transform=pc)
# ax.pcolormesh(ds2.longitude, ds2.latitude, ds2.t2m.isel(time=0), transform=pc)

# ax.gridlines(xlocs=extent[:2], ylocs=extent[2:], color="k", ls="--", draw_labels=True)

Data fields are accessed through the `.xarray()` method. This will temporarily download the file and then deliver it in memory as an xarray object. Different variables are accessed through search strings that specify the variable name (e.g. air temperature), the level of the observation (e.g. surface level), and the forecast hour relative to the f00 start time (e.g. hour 3 as we will be using). The `retrieve_hrrr_api` module in this project stores a dataframe with names and info on various variables that will be considered for modeling FMC.

In [None]:
# # Show HRRR naming dataframe
# ih.hrrr_name_df

In [None]:
ds_dict = {}

for layer in search_strings:
    print(f"Reading HRRR data for layer: {layer}")
    print(f"    search strings: {search_strings[layer]}")
    ds_dict[layer] = FH.xarray(search_strings[layer], remove_grib=False) # Keep grib for easier re-use, delete later

In [None]:
ds_dict.keys()

In [None]:
ds_dict["surface"]

In [None]:
ds_dict["2m"]

In [None]:
ds_dict["10m"]

In [None]:
ds = ih.merge_datasets(ds_dict)

In [None]:
ds = ds.assign_coords({
    'grid_x' : ds.x,
    'grid_y' : ds.y
})

In [None]:
ds

## Formatting Forecast Data

Forecasting with a trained model is done pointwise (for now) on the HRRR grid.

In [None]:
bbox = [37, -111, 46, -95]

In [None]:
pts = pd.DataFrame({
    "latitude": [bbox[0], bbox[2], bbox[0], bbox[2]],
    "longitude": [bbox[1], bbox[3], bbox[1], bbox[3]]
})

pts

In [None]:
ds_bbox = ds.herbie.pick_points(pts)
ds_bbox

In [None]:
xmin, xmax = int(ds_bbox.grid_x.min()), int(ds_bbox.grid_x.max())
ymin, ymax = int(ds_bbox.grid_y.min()), int(ds_bbox.grid_y.max())

In [None]:
xmin, xmax

In [None]:
ymin, ymax

In [None]:
ds_cropped = ds.sel(x=slice(xmin, xmax), y=slice(ymin, ymax))

In [None]:
map_var(ds_cropped, "Ed")

In [None]:
from herbie import paint
from herbie.toolbox import EasyMap, pc, ccrs
import matplotlib.pyplot as plt

In [None]:
ax = EasyMap("110m", figsize=[15, 9], crs=ds.herbie.crs).STATES().ax
p = ax.pcolormesh(
    ds.longitude,
    ds.latitude,
    ds.Ed.isel(time=0),
    transform=pc,
    cmap=paint.NWSRelativeHumidity.cmap,
)

In [None]:
# ax = EasyMap().STATES().OCEAN().LAND().DOMAIN(ds_cropped).ax
ds["test"] = ds.Ed.where(ds.Ed < -100)
ax = EasyMap("110m", figsize=[15, 9], crs=ds.herbie.crs).STATES().ax
p = ax.pcolormesh(
    ds.longitude,
    ds.latitude,
    ds.test.isel(time=0),
    transform=pc,
    cmap=paint.NWSRelativeHumidity.cmap,
)
p = ax.pcolormesh(
    ds_cropped.longitude,
    ds_cropped.latitude,
    ds_cropped.Ed.isel(time=0),
    transform=pc,
    cmap=paint.NWSRelativeHumidity.cmap,
)

In [None]:
ds_cropped2 = ds.copy()
ds_cropped2["test2"] = ds.Ed.where(((ds.latitude > ds_bbox.latitude.min()) & (ds.latitude < ds_bbox.latitude.max()) & (ds.longitude > ds_bbox.longitude.min()) & (ds.longitude < ds_bbox.longitude.max())))

In [None]:
ds["test"] = ds.Ed.where(ds.Ed < -100)
ax = EasyMap("110m", figsize=[15, 9], crs=ds.herbie.crs).STATES().ax
p = ax.pcolormesh(
    ds.longitude,
    ds.latitude,
    ds.test.isel(time=0),
    transform=pc,
    cmap=paint.NWSRelativeHumidity.cmap,
)
p = ax.pcolormesh(
    ds_cropped2.longitude,
    ds_cropped2.latitude,
    ds_cropped2.test2.isel(time=0),
    transform=pc,
    cmap=paint.NWSRelativeHumidity.cmap,
)