# Data Ingest of RAWS 10-h Fuel Moisture Content

This notebook demonstrates retrieval and filtering of 10-h dead FMC data from RAWS. 
- Realtime 10-h FMC observations are retrieved with `SynopticPy`
- Old 10-h FMC observations are retrieved from a stash MesoDB maintained by Angel Farguell

This notebook will demonstrate use of `Synopticpy` with a free token, so limits are placed on the number of sensor hours that can be requested. Only records within the past year are freely available. Time frame and spatial domain for data ingest are controlled in automated processes in the configuration files `training_data_config.json` or the `forecast_config.json` files. This notebook will demonstrate manual data retrieval with short time frames for illustration purposes.

User inputs for data retrieval are:
- Start time
- End time
- Spatial bounding box (see rtma_cycler in wrfxpy for GACC bbox's)

The main steps in the retrieval are:
* Use `synoptic.Metadata` to determine the RAWS with FMC data in the given spatial domain and time frame
* Get data from stash OR use `synoptic.Timeseries` to retrieve all available data that may be relevant to FMC modeling. *NOTE:* the stations are selected so they must have FMC data, and then any other available variables are collected as a bonus. These data are used for exploratory purposes and quality control checks, but predictors for final modeling comes from HRRR.
* Format data and convert units.
* Identify missing data and interpolate with linear interpolation from numpy so resulting data is regular 1 hour intervals

## References

For more info on python library API, see Brian Blaylock's `SynopticPy` [python package](https://github.com/blaylockbk/SynopticPy)

For more info on available Synoptic RAWS variables, see [Synoptic Data](https://demos.synopticdata.com/variables/index.html) documentation

## Setup

In [None]:
# import matplotlib.pyplot as plt
# from datetime import datetime, timezone
from dateutil.relativedelta import relativedelta
import synoptic
# import json
# import sys
import numpy as np
# import polars as pl
import pandas as pd
sys.path.append('../src')
from utils import Dict, read_yml, print_dict_summary, str2time, time_range, rename_dict
import ingest.RAWS as rr

In [None]:
bbox = [40, -105, 45, -100] # subset of rocky mountain gacc
start = str2time('2024-06-01T00:00:00Z')
end = str2time('2024-06-01T05:00:00Z')

raws_meta = Dict(read_yml("../etc/variable_metadata/raws_metadata.yaml"))
print_dict_summary(raws_meta)

## Stations MetaData

We use `SynopticPy` to get a list of all RAWS stations within the bounding box that have fuel moisture data availability in the given time period. The function `get_stations` wraps the `synoptic.Metadata` function to order the bounding box properly and retrieve stations with FMC sensors.

*Note*: the bounding box format used in `wrfxpy` is `[min_lat, min_lon, max_lat, max_lon]`. But, the bounding box format used by Synoptic is `[min_lon, min_lat, max_lon, max_lat]`. The code will assume the `wrfxpy` format and convert internally.

In [None]:
sts = rr.get_stations(bbox)

print(sts["stid"])

## API Weather Data Time Series

Timeseries of observations are drawn for a single RAWS using the `synopticpy` package. Then, the data are formatted by custom funcitons in the `ingest.RAWS` module. 

We subtract one hour from the start time and add one hour to the end. This is because most stations produce data some number of minutes after the requested time, so if you request data at 1:00 the API will return data after that time. Then the temporal interpolation procedure, described below, will be extrapolating out at end points. Shifting the start time by 1 hour accounts for this, but if the start time is longer than 1 year in the past the API will truncate to 1 year. The module has a metadata file with a list of all RAWS weather variables relevant to FMC modeling. 

The `raws_metadata` file has a list of "static" variables that are unchanging in time. In the data returned by SynopticPy, these variables are arranged differently than the time-dynamic weather sensor variables, which are also listed in the metadata file. Module functions combine these two types of variables into one tabular dataframe.

The data is returned in "long" format, where each weather variable has its own row. We restructure the data into "wide" format with the module function `format_raws` so that a single row corresponds to one time, and the columns correspond to different data variables. Additionally, this function converts units and returns a dictionary of all units for the variables. 

In [None]:
weather_vars = rr.raws_meta["raws_weather_vars"]
df_temp = synoptic.TimeSeries(
        stid="HSYN1",
        start=start-relativedelta(hours=1),
        end=end+relativedelta(hours=1),
        vars=weather_vars,
        units = "metric"
    ).df()

df_temp

In [None]:
dat, units = rr.format_raws(df_temp)

In [None]:
units

In [None]:
dat

We then loop over the station IDs found in the previous step and retrieve all available data and then rename and pivot from long to wide. The loop generates a dictionary for each RAWS station with keys for weather data and other metadata.

*NOTE*: this process is not parallelized, as the same IP address is used for each request and parallelization may result in issues

In [None]:
print(f"Attempting retrieval of RAWS from {start} to {end} within {bbox}")
print("~"*75)

raws_dict = {}

for st in sts["stid"]:
    print("~"*50)
    print(f"Attempting retrival of station {st}")
    try:
        df = synoptic.TimeSeries(
            stid=st,
            start=start-relativedelta(hours=1),
            end=end+relativedelta(hours=1),
            vars=weather_vars,
            units = "metric"
        ).df()
    
        dat, units = rr.format_raws(df)
        loc = rr.get_static(sts, st)
        raws_dict[st] = {
            'RAWS': dat,
            'units': units,
            'loc': loc,
            'misc': "Data retrieved using `synoptic.TimeSeries` and formatted with custom functions within `ml_fmda` project."
        }
    except Exception as e:
        print(f"An error occured: {e}")

In [None]:
raws_dict.keys()

### Fix Time, Interpolate, and Calculate Rain

Synoptic may return RAWS data that has missing hours or is returned not exactly on the hour. The missing hours are simply absent in the return data from Synoptic, not marked by NaN. We fix that by filling in NaN for missing hours and interpolating to the exact hour. The resulting data should have regular hourly observations for every RAWS station. If Synoptic returns only a small number of observations, the interpolation process may create long stretches of perfectly linear data from the interpolation. These stretches of suspect data are flagged and filtered in a later stage of the data processing in this project, since the hyperparameters controlling that filtering may be changed but the underlying retrieval and interpolation would be unchanged.

Also, this is a good place in the code to rename variables. Various data sources have different variable names, so we standardize with naming conventions from the metadata files

In [None]:
times = time_range(start, end, freq="1h")
times

In [None]:
print(raws_dict["BRLW4"]["RAWS"].shape)

In [None]:
df2 = rr.time_intp_df(raws_dict["BRLW4"]["RAWS"], times)
df2

In [None]:
print(df2.shape)

We now loop over all stations and run temporal interpolation. We also convert to pandas for easier pickle write.

In [None]:
print(f"Interpolating dataframe in time from {times.min()} to {times.max()}")
rename=True
if rename:
    print(f"Renaming RAWS columns based on raws_metadata file")
for st in raws_dict:
    print("~"*75)
    print(st)
    nsteps = raws_dict[st]["RAWS"].shape[0]
    raws_dict[st]["RAWS"] = rr.time_intp_df(raws_dict[st]["RAWS"], times)
    raws_dict[st]["RAWS"] = pd.DataFrame(raws_dict[st]["RAWS"], columns = raws_dict[st]["RAWS"].columns)
    raws_dict[st]["times"] = times
    if raws_dict[st]["RAWS"].shape[0] != nsteps:
        raws_dict[st]["misc"] += " Interpolated data with numpy linear interpolation."
        print(f"    Original Dataframe time steps: {nsteps}")
        print(f"    Interpolated DataFrame time steps: {raws_dict[st]["RAWS"].shape[0]}")
        print(f"        interpolated {raws_dict[st]["RAWS"].shape[0] - nsteps} time steps")
    if rename:
        raws_dict[st]["units"] = rename_dict(raws_dict[st]["units"], raws_meta["rename_synoptic"])
        raws_dict[st]["RAWS"] = raws_dict[st]["RAWS"].rename(columns = raws_meta["rename_synoptic"])
        raws_dict[st]["loc"] = rename_dict(raws_dict[st]["loc"], raws_meta["rename_synoptic"])

In [None]:
raws_dict[st].keys()

In [None]:
raws_dict[st]["units"]

In [None]:
raws_dict[st]["loc"]

In [None]:
raws_dict[st]["misc"]

In [None]:
raws_dict[st]["RAWS"]

### Using Module Wrapper

The module function `build_raws_dict_api` combines the previous steps. The resulting dictionary should be the same as above

In [None]:
raws_dict2 = rr.build_raws_dict_api(start, end, bbox, save_path = "../data/raws_test1.pkl")

In [None]:
# Compare dicts
np.all(raws_dict.keys() == raws_dict2.keys())

In [None]:
np.all(raws_dict["WPKS2"]["RAWS"] == raws_dict2["WPKS2"]["RAWS"])

## RAWS Stash

This is intended to be used for older data where the free Synoptic token won't return data. However, the stash needs to be unzipped and may not contain the latest data. Additionally, the stash only includes 10-hr dead FMC observations. It is a work in progress to save all other sensor variables in the stash. As of Jan 2025 this process will only return dead FMC

In [None]:
start = str2time('2023-01-01T00:00:00Z')
end = str2time('2023-01-01T05:00:00Z')

### Get stash file paths

Given a date range, it returns a list of file paths to read from the stash. Like before, we subtract an hour from the start and add an hour to the end to give the interpolation procedure endpoints outside the target time range. The file directories are arranged by year and Julian day (0-366). Then the individual files are for a single day and all RAWS available in CONUS at that time, saved as pickle files.

In [None]:
times = time_range(start-relativedelta(hours=1), end+relativedelta(hours=1))

rr.get_file_paths(times)

### Build Dictionary

The process calls the `get_stations` function shown above (the one time where the API is used here), then loops through the files listed above and extracts data for the needed stations into a nested dictionary format that matches the format above.

In [None]:
import importlib
import ingest.RAWS
importlib.reload(ingest.RAWS)
import ingest.RAWS as rr 

In [None]:
start

In [None]:
end

In [None]:
raws_dict3 = rr.build_raws_dict_stash(start, end, bbox, save_path = "../data/raws_test2.pkl")

In [None]:
raws_dict3.keys()

In [None]:
raws_dict3["BRLW4"].keys()

In [None]:
raws_dict3["BRLW4"]["units"]

In [None]:
raws_dict3["BRLW4"]["loc"]

In [None]:
raws_dict3["BRLW4"]["RAWS"]

In [None]:
raws_dict3["BRLW4"]["misc"]