# Data Ingest of 10-h Fuel Moisture Content

This notebook demonstrates retrieval and filtering of 10-h dead FMC data from RAWS. Retrieval of 10-h FMC observations is done with the software package `SynopticPy` and a stash of RAWS data kept and maintained by the broader OpenWFM community. This notebook will demonstrate use of `Synopticpy` with a free token, so limits are placed on the number of sensor hours that can be requested. Only records within the past year are freely available.

The module `ingest/retrieve_raws_api.py` has an executable section and will be run from the command line within this project. Here, the functions are used individually to demonstrate their utility. 

Time frame and spatial domain for data ingest are controlled in automated processes in the configuration files `training_data_config.json` or the `forecast_config.json` files. We will manually enter time frame and spatial domain variables in this notebook.

The main steps in the retrieval are:
* Use `synoptic.Metadata` to determine the RAWS with FMC data in the given spatial domain and time frame
* Use `synoptic.Timeseries` to retrieve all available data that may be relevant to FMC modeling. *NOTE:* the stations are selected so they must have FMC data, and then any other available variables are collected as a bonus. These data are used for exploratory purposes and quality control checks, but predictors for final modeling comes from HRRR.

## References

For more info on python library API, see Brian Blaylock's `SynopticPy` [python package](https://github.com/blaylockbk/SynopticPy)

For more info on available Synoptic RAWS variables, see [Synoptic Data](https://demos.synopticdata.com/variables/index.html) documentation

## Setup

In [1]:
# import matplotlib.pyplot as plt
from datetime import datetime, timezone
from dateutil.relativedelta import relativedelta
import synoptic
import json
import sys
import numpy as np
import polars as pl
import pandas as pd
sys.path.append('../src')
from utils import Dict, read_yml
# from utils import Dict, time_intp
import ingest.retrieve_raws_api as rfuncs

In [2]:
end = datetime.now(timezone.utc)
end = end.replace(minute=0, second=0, microsecond=0)
start = end - relativedelta(months=2)
bbox = [40, -105, 45, -100] # subset of the rocky mountain GACC region


print(f"Start Date of retrieval: {start}")
print(f"End Date of retrieval: {end}")
print(f"Spatial Domain: {bbox}")

Start Date of retrieval: 2024-10-17 21:00:00+00:00
End Date of retrieval: 2024-12-17 21:00:00+00:00
Spatial Domain: [40, -105, 45, -100]


## Stations MetaData

*Note*: the bounding box format used in `wrfxpy` is `[min_lat, min_lon, max_lat, max_lon]`. But, the bounding box format used by Synoptic is `[min_lon, min_lat, max_lon, max_lat]`. The code will assume the `wrfxpy` format and convert internally.

In [3]:
bbox_reordered = [bbox[1], bbox[0], bbox[3], bbox[2]]

In [4]:
bbox_reordered

[-105, 40, -100, 45]

In [5]:
sts = synoptic.Metadata(
    bbox=bbox_reordered,
    vars=["fuel_moisture"], # Note we only want to include stations with FMC. Other "raws_vars" are bonus
    obrange=(start, end),
).df()

🚚💨 Speedy delivery from Synoptic's [32mmetadata[0m service.
📦 Received data from 29 stations.


In [6]:
sts

id,stid,name,elevation,latitude,longitude,mnet_id,state,timezone,elev_dem,period_of_record_start,period_of_record_end,is_restricted,restricted_metadata,is_active
u32,str,str,f64,f64,f64,u32,str,str,f64,"datetime[μs, UTC]","datetime[μs, UTC]",bool,bool,bool
2438,"""BRLW4""","""BEAR LODGE""",5280.0,44.59722,-104.42806,2,"""WY""","""America/Denver""",5236.2,1998-07-28 00:00:00 UTC,2024-12-17 20:55:00 UTC,false,false,true
3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 20:23:00 UTC,false,false,true
3811,"""HRSN1""","""KINGS CANYON""",4080.0,42.72361,-102.97167,2,"""NE""","""America/Denver""",4124.0,2002-04-18 00:00:00 UTC,2024-12-17 20:22:00 UTC,false,false,true
3812,"""SBFN1""","""SCOTTS BLUFF""",4224.0,41.82944,-103.70806,2,"""NE""","""America/Denver""",4127.3,2002-04-18 00:00:00 UTC,2024-12-17 20:34:00 UTC,false,false,true
3815,"""DOHS2""","""BAKER PARK""",4674.0,43.97917,-103.425,2,"""SD""","""America/Denver""",4681.8,2002-04-18 00:00:00 UTC,2024-12-17 20:35:00 UTC,false,false,true
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
63602,"""MTRN1""","""MONTROSE""",3734.0,42.92297,-103.70964,2,"""NE""","""America/Denver""",3723.8,2017-11-30 20:26:00 UTC,2024-12-17 20:24:00 UTC,false,false,true
63604,"""MKVN1""","""MCKELVIE""",3060.0,42.6894,-101.12824,2,"""NE""","""America/Denver""",3044.6,2019-02-04 22:51:00 UTC,2024-12-17 20:49:00 UTC,false,false,true
89644,"""TT562""","""NORTH STERLING""",4066.0,40.78858,-103.26281,2,"""CO""","""America/Denver""",4071.5,2019-11-07 20:51:00 UTC,2024-12-17 20:45:00 UTC,false,false,true
89649,"""TT567""","""CROW CREEK""",4850.0,40.65013,-104.3375,2,"""CO""","""America/Denver""",4849.1,2020-04-28 20:12:00 UTC,2024-12-17 21:07:00 UTC,false,false,true


## Station Time Series

Timeseries of observations are drawn for a single RAWS using the `synopticpy` package. Then, the data are formatted by custom funcitons in the `retrieve_raws_api` module. We subtract one hour from the start time because most stations produce data some number of minutes after the requested time, so if you request data at 1:00 the API will return data after that time. Then the temporal interpolation procedure, described below, will be extrapolating out at end points. Shifting the start time by 1 hour accounts for this, but if the start time is longer than 1 year in the past the API will truncate to 1 year.

In [7]:
weather_vars = rfuncs.raws_vars_dict["raws_weather_vars"]
df_temp = synoptic.TimeSeries(
        stid="HSYN1",
        start=start-relativedelta(hours=1),
        end=end,
        vars=weather_vars,
        units = "metric"
    ).df()

df_temp

🚚💨 Speedy delivery from Synoptic's [32mtimeseries[0m service.
📦 Received data from 1 stations.


date_time,variable,sensor_index,is_derived,value,units,id,stid,name,elevation,latitude,longitude,mnet_id,state,timezone,elev_dem,period_of_record_start,period_of_record_end,is_restricted,restricted_metadata,is_active
"datetime[μs, UTC]",str,u32,bool,f64,str,u32,str,str,f64,f64,f64,u32,str,str,f64,"datetime[μs, UTC]","datetime[μs, UTC]",bool,bool,bool
2024-10-17 20:23:00 UTC,"""air_temp""",1,false,25.556,"""Celsius""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
2024-10-17 21:23:00 UTC,"""air_temp""",1,false,26.111,"""Celsius""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
2024-10-17 22:23:00 UTC,"""air_temp""",1,false,24.444,"""Celsius""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
2024-10-17 23:23:00 UTC,"""air_temp""",1,false,23.889,"""Celsius""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
2024-10-18 00:23:00 UTC,"""air_temp""",1,false,23.333,"""Celsius""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2024-12-17 16:23:00 UTC,"""fuel_moisture""",1,false,9.0,"""gm""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
2024-12-17 17:23:00 UTC,"""fuel_moisture""",1,false,9.1,"""gm""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
2024-12-17 18:23:00 UTC,"""fuel_moisture""",1,false,9.0,"""gm""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true
2024-12-17 19:23:00 UTC,"""fuel_moisture""",1,false,9.0,"""gm""",3807,"""HSYN1""","""BESSEY""",2873.0,41.89722,-100.31056,2,"""NE""","""America/Chicago""",2841.2,2002-04-18 00:00:00 UTC,2024-12-17 21:23:00 UTC,false,false,true


In [8]:
dat, units = rfuncs.format_raws(df_temp)

Found 1465 FMC records
Converting RAWS air temp from C to K
Converting RAWS elevation from ft to meters


In [9]:
units

{'air_temp': 'Kelvin',
 'relative_humidity': '%',
 'precip_accum': 'Millimeters',
 'fuel_moisture': 'gm',
 'wind_speed': 'm/s',
 'solar_radiation': 'W/m**2',
 'wind_direction': 'Degrees',
 'elevation': 'm'}

In [None]:
dat

We then loop over the station IDs found in the previous step and retrieve all available data and then rename and pivot from long to wide.

*NOTE*: this process is not parallelized, as the same IP address is used for each request and parallization may result in issues

In [None]:
print(f"Attempting retrieval of RAWS from {start} to {end} within {bbox}")
print("~"*75)

raws_dict = {}

for st in sts['stid']:
    print("~"*50)
    print(f"Attempting retrival of station {st}")
    df = synoptic.TimeSeries(
        stid=st,
        start=start-relativedelta(hours=1),
        end=end,
        vars=weather_vars,
        units = "metric"
    ).df()
    
    dat, units = rfuncs.format_raws(df)
    loc = rfuncs.get_static(dat)
    raws_dict[st] = {
        'RAWS': dat,
        'units': units,
        'loc': loc,
        'misc': "Data retrieved using `synoptic.TimeSeries` and formatted with custom functions within `ml_fmda` project."
    }

In [None]:
raws_dict.keys()

In [None]:
st = [*raws_dict.keys()][0]
raws_dict[st].keys()

In [None]:
raws_dict[st]['loc']

In [None]:
raws_dict[st]['units']

In [None]:
raws_dict[st]['misc']

## Fix Time and Interpolate

Synoptic may return RAWS data that has missing hours or is returned not exactly on the hour. The missing hours are simply absent in the return data, not marked by NaN. We fix that by filling in NaN for missing hours and interpolating to the exact hour. The resulting data should have regular hourly observations for every RAWS station.

In [None]:
times = pl.datetime_range(
    start=start,
    end=end,
    interval="1h",
    time_zone = "UTC",
    eager=True
).alias("time")
# times = np.array([dt.strftime("%Y-%m-%dT%H:%M:%SZ") for dt in times.to_list()])
times = np.array(times.to_list())

In [None]:
df2 = rfuncs.time_intp_df(raws_dict["BRLW4"]["RAWS"], times)
df2

We now loop over all stations and run temporal interpolation. We also convert to pandas for easier pickle write.

In [None]:
print(f"Interpolating dataframe in time from {times.min()} to {times.max()}")
for st in raws_dict:
    print("~"*75)
    print(st)
    nsteps = raws_dict[st]["RAWS"].shape[0]
    raws_dict[st]["RAWS"] = rfuncs.time_intp_df(raws_dict[st]["RAWS"], times)
    raws_dict[st]["RAWS"] = pd.DataFrame(raws_dict[st]["RAWS"], columns = raws_dict[st]["RAWS"].columns)
    raws_dict[st]["times"] = times
    if raws_dict[st]["RAWS"].shape[0] != nsteps:
        raws_dict[st]["misc"] += " Interpolated data with numpy linear interpolation."
        print(f"    Original Dataframe time steps: {nsteps}")
        print(f"    Interpolated DataFrame time steps: {raws_dict[st]["RAWS"].shape[0]}")
        print(f"        interpolated {raws_dict[st]["RAWS"].shape[0] - nsteps} time steps")

In [None]:
# import pickle
# with open("../data/raws_test.pkl", 'wb') as file:
#     pickle.dump(raws_dict, file)
# with open("../data/raws_test.pkl", "rb") as f:
#     dat = pickle.load(f)
# print(dat.keys())
# dat["BRLW4"]["RAWS"]