# Data Ingest of 10-h Fuel Moisture Content

This notebook demonstrates retrieval and filtering of 10-h dead FMC data from RAWS. Retrieval of 10-h FMC observations is done with the software package `SynopticPy` and a stash of RAWS data kept and maintained by the broader OpenWFM community. This notebook will demonstrate use of `Synopticpy` with a free token, so limits are placed on the number of sensor hours that can be requested. Only records within the past year are freely available.

The module `ingest/retrieve_raws_api.py` has an executable section and will be run from the command line within this project. Here, the functions are used individually to demonstrate their utility. 

Time frame and spatial domain for data ingest are controlled in automated processes in the configuration files `training_data_config.json` or the `forecast_config.json` files. 

The main steps in the retrieval are:
* Use `synoptic.Metadata` to determine the RAWS with FMC data in the given spatial domain and time frame
* Use `synoptic.Timeseries` to retrieve all available data that may be relevant to FMC modeling. *NOTE:* the stations are selected so they must have FMC data, and then any other available variables are collected as a bonus. These data are used for exploratory purposes and quality control checks, but predictors for final modeling comes from HRRR.
* Format data and convert units.
* Identify missing data and interpolate with linear interpolation from numpy

The module has a main wrapper function `build_raws_dict` that puts all the steps together. In this module, we will demonstrate the individual steps with the module functions, and then run the main wrapper function at the end and check that it is all the same.

## References

For more info on python library API, see Brian Blaylock's `SynopticPy` [python package](https://github.com/blaylockbk/SynopticPy)

For more info on available Synoptic RAWS variables, see [Synoptic Data](https://demos.synopticdata.com/variables/index.html) documentation

## Setup

In [None]:
# import matplotlib.pyplot as plt
from datetime import datetime, timezone
from dateutil.relativedelta import relativedelta
import synoptic
import json
import sys
import numpy as np
import polars as pl
import pandas as pd
sys.path.append('../src')
from utils import Dict, read_yml, read_pkl, str2time, rename_dict
import ingest.retrieve_raws_api as rfuncs

In [None]:
raws_meta = read_yml("../etc/variable_metadata/raws_metadata.yaml")

with open("../etc/training_data_config.json", "r") as json_file:
    config = json.load(json_file)   
    config = Dict(config)

In [None]:
config

In [None]:
# End result should be the same as this...
raws_dict = rfuncs.build_raws_dict(config)

## Stations MetaData

We use `SynopticPy` to get a list of all RAWS stations within the bounding box that have fuel moisture data availability in the given time period.

*Note*: the bounding box format used in `wrfxpy` is `[min_lat, min_lon, max_lat, max_lon]`. But, the bounding box format used by Synoptic is `[min_lon, min_lat, max_lon, max_lat]`. The code will assume the `wrfxpy` format and convert internally.

In [None]:
start = str2time(config.start_time)
end = str2time(config.end_time)
bbox = config.bbox
bbox_reordered = [bbox[1], bbox[0], bbox[3], bbox[2]]

In [None]:
sts = rfuncs.get_stations(bbox_reordered)

print(sts["stid"])

## Station Weather Data Time Series

Timeseries of observations are drawn for a single RAWS using the `synopticpy` package. Then, the data are formatted by custom funcitons in the `retrieve_raws_api` module. We subtract one hour from the start time because most stations produce data some number of minutes after the requested time, so if you request data at 1:00 the API will return data after that time. Then the temporal interpolation procedure, described below, will be extrapolating out at end points. Shifting the start time by 1 hour accounts for this, but if the start time is longer than 1 year in the past the API will truncate to 1 year. The module has a metadata file with a list of all RAWS weather variables relevant to FMC modeling. 

The data is returned in "long" format, where each weather variable has its own row. We restructure the data into "wide" format with the module function `format_raws` so that a single row corresponds to one time, and the columns correspond to different data variables. Additionally, this function converts units and returns a dictionary of all units for the variables

In [None]:
weather_vars = rfuncs.raws_meta["raws_weather_vars"]
df_temp = synoptic.TimeSeries(
        stid="HSYN1",
        start=start-relativedelta(hours=1),
        end=end,
        vars=weather_vars,
        units = "metric"
    ).df()

df_temp

In [None]:
dat, units = rfuncs.format_raws(df_temp)

In [None]:
units

In [None]:
dat

We then loop over the station IDs found in the previous step and retrieve all available data and then rename and pivot from long to wide. The loop generates a dictionary for each RAWS station with keys for weather data and other metadata.

*NOTE*: this process is not parallelized, as the same IP address is used for each request and parallization may result in issues

In [None]:
print(f"Attempting retrieval of RAWS from {start} to {end} within {bbox}")
print("~"*75)

raws_dict = {}

for st in sts["stid"]:
    print("~"*50)
    print(f"Attempting retrival of station {st}")
    try:
        df = synoptic.TimeSeries(
            stid=st,
            start=start-relativedelta(hours=1),
            end=end,
            vars=weather_vars,
            units = "metric"
        ).df()
    
        dat, units = rfuncs.format_raws(df)
        loc = rfuncs.get_static(sts, st)
        raws_dict[st] = {
            'RAWS': dat,
            'units': units,
            'loc': loc,
            'misc': "Data retrieved using `synoptic.TimeSeries` and formatted with custom functions within `ml_fmda` project."
        }
    except Exception as e:
        print(f"An error occured: {e}")

In [None]:
raws_dict.keys()

In [None]:
st = [*raws_dict.keys()][0]
raws_dict[st].keys()

## Fix Time, Interpolate, and Calculate Rain

Synoptic may return RAWS data that has missing hours or is returned not exactly on the hour. The missing hours are simply absent in the return data from Synoptic, not marked by NaN. We fix that by filling in NaN for missing hours and interpolating to the exact hour. The resulting data should have regular hourly observations for every RAWS station.

Also, this is a good place in the code to rename variables. Various data sources have different variable names, so we standardize with naming conventions from the metadata files

In [None]:
times = pl.datetime_range(
    start=start,
    end=end,
    interval="1h",
    time_zone = "UTC",
    eager=True
).alias("time")
# times = np.array([dt.strftime("%Y-%m-%dT%H:%M:%SZ") for dt in times.to_list()])
times = np.array(times.to_list())

In [None]:
df2 = rfuncs.time_intp_df(raws_dict["BRLW4"]["RAWS"], times)
df2

We now loop over all stations and run temporal interpolation. We also convert to pandas for easier pickle write.

In [None]:
print(f"Interpolating dataframe in time from {times.min()} to {times.max()}")
rename=True
if rename:
    print(f"Renaming RAWS columns based on raws_metadata file")
for st in raws_dict:
    print("~"*75)
    print(st)
    nsteps = raws_dict[st]["RAWS"].shape[0]
    raws_dict[st]["RAWS"] = rfuncs.time_intp_df(raws_dict[st]["RAWS"], times)
    raws_dict[st]["RAWS"] = pd.DataFrame(raws_dict[st]["RAWS"], columns = raws_dict[st]["RAWS"].columns)
    raws_dict[st]["times"] = times
    if raws_dict[st]["RAWS"].shape[0] != nsteps:
        raws_dict[st]["misc"] += " Interpolated data with numpy linear interpolation."
        print(f"    Original Dataframe time steps: {nsteps}")
        print(f"    Interpolated DataFrame time steps: {raws_dict[st]["RAWS"].shape[0]}")
        print(f"        interpolated {raws_dict[st]["RAWS"].shape[0] - nsteps} time steps")
    if rename:
        raws_dict[st]["units"] = rename_dict(raws_dict[st]["units"], raws_meta["rename_synoptic"])
        raws_dict[st]["RAWS"] = raws_dict[st]["RAWS"].rename(columns = raws_meta["rename_synoptic"])
        raws_dict[st]["loc"] = rename_dict(raws_dict[st]["loc"], raws_meta["rename_synoptic"])

In [None]:
raws_dict[st]["units"]

In [None]:
raws_dict["BRLW4"].keys()

In [None]:
raws_dict["BRLW4"]["RAWS"]

In [None]:
raws_dict["BRLW4"]["loc"]

In [None]:
raws_dict["BRLW4"]["units"]