# Data Ingest of 10-h Fuel Moisture Content

This notebook demonstrates retrieval and filtering of 10-h dead FMC data from RAWS. Retrieval of 10-h FMC observations is done with the software package `SynopticPy` and a stash of RAWS data kept and maintained by the broader OpenWFM community. This notebook will demonstrate use of `Synopticpy` with a free token, so limits are placed on the number of sensor hours that can be requested. Only records within the past year are freely available.

The main steps in the retrieval are:
* Use `synoptic.Metadata` to determine the RAWS with FMC data in the given spatial domain and time frame
* Use `synoptic.Timeseries` to retrieve all available data that may be relevant to FMC modeling. *NOTE:* the stations are selected so they must have FMC data, and then any other available variables are collected as a bonus. These data are used for exploratory purposes and quality control checks, but predictors for final modeling comes from HRRR.

For more info on python library API, see Brian Blaylock's `SynopticPy` [python package](https://github.com/blaylockbk/SynopticPy)

For more info on available Synoptic RAWS variables, see [Synoptic Data](https://demos.synopticdata.com/variables/index.html) documentation

## Setup

In [None]:
import matplotlib.pyplot as plt
from datetime import datetime, timedelta, timezone
from dateutil.relativedelta import relativedelta
import synoptic
import json
import sys
import numpy as np
import polars as pl
import pandas as pd
sys.path.append('../src')
from utils import Dict, time_intp

A configuration file is used to control data ingest. Automated processes utilize the file `training_data_config.json` or `forecast_config.json`. In this tutorial, we will manually build a config file

In [None]:
end = datetime.now(timezone.utc)
end = end.replace(minute=0, second=0, microsecond=0)
start = end - relativedelta(months=6)

print(f"Start Date of retrieval: {start}")
print(f"End Date of retrieval: {end}")

In [None]:
config = Dict({
    'start_time': start, # String as YYYY-MM-DD_HH:mm:ss OR datetime object
    'end_time': end,
    'bbox': [40, -105, 45, -100], # [min_lat, min_lon, max_lat, max_lon]
    'raws_weather_vars': ["air_temp", "relative_humidity", "precip_accum", "fuel_moisture", "wind_speed", "solar_radiation", "pressure", "soil_moisture", "soil_temp", "snow_depth", "snow_accum", "wind_direction"],
    'raws_static_vars': ["stid", "latitude", "longitude", "elevation", "name", "state", "id"]
})

config

## Stations MetaData

*Note*: the bounding box format used in `wrfxpy` is `[min_lat, min_lon, max_lat, max_lon]`. But, the bounding box format used by Synoptic is `[min_lon, min_lat, max_lon, max_lat]`.

In [None]:
bbox = config.bbox
bbox_reordered = [bbox[1], bbox[0], bbox[3], bbox[2]]

In [None]:
sts = synoptic.Metadata(
    bbox=bbox_reordered,
    vars=["fuel_moisture"], # Note we only want to include stations with FMC. Other "raws_vars" are bonus
    obrange=(start, end),
).df()

In [None]:
sts

## Station Time Series

We loop over the station IDs found in the previous step and retrieve all available data and then format and clean.

*NOTE*: this process is not parallelized, as the same IP address is used for each request and parallization may result in issues

In [None]:
# name_mapping = {
#     "air_temp":"temp", 
#     "fuel_moisture":"fm", 
#     "relative_humidity":"rh", 
#     "precip_accum":"rain",
#     "solar_radiation":"solar", 
#     "wind_speed":"wind", 
#     "precip_accum":"precip_accum", 
#     "soil_moisture":"soil_moisture",
# }

In [None]:
def format_raws(df, static_vars, weather_vars):
    # Given input dataframe (the output of synoptic.TimeSeries), return formatted dictionary
    # Inputs:
    # df: (dataframe)
    # Returns: tuple of data, units   

    assert "fuel_moisture" in df["variable"], "fuel_moisture not detected in input dictionary"
    units = {} # stores units for variables
    
    
    for var in weather_vars:
        if var in df['variable']:
            df_temp = df.filter(df['variable'] == var)
            unit = df_temp['units'].unique()
            if len(unit) != 1:
                raise ValueError(f"Variable {var} has multiple values for units")
            units[var] = unit[0]
    
    dat = df.filter(pl.col("variable").is_in(weather_vars))
    dat = dat.pivot(
        values="value",
        index=["date_time"]+static_vars,
        on="variable"
    )

    print(f"Found {dat.shape[0]} FMC records")
    
    # Fix column units
    if "air_temp" in dat.columns and units['air_temp'] == "Celsius":
        print("Converting RAWS air temp from C to K")
        units['air_temp'] = "Kelvin"
        dat = dat.with_columns(
                (pl.col("air_temp")+273.15).alias("air_temp")
            )
        
    if 'elevation' in static_vars: # convert ft to meters
        print("Converting RAWS elevation from ft to meters")
        # loc['elevation'] = loc['elevation'] * 0.3048
        dat = dat.with_columns(
                (pl.col("elevation") * 0.3048).alias("elevation")
            )
        units['elevation'] = "m"    
        
        
    return dat, units

In [None]:
df_temp = synoptic.TimeSeries(
        stid="HSYN1",
        start=start,
        end=end,
        vars=config.raws_weather_vars,
        units = "metric"
    ).df()

df_temp

In [None]:
df_temp["date_time"].min()

In [None]:
df_temp["date_time"].max()

In [None]:
start

In [None]:
end

In [None]:
dat, units = format_raws(df_temp, static_vars = config.raws_static_vars, weather_vars = config.raws_weather_vars)

In [None]:
units

In [None]:
dat

In [None]:
def get_static(df, static_vars):
    """
    Given dataframe of timeseries observations from RAWS station, get dictionary of static info, such as identifiers and physical attributes of station.
    
    Args:
        df: Input dataframe with timeseries observations.
        static_vars: List of column names to extract static information from.
    
    Returns:
        A dictionary called "loc" containing the unique value for each column in static_vars.
    
    Raises:
        ValueError: If any column in static_vars has more than one unique value in the dataframe.
    """
    
    loc = {}
    for col in static_vars:
        if col in df.columns:
            unique_values = df[col].unique()
            if len(unique_values) == 1:
                loc[col] = unique_values[0]
            else:
                raise ValueError(f"Column '{col}' has more than one unique value: {unique_values}")
        else:
            raise KeyError(f"Column '{col}' not found in the dataframe.")
    return loc

In [None]:
print(f"Attempting retrieval of RAWS from {start} to {end} within {bbox}")
print("~"*75)

raws_dict = {}

for st in sts['stid']:
    print("~"*50)
    print(f"Attempting retrival of station {st}")
    df = synoptic.TimeSeries(
        stid=st,
        start=start,
        end=end,
        vars=config.raws_weather_vars,
        units = "metric"
    ).df()
    
    dat, units = format_raws(df, static_vars = config.raws_static_vars, weather_vars = config.raws_weather_vars)
    loc = get_static(dat, config.raws_static_vars)
    raws_dict[st] = {
        'RAWS': dat,
        'units': units,
        'loc': loc,
        'misc': "Data retrieved using `synoptic.TimeSeries` and formatted with custom functions within `ml_fmda` project."
    }

In [None]:
raws_dict.keys()

In [None]:
st = [*raws_dict.keys()][0]
raws_dict[st].keys()

In [None]:
raws_dict[st]['loc']

In [None]:
raws_dict[st]['units']

## Fix Time and Interpolate

Synoptic may return RAWS data that has missing hours or is returned not exactly on the hour. The missing hours are simply absent in the return data, not marked by NaN. We fix that by filling in NaN for missing hours and interpolating to the exact hour. The resulting data should have regular hourly observations for every RAWS station.

In [None]:
times = pl.datetime_range(
    start=start,
    end=end,
    interval="1h",
    time_zone = "UTC",
    eager=True
).alias("time")
# times = np.array([dt.strftime("%Y-%m-%dT%H:%M:%SZ") for dt in times.to_list()])
times = np.array(times.to_list())

In [None]:
def time_intp_df(df, target_times, static_cols, time_cols):
    """
    Interp and ...
    """

    print(f"Interpolating dataframe in time from {target_times.min()} to {target_times.max()}")
    print(f"    Original Dataframe shape: {df.shape}")

    # Get raw datetime values as numpy array
    time_raws = np.array(df["date_time"].to_list())    

    # Interpolate time dynamic columns only for columns that exist in the dataframe
    weather_data = {
        var: time_intp(
            time_raws, 
            df[var].to_numpy(), 
            target_times
        ) for var in time_cols if var in df.columns
    }
    # Create a Polars DataFrame from the interpolated results
    weather_df = pl.DataFrame(weather_data)
    weather_df = weather_df.with_columns(pl.Series("date_time", target_times))

    # Expand only for columns that exist in the dataframe
    nrow = weather_df.shape[0]
    static_data = {
        var: np.repeat(df[var].to_numpy()[0], nrow)
        for var in static_cols if var in df.columns
    }
    static_df = pl.DataFrame(static_data)  
    
    # Combine interpolated weather data and expanded static variables
    result_df = pl.concat([weather_df, static_df], how="horizontal")
    result_df = result_df.select(df.columns) # reorder columns to match original

    print(f"    Interpolated DataFrame shape: {result_df.shape}")
    print(f"        interpolated {result_df.shape[0] - df.shape[0]} time steps")
    
    return result_df

In [None]:
df2 = time_intp_df(raws_dict["BRLW4"]["RAWS"], times, static_cols = config.raws_static_vars, time_cols = config.raws_weather_vars)
df2

In [None]:
for st in raws_dict:
    print("~"*75)
    print(st)
    raws_dict[st]["RAWS"] = time_intp_df(raws_dict[st]["RAWS"], times, static_cols = config.raws_static_vars, time_cols = config.raws_weather_vars)
    raws_dict[st]["times"] = times
    raws_dict["BRLW4"]["misc"] += " Interpolated data with numpy linear interpolation."