# Create zonal statistics and point extractions for comparing CONUS404 and reference datasets

Author: Andrew Laws alaws@usgs.gov

<img src='../../../doc/assets/Eval_Analysis.svg' width=600>

Now that the data has been prepared, it is time to compute zonal statistics and perform point extractions. 

<details>
  <summary>Guide to pre-requisites and learning outcomes...&lt;click to expand&gt;</summary>
  
  <table>
    <tr>
      <td>Pre-Requisites
      <td>To get the most out of this notebook, you should already have an understanding of these topics: 
        <ul>
        <li>the following summary statistics: mean, median, standard deviation, bias, mean absolute error (MAE), root mean squared error (RMSE), Pearson and Spearman correlation coefficients, and percent bias.
        <li>the HUC (Hydrologic Unit Code) system of identification of drainage basins used by the USGS, in particular the HUC6 designation used here.
        </ul>
    <tr>
      <td>Expected Results
      <td>At the end of this notebook, you will produce: 
        <ul>
        <li>Tables of descriptive statistics for the Delaware River Basin for each HUC6 designation within the basin, for forcing and reference datasets of precipitation, temperature, and net radiation.
        <li>Further processed datasets for subsequent use in a separate visualization notebook
        </ul>
  </table>
</details>

## Introduction

The evaluation of forcings data can be performed either directly to some observational or reference data, or indirectly. An example of an indirect evaluation is to use multiple forcing datasets to drive a hydrological model and compare simulated outputs with observations (e.g., streamflow to gaged streamflow) to assess the influence of different forcing data on hydrologic model outputs. Indirect approaches are computationally expensive, thus we demonstrate in this notebook how to perform a direct evaluation of forcing data to multiple reference datasets. The methods for a direct evaluation consist of calculating descriptive and comparative statistics between forcing and reference datasets for several variables (precipitation, temperature, and net radiation).  

### How does this notebook evaluate forcing data?
First, data from gridded datasets (forcing: CONUS404; reference: PRISM, CERES-EBAF) are spatially summarized by computing area-weighted means over HUC6 spatial units within the Delaware River Basin. Next, annual means are computed over consistent periods of record. From these, some commonly-used descriptive statistics are calculated and tabulated. Similarly, annual means of point (station-based) reference data (GHCN, CRN) and annual means of forcing data extracted from nearest (to station) grid-points are computed, and descriptive statistics calculated and tabulated. For further evaluation and visualization, data processed within the notebook are output for subsequent use by a separate visualization notebook (see graphic above).
    
The following statistics will be calculated for comparative analysis between CONUS404 and each reference dataset:

<ul>
    <li>mean</li>
    <li>median</li>
    <li>standard deviation (stdev)</li>
    <li>bias</li>
    <li>mean absolute error (mae)</li>
    <li>root-mean square error (rmse)</li>
    <li>Pearson's correlation</li>
    <li>Spearman's correlation</li>
    <li>percent bias (pbias)</li>
</ul>


### Data References
- [CONUS404 (**CON**tiguous United States for **40** years at **4**-km resolution)](https://doi.org/10.5066/P9PHPK4F)
- [PRISM (**P**arameter-elevation **R**egressions on **I**ndependent **S**lopes **M**odel)](https://prism.oregonstate.edu/)
- [CERES-EBAF (**C**louds and **E**arth's **R**adiant **E**nergy **S**ystems - **E**nergy **B**alanced **A**nd **F**illed)](https://ceres.larc.nasa.gov/data/)
- [GHCN (**G**lobal **H**istorical **C**limate **N**etwork)](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)
- [CRN (**C**limate **R**eference **N**etwork)](https://www.ncei.noaa.gov/products/land-based-station/us-climate-reference-network)


In [None]:
# library imports
import math
import warnings

import cf_xarray
import dask
import geopandas as gpd
import hvplot.xarray  # noqa: F401
import hvplot.pandas  # noqa: F401
import intake
import numpy as np
import pandas as pd
import sparse
import xarray as xr
from dask.distributed import Client, LocalCluster
from pygeohydro import WBD
from shapely.geometry import Polygon

warnings.filterwarnings("ignore")

We will use the standard suite and additional statistical metrics for our comparative statistics.

In [None]:
# run script for available functions from the standard suite
%run ../../../evaluation/Metrics_StdSuite_v1.ipynb

# run script for additional functions that will be used
%run ../../../evaluation/Metrics_Misc_v1.ipynb

Using the `intake` catalog, we will load CONUS404 data for the Delaware River Basin

In [None]:
# data
# connect to HyTEST catalog
url = "https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml"
cat = intake.open_catalog(url)

# access tutorial catalog
conus404_drb_cat = cat["conus404-drb-eval-tutorial-catalog"]
list(conus404_drb_cat)

## **Start a Dask client using an appropriate Dask Cluster** 
This is an optional step, but can speed up data loading significantly, especially when accessing data from the cloud.

### Setup your client on your local PC or on HPC like this:

In [None]:
# check for existing Dask cluster
if "client" in locals():
    print("Shutting down existing Dask cluster.")
    cluster.close()
    client.close()

cluster = LocalCluster()
client = Client(cluster)

print(
    f"The link to the Dask dashboard is {client.dashboard_link}. If on HPC, this may not be available."
)

Workflow outline:
1. Read in the prepared dataset
2. Read in the HUC6 boundaries and transform to same coordinate reference system as prepared dataset
3. Make a data mask with the HUC6 boundaries to calculate zonal statistics
4. Compute zonal statistics with data mask and prepared data

Once all calculations are done: 

5. Combine each reference with benchmark into single dataset
6. Export gridded data zonal statistics
<br>

## **Compute zonal statistics for gridded datasets**

In the last tutorial, we prepared three gridded datasets: CONUS404 (benchmark), PRISM (reference), and CERES-EBAF (reference). The goal of this section is compute [zonal statistics](https://gisgeography.com/zonal-statistics/) for each HUC6 zone in the Delaware River Basin (DRB) by using the [conservative regridding method put forth by Ryan Abernathy](https://discourse.pangeo.io/t/conservative-region-aggregation-with-xarray-geopandas-and-sparse/2715) to regrid and perform an area-weighted analysis.

Dataset outline:
<ol>
    <li>Read in the prepared dataset</li>
    <li>Compute bounding bands for grid cells then use these to create polygons in area-preserving CRS</li>
    <li>Read in the HUC6 boundaries and transform to same coordinate reference system as prepared dataset</li>
    <li>Overlay the dataset polygons over the HUC6 boundaries and create spatial weights matrices</li>
    <li>Perform matrix multiplication between The prepared dataset and the spatial weights matrices</li>
</ol>

The following two functions will be used for regridding each dataset. An explanation of what they do will be provided when they are applied.

In [None]:
def bounds_to_poly(x_bounds, y_bounds) -> Polygon:
    """Return a polygon based on the x (longitude) and
    y (longitude) bounding band DataArrays.
    """
    return Polygon(
        [
            (x_bounds[0], y_bounds[0]),
            (x_bounds[0], y_bounds[1]),
            (x_bounds[1], y_bounds[1]),
            (x_bounds[1], y_bounds[0]),
        ]
    )


def apply_weights_matmul_sparse(weights, data):
    """Apply weights in a sparse matrices to data and regrid."""
    assert isinstance(weights, sparse.SparseArray)
    assert isinstance(data, np.ndarray)
    data = sparse.COO.from_numpy(data)
    data_shape = data.shape
    n, k = data_shape[0], data_shape[1] * data_shape[2]
    data = data.reshape((n, k))
    weights_shape = weights.shape
    k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
    assert k == k_
    weights_data = weights.reshape((k, m))

    regridded = sparse.matmul(data, weights_data)
    assert regridded.shape == (n, m)
    return regridded.todense()

    assert isinstance(weights, sparse.SparseArray)
    assert isinstance(data, np.ndarray)
    data = sparse.COO.from_numpy(data)
    data_shape = data.shape
    # k = nlat * nlon
    n, k = data_shape[0], data_shape[1] * data_shape[2]
    data = data.reshape((n, k))
    weights_shape = weights.shape
    k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
    assert k == k_
    weights_data = weights.reshape((k, m))

    regridded = sparse.matmul(data, weights_data)
    assert regridded.shape == (n, m)
    return regridded.todense()

And the following `fsspec.filesystem` will be using to read in each dataset from an [Open Storage Network](https://www.openstoragenetwork.org/) bucket, which is read only.

In [None]:
# x and y below are syntatically longitude (x) and latitude (y) in the datasets


x = "x"
y = "y"

**CONUS404 zonal statistics**

#### 1. Read in the HUC6 boundaries and transform to same coordinate reference system as prepared dataset

In [None]:
# bring in HUC6 boundaries found in the DRB
drb_wbd = WBD(layer="huc6", outfields=["huc6", "name"])
drb_gdf = drb_wbd.byids("huc6", ["020401", "020402"])

In [None]:
# view the dataframe
drb_gdf.head()

In [None]:
# area preserving coordinate reference system
crs_area = "ESRI:53034"

# set CRS to match c404_drb
drb_gdf = drb_gdf.to_crs(crs_area)

In [None]:
# visualize
drb_gdf.hvplot(line_color="orange", color="purple", line_width=2.5)

#### 2. Read CONUS404 data, previously processed over the DRB

In [None]:
conus404_drb_cat["conus404-drb-OSN"].kwargs["decode_coords"] = "all"
conus404_drb_cat["conus404-drb-OSN"].kwargs

In [None]:
# open dataset
c404_drb = conus404_drb_cat["conus404-drb-OSN"].to_dask()

# crs
c404_crs = c404_drb.rio.crs.to_proj4()

c404_drb

In [None]:
# visualize (uncomment to view)
# c404_drb.PREC_ACC_NC.hvplot(x="x", y="y", rasterize=True)

#### 3. Compute bounding bands for grid cells and use them to create polygons in area-preserving CRS

Create the grid of `c404_drb` using any of the variables

In [None]:
# set var
c404_var = "TK"

# drop unneeded variable and coordinates
c404_grid = c404_drb[[c404_var]].drop(["time", "lon", "lat", c404_var]).reset_coords()
c404_grid

And create bounding bands then stack into points

In [None]:
# add bounds
c404_grid = c404_grid.cf.add_bounds(x)
c404_grid = c404_grid.cf.add_bounds(y)

# stack
c404_points = c404_grid.stack(point=(y, x))
c404_points

Next, use the `xr.apply_ufunc` function to apply the `bounds_to_poly` function above to the _c404_points_ DataSet.

In [None]:
c404_boxes = xr.apply_ufunc(
    bounds_to_poly,
    c404_points.x_bounds,
    c404_points.y_bounds,
    input_core_dims=[("bounds",), ("bounds",)],
    output_dtypes=[np.dtype("O")],
    vectorize=True,
)
c404_boxes

Create `gpd.GeoDataFrame` from boxes

In [None]:
c404_grid_df = gpd.GeoDataFrame(
    data={"geometry": c404_boxes.values, "y": c404_boxes[y], "x": c404_boxes[x]},
    index=c404_boxes.indexes["point"],
    crs=c404_crs,
)
c404_grid_df

In [None]:
# visualize (uncomment to view)
# c404_grid_df.hvplot(line_color="red", color="white", line_width=0.05)

#### 4. Overlay the dataset polygons we just created over the HUC6 boundaries and create spatial weights matrices

In [None]:
# convert DRB to conus404 crs
c404_drb_gdf = drb_gdf.to_crs(c404_crs)

# perform overlay
c404_overlay = c404_grid_df.overlay(c404_drb_gdf, keep_geom_type=True)
c404_overlay.head()

In [None]:
# plot overlay for single HUC6
c404_overlay[c404_overlay.huc6 == "020402"].geometry.plot(edgecolor="k")

Compute grid cell fractions. 

A cell fraction is defined as the area of the grid cell divided by the area of the target polygon (HUC6).

In [None]:
c404_grid_cell_fraction = c404_overlay.geometry.area.groupby(
    c404_overlay.huc6
).transform(lambda x: x / x.sum())

Sparse DataArray (for an in-depth description of sparse data, see [this article](https://www.techopedia.com/definition/9480/sparse-array#:~:text=17%20May%2C%202017-,What%20Does%20Sparse%20Array%20Mean%3F,array%20in%20digital%20data%20handling.))

In [None]:
c404_multi_index = c404_overlay.set_index([y, x, "huc6"]).index
c404_df_weights = pd.DataFrame(
    {"weights": c404_grid_cell_fraction.values}, index=c404_multi_index
)

# create xarray dataset
c404_ds_weights = xr.Dataset(c404_df_weights)

# generate sparse data array
c404_weights_sparse = c404_ds_weights.unstack(sparse=True, fill_value=0.0).weights

#### 5. Perform matrix multiplication between the prepared dataset and the spatial weights matrices

Matrix multiplication across each DataArray. We do this for precipitation (PREC_ACC_NC), net radiation (RNET), and temperature (TK).

In [None]:
with dask.config.set(**{"array.slicing.split_large_chunks": False}):
    # precipitation
    c404_precip_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        c404_weights_sparse,
        c404_drb["PREC_ACC_NC"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))],
    )

    # net radiation
    c404_rnet_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        c404_weights_sparse,
        c404_drb["RNET"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))],
    )

    # temperature
    c404_tk_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        c404_weights_sparse,
        c404_drb["TK"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))],
    )

Merge DataArrays into Dataset

In [None]:
c404_regridded = xr.Dataset(
    {
        "PREC_NC_ACC": c404_precip_regridded,
        "RNET": c404_rnet_regridded,
        "TK": c404_tk_regridded,
    }
)
c404_regridded = c404_regridded.drop("crs")
c404_regridded.attrs = c404_drb.attrs
c404_regridded

Covert to DataFrame. Now we see we have our timeseries for each variable and each HU06 within DRB.

In [None]:
c404_df = c404_regridded.to_dataframe()
c404_df

Let's clean up our data. We will reset our index and modify our dates to represent the year-month instead of reporting on the last day of the year.

In [None]:
# reset index
c404_zonal_stats = c404_df.reset_index(drop=False)

# convert time to string and remove days
c404_zonal_stats["time"] = c404_zonal_stats["time"].astype(str).str[:-3]

# drop 1979 as it only had three months of data
c404_zonal_stats = c404_zonal_stats[
    c404_zonal_stats["time"].str.contains("1979") == False
]

c404_zonal_stats

The zonal stats that were just calculated took a lot of steps and ,depending on the computational power of your environment, time to create. It could be a good idea to export this intermediate data in case you need to come back to this analysis later.

In [None]:
# This is only an example
# c404_zonal_stats.to_parquet("./file/path/to/conus404_drb_zonal_stats.parquet", index=False)

**PRISM zonal statistics**

Now, in the next cell, we'll run through the same steps to compute zonal statistics for the two PRISM variables, temperature (TK) and precipitation (PREC_ACC_NC).

In [None]:
conus404_drb_cat["prism-drb-OSN"].kwargs["decode_coords"] = "all"
prism_drb = conus404_drb_cat["prism-drb-OSN"].to_dask()
prism_drb = prism_drb.chunk(dict(x=-1))
prism_drb

In [None]:
# open dataset
conus404_drb_cat["prism-drb-OSN"].kwargs["decode_coords"] = "all"
prism_drb = conus404_drb_cat["prism-drb-OSN"].to_dask()
prism_drb = prism_drb.chunk(dict(x=-1))

# prism crs
prism_crs = 4269  # EPSG:4269 = NAD83

# create the grid of c404_drb using any of the variables
prism_var = "TK"

# drop unneeded variable and coordinates
prism_grid = prism_drb[[prism_var]].drop(["time", prism_var]).reset_coords().load()


# add bounds
prism_grid = prism_grid.cf.add_bounds(x)
prism_grid = prism_grid.cf.add_bounds(y)

# stack
prism_points = prism_grid.stack(point=(y, x))

# apply bounds_to method
prism_boxes = xr.apply_ufunc(
    bounds_to_poly,
    prism_points.x_bounds,
    prism_points.y_bounds,
    input_core_dims=[("bounds",), ("bounds",)],
    output_dtypes=[np.dtype("O")],
    vectorize=True,
)

# create geodataframe from boxes
prism_grid_df = gpd.GeoDataFrame(
    data={"geometry": prism_boxes.values, "y": prism_boxes[y], "x": prism_boxes[x]},
    index=prism_boxes.indexes["point"],
    crs=prism_crs,
)

# convert DRB to conus404 crs
prism_drb_gdf = drb_gdf.to_crs(epsg=prism_crs)

# overlay the two grids
prism_overlay = prism_grid_df.overlay(prism_drb_gdf, keep_geom_type=True)

# grid cell fractions
prism_grid_cell_fraction = prism_overlay.geometry.area.groupby(
    prism_overlay.huc6
).transform(lambda x: x / x.sum())

# create sparse dataarray
prism_multi_index = prism_overlay.set_index([y, x, "huc6"]).index
prism_df_weights = pd.DataFrame(
    {"weights": prism_grid_cell_fraction.values}, index=prism_multi_index
)

prism_ds_weights = xr.Dataset(prism_df_weights)

prism_weights_sparse = prism_ds_weights.unstack(sparse=True, fill_value=0.0).weights

# Matrix multiplication across each DataArray
with dask.config.set(
    **{"array.slicing.split_large_chunks": False, "allow_rechunk": True}
):
    prism_precip_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        prism_weights_sparse,
        prism_drb["PREC_ACC_NC"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))],
    )

    prism_tk_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        prism_weights_sparse,
        prism_drb["TK"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))],
    )
# merge DataArrays into Dataset
prism_regridded = xr.Dataset(
    {"PREC_NC_ACC": prism_precip_regridded, "TK": prism_tk_regridded}
)
prism_regridded.attrs = prism_drb.attrs

# Covert to DataFrame
prism_df = prism_regridded.load().to_dataframe()
prism_df.head()

Let's clean up the PRISM data as we have done with the CONUS404 data. We will reset the index and clean up the date formatting.

In [None]:
# reset index and add time back
prism_zonal_stats = prism_df.reset_index(drop=False)

# convert time to string and remove days
prism_zonal_stats["time"] = prism_zonal_stats["time"].astype(str).str[:-3]

prism_zonal_stats.head()

Example export of zonal stats.

In [None]:
# This is only an example
# prism_zonal_stats.to_parquet("./file/path/to/prism_drb_zonal_stats.parquet", index=False) #noqa : E501

### Descriptive statistics

We will now compute descriptive statistics between the previously computed CONUS404 and PRISM zonal statistics. The statistics that will be calculated are mean, median, standard deviation, bias, MAE, RMSE, Pearson's correlation, Spearman's r, and percent bias. 

The overall process will look like this:
1. Merge zonal stats together on HUC6 and time values
2. Resample data to 1 year means of monthly values
3. Calculate each statistic 

In [None]:
# Merge the PRISM and CONUS404 zonals stats together based on the HUC6 code and time
prism_c404_zonal = prism_zonal_stats.merge(
    c404_zonal_stats,
    left_on=["huc6", "time"],
    right_on=["huc6", "time"],
    suffixes=["_prism", "_c404"],
)

# drop RNET
prism_c404_zonal.drop("RNET", axis=1, inplace=True)

prism_c404_zonal.head()

In [None]:
# convert time column to datetime type
prism_c404_zonal["time"] = pd.to_datetime(prism_c404_zonal["time"], format="%Y-%m")
prism_c404_zonal.head()

In [None]:
# resample done by HUC6 as the index is HUC6 and year
prism_c404_yearly = prism_c404_zonal.resample("1Y", on="time").agg(
    {
        "PREC_NC_ACC_c404": "sum",
        "PREC_NC_ACC_prism": "sum",
        "TK_c404": "mean",
        "TK_prism": "mean",
    }
)

In [None]:
prism_c404_yearly.reset_index(drop=False, inplace=True)
prism_c404_yearly.head()

Here we calculate mean, median, and standard deviation of yearly data.

In [None]:
# mean, median, standard deviation
prism_c404_mean = prism_c404_yearly.mean()
prism_c404_median = prism_c404_yearly.median()
prism_c404_stdev = prism_c404_yearly.std()

# create dataframe
prism_c404_stats = pd.DataFrame(
    {
        "annual_mean": prism_c404_mean,
        "median": prism_c404_median,
        "stdev": prism_c404_stdev,
    }
).T.drop("time", axis=1)

# reset index and rename
prism_c404_stats = prism_c404_stats.reset_index(drop=False).rename(
    {"index": "stat"}, axis=1
)  # noqa : E501

prism_c404_stats

Now we calculate the bias

In [None]:
# bias
prism_c404_stats_annual_mean = prism_c404_stats.loc[
    prism_c404_stats["stat"] == "annual_mean"
]  # noqa : E501
prism_c404_bias_precip = float(
    prism_c404_stats_annual_mean["PREC_NC_ACC_c404"]
    - prism_c404_stats_annual_mean["PREC_NC_ACC_prism"]
)  # noqa : E501
prism_c404_bias_tk = float(
    prism_c404_stats_annual_mean["TK_c404"] - prism_c404_stats_annual_mean["TK_prism"]
)  # noqa : E501

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = [
    "bias",
    prism_c404_bias_precip,
    None,
    prism_c404_bias_tk,
    None,
]  # noqa : E501

prism_c404_stats

MAE and RMSE is then calculated

In [None]:
# MAE
prism_c404_mae_precip = sum(
    abs(prism_c404_yearly["PREC_NC_ACC_c404"] - prism_c404_yearly["PREC_NC_ACC_prism"])
) / len(prism_c404_yearly)  # noqa : E501
prism_c404_mae_tk = sum(
    abs(prism_c404_yearly["TK_c404"] - prism_c404_yearly["TK_prism"])
) / len(prism_c404_yearly)  # noqa : E501

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = [
    "MAE",
    prism_c404_mae_precip,
    None,
    prism_c404_mae_tk,
    None,
]  # noqa : E501

prism_c404_stats

In [None]:
# RMSE
prism_c404_rmse_precip = math.sqrt(
    np.square(
        np.subtract(
            prism_c404_yearly["PREC_NC_ACC_c404"],
            prism_c404_yearly["PREC_NC_ACC_prism"],
        )
    ).mean()
)  # noqa : E501
prism_c404_rmse_tk = math.sqrt(
    np.square(
        np.subtract(prism_c404_yearly["TK_c404"], prism_c404_yearly["TK_prism"])
    ).mean()
)  # noqa : E501

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = [
    "RMSE",
    prism_c404_rmse_precip,
    None,
    prism_c404_rmse_tk,
    None,
]  # noqa : E501

prism_c404_stats

In [None]:
# Pearsons correlation
prism_c404_pearson_precip = pearson_r(
    prism_c404_yearly["PREC_NC_ACC_c404"], prism_c404_yearly["PREC_NC_ACC_prism"]
)  # noqa : E501
prism_c404_pearson_tk = pearson_r(
    prism_c404_yearly["TK_c404"], prism_c404_yearly["TK_prism"]
)  # noqa : E501

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = [
    "Pearson",
    prism_c404_pearson_precip,
    None,
    prism_c404_pearson_tk,
    None,
]  # noqa : E501

prism_c404_stats

In [None]:
# Spearman's correlation
prism_c404_spearman_precip = spearman_r(
    prism_c404_yearly["PREC_NC_ACC_c404"], prism_c404_yearly["PREC_NC_ACC_prism"]
)  # noqa : E501
prism_c404_spearman_tk = spearman_r(
    prism_c404_yearly["TK_c404"], prism_c404_yearly["TK_prism"]
)  # noqa : E501

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = [
    "Spearman",
    prism_c404_spearman_precip,
    None,
    prism_c404_spearman_tk,
    None,
]  # noqa : E501

prism_c404_stats

In [None]:
# percent bias
prism_c404_pbias_precip = pbias(
    prism_c404_yearly["PREC_NC_ACC_prism"], prism_c404_yearly["PREC_NC_ACC_c404"]
)  # noqa : E501
prism_c404_pbias_tk = pbias(prism_c404_yearly["TK_prism"], prism_c404_yearly["TK_c404"])

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = [
    "pbias",
    prism_c404_pbias_precip,
    None,
    prism_c404_pbias_tk,
    None,
]  # noqa : E501

prism_c404_stats

Example export of descriptive stats

In [None]:
# prism_c404_stats.to_parquet("./file/path/to/c404_prism_drb_descriptive_stats.parquet", index=False) #noqa : E501

**CERES-EBAF net radiation zonal statistics**

Now, in the next cell, we'll run through the same steps to compute zonal statistics for the single CERES-EBAF variable, net radiation (RNET).

In [None]:
# open dataset
conus404_drb_cat["ceres-drb-OSN"].kwargs["decode_coords"] = "all"
ceres_drb = conus404_drb_cat["ceres-drb-OSN"].to_dask()

# crs
ceres_crs = 4326

# create the grid of c404_drb using any of the variables
ceres_var = "RNET"

# drop unneeded variable and coordinates
ceres_grid = ceres_drb[[ceres_var]].drop(["time", ceres_var]).reset_coords().load()


# add bounds
ceres_grid = ceres_grid.cf.add_bounds(x)
ceres_grid = ceres_grid.cf.add_bounds(y)

# stack
ceres_points = ceres_grid.stack(point=(y, x))

# apply bounds_to method
ceres_boxes = xr.apply_ufunc(
    bounds_to_poly,
    ceres_points.x_bounds,
    ceres_points.y_bounds,
    input_core_dims=[("bounds",), ("bounds",)],
    output_dtypes=[np.dtype("O")],
    vectorize=True,
)

# create geodataframe from boxes
ceres_grid_df = gpd.GeoDataFrame(
    data={"geometry": ceres_boxes.values, "y": ceres_boxes[y], "x": ceres_boxes[x]},
    index=ceres_boxes.indexes["point"],
    crs=ceres_crs,
)

# convert DRB to conus404 crs
ceres_drb_gdf = drb_gdf.to_crs(epsg=ceres_crs)

# overlay the two grids
ceres_overlay = ceres_grid_df.overlay(ceres_drb_gdf, keep_geom_type=True)

# grid cell fractions
ceres_grid_cell_fraction = ceres_overlay.geometry.area.groupby(
    ceres_overlay.huc6
).transform(lambda x: x / x.sum())

# create sparse dataarray
ceres_multi_index = ceres_overlay.set_index([y, x, "huc6"]).index
ceres_df_weights = pd.DataFrame(
    {"weights": ceres_grid_cell_fraction.values}, index=ceres_multi_index
)

ceres_ds_weights = xr.Dataset(ceres_df_weights)

ceres_weights_sparse = ceres_ds_weights.unstack(sparse=True, fill_value=0.0).weights

# Matrix multiplication across each DataArray
with dask.config.set(**{"array.slicing.split_large_chunks": False}):
    ceres_rnet_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        ceres_weights_sparse,
        ceres_drb["RNET"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))],
    )

# merge DataArrays into Dataset
ceres_regridded = xr.Dataset({"RNET": ceres_rnet_regridded})
ceres_regridded.attrs = ceres_drb.attrs

# Covert to DataFrame
ceres_df = ceres_regridded.load().to_dataframe()
ceres_df.head()

In [None]:
# reset index
ceres_zonal_stats = ceres_df.reset_index(drop=False)

# drop spatial_ref
ceres_zonal_stats.drop("spatial_ref", axis=1, inplace=True)

# convert time to string and drop day
ceres_zonal_stats["time"] = ceres_zonal_stats["time"].astype(str).str[:-3]

ceres_zonal_stats.head()

Example export of zonal stats.

In [None]:
# This is only an example
# ceres_zonal_stats.to_parquet("./file/path/to/ceres_drb_zonal_stats.parquet", index=False) #noqa : E501

### Descriptive statistics

We will now compute descriptive statistics between the previously computed CONUS404 and CERES-EBAF zonal statistics. This is the same overall process that was done with the precipitation and temperature PRISM zonal stats.

In [None]:
# Merge the PRISM and CONUS404 zonals stats together based on the HUC6 code and time
ceres_c404_zonal = ceres_zonal_stats.merge(
    c404_zonal_stats,
    left_on=["huc6", "time"],
    right_on=["huc6", "time"],
    suffixes=["_ceres", "_c404"],
)

# drop RNET
ceres_c404_zonal.drop(["PREC_NC_ACC", "TK"], axis=1, inplace=True)

ceres_c404_zonal.head()

In [None]:
# convert time column to datetime type
ceres_c404_zonal["time"] = pd.to_datetime(ceres_c404_zonal["time"], format="%Y-%m")
ceres_c404_zonal.head()

In [None]:
# resample done by HUC6 as the index is HUC6 and year
ceres_c404_yearly = ceres_c404_zonal.resample("1Y", on="time").agg(
    {"RNET_c404": "sum", "RNET_ceres": "sum"}
)
ceres_c404_yearly.reset_index(drop=False, inplace=True)
ceres_c404_yearly.head()

In [None]:
# mean, median, standard devation
ceres_c404_mean = ceres_c404_yearly.mean()
ceres_c404_median = ceres_c404_yearly.median()
ceres_c404_stdev = ceres_c404_yearly.std()

# create dataframe
ceres_c404_stats = pd.DataFrame(
    {
        "annual_mean": ceres_c404_mean,
        "median": ceres_c404_median,
        "stdev": ceres_c404_stdev,
    }
).T.drop("time", axis=1)

# reset index and rename
ceres_c404_stats = ceres_c404_stats.reset_index(drop=False).rename(
    {"index": "stat"}, axis=1
)  # noqa : E501

ceres_c404_stats

In [None]:
# bias
ceres_c404_stats_annual_mean = ceres_c404_stats.loc[
    ceres_c404_stats["stat"] == "annual_mean"
]  # noqa : E501
ceres_c404_bias_rnet = float(
    ceres_c404_stats_annual_mean["RNET_c404"]
    - ceres_c404_stats_annual_mean["RNET_ceres"]
)  # noqa : E501

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["bias", ceres_c404_bias_rnet, None]  # noqa : E501

# MAE
ceres_c404_mae_rnet = sum(
    abs(ceres_c404_yearly["RNET_c404"] - ceres_c404_yearly["RNET_ceres"])
) / len(ceres_c404_yearly)  # noqa : E501

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["MAE", ceres_c404_mae_rnet, None]  # noqa : E501

# RMSE
ceres_c404_rmse_rnet = math.sqrt(
    np.square(
        np.subtract(ceres_c404_yearly["RNET_c404"], ceres_c404_yearly["RNET_ceres"])
    ).mean()
)  # noqa : E501

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["RMSE", ceres_c404_rmse_rnet, None]  # noqa : E501

# Pearsons correlation
ceres_c404_pearson_rnet = pearson_r(
    ceres_c404_yearly["RNET_c404"], ceres_c404_yearly["RNET_ceres"]
)  # noqa : E501

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = [
    "Pearson",
    ceres_c404_pearson_rnet,
    None,
]  # noqa : E501

# Spearman's correlation
ceres_c404_spearman_rnet = spearman_r(
    ceres_c404_yearly["RNET_c404"], ceres_c404_yearly["RNET_ceres"]
)  # noqa : E501

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = [
    "Spearman",
    ceres_c404_spearman_rnet,
    None,
]  # noqa : E501

# percent bias
ceres_c404_pbias_rnet = pbias(
    ceres_c404_yearly["RNET_ceres"], ceres_c404_yearly["RNET_c404"]
)

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = [
    "pbias",
    ceres_c404_pbias_rnet,
    None,
]  # noqa : E501

ceres_c404_stats

Example export of descriptive stats

In [None]:
# ceres_c404_stats.to_parquet("./file/path/to/c404_ceres_drb_descriptive_stats.parquet", index=False) #noqa : E501

**Extract gridded values to points**

The goal of this section is extract values from CONUS404 where they overlay spatially and temporally with station data. This concept is described in an article about the ESRI tool [Extract Values to Points](https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-analyst/extract-values-to-points.htm). After extracting the values, the same descriptive statistics used to compare the gridded datasets will be run.

Process outline:
1. Read in the prepared dataset
2. Use latitude and longitude of each point to extract data from matching CONUS404 grid cell
<br>

**Climate Reference Network point extraction**

In [None]:
crn_drb_df = conus404_drb_cat["crn-drb-OSN"].read()

# create geodataframe
crn_drb = gpd.GeoDataFrame(
    crn_drb_df,
    crs=4326,
    geometry=gpd.points_from_xy(crn_drb_df.LONGITUDE, crn_drb_df.LATITUDE),
)

crn_drb.rename(
    {"DATE": "time", "TK": "TK_crn", "PREC_ACC_NC": "PREC_ACC_NC_crn"},
    axis=1,
    inplace=True,
)

crn_drb.head()

Get coordinates from CRN stations to index CONUS404 DRB by.

In [None]:
# isolate single row and transform to c404_drb crs
crn_coords_gdf = crn_drb.iloc[[0]].to_crs(c404_crs)

# extract lat/long values
crn_lat = crn_coords_gdf.iloc[0]["geometry"].y
crn_lon = crn_coords_gdf.iloc[0]["geometry"].x

# get min and max time
crn_time_min = crn_drb["time"].min()
crn_time_max = crn_drb["time"].max()

# convert time to str
crn_drb["time"] = crn_drb["time"].astype(str).str[:-3]

# subset c404_drb to lat/long using nearest
c404_crn_sub = c404_drb.sel(x=crn_lon, y=crn_lat, method="nearest")

# slice to time-steps of crn_drb
c404_crn_sub = c404_crn_sub.sel(time=slice(crn_time_min, crn_time_max))

c404_crn_sub

Convert subset to dataframe and reorganize columns

In [None]:
c404_crn_sub_df = c404_crn_sub.to_dataframe().reset_index(drop=False)

# trim columns
c404_crn_sub_df = c404_crn_sub_df[["time", "TK", "PREC_ACC_NC"]]

# rename columns
c404_crn_sub_df.rename(
    {"TK": "TK_c404", "PREC_ACC_NC": "PREC_ACC_NC_c404"}, axis=1, inplace=True
)

# trim time
c404_crn_sub_df["time"] = c404_crn_sub_df["time"].astype(str).str[:-3]

c404_crn_sub_df

Combine CONUS404 subset with CRN data

In [None]:
crn_c404_point = crn_drb.merge(c404_crn_sub_df, on="time").reset_index(drop=False)

# drop columns
crn_c404_point.drop(
    ["index", "LATITUDE", "LONGITUDE", "ID", "geometry"], axis=1, inplace=True
)  # noqa : E501

crn_c404_point.head()

In [None]:
# convert time column to datetime type
crn_c404_point["time"] = pd.to_datetime(crn_c404_point["time"], format="%Y-%m")

Example of exporting point data

In [None]:
# crn_c404_point.to_parquet("./file/path/to/c404_crn_drb_point_values.parquet", index=False)

### Descriptive statistics

We will now compute descriptive statistics between the point extracted CONUS404 and CRN data.

In [None]:
# resample to yearly means
crn_c404_yearly = crn_c404_point.resample("1Y", on="time").agg(
    {
        "PREC_ACC_NC_c404": "sum",
        "PREC_ACC_NC_crn": "sum",
        "TK_c404": "mean",
        "TK_crn": "mean",
    }
)
crn_c404_yearly.reset_index(drop=False, inplace=True)

# mean, median, standard devation
crn_c404_mean = crn_c404_yearly.mean()
crn_c404_median = crn_c404_yearly.median()
crn_c404_stdev = crn_c404_yearly.std()

# create dataframe
crn_c404_stats = pd.DataFrame(
    {"annual_mean": crn_c404_mean, "median": crn_c404_median, "stdev": crn_c404_stdev}
).T.drop("time", axis=1)  # noqa : E501

# reset index and rename
crn_c404_stats = crn_c404_stats.reset_index(drop=False).rename(
    {"index": "stat"}, axis=1
)  # noqa : E501

# bias
crn_c404_stats_annual_mean = crn_c404_stats.loc[crn_c404_stats["stat"] == "annual_mean"]  # noqa : E501
crn_c404_bias_precip = float(
    crn_c404_stats_annual_mean["PREC_ACC_NC_c404"]
    - crn_c404_stats_annual_mean["PREC_ACC_NC_crn"]
)  # noqa : E501
crn_c404_bias_tk = float(
    crn_c404_stats_annual_mean["TK_c404"] - crn_c404_stats_annual_mean["TK_crn"]
)  # noqa : E501
# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = [
    "bias",
    crn_c404_bias_precip,
    np.nan,
    crn_c404_bias_tk,
    np.nan,
]  # noqa : E501

# MAE
crn_c404_mae_precip = sum(
    abs(crn_c404_yearly["PREC_ACC_NC_c404"] - crn_c404_yearly["PREC_ACC_NC_crn"])
) / len(crn_c404_yearly)  # noqa : E501
crn_c404_mae_tk = sum(
    abs(crn_c404_yearly["TK_c404"] - crn_c404_yearly["TK_crn"])
) / len(crn_c404_yearly)  # noqa : E501
# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = [
    "MAE",
    crn_c404_mae_precip,
    np.nan,
    crn_c404_mae_tk,
    np.nan,
]  # noqa : E501

# RMSE
crn_c404_rmse_precip = math.sqrt(
    np.square(
        np.subtract(
            crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"]
        )
    ).mean()
)  # noqa : E501
crn_c404_rmse_tk = math.sqrt(
    np.square(np.subtract(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])).mean()
)  # noqa : E501

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = [
    "RMSE",
    crn_c404_rmse_precip,
    np.nan,
    crn_c404_rmse_tk,
    np.nan,
]  # noqa : E501

# Pearsons correlation
crn_c404_pearson_precip = pearson_r(
    crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"]
)  # noqa : E501
crn_c404_pearson_tk = pearson_r(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])  # noqa : E501

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = [
    "pearson",
    crn_c404_pearson_precip,
    np.nan,
    crn_c404_pearson_tk,
    np.nan,
]  # noqa : E501

# Spearman's correlation
crn_c404_spearman_precip = spearman_r(
    crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"]
)  # noqa : E501
crn_c404_spearman_tk = spearman_r(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])  # noqa : E501

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = [
    "spearman",
    crn_c404_spearman_precip,
    np.nan,
    crn_c404_spearman_tk,
    np.nan,
]  # noqa : E501

# percent bias
crn_c404_pbias_precip = pbias(
    crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"]
)  # noqa : E501
crn_c404_pbias_tk = pbias(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = [
    "pbias",
    crn_c404_pbias_precip,
    np.nan,
    crn_c404_pbias_tk,
    np.nan,
]  # noqa : E501

crn_c404_stats

Example export of descriptive stats

In [None]:
# crn_c404_stats.to_parquet("./file/path/to/c404_crn_drb_descriptive_stats.parquet", index=False) #noqa : E501

**Historical Climate Network (HCN) point extraction**

The HCN data is different than the CRN data as the HCN data comes from multiple stations whereas the CRN data was from a single station. This will involve using multiple sets of geographic coordinates to extract data from CONUS404.

In [None]:
# read in dataset
hcn_drb_df = conus404_drb_cat["hcn-drb-OSN"].read()

# rename columns
hcn_drb_df.rename(
    {"DATE": "time", "TK": "TK_hcn", "PREC_ACC_NC": "PREC_ACC_NC_hcn"},
    axis=1,
    inplace=True,
)

# change DATE field to string
hcn_drb_df["time"] = hcn_drb_df["time"].astype(str).str[:-3]

# hcn_drb_df.head()

Get a DataFrame of the station IDs, lats, and longs to use for extract data

In [None]:
hcn_stations = hcn_drb_df.copy().drop(["time", "TK_hcn", "PREC_ACC_NC_hcn"], axis=1)
hcn_stations["LONGITUDE"] = pd.to_numeric(hcn_stations["LONGITUDE"])
hcn_stations["LATITUDE"] = pd.to_numeric(hcn_stations["LATITUDE"])

hcn_stations = hcn_stations.groupby("ID").mean().reset_index(drop=False)
# hcn_stations

Create a GeoDataFrame to convert the lat and long to the coordinate system of CONUS404

In [None]:
hcn_stations_gdf = gpd.GeoDataFrame(
    hcn_stations,
    crs=4326,
    geometry=gpd.points_from_xy(hcn_stations.LONGITUDE, hcn_stations.LATITUDE),
)

# transform to c404_drb crs
hcn_stations_gdf = hcn_stations_gdf.to_crs(c404_crs)

# extract lat/long values and give them the ID of the stations
target_lon = xr.DataArray(
    hcn_stations_gdf["geometry"].x.to_numpy(),
    dims="ID",
    coords=dict(ID=hcn_stations_gdf.ID),
)  # noqa : E501
target_lat = xr.DataArray(
    hcn_stations_gdf["geometry"].y.to_numpy(),
    dims="ID",
    coords=dict(ID=hcn_stations_gdf.ID),
)  # noqa : E501

Subset `c404_drb` to time period of HCN

In [None]:
# time min/max
hcn_time_min = hcn_drb_df["time"].min()
hcn_time_max = hcn_drb_df["time"].max()

# slice c404 to HCN time and drop unused vars
c404_hcn_timesub = c404_drb.sel(time=slice(hcn_time_min, hcn_time_max)).drop_vars(
    "RNET"
)  # noqa : E501


# rename data_vars names
name_dict = dict(TK="TK_c404", PREC_ACC_NC="PREC_ACC_NC_c404")
c404_hcn_timesub = c404_hcn_timesub.rename(name_dict)

Extract data from points

In [None]:
# subset to points
c404_stations_sub = c404_hcn_timesub.sel(
    y=target_lat, x=target_lon, method="nearest"
).compute()  # noqa : E501

# covert to a dataframe and reset the index
c404_stations_df = c404_stations_sub.to_dataframe().reset_index(drop=False)

Merge CONUS404 observations to HCN observations using the station ID and time

In [None]:
# prepare df to merge back to hcn_drb_df
# start by subsetting columns
c404_stations_df = c404_stations_df[["ID", "time", "TK_c404", "PREC_ACC_NC_c404"]]

# convert time to string
c404_stations_df["time"] = c404_stations_df["time"].dt.strftime("%Y-%m")

# merge into hcn_drb_df
hcn_c404_point = pd.merge(hcn_drb_df, c404_stations_df, on=["ID", "time"])

# convert time column to datetime type
hcn_c404_point["time"] = pd.to_datetime(hcn_c404_point["time"], format="%Y-%m")

hcn_c404_point.head()

Example of exporting point data

In [None]:
# hcn_c404_point.to_parquet("./file/path/to/c404_hcn_drb_point_values.parquet", index=False)

### Descriptive statistics

We will now compute descriptive statistics between the point extracted CONUS404 and HCN data.

In [None]:
# yearly means
hcn_c404_yearly = (
    hcn_c404_point.groupby("ID")
    .resample("1Y", on="time")
    .agg(
        {
            "PREC_ACC_NC_c404": "sum",
            "PREC_ACC_NC_hcn": "sum",
            "TK_c404": "mean",
            "TK_hcn": "mean",
        }
    )
)

hcn_c404_yearly = hcn_c404_yearly.dropna(axis=0)

In [None]:
# mean
hcn_c404_mean = hcn_c404_yearly.groupby("ID").mean()
hcn_c404_mean["stat"] = "mean"

# mean
hcn_c404_median = hcn_c404_yearly.groupby("ID").median()
hcn_c404_median["stat"] = "median"

# stdev
hcn_c404_std = hcn_c404_yearly.groupby("ID").std()
hcn_c404_std["stat"] = "stdev"

# summary stats
hcn_c404_summary = (
    pd.concat([hcn_c404_mean, hcn_c404_median, hcn_c404_std])
    .sort_index()
    .reset_index(drop=False)
)

# reorder columns to make more table more readable
hcn_c404_columns = [
    "ID",
    "stat",
    "TK_c404",
    "TK_hcn",
    "PREC_ACC_NC_c404",
    "PREC_ACC_NC_hcn",
]  # noqa : E501
hcn_c404_summary = hcn_c404_summary[hcn_c404_columns]

hcn_c404_summary

Next we will create goodness of fit status for each station and then create a dataframe of all the stats.

Notice that many of the intermediate `pd.Series` objects are getting a respective `.name` (ex: `hcn_c404_bias_PREC_ACC_NC.name = "PREC_ACC_NC_c404"`). This will automatically name the column that when the `pd.Series` is transformed into a `pd.DataFrame` later on. Additionally, there multiple times that a `pd.Series` gets called the same name. This will be to match the column names in `hcn_c404_summary` above so when we can smoothly concantenate the two `pd.DataFrame`s together.

In [None]:
# create dataframes for each goodness of fit stat

# functions
def dataframe_transform(data: pd.Series) -> pd.DataFrame:
    """Transform a Series to DataFrame."""
    df = data.to_frame().reset_index(drop=False)
    if "level_1" in df.columns:
        df = df.drop("level_1", axis=1)
    return df


def create_stat_df(ldata: pd.Series, rdata: pd.Series, stat: str) -> pd.DataFrame:
    """Combine a list of series into unified dataframe."""
    # custom transform function
    left_data_df = dataframe_transform(ldata)
    right_data_df = dataframe_transform(rdata)

    # merge data on ID
    df = pd.merge(left_data_df, right_data_df, on="ID").assign(stat=stat)
    return df


# calculate stats

# bias

hcn_c404_stats_annual_mean = hcn_c404_summary.loc[hcn_c404_summary["stat"] == "mean"]
hcn_c404_bias_PREC_ACC_NC = hcn_c404_stats_annual_mean.groupby("ID").apply(
    lambda x: bias(x["PREC_ACC_NC_c404"], x["PREC_ACC_NC_hcn"])
)  # noqa : E501
hcn_c404_bias_PREC_ACC_NC.name = "PREC_ACC_NC_c404"
hcn_c404_bias_TK = hcn_c404_stats_annual_mean.groupby("ID").apply(
    lambda x: bias(x["TK_c404"], x["TK_hcn"])
)  # noqa : E501
hcn_c404_bias_TK.name = "TK_c404"

hcn_c404_bias = create_stat_df(hcn_c404_bias_PREC_ACC_NC, hcn_c404_bias_TK, "bias")  # noqa : E501

# MAE
hcn_c404_mae_PREC_ACC_NC = hcn_c404_yearly.groupby("ID").apply(
    lambda x: mae(x["PREC_ACC_NC_c404"], x["PREC_ACC_NC_hcn"])
)  # noqa : E501
hcn_c404_mae_PREC_ACC_NC.name = "PREC_ACC_NC_c404"
hcn_c404_mae_TK = hcn_c404_yearly.groupby("ID").apply(
    lambda x: mae(x["TK_c404"], x["TK_hcn"])
)  # noqa : E501
hcn_c404_mae_TK.name = "TK_c404"

hcn_c404_mae = create_stat_df(hcn_c404_mae_PREC_ACC_NC, hcn_c404_mae_TK, "MAE")  # noqa : E501

# RMSE
hcn_c404_rmse_PREC_ACC_NC = hcn_c404_yearly.groupby("ID").apply(
    lambda x: rmse(x["PREC_ACC_NC_c404"], x["PREC_ACC_NC_hcn"])
)  # noqa : E501
hcn_c404_rmse_PREC_ACC_NC.name = "PREC_ACC_NC_c404"
hcn_c404_rmse_TK = hcn_c404_yearly.groupby("ID").apply(
    lambda x: rmse(x["TK_c404"], x["TK_hcn"])
)  # noqa : E501
hcn_c404_rmse_TK.name = "TK_c404"

hcn_c404_rmse = create_stat_df(hcn_c404_rmse_PREC_ACC_NC, hcn_c404_rmse_TK, "RMSE")  # noqa : E501

# Pearson r
hcn_c404_pearson_PREC_ACC_NC = hcn_c404_yearly.groupby("ID").apply(
    lambda x: pearson_r(x["PREC_ACC_NC_c404"], x["PREC_ACC_NC_hcn"])
)  # noqa : E501
hcn_c404_pearson_PREC_ACC_NC.name = "PREC_ACC_NC_c404"
hcn_c404_pearson_TK = hcn_c404_yearly.groupby("ID").apply(
    lambda x: pearson_r(x["TK_c404"], x["TK_hcn"])
)  # noqa : E501
hcn_c404_pearson_TK.name = "TK_c404"

hcn_c404_pearson = create_stat_df(
    hcn_c404_pearson_PREC_ACC_NC, hcn_c404_pearson_TK, "pearson"
)  # noqa : E501

# Spearman r
hcn_c404_spearman_PREC_ACC_NC = hcn_c404_yearly.groupby("ID").apply(
    lambda x: spearman_r(x["PREC_ACC_NC_c404"], x["PREC_ACC_NC_hcn"])
)  # noqa : E501
hcn_c404_spearman_PREC_ACC_NC.name = "PREC_ACC_NC_c404"
hcn_c404_spearman_TK = hcn_c404_yearly.groupby("ID").apply(
    lambda x: spearman_r(x["TK_c404"], x["TK_hcn"])
)  # noqa : E501
hcn_c404_spearman_TK.name = "TK_c404"

hcn_c404_spearman = create_stat_df(
    hcn_c404_spearman_PREC_ACC_NC, hcn_c404_spearman_TK, "spearman"
)  # noqa : E501

# percent bias
hcn_c404_pbias_PREC_ACC_NC = hcn_c404_yearly.groupby("ID").apply(
    lambda x: pbias(x["PREC_ACC_NC_c404"], x["PREC_ACC_NC_hcn"])
)  # noqa : E501
hcn_c404_pbias_PREC_ACC_NC.name = "PREC_ACC_NC_c404"
hcn_c404_pbias_TK = hcn_c404_yearly.groupby("ID").apply(
    lambda x: pbias(x["TK_c404"], x["TK_hcn"])
)  # noqa : E501
hcn_c404_pbias_TK.name = "TK_c404"

hcn_c404_pbias = create_stat_df(hcn_c404_pbias_PREC_ACC_NC, hcn_c404_pbias_TK, "pbias")  # noqa : E501

# create dataframe of all goodness of fit stats
hcn_gof_df = pd.concat(
    [
        hcn_c404_bias,
        hcn_c404_mae,
        hcn_c404_rmse,
        hcn_c404_pearson,
        hcn_c404_spearman,
        hcn_c404_pbias,
    ]
)  # noqa : E501
hcn_gof_df

Concat the summary statistics and goodness of fit statistics and sort by ID

In [None]:
def concat_sort(df_list: list[pd.DataFrame]) -> pd.DataFrame:
    """Concat dataframes, fill NaN, and sort final dataframe."""
    df = pd.concat(df_list)
    df = df.sort_values(by="ID")
    return df


hcn_c404_stats = concat_sort([hcn_c404_summary, hcn_gof_df])
hcn_c404_stats

Example export of descriptive stats

In [None]:
# hcn_c404_stats.to_parquet("./file/path/to/c404_hcn_drb_descriptive_stats.parquet", index=False) #noqa : E501

#### Shut down the client and cluster

In [None]:
client.close()
cluster.shutdown()

## Wrapping Up

You have now calculated the statistics needed to start making decisions about CONUS404 and it's ability to be used as a forcing dataset. You have done this by comparing CONUS404 to other gridded datasets (PRISM, CERES-EBAF) and against two sources of ground truthing (HCN, CRN).

The next notebook will walk you through different visualizations, including comparing maps of the gridded and point data, boxplots, histograms, and other visualizations of the results.