# Create zonal statistics and point extractions for comparing CONUS404 and reference datasets

<img src='../../../doc/assets/Eval_Analysis.svg' width=600>

Now that the data has been prepared, it is time to compute zonal statistics and perform point extractions. 

<details>
  <summary>Guide to pre-requisites and learning outcomes...&lt;click to expand&gt;</summary>
  
  <table>
    <tr>
      <td>Pre-Requisites
      <td>To get the most out of this notebook, you should already have an understanding of these topics: 
        <ul>
        <li>pre-req one
        <li>pre-req two
        </ul>
    <tr>
      <td>Expected Results
      <td>At the end of this notebook, you should be able to: 
        <ul>
        <li>outcome one
        <li>outcome two
        </ul>
  </table>
</details>

In [None]:
# library imports
import cf_xarray
import dask
import fsspec 
import geopandas as gpd
import hvplot.xarray
import intake
import math
import numpy as np
import pandas as pd
import pygeohydro
import sparse 
import warnings
import xarray as xr

from shapely.geometry import Polygon

warnings.filterwarnings('ignore')

# run script for available functions
%run ../../../model_evaluation/Metrics_StdSuite_v1.ipynb

# data
# connect to HyTEST catalog
url = 'https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml'
cat = intake.open_catalog(url)

# access tutorial catalog
conus404_drb_cat = cat["conus404-drb-eval-tutorial-catalog"]
list(conus404_drb_cat)

# Update to helper function after repo consolidation
## **Start a Dask client using an appropriate Dask Cluster** 
This is an optional step, but can speed up data loading significantly, especially when accessing data from the cloud.

In [None]:
def configure_cluster(machine):
    ''' Helper function to configure cluster
    '''
    if machine == 'denali':
        from dask.distributed import LocalCluster, Client
        cluster = LocalCluster(threads_per_worker=1)
        client = Client(cluster)
    
    elif machine == 'tallgrass':
        from dask.distributed import Client
        from dask_jobqueue import SLURMCluster
        cluster = SLURMCluster(queue='cpu', cores=1, interface='ib0',
                               job_extra=['--nodes=1', '--ntasks-per-node=1', '--cpus-per-task=1'],
                               memory='6GB')
        cluster.adapt(maximum_jobs=30)
        client = Client(cluster)
        
    elif machine == 'local':
        import os
        import warnings
        from dask.distributed import LocalCluster, Client
        warnings.warn("Running locally can result in costly data transfers!\n")
        n_cores = os.cpu_count() # set to match your machine
        cluster = LocalCluster(threads_per_worker=n_cores)
        client = Client(cluster)
        
    elif machine in ['esip-qhub-gateway-v0.4']:   
        import sys, os
        sys.path.append(os.path.join(os.environ['HOME'],'shared','users','lib'))
        import ebdpy as ebd
        aws_profile = 'esip-qhub'  
        ebd.set_credentials(profile=aws_profile)

        aws_region = 'us-west-2'
        endpoint = f's3.{aws_region}.amazonaws.com'
        ebd.set_credentials(profile=aws_profile, region=aws_region, endpoint=endpoint)
        worker_max = 30
        client,cluster = ebd.start_dask_cluster(profile=aws_profile, worker_max=worker_max, 
                                              region=aws_region, use_existing_cluster=True,
                                              adaptive_scaling=True, wait_for_cluster=False, 
                                              worker_profile='Medium Worker', propagate_env=True)
        
    return client, cluster

### Setup your cluster

#### QHub...
Uncomment single commented spaces (#) to run

In [None]:
# set machine
machine = 'esip-qhub-gateway-v0.4'

# use configure cluster helper function to setup dask
client, cluster = configure_cluster(machine)

#### or HPC
Uncomment single commented spaces (#) to run

In [None]:
## set machine
# machine = os.environ['SLURM_CLUSTER_NAME']

## use configure_cluster helper function to setup dask
# client, cluster = configure_cluster(machine)

### Connect to catalog of tutorial datasets

Workflow outline:
1. Read in the prepared dataset
2. Read in the HUC6 boundaries and transform to same coordinate reference system as prepared dataset
3. Make a data mask with the HUC6 boundaries to calculate zonal statistics
4. Compute zonal statistics with data mask and prepared data

Once all calculations are done: 

5. Combine each reference with benchmark into single dataset
6. Export gridded data zonal statistics
<br>

**CONUS404 zonal statistics**

## **Compute zonal statistics for gridded datasets**

In the last tutorial, we prepared three gridded datasets: CONUS404 (benchmark), PRISM (reference), and CERES-EBAF (reference). The goal of this section is compute [zonal statistics](https://gisgeography.com/zonal-statistics/) for each HUC6 zone in the Delaware River Basin (DRB) by using the [conservative regridding method put forth by Ryan Abernathy](https://discourse.pangeo.io/t/conservative-region-aggregation-with-xarray-geopandas-and-sparse/2715) to regrid and perform an area-weight analysis.

Dataset outline:
<ol>
    <li>Read in the prepared dataset</li>
    <li>Compute bounding bands for latitude and longitude (if necessary) then use these to create polygons in area-preserving CRS</li>
    <li>Read in the HUC6 boundaries and transform to same coordinate reference system as prepared dataset</li>
    <li>Overlay the dataset polygons over the HUC6 boundaries and create spatial weights matrices</li>
    <li>Perform matrix multiplication between The prepared dataset and the spatial weights matrices</li>
    <li>Perform zonal statistics</li>
</ol>

The following two functions will be used for regridding each dataset. Review them for now and an explanation of what they do will be provided when they are applied.

In [None]:
def bounds_to_poly(x_bounds, y_bounds):
    """Return a polygon based on the x (longitude) and y (longitude) bounding band DataArrays"""
    return Polygon([
        (x_bounds[0], y_bounds[0]),
        (x_bounds[0], y_bounds[1]),
        (x_bounds[1], y_bounds[1]),
        (x_bounds[1], y_bounds[0])
    ])

def apply_weights_matmul_sparse(weights, data):
    """Apply weights in a sparse matrices to data and regrid"""
    assert isinstance(weights, sparse.SparseArray)
    assert isinstance(data, np.ndarray)
    data = sparse.COO.from_numpy(data)
    data_shape = data.shape
    n, k = data_shape[0], data_shape[1] * data_shape[2]
    data = data.reshape((n, k))
    weights_shape = weights.shape
    k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
    assert k == k_
    weights_data = weights.reshape((k, m))

    regridded = sparse.matmul(data, weights_data)
    assert regridded.shape == (n, m)
    return regridded.todense()

    assert isinstance(weights, sparse.SparseArray)
    assert isinstance(data, np.ndarray)
    data = sparse.COO.from_numpy(data)
    data_shape = data.shape
    # k = nlat * nlon
    n, k = data_shape[0], data_shape[1] * data_shape[2]
    data = data.reshape((n, k))
    weights_shape = weights.shape
    k_, m = weights_shape[0] * weights_shape[1], weights_shape[2]
    assert k == k_
    weights_data = weights.reshape((k, m))

    regridded = sparse.matmul(data, weights_data)
    assert regridded.shape == (n, m)
    return regridded.todense()

And the following `fsspec.filesystem` will be using to read in each dataset from an [Open Storage Network](https://www.openstoragenetwork.org/) bucket, which is read only.

In [None]:
fs_read = fsspec.filesystem('s3', anon=True, skip_instance_cache=True,
                            client_kwargs={'endpoint_url': 'https://renc.osn.xsede.org'})

x = "x"
y = "y"

Setup the geometries for the DRB

In [None]:
# bring in HUC6 boundaries found in the DRB
drb_gdf = pygeohydro.WBD("huc6", outfields=["huc6", "name"]).byids("huc6", ["020401", "020402"])

# area preserving crs
crs_area = "ESRI:53034"

# set CRS to match c404_drb
drb_gdf = drb_gdf.to_crs(crs_area)

#visualize
# drb_gdf.plot(edgecolor="orange", facecolor="purple", linewidth=2.5)

**CONUS404 zonal statistics**

In [None]:
# open dataset
c404_drb = conus404_drb_cat['conus404-drb-OSN'].to_dask()

# crs
c404_crs = c404_drb.rio.crs.to_proj4()

# c404_drb

In [None]:
# c404_drb.PREC_ACC_NC.hvplot(x="x", y="y", rasterize=True)

Create the grid of c404_drb using any of the variables

In [None]:
# set vars
c404_var = "TK"

# drop unneeded variable and coordinates
c404_grid = c404_drb[[c404_var]].drop(['time', 'lon', 'lat', c404_var]).reset_coords().load() #load in Richs code
c404_grid

And create bounding bands then stack into points

In [None]:
# add bounds
c404_grid = c404_grid.cf.add_bounds(x)
c404_grid = c404_grid.cf.add_bounds(y)

# stack
c404_points = c404_grid.stack(point=(y,x))
c404_points

Next, use the `xarray apply_ufunc` function to apply the `bounds_to_poly` function above to the _c404_points_ DataSet.

In [None]:
c404_boxes = xr.apply_ufunc(
    bounds_to_poly,
    c404_points.x_bounds,
    c404_points.y_bounds,
    input_core_dims=[("bounds",),  ("bounds",)],
    output_dtypes=[np.dtype('O')],
    vectorize=True
)
c404_boxes

Create geodataframe from boxes

In [None]:
c404_grid_df= gpd.GeoDataFrame(
    data={"geometry": c404_boxes.values, "y": c404_boxes[y], "x": c404_boxes[x]},
    index=c404_boxes.indexes["point"],
    crs=c404_crs
)
c404_grid_df

In [None]:
# c404_grid_df.plot(edgecolor="red", facecolor="white", linewidth=0.05)

Overlay the two grids

In [None]:
# convert DRB to conus404 crs
c404_drb_gdf = drb_gdf.to_crs(c404_crs)

#perform overly
c404_overlay = c404_grid_df.overlay(c404_drb_gdf, keep_geom_type=True)
c404_overlay.head()

In [None]:
# plot overlay for single HUC6
# c404_overlay[c404_overlay.huc6 == "020402"].geometry.plot(edgecolor='k')

Grid cell fractions

In [None]:
c404_grid_cell_fraction = c404_overlay.geometry.area.groupby(c404_overlay.huc6).transform(lambda x: x / x.sum())

Sparse DataArray

In [None]:
c404_multi_index = c404_overlay.set_index([y, x, "huc6"]).index
c404_df_weights = pd.DataFrame({"weights": c404_grid_cell_fraction.values}, index=c404_multi_index)

c404_ds_weights = xr.Dataset(c404_df_weights)

c404_weights_sparse = c404_ds_weights.unstack(sparse=True, fill_value=0.).weights

Matrix multiplication across each DataArray

In [None]:
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    c404_precip_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        c404_weights_sparse,
        c404_drb["PREC_ACC_NC"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))]
    )

    c404_rnet_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        c404_weights_sparse,
        c404_drb["RNET"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))]
    )

    c404_tk_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        c404_weights_sparse,
        c404_drb["TK"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))]
    )

Merge DataArrays into Dataset

In [None]:
c404_regridded = xr.Dataset({"PREC_NC_ACC":c404_precip_regridded, "RNET":c404_rnet_regridded, "TK": c404_tk_regridded})
c404_regridded = c404_regridded.drop("crs")
c404_regridded.attrs = c404_drb.attrs
c404_regridded

Covert to DataFrame

In [None]:
c404_df = c404_regridded.load().to_dataframe()
c404_df

In [None]:
# reset index
c404_zonal_stats = c404_df.reset_index(drop=False)
c404_zonal_stats["time"] = c404_zonal_stats["time"].astype(str).str[:-3]
c404_zonal_stats

**PRISM zonal statistics**

PRISM has two variables: TK and PREC_ACC_NC

In [None]:
# open dataset
prism_drb = conus404_drb_cat['prism-drb-OSN'].to_dask()

# prism crs
prism_crs = 4269

# create the grid of c404_drb using any of the variables
prism_var = "TK"

# drop unneeded variable and coordinates
prism_grid = prism_drb[[prism_var]].drop(
    ['time', prism_var]).reset_coords().load()


# add bounds
prism_grid = prism_grid.cf.add_bounds(x)
prism_grid = prism_grid.cf.add_bounds(y)

# stack
prism_points = prism_grid.stack(point=(y,x))

# apply bounds_to method
prism_boxes = xr.apply_ufunc(
    bounds_to_poly,
    prism_points.x_bounds,
    prism_points.y_bounds,
    input_core_dims=[("bounds",),  ("bounds",)],
    output_dtypes=[np.dtype('O')],
    vectorize=True
)

# create geodataframe from boxes
prism_grid_df= gpd.GeoDataFrame(
    data={"geometry": prism_boxes.values, "y": prism_boxes[y], "x": prism_boxes[x]},
    index=prism_boxes.indexes["point"],
    crs=prism_crs
)

# convert DRB to conus404 crs
prism_drb_gdf = drb_gdf.to_crs(epsg=prism_crs)

# overlay the two grids
prism_overlay = prism_grid_df.overlay(prism_drb_gdf, keep_geom_type=True)

# grid cell fractions
prism_grid_cell_fraction = prism_overlay.geometry.area.groupby(prism_overlay.huc6).transform(lambda x: x / x.sum())

# create sparse dataarray
prism_multi_index = prism_overlay.set_index([y, x, "huc6"]).index
prism_df_weights = pd.DataFrame({"weights": prism_grid_cell_fraction.values}, index=prism_multi_index)

prism_ds_weights = xr.Dataset(prism_df_weights)

prism_weights_sparse = prism_ds_weights.unstack(sparse=True, fill_value=0.).weights

# Matrix multiplication across each DataArray
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    prism_precip_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        prism_weights_sparse,
        prism_drb["PREC_ACC_NC"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))]
    )

    prism_tk_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        prism_weights_sparse,
        prism_drb["TK"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))]
    )
# merge DataArrays into Dataset
prism_regridded = xr.Dataset({"PREC_NC_ACC":prism_precip_regridded, "TK": prism_tk_regridded})
prism_regridded.attrs = prism_drb.attrs

# Covert to DataFrame
prism_df = prism_regridded.load().to_dataframe()
prism_df.head()

In [None]:
# reset index and add time back
prism_zonal_stats = prism_df.reset_index(drop=False)
prism_zonal_stats["time"] = prism_zonal_stats["time"].astype(str).str[:-3]
prism_zonal_stats.head()

In [None]:
# Merge the PRISM and CONUS404 zonals stats together based on the HUC6 code and time
prism_c404_zonal = prism_zonal_stats.merge(c404_zonal_stats, left_on=['huc6', 'time'], right_on=['huc6', 'time'], suffixes=["_prism", "_c404"])

#drop RNET
prism_c404_zonal.drop("RNET", axis=1, inplace=True)

prism_c404_zonal.head()

In [None]:
# convert time column to datetime type
prism_c404_zonal["time"] = pd.to_datetime(prism_c404_zonal["time"], format="%Y-%m")
prism_c404_zonal.head()

Summary statistics

In [None]:
prism_c404_yearly = prism_c404_zonal.resample("1Y", on="time").mean()
prism_c404_yearly.reset_index(drop=False, inplace=True)
prism_c404_yearly.head()

In [None]:
# mean, median, standard devation
prism_c404_mean = prism_c404_yearly.mean()
prism_c404_median = prism_c404_yearly.median()
prism_c404_stdev = prism_c404_yearly.std()

#create dataframe
prism_c404_stats = pd.DataFrame({"annual_mean": prism_c404_mean, "median": prism_c404_median, "stdev": prism_c404_stdev}).T.drop("time", axis=1)

# reset index and rename
prism_c404_stats = prism_c404_stats.reset_index(drop=False).rename({"index":"stat"}, axis=1)

prism_c404_stats

In [None]:
# bias
prism_c404_stats_annual_mean = prism_c404_stats.loc[prism_c404_stats['stat'] == "annual_mean"]
prism_c404_bias_precip = float(prism_c404_stats_annual_mean["PREC_NC_ACC_c404"] - prism_c404_stats_annual_mean["PREC_NC_ACC_prism"])
prism_c404_bias_tk = float(prism_c404_stats_annual_mean["TK_c404"] - prism_c404_stats_annual_mean["TK_prism"])

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = ["bias", prism_c404_bias_precip, None, prism_c404_bias_tk, None]

prism_c404_stats

In [None]:
# MAE
prism_c404_mae_precip = sum(abs(prism_c404_yearly["PREC_NC_ACC_c404"] - prism_c404_yearly["PREC_NC_ACC_prism"]))/len(prism_c404_yearly)
prism_c404_mae_tk = sum(abs(prism_c404_yearly["TK_c404"] - prism_c404_yearly["TK_prism"]))/len(prism_c404_yearly)

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = ["MAE", prism_c404_mae_precip, None, prism_c404_mae_tk, None]

prism_c404_stats

In [None]:
# RMSE
prism_c404_rmse_precip = math.sqrt(np.square(np.subtract(prism_c404_yearly["PREC_NC_ACC_c404"], prism_c404_yearly["PREC_NC_ACC_prism"])).mean())
prism_c404_rmse_tk = math.sqrt(np.square(np.subtract(prism_c404_yearly["TK_c404"], prism_c404_yearly["TK_prism"])).mean())

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = ["RMSE", prism_c404_rmse_precip, None, prism_c404_rmse_tk, None]

prism_c404_stats

In [None]:
%run ../../../model_evaluation/Metrics_StdSuite_v1.ipynb

In [None]:
# Pearsons correlation
prism_c404_pearson_precip = pearson_r(prism_c404_yearly["PREC_NC_ACC_c404"], prism_c404_yearly["PREC_NC_ACC_prism"])
prism_c404_pearson_tk = pearson_r(prism_c404_yearly["TK_c404"], prism_c404_yearly["TK_prism"])

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = ["Pearson", prism_c404_pearson_precip, None, prism_c404_pearson_tk, None]

prism_c404_stats

In [None]:
# Spearman's correlation
prism_c404_spearman_precip = spearman_r(prism_c404_yearly["PREC_NC_ACC_c404"], prism_c404_yearly["PREC_NC_ACC_prism"])
prism_c404_spearman_tk = spearman_r(prism_c404_yearly["TK_c404"], prism_c404_yearly["TK_prism"])

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = ["Spearman", prism_c404_spearman_precip, None, prism_c404_spearman_tk, None]

prism_c404_stats

In [None]:
# percent bias
prism_c404_pbias_precip = pbias(prism_c404_yearly["PREC_NC_ACC_c404"], prism_c404_yearly["PREC_NC_ACC_prism"])
prism_c404_pbias_tk = pbias(prism_c404_yearly["TK_c404"], prism_c404_yearly["TK_prism"])

# add stat to bottom of dataframe
prism_c404_stats.loc[len(prism_c404_stats.index)] = ["pbias", prism_c404_pbias_precip, None, prism_c404_pbias_tk, None]

prism_c404_stats

**CERES-EBAF zonal statistics**

CERES-EBAF has a single variable: RNET

In [None]:
# open dataset
ceres_drb = conus404_drb_cat['ceres-drb-OSN']

# crs
ceres_crs = 4326

# create the grid of c404_drb using any of the variables
ceres_var = "RNET"

# drop unneeded variable and coordinates
ceres_grid = ceres_drb[[ceres_var]].drop(
    ['time', ceres_var]).reset_coords().load()


# add bounds
ceres_grid = ceres_grid.cf.add_bounds(x)
ceres_grid = ceres_grid.cf.add_bounds(y)

# stack
ceres_points = ceres_grid.stack(point=(y, x))

# apply bounds_to method
ceres_boxes = xr.apply_ufunc(
    bounds_to_poly,
    ceres_points.x_bounds,
    ceres_points.y_bounds,
    input_core_dims=[("bounds",),  ("bounds",)],
    output_dtypes=[np.dtype('O')],
    vectorize=True
)

# create geodataframe from boxes
ceres_grid_df = gpd.GeoDataFrame(
    data={"geometry": ceres_boxes.values,
          "y": ceres_boxes[y], "x": ceres_boxes[x]},
    index=ceres_boxes.indexes["point"],
    crs=ceres_crs
)

# convert DRB to conus404 crs
ceres_drb_gdf = drb_gdf.to_crs(epsg=ceres_crs)

# overlay the two grids
ceres_overlay = ceres_grid_df.overlay(ceres_drb_gdf, keep_geom_type=True)

# grid cell fractions
ceres_grid_cell_fraction = ceres_overlay.geometry.area.groupby(
    ceres_overlay.huc6).transform(lambda x: x / x.sum())

# create sparse dataarray
ceres_multi_index = ceres_overlay.set_index([y, x, "huc6"]).index
ceres_df_weights = pd.DataFrame(
    {"weights": ceres_grid_cell_fraction.values}, index=ceres_multi_index)

ceres_ds_weights = xr.Dataset(ceres_df_weights)

ceres_weights_sparse = ceres_ds_weights.unstack(
    sparse=True, fill_value=0.).weights

# Matrix multiplication across each DataArray
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ceres_rnet_regridded = xr.apply_ufunc(
        apply_weights_matmul_sparse,
        ceres_weights_sparse,
        ceres_drb["RNET"],
        join="left",
        input_core_dims=[["y", "x", "huc6"], ["y", "x"]],
        output_core_dims=[["huc6"]],
        dask="parallelized",
        meta=[np.ndarray((0,))]
    )

# merge DataArrays into Dataset
ceres_regridded = xr.Dataset({"RNET": ceres_rnet_regridded})
ceres_regridded.attrs = ceres_drb.attrs

# Covert to DataFrame
ceres_df = ceres_regridded.load().to_dataframe()
ceres_df.head()

In [None]:
# reset index 
ceres_zonal_stats = ceres_df.reset_index(drop=False)
ceres_zonal_stats["time"] = ceres_zonal_stats["time"].astype(str).str[:-3]
ceres_zonal_stats.head()

In [None]:
# Merge the PRISM and CONUS404 zonals stats together based on the HUC6 code and time
ceres_c404_zonal = ceres_zonal_stats.merge(c404_zonal_stats, left_on=['huc6', 'time'], right_on=['huc6', 'time'], suffixes=["_ceres", "_c404"])

#drop RNET
ceres_c404_zonal.drop(["PREC_NC_ACC", "TK"], axis=1, inplace=True)

ceres_c404_zonal.head()

In [None]:
# convert time column to datetime type
ceres_c404_zonal["time"] = pd.to_datetime(ceres_c404_zonal["time"], format="%Y-%m")
ceres_c404_zonal.head()

Summary statistics

In [None]:
ceres_c404_yearly = ceres_c404_zonal.resample("1Y", on="time").mean()
ceres_c404_yearly.reset_index(drop=False, inplace=True)
ceres_c404_yearly.head()

In [None]:
# mean, median, standard devation
ceres_c404_mean = ceres_c404_yearly.mean()
ceres_c404_median = ceres_c404_yearly.median()
ceres_c404_stdev = ceres_c404_yearly.std()

#create dataframe
ceres_c404_stats = pd.DataFrame({"annual_mean": ceres_c404_mean, "median": ceres_c404_median, "stdev": ceres_c404_stdev}).T.drop("time", axis=1)

# reset index and rename
ceres_c404_stats = ceres_c404_stats.reset_index(drop=False).rename({"index":"stat"}, axis=1)

ceres_c404_stats

In [None]:
# bias
ceres_c404_stats_annual_mean = ceres_c404_stats.loc[ceres_c404_stats['stat'] == "annual_mean"]
ceres_c404_bias_rnet = float(ceres_c404_stats_annual_mean["RNET_c404"] - ceres_c404_stats_annual_mean["RNET_ceres"])

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["bias", ceres_c404_bias_rnet, None]

# MAE
ceres_c404_mae_rnet = sum(abs(ceres_c404_yearly["RNET_c404"] - ceres_c404_yearly["RNET_ceres"]))/len(ceres_c404_yearly)

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["MAE", ceres_c404_mae_rnet, None]

# RMSE
ceres_c404_rmse_rnet = math.sqrt(np.square(np.subtract(ceres_c404_yearly["RNET_c404"], ceres_c404_yearly["RNET_ceres"])).mean())

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["RMSE", ceres_c404_rmse_rnet, None]

# Pearsons correlation
ceres_c404_pearson_rnet = pearson_r(ceres_c404_yearly["RNET_c404"], ceres_c404_yearly["RNET_ceres"])

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["Pearson", ceres_c404_pearson_rnet, None]

# Spearman's correlation
ceres_c404_spearman_rnet = spearman_r(ceres_c404_yearly["RNET_c404"], ceres_c404_yearly["RNET_ceres"])

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["Spearman", ceres_c404_spearman_rnet, None]

# percent bias
ceres_c404_pbias_rnet = pbias(ceres_c404_yearly["RNET_c404"], ceres_c404_yearly["RNET_ceres"])

# add stat to bottom of dataframe
ceres_c404_stats.loc[len(ceres_c404_stats.index)] = ["pbias", ceres_c404_pbias_rnet, None]

ceres_c404_stats

## **Extract gridded values to points**

The goal of this section is extract values from CONUS404 where they intersect with station data. This process is described in article about the ESRI tool [Extract Values to Points](https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-analyst/extract-values-to-points.htm). This tabular data will then be exported for use in the next notebook, **CONUS404 Analysis**.

Dataset outline:
1. Read in the prepared dataset
2. Extract data from overlapping pixel at same time step as point
<br>

**Climate Reference Network point extraction**

In [None]:
crn_drb_df = conus404_drb_cat['crn-drb-OSN'].read()

# create geodataframe
crn_drb = gpd.GeoDataFrame(crn_drb_df, crs=4326,
                       geometry=gpd.points_from_xy(crn_drb_df.LONGITUDE, 
                                                         crn_drb_df.LATITUDE))

# modify date field
crn_drb["DATE"] = crn_drb["DATE"].astype(str).str[:-3]

crn_drb.rename({"DATE": "time",
                "TK": "TK_crn", 
                "RNET": "RNET_crn", 
                "PREC_ACC_NC": "PREC_ACC_NC_crn"},
                  axis=1, inplace=True)

crn_drb.head()

Get coordinates from crn_drb to index c404_drb by

In [None]:
# isolate single row and transform to c404_drb crs
crn_coords_gdf = crn_drb.iloc[[0]].to_crs(c404_crs)

# extract lat/long values
crn_lat = crn_coords_gdf.iloc[0]["geometry"].y
crn_lon = crn_coords_gdf.iloc[0]["geometry"].x

# time
crn_time_min = crn_drb_df["time"].min()
crn_time_max = crn_drb_df["time"].max()
crn_time_min, crn_time_max

# subset c404_drb to lat/long using nearest
c404_crn_sub = c404_drb.sel(x=crn_lon, y=crn_lat, method="nearest")

# slice to time-steps of crn_drb
c404_crn_sub = c404_crn_sub.sel(time=slice(crn_time_min, crn_time_max))

c404_crn_sub

Convert subset to dataframe and reorganize columns

In [None]:
c404_crn_sub_df = c404_crn_sub.to_dataframe().reset_index(drop=False)

# trim columns
c404_crn_sub_df = c404_crn_sub_df[["time", "TK", "RNET", "PREC_ACC_NC"]]

# rename columns
c404_crn_sub_df.rename({"TK": "TK_c404", 
                    "RNET": "RNET_c404", 
                    "PREC_ACC_NC": "PREC_ACC_NC_c404"},
                  axis=1, inplace=True)

# trim time
c404_crn_sub_df["time"] = c404_crn_sub_df["time"].astype(str).str[:-3]

c404_crn_sub_df

Combine CONUS404 subset with CRN data

In [None]:
crn_c404_point = crn_drb.merge(c404_crn_sub_df, on="time").reset_index(drop=False)

# drop columns
crn_c404_point.drop(["index", "LATITUDE", "LONGITUDE", "ID", "geometry"], axis=1, inplace=True)

crn_c404_point.head()

In [None]:
# convert time column to datetime type
crn_c404_point["time"] = pd.to_datetime(crn_c404_point["time"], format="%Y-%m")

Summary statistics

In [None]:
# resample to yearly means
crn_c404_yearly = crn_c404_point.resample("1Y", on="time").mean()
crn_c404_yearly.reset_index(drop=False, inplace=True)

# mean, median, standard devation
crn_c404_mean = crn_c404_yearly.mean()
crn_c404_median = crn_c404_yearly.median()
crn_c404_stdev = crn_c404_yearly.std()

#create dataframe
crn_c404_stats = pd.DataFrame({"annual_mean": crn_c404_mean, "median": crn_c404_median, "stdev": crn_c404_stdev}).T.drop("time", axis=1)

# reset index and rename
crn_c404_stats = crn_c404_stats.reset_index(drop=False).rename({"index":"stat"}, axis=1)

# bias
crn_c404_stats_annual_mean = crn_c404_stats.loc[crn_c404_stats['stat'] == "annual_mean"]
crn_c404_bias_precip = float(crn_c404_stats_annual_mean["PREC_ACC_NC_c404"] - crn_c404_stats_annual_mean["PREC_ACC_NC_crn"])
crn_c404_bias_rnet = float(crn_c404_stats_annual_mean["RNET_c404"] - crn_c404_stats_annual_mean["RNET_crn"])
crn_c404_bias_tk = float(crn_c404_stats_annual_mean["TK_c404"] - crn_c404_stats_annual_mean["TK_crn"])

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = ["bias", crn_c404_bias_precip, None, crn_c404_bias_rnet, None, crn_c404_bias_tk, None]

# MAE
crn_c404_mae_precip = sum(abs(crn_c404_yearly["PREC_ACC_NC_c404"] - crn_c404_yearly["PREC_ACC_NC_crn"]))/len(crn_c404_yearly)
crn_c404_mae_rnet = sum(abs(crn_c404_yearly["RNET_c404"] - crn_c404_yearly["RNET_crn"]))/len(crn_c404_yearly)
crn_c404_mae_tk = sum(abs(crn_c404_yearly["TK_c404"] - crn_c404_yearly["TK_crn"]))/len(crn_c404_yearly)
# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = ["MAE", crn_c404_mae_precip, None, crn_c404_mae_rnet, None, crn_c404_mae_tk, None]

# RMSE
crn_c404_rmse_precip = math.sqrt(np.square(np.subtract(crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"])).mean())
crn_c404_rmse_rnet = math.sqrt(np.square(np.subtract(crn_c404_yearly["RNET_c404"], crn_c404_yearly["RNET_crn"])).mean())
crn_c404_rmse_tk = math.sqrt(np.square(np.subtract(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])).mean())

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = ["RMSE", crn_c404_rmse_precip, None, crn_c404_rmse_rnet, None, crn_c404_rmse_tk, None]

# Pearsons correlation
crn_c404_pearson_precip = pearson_r(crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"])
crn_c404_pearson_rnet = pearson_r(crn_c404_yearly["RNET_c404"], crn_c404_yearly["RNET_crn"])
crn_c404_pearson_tk = pearson_r(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = ["pearson", crn_c404_pearson_precip, None, crn_c404_pearson_rnet, None, crn_c404_pearson_tk, None]

# Spearman's correlation
crn_c404_spearman_precip = spearman_r(crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"])
crn_c404_spearman_rnet = spearman_r(crn_c404_yearly["RNET_c404"], crn_c404_yearly["RNET_crn"])
crn_c404_spearman_tk = spearman_r(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = ["spearman", crn_c404_spearman_precip, None, crn_c404_spearman_rnet, None, crn_c404_spearman_tk, None]

# percent bias
crn_c404_pbias_precip = pbias(crn_c404_yearly["PREC_ACC_NC_c404"], crn_c404_yearly["PREC_ACC_NC_crn"])
crn_c404_pbias_rnet = pbias(crn_c404_yearly["RNET_c404"], crn_c404_yearly["RNET_crn"])
crn_c404_pbias_tk = pbias(crn_c404_yearly["TK_c404"], crn_c404_yearly["TK_crn"])

# add stat to bottom of dataframe
crn_c404_stats.loc[len(crn_c404_stats.index)] = ["pbias", crn_c404_pbias_precip, None, crn_c404_pbias_rnet, None, crn_c404_pbias_tk, None]

crn_c404_stats

Export dataset

In [None]:
# crn_c404_point.to_parquet("s3://nhgf-development/workspace/tutorial/CONUS404/crn_c404_point.parquet")

**Historical Climate Network (HCN) point extraction**

The HCN data is different than the CRN data as the HCN data comes from multiple stations whereas the CRN data was from a single station. This will involve using multiple sets of geographic coordinates to extract data from CONUS404.

In [None]:
# read in dataset
hcn_drb_df = conus404_drb_cat['hcn-drb-OSN'].read()

#rename columns
hcn_drb_df.rename({"DATE": "time",
                "TK": "TK_hcn",  
                "PREC_ACC_NC": "PREC_ACC_NC_hcn"},
                  axis=1, inplace=True)

# change DATE field to 
hcn_drb_df["time"] = hcn_drb_df["time"].astype(str).str[:-3]

hcn_drb_df.head()

Get a DataFrame of the station IDs, lats, and longs to use for extract data

In [None]:
hcn_stations = hcn_drb_df.copy().drop(["time", "TK_hcn", "PREC_ACC_NC_hcn"], axis=1)
hcn_stations["LONGITUDE"] = pd.to_numeric(hcn_stations["LONGITUDE"])
hcn_stations["LATITUDE"] = pd.to_numeric(hcn_stations["LATITUDE"])

hcn_stations = hcn_stations.groupby('ID').mean().reset_index(drop=False)
# hcn_stations

Create a GeoDataFrame to convert the lat and long to the coordinate system of CONUS404

In [None]:
hcn_stations_gdf = gpd.GeoDataFrame(hcn_stations, crs=4326,
                       geometry=gpd.points_from_xy(hcn_stations.LONGITUDE, 
                                                         hcn_stations.LATITUDE))

# transform to c404_drb crs
hcn_stations_gdf = hcn_stations_gdf.to_crs(c404_crs)

# extract lat/long values
hcn_stations_gdf["y"] = hcn_stations_gdf["geometry"].y
hcn_stations_gdf["x"] = hcn_stations_gdf["geometry"].x

#drop lat/lon/geo
hcn_stations_df = hcn_stations_gdf.drop(["LATITUDE", "LONGITUDE", "geometry"], axis=1)

Subset c404_drb to time period of HCN

In [None]:
# time min/max
hcn_time_min = hcn_drb_df["time"].min()
hcn_time_max = hcn_drb_df["time"].max()

# slice c404 to HCN time
c404_hcn_timesub = c404_drb.sel(time=slice(hcn_time_min, hcn_time_max))

Use Dataframe rows to extract data from c404_drb

In [None]:
# list of extracted data
c404_hcn_subs = []

for index, data in hcn_stations_df.iterrows():
    c404_hcn_sub_step = c404_hcn_timesub.sel(x=data.x, y=data.y, method="nearest").to_dataframe()
    c404_hcn_sub_step["ID"] = data.ID
    c404_hcn_subs.append(c404_hcn_sub_step)

# concat list of extracted data into single Dataframe
c404_hcn_sub = pd.concat(c404_hcn_subs)

#reset index
c404_hcn_sub.reset_index(drop=False, inplace=True)

# drop columns
c404_hcn_sub.drop(["RNET", "lon", "lat", "y", "x", "crs"], axis=1, inplace=True)

# rename columns
c404_hcn_sub.rename({"TK":"TK_c404",
                    "PREC_ACC_NC": "PREC_ACC_NC_c404"},
                   axis=1, inplace=True)

# trim time
c404_hcn_sub["time"] = c404_hcn_sub["time"].astype(str).str[:-3]

# c404_hcn_sub

Merge CONUS404 observations to HCN observations using the station ID and time

In [None]:
hcn_c404_point = hcn_drb_df.merge(c404_hcn_sub, left_on=["ID", "time"], right_on=["ID", "time"])

# drop columns
hcn_c404_point.drop(["LATITUDE", "LONGITUDE"], axis=1, inplace=True)

hcn_c404_point.head()

In [None]:
# convert time column to datetime type
hcn_c404_point["time"] = pd.to_datetime(hcn_c404_point["time"], format="%Y-%m")

Summary stastics

In [None]:
# resample to yearly means
hcn_c404_yearly = hcn_c404_point.resample("1Y", on="time").mean()
hcn_c404_yearly.reset_index(drop=False, inplace=True)

# mean, median, standard devation
hcn_c404_mean = hcn_c404_yearly.mean()
hcn_c404_median = hcn_c404_yearly.median()
hcn_c404_stdev = hcn_c404_yearly.std()

#create dataframe
hcn_c404_stats = pd.DataFrame({"annual_mean": hcn_c404_mean, "median": hcn_c404_median, "stdev": hcn_c404_stdev}).T.drop("time", axis=1)

# reset index and rename
hcn_c404_stats = hcn_c404_stats.reset_index(drop=False).rename({"index":"stat"}, axis=1)

# bias
hcn_c404_stats_annual_mean = hcn_c404_stats.loc[hcn_c404_stats['stat'] == "annual_mean"]
hcn_c404_bias_precip = float(hcn_c404_stats_annual_mean["PREC_ACC_NC_c404"] - hcn_c404_stats_annual_mean["PREC_ACC_NC_hcn"])
hcn_c404_bias_tk = float(hcn_c404_stats_annual_mean["TK_c404"] - hcn_c404_stats_annual_mean["TK_hcn"])

# add stat to bottom of dataframe
hcn_c404_stats.loc[len(hcn_c404_stats.index)] = ["bias", hcn_c404_bias_precip, None, hcn_c404_bias_tk, None]

# MAE
hcn_c404_mae_precip = sum(abs(hcn_c404_yearly["PREC_ACC_NC_c404"] - hcn_c404_yearly["PREC_ACC_NC_hcn"]))/len(hcn_c404_yearly)
hcn_c404_mae_tk = sum(abs(hcn_c404_yearly["TK_c404"] - hcn_c404_yearly["TK_hcn"]))/len(hcn_c404_yearly)
# add stat to bottom of dataframe
hcn_c404_stats.loc[len(hcn_c404_stats.index)] = ["MAE", hcn_c404_mae_precip, None, hcn_c404_mae_tk, None]

# RMSE
hcn_c404_rmse_precip = math.sqrt(np.square(np.subtract(hcn_c404_yearly["PREC_ACC_NC_c404"], hcn_c404_yearly["PREC_ACC_NC_hcn"])).mean())
hcn_c404_rmse_tk = math.sqrt(np.square(np.subtract(hcn_c404_yearly["TK_c404"], hcn_c404_yearly["TK_hcn"])).mean())

# add stat to bottom of dataframe
hcn_c404_stats.loc[len(hcn_c404_stats.index)] = ["RMSE", hcn_c404_rmse_precip, None, hcn_c404_rmse_tk, None]

# Pearsons correlation
hcn_c404_pearson_precip = pearson_r(hcn_c404_yearly["PREC_ACC_NC_c404"], hcn_c404_yearly["PREC_ACC_NC_hcn"])
hcn_c404_pearson_tk = pearson_r(hcn_c404_yearly["TK_c404"], hcn_c404_yearly["TK_hcn"])

# add stat to bottom of dataframe
hcn_c404_stats.loc[len(hcn_c404_stats.index)] = ["pearson", hcn_c404_pearson_precip, None, hcn_c404_pearson_tk, None]

# Spearman's correlation
hcn_c404_spearman_precip = spearman_r(hcn_c404_yearly["PREC_ACC_NC_c404"], hcn_c404_yearly["PREC_ACC_NC_hcn"])
hcn_c404_spearman_tk = spearman_r(hcn_c404_yearly["TK_c404"], hcn_c404_yearly["TK_hcn"])

# add stat to bottom of dataframe
hcn_c404_stats.loc[len(hcn_c404_stats.index)] = ["spearman", hcn_c404_spearman_precip, None, hcn_c404_spearman_tk, None]

# percent bias
hcn_c404_pbias_precip = pbias(hcn_c404_yearly["PREC_ACC_NC_c404"], hcn_c404_yearly["PREC_ACC_NC_hcn"])
hcn_c404_pbias_tk = pbias(hcn_c404_yearly["TK_c404"], hcn_c404_yearly["TK_hcn"])

# add stat to bottom of dataframe
hcn_c404_stats.loc[len(hcn_c404_stats.index)] = ["pbias", hcn_c404_pbias_precip, None, hcn_c404_pbias_tk, None]

hcn_c404_stats

Export the dataset

In [None]:
# hcn_c404_point.to_parquet("s3://nhgf-development/workspace/tutorial/CONUS404/hcn_c404_point.parquet")

Shut down the client and cluster

In [None]:
client.close(); cluster.shutdown()

# Next: CONUS404 Visualization notebook

Now that we have moved through our zonal and point statistics, we can move on to visualizing the results in the CONUS404 Visualization notebook.

In [None]:
# # Last code cell of the notebook
# import watermark.watermark as watermark
# print(watermark(iversions=True, python=True, machine=True, globals_=globals()))