# Create zonal statistics and point extractions for comparing CONUS404 and reference datasets

Short paragraph describing what is about to happen

<details>
  <summary>Guide to pre-requisites and learning outcomes...&lt;click to expand&gt;</summary>
  
  <table>
    <tr>
      <td>Pre-Requisites
      <td>To get the most out of this notebook, you should already have an understanding of these topics: 
        <ul>
        <li>pre-req one
        <li>pre-req two
        </ul>
    <tr>
      <td>Expected Results
      <td>At the end of this notebook, you should be able to: 
        <ul>
        <li>outcome one
        <li>outcome two
        </ul>
  </table>
</details>

In [None]:
# library imports
import fsspec #testing
import hvplot.xarray #testing
import intake #testing
import os #testing
import warnings #testing
import rioxarray #testing
import dask #testing
import metpy #testing
import calendar #testing

from shapely.geometry import Polygon #testing
from dask.distributed import LocalCluster, Client #testing
from pygeohydro import pygeohydro #testing
from fsspec.implementations.ftp import FTPFileSystem #testing
from holoviews.streams import PolyEdit, PolyDraw #testing
from geocube.api.core import make_geocube #testing

import xarray as xr #testing
import geopandas as gpd #testing
import pandas as pd #testing
import geoviews as gv #testing
import dask.dataframe as dd #testing
import numpy as np #testing

warnings.filterwarnings('ignore')

# Update to helper function after repo consolidation
## **Start a Dask client using an appropriate Dask Cluster** 
This is an optional step, but can speed up data loading significantly, especially when accessing data from the cloud.

In [None]:
def configure_cluster(machine):
    ''' Helper function to configure cluster
    '''
    if machine == 'denali':
        from dask.distributed import LocalCluster, Client
        cluster = LocalCluster(threads_per_worker=1)
        client = Client(cluster)
    
    elif machine == 'tallgrass':
        from dask.distributed import Client
        from dask_jobqueue import SLURMCluster
        cluster = SLURMCluster(queue='cpu', cores=1, interface='ib0',
                               job_extra=['--nodes=1', '--ntasks-per-node=1', '--cpus-per-task=1'],
                               memory='6GB')
        cluster.adapt(maximum_jobs=30)
        client = Client(cluster)
        
    elif machine == 'local':
        import os
        import warnings
        from dask.distributed import LocalCluster, Client
        warnings.warn("Running locally can result in costly data transfers!\n")
        n_cores = os.cpu_count() # set to match your machine
        cluster = LocalCluster(threads_per_worker=n_cores)
        client = Client(cluster)
        
    elif machine in ['esip-qhub-gateway-v0.4']:   
        import sys, os
        sys.path.append(os.path.join(os.environ['HOME'],'shared','users','lib'))
        import ebdpy as ebd
        aws_profile = 'nhgf-development'
        ebd.set_credentials(profile=aws_profile)

        aws_region = 'us-west-2'
        endpoint = f's3.{aws_region}.amazonaws.com'
        ebd.set_credentials(profile=aws_profile, region=aws_region, endpoint=endpoint)
        worker_max = 30
        client,cluster = ebd.start_dask_cluster(profile=aws_profile, worker_max=worker_max, 
                                              region=aws_region, use_existing_cluster=True,
                                              adaptive_scaling=False, wait_for_cluster=False, 
                                              worker_profile='Medium Worker', propagate_env=True)
        
    return client, cluster

### Setup your cluster

#### QHub...
Uncomment single commented spaces (#) to run

In [None]:
# set machine
machine = 'esip-qhub-gateway-v0.4'

# use configure cluster helper function to setup dask
client, cluster = configure_cluster(machine)

#### or HPC
Uncomment single commented spaces (#) to run

In [None]:
## set machine
# machine = os.environ['SLURM_CLUSTER_NAME']

## use configure_cluster helper function to setup dask
# client, cluster = configure_cluster(machine)

## **Compute zonal statistics for gridded datasets**

In the last tutorial, we prepared three gridded datasets: CONUS404 (benchmark), PRISM (reference), and CERES-EBAF (reference). The goal of this section is compute [zonal statistics](https://gisgeography.com/zonal-statistics/) for each HUC6 zone in the Delaware River Basin (DRB) at each time-step in the data. This tabular data will then be exported for use in the next notebook, **CONUS404 Analysis**.

Dataset outline:
1. Read in the prepared dataset
2. Read in the HUC6 boundaries and transform to same coordinate reference system as prepared dataset
3. Make a data mask with the HUC6 boundaries to calculate zonal statistics
4. Compute zonal statistics with data mask and prepared data

Once all calculations are done: 

5. Combine each reference with benchmark into single dataset
6. Export gridded data zonal statistics
<br>

**CONUS404 zonal statistics**

In [None]:
# url to c404_drb
c404_drb_url = 's3://nhgf-development/workspace/tutorial/CONUS404/c404_drb.nc'

fs = fsspec.filesystem("s3", anon=False, requester_pays=True, skip_instance_cache=True)

# open dataset
c404_drb = xr.open_dataset(fs.open(c404_drb_url), decode_coords="all")

# set crs
c404_crs = c404_drb.rio.crs.to_proj4()

c404_drb

Read in HUC6 boundaries

In [None]:
# bring in HUC6 boundaries found in the DRB
c404_drb_gdf = pygeohydro.WBD("huc6", outfields=["huc6", "name"]).byids("huc6", ["020401", "020402"])

# set CRS to match c404_drb
c404_drb_gdf = c404_drb_gdf.to_crs(c404_crs)

#visualize
# c404_drb_gdf.plot(edgecolor="orange", facecolor="purple", linewidth=2.5)

Testing geocube

Create datamask and build new dataset

In [None]:
# convert huc6 field to int as this works best for the following steps
c404_drb_gdf["huc6"] = c404_drb_gdf["huc6"].astype(int) #note: this may drop the # of digits from 6 to less depending on how many zeroes there were, may need to pad back to 6 digits later

In [None]:
# c404_drb.rio.write_crs(c404_crs, inplace=True) 

# create an output grid
c404_out_grid = make_geocube(
    vector_data = c404_drb_gdf,
    measurements=["huc6"],
    like=c404_drb
)

# add datarrays to grid
c404_out_grid["RNET"] = (c404_drb.RNET.dims, c404_drb.RNET.values, 
                         c404_drb.RNET.attrs, c404_drb.RNET.encoding)

c404_out_grid["TK"] = (c404_drb.TK.dims, c404_drb.TK.values,
                         c404_drb.TK.attrs, c404_drb.TK.encoding)

c404_out_grid["PREC_ACC_NC"] = (c404_drb.PREC_ACC_NC.dims, c404_drb.PREC_ACC_NC.values,
                         c404_drb.PREC_ACC_NC.attrs, c404_drb.PREC_ACC_NC.encoding)


Group data arrays by HUC6 code

In [None]:
c404_grouped = c404_out_grid.drop_vars("spatial_ref").groupby(c404_out_grid.huc6)

Calculate the mean variables

In [None]:
c404_grid_mean = c404_grouped.mean().rename({"RNET": "c404_RNET_mean", "TK": "c404_TK_mean", 
                                       "PREC_ACC_NC": "c404_PREC_ACC_NC_mean", "time":"time_index"})

Convert to a dataframe

In [None]:
c404_zonal_stats = c404_grid_mean.to_dataframe().drop("spatial_ref", axis=1)
c404_zonal_stats.head(4)

The time has been replaced by the position index from the *c404_drb* time coordinate. Ungroup the data and add the time value from the index to the dataframe.

In [None]:
c404_zonal_stats = c404_zonal_stats.reset_index(drop=False)

c404_zonal_stats["time"] = c404_drb.coords["time"][c404_zonal_stats["time_index"].values]

c404_zonal_stats.drop("time_index", axis=1, inplace=True)

c404_zonal_stats["time"] = c404_zonal_stats["time"].astype(str).str[:-3]

Reset huc6 back to a string type of length 6

In [None]:
c404_zonal_stats["huc6"] = c404_zonal_stats["huc6"].astype(int).astype(str).str.zfill(6) # pads with 0's to make all column values lenght == 0
c404_zonal_stats

**PRISM zonal statistics**

PRISM has two variables: TK and PREC_ACC_NC

In [None]:
# url to prism_drb
prism_drb_url = 's3://nhgf-development/workspace/tutorial/CONUS404/prism_drb.nc'

fs = fsspec.filesystem("s3", anon=False, requester_pays=True, skip_instance_cache=True)

# open dataset
prism_drb = xr.open_dataset(fs.open(prism_drb_url), decode_coords="all")

# set crs
prism_crs = prism_drb.rio.crs.to_proj4()

# bring in HUC6 boundaries found in the DRB
prism_drb_gdf = pygeohydro.WBD("huc6", outfields=["huc6", "name"]).byids("huc6", ["020401", "020402"])

# set CRS to match prism_drb
prism_drb_gdf = prism_drb_gdf.to_crs(prism_crs)

# convert huc6 field to int as this works best for the following steps
prism_drb_gdf["huc6"] = prism_drb_gdf["huc6"].astype(int) #note: this may drop the # of digits from 6 to less depending on how many zeroes 
                                                            #  there were, may need to pad back to 6 digits later

# create an output grid
prism_out_grid = make_geocube(
    vector_data = prism_drb_gdf,
    measurements=["huc6"],
    like=prism_drb
)

# add datarrays to grid
prism_out_grid["TK"] = (prism_drb.TK.dims, prism_drb.TK.values,
                         prism_drb.TK.attrs, prism_drb.TK.encoding)

prism_out_grid["PREC_ACC_NC"] = (prism_drb.PREC_ACC_NC.dims, prism_drb.PREC_ACC_NC.values,
                         prism_drb.PREC_ACC_NC.attrs, prism_drb.PREC_ACC_NC.encoding)

# groupby
prism_grouped = prism_out_grid.drop_vars("spatial_ref").groupby(prism_out_grid.huc6)

# Calculate the mean variables
prism_grid_mean = prism_grouped.mean().rename({"TK": "prism_TK_mean", 
                                       "PREC_ACC_NC": "prism_PREC_ACC_NC_mean", "time":"time_index"})

#convert to a dataframe
prism_zonal_stats = prism_grid_mean.to_dataframe().drop("spatial_ref", axis=1)

# reste index and add time back
prism_zonal_stats = prism_zonal_stats.reset_index(drop=False)
prism_zonal_stats["time"] = prism_drb.coords["time"][prism_zonal_stats["time_index"].values]
prism_zonal_stats.drop("time_index", axis=1, inplace=True)
prism_zonal_stats["time"] = prism_zonal_stats["time"].astype(str).str[:-3]

# change huc6 to string and pad with zeros
prism_zonal_stats["huc6"] = prism_zonal_stats["huc6"].astype(int).astype(str).str.zfill(6) # pads with 0's to make all column values lenght == 0

prism_zonal_stats

Merge the PRISM and CONUS404 zonals stats together based on the HUC6 code and time

In [None]:
prism_c404_zonal = prism_zonal_stats.merge(c404_zonal_stats, left_on=['huc6', 'time'], right_on=['huc6', 'time'])
prism_c404_zonal.head()

We don't need the CONUS404 RNET value so we'll drop that column before exporting the data

In [None]:
prism_c404_zonal.drop("c404_RNET_mean", axis=1, inplace=True)

Export the data

In [None]:
prism_c404_zonal.to_parquet("s3://nhgf-development/workspace/tutorial/CONUS404/prism_c404_zonal.parquet")

**CERES-EBAF zonal statistics**

CERES-EBAF has a single variable: RNET

In [None]:
# url to ceres_drb
ceres_drb_url = 's3://nhgf-development/workspace/tutorial/CONUS404/ceres_drb.nc'

fs = fsspec.filesystem("s3", anon=False, requester_pays=True, skip_instance_cache=True)

# open dataset
ceres_drb = xr.open_dataset(fs.open(ceres_drb_url), decode_coords="all")

# set crs
ceres_crs = ceres_drb.rio.crs.to_proj4()

# bring in HUC6 boundaries found in the DRB
ceres_drb_gdf = pygeohydro.WBD("huc6", outfields=["huc6", "name"]).byids("huc6", ["020401", "020402"])

# set CRS to match ceres_drb
ceres_drb_gdf = ceres_drb_gdf.to_crs(ceres_crs)

# convert huc6 field to int as this works best for the following steps
ceres_drb_gdf["huc6"] = ceres_drb_gdf["huc6"].astype(int) #note: this may drop the # of digits from 6 to less depending on how many zeroes 
                                                            #  there were, may need to pad back to 6 digits later
    
# create an output grid
ceres_out_grid = make_geocube(
    vector_data = ceres_drb_gdf,
    measurements=["huc6"],
    like=ceres_drb
)

# add datarrays to grid
ceres_out_grid["RNET"] = (ceres_drb.RNET.dims, ceres_drb.RNET.values,
                         ceres_drb.RNET.attrs, ceres_drb.RNET.encoding)

# groupby
ceres_grouped = ceres_out_grid.drop_vars("spatial_ref").groupby(ceres_out_grid.huc6)

# Calculate the mean variables
ceres_grid_mean = ceres_grouped.mean().rename({"RNET": "ceres_RNET_mean", "time":"time_index"})

#convert to a dataframe
ceres_zonal_stats = ceres_grid_mean.to_dataframe().drop("spatial_ref", axis=1)

# reste index and add time back
ceres_zonal_stats = ceres_zonal_stats.reset_index(drop=False)
ceres_zonal_stats["time"] = ceres_drb.coords["time"][ceres_zonal_stats["time_index"].values]
ceres_zonal_stats.drop("time_index", axis=1, inplace=True)
ceres_zonal_stats["time"] = ceres_zonal_stats["time"].astype(str).str[:-3]

# change huc6 to string and pad with zeros
ceres_zonal_stats["huc6"] = ceres_zonal_stats["huc6"].astype(int).astype(str).str.zfill(6) # pads with 0's to make all column values lenght == 0

ceres_zonal_stats

Merge the CERES-EBAF and CONUS404 zonals stats together based on the HUC6 code and time

In [None]:
ceres_c404_zonal = ceres_zonal_stats.merge(c404_zonal_stats, left_on=['huc6', 'time'], right_on=['huc6', 'time'])
ceres_c404_zonal.head()

We don't need the CONUS404 TK and PREC_ACC_NC values so we'll drop these columns before exporting the data

In [None]:
ceres_c404_zonal.drop(["c404_TK_mean", "c404_PREC_ACC_NC_mean"], axis=1, inplace=True)

In [None]:
ceres_c404_zonal

Export the data

In [None]:
ceres_c404_zonal.to_parquet("s3://nhgf-development/workspace/tutorial/CONUS404/ceres_c404_zonal.parquet")

## **Extract gridded values to points**

The goal of this section is extract values from CONUS404 where they intersect with station data. This process is described in article about the ESRI tool [Extract Values to Points](https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-analyst/extract-values-to-points.htm). This tabular data will then be exported for use in the next notebook, **CONUS404 Analysis**.

Dataset outline:
1. Read in the prepared dataset
2. Extract data from overlapping pixel at same time step as point
<br>

**Climate Reference Network point extraction**

In [None]:
fs = fsspec.filesystem("s3", anon=False, requester_pays=True, skip_instance_cache=True)

crn_drb_df = pd.read_parquet(fs.open("s3://nhgf-development/workspace/tutorial/CONUS404/crn_drb.parquet"))

In [None]:
crn_drb = gpd.GeoDataFrame(crn_drb_df, 
                       geometry=gpd.points_from_xy(crn_drb_df.LONGITUDE, 
                                                         crn_drb_df.LATITUDE))

# modify date field
crn_drb["DATE"] = crn_drb["DATE"].astype(str).str[:-3]

crn_drb.head()

Make sure all files have been created. There should be:
1. c404_drb.nc
2. ceres_drb.nc
3. crn_drb.parquet
4. hcn_drb.parquet
6. prism_drb.nc

In [None]:
fs = fsspec.filesystem("s3", anon=False, requester_pays=True, skip_instance_cache=True)

fs.ls("s3://nhgf-development/workspace/tutorial/CONUS404", detail=True)

In [None]:
# # Last code cell of the notebook
# import watermark.watermark as watermark
# print(watermark(iversions=True, python=True, machine=True, globals_=globals()))

In [None]:
client.close(); cluster.shutdown()