# Lesson: Data Exploration

## About 
This notebook shows a user how to load data using the HyTEST `intake` catalog and `dask`, explore that data using `xarray`, and plot that data using `hvplot`.

Authors: Sydney Foks, Gene Trantham, Andrew Laws, Tim Hodson, and Rich Signell

First, we must load some crucial libraries, `intake` and `xarray`

In [None]:
# load libraries
import intake
import xarray as xr

## using `intake`
The HyTEST catalog is structured to be compatible with the Python `intake` [package](https://intake.readthedocs.io/en/latest/index.html) and facilitates reading the data into this notebook (and others in this training course). 

The intake catalog is stored as a yaml file, which is easy to parse using other programming languages (even if there is no equivalent to the `intake` package in that programming language). For an in-depth tutorial, please see the [Pangeo intake tutorial](http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/intake.html). Intake is ideal for us in HyTEST because if we change where a dataset gets imported from, we only have to change it in one place (the catalog) rather than in each notebook we reference data. To read more about the HyTEST intake catalogs, please view the [hytest repo](https://github.com/hytest-org/hytest/tree/main/dataset_catalog).

##### Channeling our Pangeo concepts, we will open a cloud native dataset using `intake` since we are working in a cloud computing environment.

In [None]:
# open the hytest data intake catalog
hytest_cat = intake.open_catalog(r"https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")

# list all the datasets in the catalog
list(hytest_cat)

We see some acronyms of modeling applications (i.e., 'nwm', 'nhm', 'conus404') appended with 'cloud' or 'onprem'; this designates the storage location of the data. To view the full filepaths and URLs behind each data source, please see the yaml file on the [hytest repo](https://github.com/hytest-org/hytest/blob/main/dataset_catalog/hytest_intake_catalog.yml).

In the intake catalog, you'll see references to additional catalogs. We call these nested catalogs and they are ideal for housing data with multiple types of calibration schemes or for data that pertains to a course or specific tutorial. 

In [None]:
# examining nested catalogs (example)
nested_cat = hytest_cat['nhm-v1.0-daymet-catalog']
list(nested_cat)

For this tutorial we will choose a dataset, the National Water Model version 2.1 which has streamflow but also velocity as we will see in a moment.

In [None]:
# choose a dataset from the above list
dataset = "nwm21-streamflow-usgs-gages-cloud"

In [None]:
# and view the metadata
hytest_cat[dataset]

In some cases, `requester_pays` will be set to `true`. If so, you will need to setup your AWS (Amazon Web Services) credentials to load the data from S3 object storage. Please see this [notebook](https://github.com/hytest-org/hytest/blob/main/environment_set_up/Help_AWS_Credentials.ipynb) for assistance. The good news is our request_pays is set to `false` for this particular dataset so we can continue without an AWS crediential.

## using `dask`

To load this data, we will start a parallel cluster using the Python package `dask` (in-depth tutorial [here](http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/dask.html)). Dask parallelism makes use of 'clusters' of workers, each of which is given some task to do. Much like inviting your friends to come help you move, having more workers to accomplish a task is ideal and accomplishes the goal quicker. Dask allows for lazy operations, meaning an entire dataset will not be loaded into memory (RAM) until when you want it to be.

Cluster configurations vary widely, depending on the task and the hardware available on the compute platform you are using. Dask is extremely useful when loading large amounts of data into the notebook and speeds up data loading significantly, especially when accessing data from the cloud. 

For tutorial on `dask` bag, see [here](https://github.com/hytest-org/hytest/blob/main/essential_reading/Parallel_Dask.ipynb).

In [None]:
# load libraries
import logging
import os

(users need to set up AWS credentials prior to initializing a cluster because the workers need access to writing abilities)

The following commands in the cell below are specific to cloud computing, though HyTEST has helper scripts to assist with [cluster initialization](https://github.com/hytest-org/hytest/tree/main/environment_set_up) and a user can run a command `%run ../environment_set_up/Start_Dask_Cluster_Nebari.ipynb` when running the notebooks in that main [HyTEST repo](https://github.com/hytest-org/hytest). See other ipynb files regarding 'Start_Dask_Cluster...ipynb'.

##### initialize cluster

In [None]:
try:
    from dask_gateway import Gateway
except ImportError:
    logging.error(
        "Unable to import Dask Gateway.  Are you running in a cloud compute environment?\n"
    )
    raise
os.environ["DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION"] = "1.0"

gateway = Gateway()
_options = gateway.cluster_options()
_options.conda_environment = (
    "users/users-pangeo"  ##<< this is the conda environment we use on nebari.
)
_options.profile = "Medium Worker"
_env_to_add = {}
aws_env_vars = [
    "AWS_ACCESS_KEY_ID",
    "AWS_SECRET_ACCESS_KEY",
    "AWS_SESSION_TOKEN",
    "AWS_DEFAULT_REGION",
]
for _e in aws_env_vars:
    if _e in os.environ:
        _env_to_add[_e] = os.environ[_e]
_options.environment_vars = _env_to_add
cluster = gateway.new_cluster(_options)  ##<< create cluster via the dask gateway
cluster.adapt(minimum=2, maximum=30)  ##<< Sets scaling parameters.

client = cluster.get_client()

print(
    "The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'"
)
print(
    "The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' "
)
print(f"The link to view the client dashboard is:\n>  {client.dashboard_link}")

The above link is important for visualizing the Cluster Map and Task Stream for the cluster that we just initialized. 

### load dataset with `dask` and `xarray`
We are now going to call our dataset from our intake catalog and load it to dask

In [None]:
%%time
ds = hytest_cat[dataset].to_dask()

In [None]:
# let's view this dataset
# type(ds)
ds

From examining the xarray dataset above, we have dimensions of 7994 gage_ids and 367,439 time slices. 

We also have several data variables (streamflow and velocity), along with coordinates of elevation, gage_id, latitude, longitude, and stream order. The dimensions of the streamflow and velocity variables are time and gage_id.

So what is the timestep in this dataset? You can use the three disk symbol near the `time` coordinate to examine the values or you can call them out explicitly.

In [None]:
ds['time']
#ds['time.month']
#ds['time.year']

We see that our timesteps are hourly, and that in our metadata we lack any information with regards to timezone. This is a good example of why its important to contain metadata from your source data.

##### We can use the `sel` functions to select values or character strings.

In [None]:
## select a year of data for all gages in the dataset
ds.sel(time = '2005')

## select only streamflow for all gages for only 2005
#ds.streamflow.sel(time = '2005')

##### We can use `isel` to select indices (index select) within the array or matrix.

In [None]:
# select first gage id in the dataset using isel function. 
ds.isel(gage_id = 1)

In [None]:
# select streamflow for the first gage id in the dataset using isel function. 
ds.streamflow.isel(gage_id = 1)

##### Coordinates that are not directly a dimension of the any of the variables have to be called out explicitly to examine the data. So how do examine latitude/longitude of a gage?

In [None]:
## traditional indexing:
# ds.gage_id[0].latitude.values

## using isel:
ds.isel(gage_id = 1).latitude.values

## using sel:
#ds.sel(gage_id = "USGS-01030500").latitude.values

##### Question for user: What's the stream order of the first gage in our dataset? Order is a coordinate. 

In [None]:
# answer is:
# ds.isel(gage_id = 1).order.values

##### Censoring data, checking for NaNs, InFs

##### Let's use `dask` to average streamflow for the first gage in our dataset (01030350)

Use `sel` to find first gage and add `mean` to average over the time dimension.

In [None]:
ds0 = ds['streamflow'].sel(gage_id = 'USGS-01030350').mean('time')
ds0.compute().values

##### Let's use `dask` to average streamflow and velocity for the first 100 gages in the dataset (total n = 7994). 

We can view the workers performing tasks in real-time using the link that was initialized and supplied to us when we set up our cluster. 

The task stream is a view of which tasks have been running on each thread of each worker. Each row visible in the task stream subwindow is a thread, and each rectangle represents an individual task.
The cluster map is showing the data exchange between nodes.

This next cell will take some time.

In [None]:
ds1 = ds.isel(gage_id=slice(0,100)).mean('time').compute()

In [None]:
ds1

##### We now have one mean streamflow and velocity value for each of the 100 gages in the dataset! But what if we only wanted an hourly average from the year 2000 to 2005? 

In [None]:
ds2 = ds.sel(time=slice('2000-01-01 00:00','2005-12-31 00:00')).isel(gage_id=slice(0,100)).mean("time").compute()

In [None]:
ds2

##### What if we want annual sums from 2000 to 2005 for the first 100 gages in the dataset?

In [None]:
# Answer:

##### What about monthly sums? How would the command change?

In [None]:
# Answer:

##### Now let's use a bounding box to grab gages of interest, then let's calculate annual sums for five years for each of the gages in the region.

##### Let's use a shapefile to find gages of interest ... I wonder if this will bonk if we grab a shapefile from ScienceBase .. maybe just point to conus404 data access notebooks.

## using `hvplot`, plot streamflow!

We will see more with regards to the `hvplot` Python package and its capabilities in the next segment of the tutorial, but for now we wanted to show how one might plot a histogram and hydrograph from a national model.

In [None]:
# import relevant libraries
import hvplot.xarray

In [None]:
# ds.plot style first
# hvplot style next, just show one gage

In [None]:
import dask.array as da
ds3['logQ'] = da.log10(ds3.streamflow)
ds3

In [None]:
ds3.logQ.hvplot.hist(bins = 50)
#ds.hvplot.hist(y=streamflow,bins = 50, rasterize = True)

In [None]:
#ds3.dask.visualize()

In [None]:
# monthly timeseries
ds2

Let's load our streamflow into memory, for tutorial purposes we will use five years of data per gage.

In [None]:
ds2 = ds.sel(gage_id='USGS-01030350', time=slice('2000-01-01 00:00','2005-12-31 00:00'))
ds2

In [None]:
ds2.streamflow.plot()

In [None]:
import hvplot.xarray
ds2.streamflow.hvplot(x='time', grid = True)

Rasterize = True more than 100 x 200. Good for maps, etc to avoid blowing out memory. 

##### When working on Cloud, its important to make sure to shutdown all clusters so they can be made available for others.

In [None]:
client.close()
cluster.shutdown()

##### Segway into next section:
- In this notebook, we covered very basic ways to explore data with `dask`, `xarray`, and `hvplot`
- The next notebook, we focus on more advanced plotting with `hvplot` and `panels`, both packages supported by the Pangeo platform.

The End. Thanks!