# High Water Marks from USGS STN
The United States Geological Survey (USGS) maintains a database of flood event data known as [Short-Term Network (STN)](https://stn.wim.usgs.gov/stnweb/#/). This database has a convenient [web front-end](https://stn.wim.usgs.gov/FEV/) and also a [RESTFul API](). This notebook will review some of the capabilities available specifically for high water marks (HWMs) including retrieving data dictionaries, retrieving all available data by type, and making filtered queries. Some of the limitations and errors in the data are highlighted here as well.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.markers as mmarkers
import matplotlib.lines as mlines

from pygeohydro.helpers import get_us_states
from pygeohydro import STNFloodEventData

After importing, we can start with how we can obtain all of the HWM data available in the database as a GeoDataFrame.

In [None]:
hwm_all = STNFloodEventData.get_all_data("hwms", as_list=False, async_retriever_kwargs={"disable": True, "max_workers": 6})
hwm_all.head()

In [None]:
print("There are {} HWMs in the database.".format(len(hwm_all)))

For an interactive map, we can use the explore method with all of the HWM data. There are at least 34,000 HWMs in the STN database scattered throughout the country. It's important to note the possibility of outliers as this data is collected by people and liable to errors.

In [None]:
hwm_all.explore(
    marker_kwds={"radius": 2},
    style_kwds={"stroke": False},
)

Next, we illustrate how a filtered query can be completed with the same HWM data. First we want to present what parameters are available to query. We can use the `STNFloodEventData.hwms_query_params` attribute for that.

In [None]:
STNFloodEventData.hwms_query_params

In [None]:
hwm_filtered = STNFloodEventData.get_filtered_data(
    "hwms", crs="ESRI:102003", async_retriever_kwargs={"disable" : True}, query_params={"States" : "SC,NC"}
)
hwm_filtered.head()

The first step involves retrieving the data dictionary for HWMs. We can use the `as_dict` argument to return the data as a dictionary but will prefer the default Pandas DataFrame for this example. We can also pass keyword arguments to the async retriever as shown here where the caching is disabled.

In [None]:
hwm_dd = STNFloodEventData.data_dictionary("hwms", as_dict=False, async_retriever_kwargs={"disable": True})
hwm_dd.head()

It's important to note that the schemas for the three requests: all data, filtered data, and data dictionaries don't necessarily agree. 

In [None]:
# compares the columns 
print(f"Do the columns have the same length?: {set(hwm_all.columns) == set(hwm_filtered.columns) == set(hwm_dd.columns)}")

# compare columns
pd.concat([
    pd.Series(hwm_all.columns, name="All HWM Fields"),
    pd.Series(hwm_filtered.columns, name="Filtered HWM Fields"),
    pd.Series(hwm_dd["Field"], name="HWM Data Dictionary Fields")
], axis=1)

While many of the differences can be inferred, some of the discrepancies could lead to columns with ambiguous information. The USGS is working on an updated RESTFul API that should address this. These differences are available for the other data types, "instruments", "peaks", and "sites", as well.

Now we will plot some of the HWMs. First we retrieve some state lines and project those as well as the filtered HWMs to EPSG:4329 CRS.

In [None]:
carolina_lines = get_us_states(["NC", "SC"]).to_crs("EPSG:4329")
hwm_filtered = hwm_filtered.to_crs("EPSG:4329")

In [None]:
fig, ax = plt.subplots(figsize=(6, 4), dpi=200)

event_names = hwm_filtered.loc[:,'eventName'].unique()
markers = dict(zip(event_names, mmarkers.MarkerStyle.filled_markers[:len(event_names)]))

ax.set_title("HWMs - Height Above Ground (ft)", fontsize=9)
ax.set_xlabel("Longitude (deg)", fontsize=8)
ax.set_ylabel("Latitude (deg)", fontsize=8)

ax.tick_params(axis='both', which='major', labelsize=8)

vmin, vmax = hwm_filtered.loc[:,'height_above_gnd'].min(), hwm_filtered.loc[:,'height_above_gnd'].max()

legend = True
for i, (event_name, data) in enumerate(hwm_filtered.groupby('eventName')):
    
    if i > 0:
        legend=False
    
    data.plot(
        ax = ax,
        column="height_above_gnd",
        alpha=0.7,
        legend=legend,
        markersize=3,
        marker=markers[event_name],
        vmin=0,
        vmax=10,
    )

# Create a list of Line2D objects to use for the legend
legend_elements = [mlines.Line2D([0], [0], color='black', marker=markers[event_name], linestyle='None') for event_name in event_names]

# Add the legend to the plot
ax.legend(legend_elements, event_names, loc='lower right', title='Event Names', bbox_to_anchor=(1, 0), prop={'size': 5})

colorbar = plt.gcf().get_axes()[-1]
colorbar.tick_params(labelsize=8)

carolina_lines.plot(ax=ax, facecolor="none", edgecolor="black", linewidth=0.2)

plt.show()

### Data Quality Issues

Inspecting the figure above reveals a HWM in the Atlantic Ocean. Trying to pick that one out, we get the following information about the outlier. We only display a few of the fields that may contain the problem.

In [None]:
outlier = hwm_filtered.loc[hwm_filtered.latitude < 31,:].squeeze()
print(
    outlier.loc[
        [
            "siteDescription",
            "waterbody",
            "stateName",
            "countyName",
            "latitude_dd",
            "longitude_dd",
            "site_latitude",
            "site_longitude",
            "height_above_gnd"
        ]
    ]
)

Inspecting the fields above reveals that this potential outlier should be in Georgetown County which is on the coast of South Carolina just south of Myrtle Beach. Additionally, the fields show two different entries for latitude and longitudes. We look at the definitions for latitudes below.

In [None]:
print(f"'site_latitude' : {hwm_dd.loc[hwm_dd.loc[:,'Field'] == 'site_latitude','Definition'].iloc[0]}")
print(f"'latitude_dd' : {hwm_dd.loc[hwm_dd.loc[:,'Field'] == 'latitude_dd','Definition'].iloc[0]}")

From this, we can say that the 'site_latitude' field reveals horizontal locations of the common water surface while the 'latitude_dd' field refers to that of the HWM. This distinction indicates why these two fields are expected to differ. Nevertheless, the location of the HWM all but impossibly collected so far from the Pee Dee and Waccamaw Rivers. It's likely that this was a typo. It's important to note that this service is fed by data by real people who are liable to make simple mistakes. It's advised to take a look at your data and inspect for any inconsistencies prior to using for analysis.