# Access and cite point observation data

To launch this notebook interactively in a Jupyter notebook-like browser interface, please click the "Launch Binder" button below. Note that Binder may take several minutes to launch.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/hydroframe/subsettools-binder/HEAD?labpath=hf_hydrodata/point/example_get_data.ipynb)

This notebook provides a walk-through of some example functionality for accessing and citing point observations data and site-level metadata using hf_hydrodata's `get_point_data` and `get_point_metadata` functions. Please see the full [point module](https://hf-hydrodata.readthedocs.io) documentation for information on what data is available, our data collection process, and new features we are working on! Our [Metadata Description](https://hf-hydrodata.readthedocs.io/en/latest/point_data/metadata_definitions.html) page itemizes the fields that get returned from `get_point_metadata`.

In [1]:
# Import packages
import sys
import os
import pandas as pd
from hf_hydrodata import register_api_pin, get_point_data, get_point_metadata, get_citations

In [None]:
# You need to register on https://hydrogen.princeton.edu/pin 
# and run the following with your registered information
# before you can use the hydrodata utilities
register_api_pin("your_email", "your_pin")

## Define input parameters

Note that `get_point_data` and `get_point_metadata` require mandatory parameters of `dataset`, `variable`, `temporal_resolution`, and `aggregation` (and `depth_level` if asking for soil moisture data). Please see [the documentation](https://hf-hydrodata.readthedocs.io/en/latest/available_data.html) for information about what point observation datasets are available and the parameters used to query them. 

The [hf_hydrodata API Reference](https://hf-hydrodata.readthedocs.io/en/latest/hf_hydrodata.point.html) includes information on what optional filtering parameters are available. These include filters for things like a geographic region or date range. Those parameters work cumulatively, so if `state` and `site_ids` are both supplied, for example, then only sites within `site_ids` that are *also* in `state` will be returned.

## Example 1: Specify a date range and geographic bounding box

In this example, a specific start and end date are provided, along with a geographic domain. Start and end dates, if provided, must be in 'YYYY-MM-DD' format. If a start date is not provided, data is returned from as early as it is available. Likewise, if an end date is not provided, data is returned through as current as is available.

Geographic domain specifications, if provided, can be in the form of latitude and/or longitude bounds, a 2-digit state postal code (`state`='NJ'), a specific list of site IDs (see example 2 below), or a shapefile (see example notebook "[How To Filter Sites by USGS HUC Boundaries](https://hf-hydrodata.readthedocs.io/en/latest/point_data/examples/example_shapefile.html)"). If no geography restriction is included, sites from the entire continental United States will be returned. In many cases, this exceeds a user's single-request limit of 1GB. Please add additional geography and/or date filters as needed to keep requests within this limit.

In [2]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box. 

# Request point observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05", latitude_range=(45, 50), longitude_range=(-75, -50))

# View first five records
data_df.head(5)

Unnamed: 0,date,01011000,01013500,01015800,01017000,01017550,01018000,01019000,01027200,01029200,...,01046500,01129200,01010000,01010070,01010500,01014000,01018500,01021000,04264331,04294300
0,2002-01-01,9.7069,13.8104,12.9048,21.3099,0.013301,,3.0847,1.98666,2.43663,...,46.129,23.9984,11.9143,1.48292,24.055,61.411,9.1126,21.9042,6084.5,0.2547
1,2002-01-02,9.5371,13.4142,12.0558,20.0364,0.012169,,3.0564,1.91874,2.39135,...,46.695,23.8286,11.6879,1.415,23.489,59.713,9.0277,21.9042,6056.2,0.2547
2,2002-01-03,9.339,13.0746,11.5181,19.0742,0.011886,,3.0281,1.88195,2.36305,...,46.978,23.8286,11.5181,1.3584,23.0645,58.581,8.9145,21.9042,6084.5,0.2547
3,2002-01-04,9.1692,12.6501,11.0936,26.4322,0.01132,,3.0564,1.83667,2.3489,...,51.506,23.6305,11.2917,1.31312,22.64,57.449,8.8579,21.9042,6056.2,0.2547
4,2002-01-05,8.9994,12.2822,10.6691,25.187,0.010754,,3.0281,1.79139,2.3206,...,37.639,23.6022,11.0936,1.27633,22.2155,56.317,8.7447,21.9042,5546.8,0.283


In [3]:
# Request site-level metadata for these sites (using the same filters)
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2002-01-01", date_end="2002-01-05", latitude_range=(45, 50), longitude_range=(-75, -50))

# View first five records
metadata_df.head(5)

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,1011000,"Allagash River near Allagash, Maine",stream gauge,USGS,ME,47.069722,-69.079444,1910-07-01,2023-11-30,34028,...,,1010002,,,4210,2783,3186.844,Non-ref,187.0,1478.0
1,1013500,"Fish River near Fort Kent, Maine",stream gauge,USGS,ME,47.2375,-68.582778,1903-07-29,2023-12-01,36507,...,,1010003,,,4237,2810,2252.696,Ref,157.0,873.0
2,1015800,"Aroostook River near Masardis, Maine",stream gauge,USGS,ME,46.523056,-68.371667,1957-09-14,2023-12-01,24185,...,,1010004,,,4276,2747,2313.755,Non-ref,166.0,892.0
3,1017000,"Aroostook River at Washburn, Maine",stream gauge,USGS,ME,46.777222,-68.157222,1930-08-01,2023-12-01,34091,...,,1010004,,,4281,2773,4278.907,Non-ref,131.0,1654.0
4,1017550,"Williams Brook at Phair, Maine",stream gauge,USGS,ME,46.628056,-67.953056,1999-11-01,2023-12-01,8797,...,,1010005,,,4300,2762,10.0323,Ref,176.0,3.82


In [4]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")

'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'

## Example 2: Specifying a specific site ID or list of site IDs without a time restriction

Instead of latitude/longitude bounds, data for a specific stream gauge or groundwater well can be returned with or without a date bound. Below, daily streamflow data is returned for a single site and then a select list of sites. There is no time restriction in these examples, so all data available in-house is included.

In [5]:
# Request point observations data for a single site
data = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", site_ids="01013500")

# View first five rows
print("First five records: ")
print(data.head(5))

# View final five rows 
print("\n Final five records: ")
print(data.tail(5))

First five records: 
         date  01013500
0  1903-07-29   21.5646
1  1903-07-30   21.5646
2  1903-07-31   21.5646
3  1903-08-01   19.2723
4  1903-08-02   18.1686

 Final five records: 
             date  01013500
36502  2023-11-27    30.281
36503  2023-11-28    31.413
36504  2023-11-29    30.564
36505  2023-11-30    30.281
36506  2023-12-01    29.715


In [6]:
# Request the metadata for that site
metadata = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", site_ids="01013500")
metadata.head()

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,1013500,"Fish River near Fort Kent, Maine",stream gauge,USGS,ME,47.2375,-68.582778,1903-07-29,2023-12-01,36507,...,,1010003,,,4237,2810,2252.696,Ref,157.0,873.0


In [7]:
# Request point observations data for multiple sites
data = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", 
                      site_ids=["01013500", "01011000", "01029500"])

# View first five rows
print("First five records: ")
print(data.head(5))

# View final five rows 
print("\n Final five records: ")
print(data.tail(5))

First five records: 
         date  01011000  01013500  01029500
0  1902-10-01       NaN       NaN    19.810
1  1902-10-02       NaN       NaN    19.810
2  1902-10-03       NaN       NaN    19.810
3  1902-10-04       NaN       NaN    18.678
4  1902-10-05       NaN       NaN    17.546

 Final five records: 
             date  01011000  01013500  01029500
44252  2023-11-27       NaN    30.281    41.035
44253  2023-11-28       NaN    31.413       NaN
44254  2023-11-29       NaN    30.564       NaN
44255  2023-11-30       NaN    30.281       NaN
44256  2023-12-01       NaN    29.715       NaN


In [8]:
# Request the site-level attributes for those sites
metadata = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", 
                              site_ids=["01013500", "01011000", "01029500"])
metadata.head()

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,1011000,"Allagash River near Allagash, Maine",stream gauge,USGS,ME,47.069722,-69.079444,1910-07-01,2023-11-30,34028,...,,1010002,,,4210,2783,3186.844,Non-ref,187.0,1478.0
1,1013500,"Fish River near Fort Kent, Maine",stream gauge,USGS,ME,47.2375,-68.582778,1903-07-29,2023-12-01,36507,...,,1010003,,,4237,2810,2252.696,Ref,157.0,873.0
2,1029500,"East Branch Penobscot River at Grindstone, Maine",stream gauge,USGS,ME,45.730278,-68.589444,1902-10-01,2023-12-01,37315,...,,1020002,,,4293,2656,2816.295,Non-ref,93.0,837.0


In [9]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")

'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'

## Example 3: Add a restriction on the minimum number of observations per site within a requested time range

The parameter `min_num_obs` allows the user to further specify that a site must have a minimum number of observations within the specified time range (if one is provided).

The example below ensures that only sites that have valid streamflow data for every day of the calendar year requested get returned.

In [10]:
# Request point observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2005-01-01", date_end="2005-12-31", 
                         state="CO",
                         min_num_obs=365)

# View first five records
data_df.head(5)

Unnamed: 0,date,06614800,06620000,06701500,06701900,06707500,06708800,06709000,06709530,06710150,...,382628104493700,382629104493000,383619104520401,383637104531301,383944104474201,384037104472001,384047104510301,384048104504901,384220104503701,391504106225200
0,2005-01-01,0.013584,3.5375,1.9244,2.19325,5.2921,0.163574,0.52072,0.55751,0.0566,...,0.0,0.0,0.002547,0.0,0.0,0.021508,0.0,0.00849,0.0,0.004245
1,2005-01-02,0.013301,3.396,1.9244,2.14514,5.2072,0.144896,0.4811,0.5377,0.052355,...,0.0,0.0,0.002547,0.0,0.0,0.024621,0.0,0.008207,0.0,0.004245
2,2005-01-03,0.013301,3.3111,1.9244,2.1508,5.1506,0.128765,0.49525,0.50374,0.058015,...,0.0,0.0,0.002547,0.0,0.0,0.023772,0.0,0.007924,0.0,0.004245
3,2005-01-04,0.013301,3.396,1.9244,2.1508,5.0091,0.119992,0.4811,0.4811,0.051506,...,0.0,0.0,0.002547,0.0,0.0,0.02547,0.0,0.007924,0.0,0.004245
4,2005-01-05,0.013301,3.396,1.9244,2.23853,4.1035,0.139236,0.41601,0.50374,0.046412,...,0.0,0.0,0.002547,0.0,0.0,0.022923,0.0,0.007924,0.0,0.004245


In [11]:
# NOTE: Metadata access does not support the `min_num_obs` filter because it does not inspect the data contents for the sliced date range.
# Metadata access only filters on overall data availability to be within the specified range.

# The following is an example workflow for obtaining metadata for only those sites that 
# additionally satisfy the `min_num_obs` filter
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2005-01-01", date_end="2005-12-31", 
                                 state="CO")

c = list(data_df.columns)
c.remove('date')
filtered_site_list = pd.DataFrame(data=c, columns=['site_id'])
filtered_metadata_df = pd.merge(filtered_site_list, metadata_df, on='site_id', how='left')
assert len(filtered_metadata_df) == data_df.shape[1]-1

# View first five records
filtered_metadata_df.head()

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,6614800,"MICHIGAN RIVER NEAR CAMERON PASS, CO",stream gauge,USGS,CO,40.496094,-105.865012,1973-10-01,2023-12-01,18322,...,,10180001,1054.0,818.0,1481.0,1764.0,4.0284,Ref,3188.0,1.54
1,6620000,"NORTH PLATTE RIVER NEAR NORTHGATE, CO",stream gauge,USGS,CO,40.936639,-106.339194,1904-06-01,2023-12-01,39782,...,,10180001,1020.0,870.0,1448.0,1817.0,3702.637,Non-ref,2388.0,1431.0
2,6701500,"SOUTH PLATTE RIVER BELOW CHEESMAN LAKE, CO",stream gauge,USGS,CO,39.209157,-105.267773,1924-10-01,2007-09-29,29217,...,,10190002,1091.0,671.0,,,4557.068,Non-ref,2081.0,1752.0
3,6701900,SOUTH PLATTE RIVER BLW BRUSH CRK NEAR TRUMBULL...,stream gauge,USGS,CO,39.25999,-105.221938,2002-07-19,2023-12-01,7792,...,,10190002,,,1523.0,1627.0,5252.557,Non-ref,1990.0,2028.0
4,6707500,"SOUTH PLATTE RIVER AT SOUTH PLATTE, CO",stream gauge,USGS,CO,39.409156,-105.16999,1896-01-01,2007-09-29,32959,...,,10190002,,,,,6689.03,Non-ref,1901.0,2579.0


In [12]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")

'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'