# Accessing and citing point observations through `hf_hydrodata.point`

This notebook provides a walk-through of some example functionality for accessing and citing point observations data via the `hf_hydrodata.point` module. Please see the full [point module](https://maurice.princeton.edu/hydroframe/docs/point_data/index.html) documentation for information on what data is available, our data collection process, and new features we are working on!

In [1]:
# Import packages
import sys
import os
import pandas as pd
from hf_hydrodata.gridded import register_api_pin
from hf_hydrodata.point import get_data, get_metadata, get_citations

In [None]:
# You need to register on https://hydrogen.princeton.edu/pin 
# and run the following with your registered information
# before you can use the hydrodata utilities
register_api_pin("your_email", "your_pin")

## Define input parameters

Note that there only mandatory parameters are `data_source`, `variable`, `temporal_resolution`, and `aggregation`
(and `depth_level` if asking for soil moisture data). Please see the documentation for information about the optional filtering parameters that are available. Those parameters work cumulatively, so if `state` and `site_ids` are both supplied then only sites within `site_ids` that are also in `state` will be returned.

In [2]:
# Let's explore daily streamflow data. We'll use this throughout several examples below.
data_source = 'usgs_nwis'
variable = 'streamflow'
temporal_resolution = 'daily'
aggregation = 'average'

## Example 1: Specify a date range and geographic bounding box

In this example, a specific start and end date are provided, along with a geographic domain. Start and end dates, if provided, must be in 'YYYY-MM-DD' format. If a start date is not provided, data is returned from as early as it is available. Likewise, if an end date is not provided, data is returned through as current as is available.

Geographic domain specifications, if provided, can be in the form of latitude and/or longitude bounds, a 2-digit state postal code (`state`='NJ'), or a specific list of site IDs (see example 2 below). If no geography restriction is included, sites from the entire continental United States will be returned (note that this might take some time).

Note taht the more the data is filtered, the more quickly the final data will be available. While there are no explicit bounds on how many sites can be asked for, getting data for a small geographic region for a short period of time will naturally take less time to return than all of the sites in the US for an entire Water Year.

In [3]:
date_start = '2002-01-01'
date_end = '2002-01-05'
latitude_range = (45, 50)
longitude_range = (-75, -50)

In [4]:
# Get data
data_df = get_data(data_source, variable, temporal_resolution, aggregation, 
                  date_start=date_start,
                  date_end=date_end,
                  latitude_range=latitude_range,
                  longitude_range=longitude_range)
data_df.head(5)

Unnamed: 0,date,01011000,01013500,01015800,01017000,01017550,01018000,01019000,01027200,01029200,...,01046500,01129200,01010000,01010070,01010500,01014000,01018500,01021000,04264331,04294300
0,2002-01-01,9.7069,13.8104,12.9048,21.3099,0.013301,,3.0847,1.98666,2.43663,...,46.129,23.9984,11.9143,1.48292,24.055,61.411,9.1126,21.9042,6084.5,0.2547
1,2002-01-02,9.5371,13.4142,12.0558,20.0364,0.012169,,3.0564,1.91874,2.39135,...,46.695,23.8286,11.6879,1.415,23.489,59.713,9.0277,21.9042,6056.2,0.2547
2,2002-01-03,9.339,13.0746,11.5181,19.0742,0.011886,,3.0281,1.88195,2.36305,...,46.978,23.8286,11.5181,1.3584,23.0645,58.581,8.9145,21.9042,6084.5,0.2547
3,2002-01-04,9.1692,12.6501,11.0936,26.4322,0.01132,,3.0564,1.83667,2.3489,...,51.506,23.6305,11.2917,1.31312,22.64,57.449,8.8579,21.9042,6056.2,0.2547
4,2002-01-05,8.9994,12.2822,10.6691,25.187,0.010754,,3.0281,1.79139,2.3206,...,37.639,23.6022,11.0936,1.27633,22.2155,56.317,8.7447,21.9042,5546.8,0.283


In [5]:
# Get metadata for these sites
metadata_df = get_metadata(data_source, variable, temporal_resolution, aggregation, 
                          date_start=date_start,
                          date_end=date_end,
                          latitude_range=latitude_range,
                          longitude_range=longitude_range)
metadata_df.head(5)

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,1011000,"Allagash River near Allagash, Maine",stream gauge,USGS,ME,47.069722,-69.079444,1910-07-01,2023-11-03,34001,...,,1010002,,,4210,2783,3186.844,Non-ref,187.0,1478.0
1,1013500,"Fish River near Fort Kent, Maine",stream gauge,USGS,ME,47.2375,-68.582778,1903-07-29,2023-11-03,36479,...,,1010003,,,4237,2810,2252.696,Ref,157.0,873.0
2,1015800,"Aroostook River near Masardis, Maine",stream gauge,USGS,ME,46.523056,-68.371667,1957-09-14,2023-11-03,24157,...,,1010004,,,4276,2747,2313.755,Non-ref,166.0,892.0
3,1017000,"Aroostook River at Washburn, Maine",stream gauge,USGS,ME,46.777222,-68.157222,1930-08-01,2023-11-03,34063,...,,1010004,,,4281,2773,4278.907,Non-ref,131.0,1654.0
4,1017550,"Williams Brook at Phair, Maine",stream gauge,USGS,ME,46.628056,-67.953056,1999-11-01,2023-11-03,8769,...,,1010005,,,4300,2762,10.0323,Ref,176.0,3.82


In [6]:
# See how to cite the use of this data
get_citations(data_source, variable, temporal_resolution, aggregation)

Most U.S. Geological Survey (USGS) information resides in Public Domain 
              and may be used without restriction, though they do ask that proper credit be given.
              An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey"
              Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs


## Example 2: Specifying a specific site ID or list of site IDs without a time restriction

Instead of latitude/longitude bounds, data for a specific stream gauge or groundwater well can be returned with or without a date bound. Below, daily streamflow data is returned for a single site and then a select list of sites. There is no time restriction in these examples, so all data available in-house is included.

In [7]:
# Data and metadata for a single site
get_data(data_source, variable, temporal_resolution, aggregation, site_ids=['01013500'])

Unnamed: 0,date,01013500
0,1903-07-29,21.5646
1,1903-07-30,21.5646
2,1903-07-31,21.5646
3,1903-08-01,19.2723
4,1903-08-02,18.1686
...,...,...
36474,2023-10-30,47.5440
36475,2023-10-31,46.9780
36476,2023-11-01,45.8460
36477,2023-11-02,44.4310


In [8]:
get_metadata(data_source, variable, temporal_resolution, aggregation, site_ids=['01013500'])

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,1013500,"Fish River near Fort Kent, Maine",stream gauge,USGS,ME,47.2375,-68.582778,1903-07-29,2023-11-03,36479,...,,1010003,,,4237,2810,2252.696,Ref,157.0,873.0


In [9]:
# Data and metadata for multiple sites
get_data(data_source, variable, temporal_resolution, aggregation, site_ids=['01013500', '01011000', '01029500'])

Unnamed: 0,date,01011000,01013500,01029500
0,1902-10-01,,,19.810
1,1902-10-02,,,19.810
2,1902-10-03,,,19.810
3,1902-10-04,,,18.678
4,1902-10-05,,,17.546
...,...,...,...,...
44224,2023-10-30,48.676,47.544,75.561
44225,2023-10-31,46.695,46.978,71.882
44226,2023-11-01,44.148,45.846,44.148
44227,2023-11-02,41.318,44.431,36.790


In [10]:
get_metadata(data_source, variable, temporal_resolution, aggregation, site_ids=['01013500', '01011000', '01029500'])

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,1011000,"Allagash River near Allagash, Maine",stream gauge,USGS,ME,47.069722,-69.079444,1910-07-01,2023-11-03,34001,...,,1010002,,,4210,2783,3186.844,Non-ref,187.0,1478.0
1,1013500,"Fish River near Fort Kent, Maine",stream gauge,USGS,ME,47.2375,-68.582778,1903-07-29,2023-11-03,36479,...,,1010003,,,4237,2810,2252.696,Ref,157.0,873.0
2,1029500,"East Branch Penobscot River at Grindstone, Maine",stream gauge,USGS,ME,45.730278,-68.589444,1902-10-01,2023-11-03,37287,...,,1020002,,,4293,2656,2816.295,Non-ref,93.0,837.0


In [11]:
# See how to cite the use of this data
get_citations(data_source, variable, temporal_resolution, aggregation)

Most U.S. Geological Survey (USGS) information resides in Public Domain 
              and may be used without restriction, though they do ask that proper credit be given.
              An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey"
              Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs


## Example 3: Add a restriction on the minimum number of observations per site within a requested time range

The parameter `min_num_obs` allows the user to further specify that a site must have a minimum number of observations within the specified time range (if one is provided).

The example below ensures that only sites that have valid streamflow data for every day of the calendar year requested get returned.

In [12]:
data_df = get_data(data_source, variable, temporal_resolution, aggregation, 
                   date_start='2005-01-01', date_end='2005-12-31',
                   state='NJ', 
                   min_num_obs=365
                   )
data_df.head(5)

Unnamed: 0,date,01367800,01377000,01377370,01377500,01378500,01379000,01379500,01379530,01379773,...,01467150,01477120,01482500,01380450,01387000,01408900,0140940810,01410225,01460440,01475001
0,2005-01-01,0.849,2.547,0.38205,0.68203,0.63675,2.13665,3.2828,0.166121,0.5377,...,0.47544,0.88013,0.39337,6.7637,1.83667,2.28664,0.281302,0.193289,3.6507,0.3113
1,2005-01-02,0.849,2.5187,0.36224,0.66505,0.63675,2.07722,3.396,0.142066,0.566,...,0.4528,0.84334,0.35941,6.3958,1.01597,2.26117,0.259228,0.179705,3.679,0.3396
2,2005-01-03,0.849,2.58662,0.53204,0.80372,0.64241,1.99798,3.3111,0.138104,0.566,...,0.44997,0.83485,0.40469,6.3392,0.90843,2.26117,0.27168,0.164706,3.6507,0.3396
3,2005-01-04,1.981,3.3677,1.51122,1.42066,0.56317,2.65454,4.1318,0.36507,0.7924,...,0.51506,0.86315,0.40469,12.2256,3.1696,2.26117,0.277623,0.113766,3.7356,0.3396
4,2005-01-05,1.415,3.0281,0.69901,0.87447,6.6222,2.8866,4.7544,0.165555,0.7924,...,1.57914,1.33576,0.63392,11.603,4.3582,2.49323,0.33677,0.106974,3.679,0.3396


In [13]:
# Metadata access does not support the `min_num_obs` filter.
# The following is an example workflow for obtaining metadata for only those sites that 
# additionally satisfy the `min_num_obs` filter
metadata_df = get_metadata(data_source, variable, temporal_resolution, aggregation, 
                   date_start='2005-01-01', date_end='2005-12-31',
                   state='NJ')

c = list(data_df.columns)
c.remove('date')
filtered_site_list = pd.DataFrame(data=c, columns=['site_id'])
filtered_metadata_df = pd.merge(filtered_site_list, metadata_df, on='site_id', how='left')
assert len(filtered_metadata_df) == data_df.shape[1]-1

filtered_metadata_df

Unnamed: 0,site_id,site_name,site_type,agency,state,latitude,longitude,first_date_data_available,last_date_data_available,record_count,...,doi,huc8,conus1_x,conus1_y,conus2_x,conus2_y,gagesii_drainage_area,gagesii_class,gagesii_site_elevation,usgs_drainage_area
0,01367800,Papakating Creek at Pellettown NJ,stream gauge,USGS,NJ,41.162778,-74.675278,2003-09-16,2014-09-29,4028,...,,02020007,,,,,42.8715,Non-ref,132.0,15.80
1,01377000,Hackensack River at Rivervale NJ,stream gauge,USGS,NJ,40.999167,-73.989167,1941-10-01,2023-11-04,29979,...,,02030103,,,4051,2043,146.5065,Non-ref,8.0,58.00
2,01377370,Pascack Brook at Park Ridge NJ,stream gauge,USGS,NJ,41.036667,-74.039167,2004-04-01,2023-11-04,6928,...,,02030103,,,4047,2046,35.2422,Non-ref,36.0,13.40
3,01377500,Pascack Brook at Westwood NJ,stream gauge,USGS,NJ,40.992778,-74.021111,1934-10-01,2023-11-04,32542,...,,02030103,,,4051,2042,73.6038,Non-ref,8.0,29.60
4,01378500,Hackensack River at New Milford NJ,stream gauge,USGS,NJ,40.948333,-74.026667,1921-10-01,2023-11-04,37290,...,,02030103,,,4052,2034,300.3462,Non-ref,1.0,113.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,01408900,Cedar Creek at Western Blvd near Lanoka Harbor NJ,stream gauge,USGS,NJ,39.879167,-74.190556,1932-07-07,2023-11-04,17302,...,,02040301,,,4075,1923,,,,49.90
97,0140940810,Pump Branch near Elm NJ,stream gauge,USGS,NJ,39.695833,-74.825000,2004-10-01,2006-10-02,732,...,,02040301,,,,,,,,10.80
98,01410225,Morses Mill Stream at Port Republic NJ,stream gauge,USGS,NJ,39.506389,-74.505556,2004-10-01,2007-09-29,1094,...,,02040301,,,,,,,,8.25
99,01460440,Delaware and Raritan Canal at Port Mercer NJ,stream gauge,USGS,NJ,40.304444,-74.685000,1989-10-23,2023-11-04,12149,...,,02040105,,,,,,,,


In [14]:
# See how to cite the use of this data
get_citations(data_source, variable, temporal_resolution, aggregation)

Most U.S. Geological Survey (USGS) information resides in Public Domain 
              and may be used without restriction, though they do ask that proper credit be given.
              An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey"
              Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs
