# Access NWIS with the USGS dataretrieval package

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mrahnis/nb-streamgage/blob/main/Streamgage-01--Access-NWIS-with-dataretrieval.ipynb)

## The USGS dataretrieval package

This package allows users to retrieve data using the USGS NWIS API. It is possible to get longer timeseries than is possible from the NWIS webpage. The dataretrieval git repository is here: https://github.com/USGS-python/dataretrieval


In [1]:
# if using the regular Colab runtime install dataretrieval
!pip install dataretrieval --quiet --exists-action i

## Preliminaries

In [2]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import dataretrieval.nwis as nwis

In [3]:
gages = {'01576516':'east branch',
         '015765185':'west branch',
         '015765195':'mainstem',
         '01576521':'mainstem-historical'}

gage = '015765195'

## Get Site Info

In [4]:
site_info, md = nwis.get_info(sites=gage)
site_info

Unnamed: 0,agency_cd,site_no,station_nm,site_tp_cd,lat_va,long_va,dec_lat_va,dec_long_va,coord_meth_cd,coord_acy_cd,...,local_time_fg,reliability_cd,gw_file_cd,nat_aqfr_cd,aqfr_cd,aqfr_type_cd,well_depth_va,hole_depth_va,depth_src_cd,project_no
0,USGS,15765195,"Big Spring Run near Mylin Corners, PA",ST,395945.37,761550.54,39.995936,-76.264039,N,S,...,Y,,,,,,,,,2476DFS


In [5]:
site_stats, md = nwis.get_stats(sites=gage)
site_stats

Unnamed: 0,agency_cd,site_no,parameter_cd,ts_id,loc_web_ds,month_nu,day_nu,begin_yr,end_yr,count_nu,...,mean_va,p05_va,p10_va,p20_va,p25_va,p50_va,p75_va,p80_va,p90_va,p95_va
0,USGS,015765195,10,170026,,1,1,2013,2022,10,...,7.4,,3.2,5.8,6.2,7.4,8.8,9.3,11.0,
1,USGS,015765195,10,170026,,1,2,2013,2022,10,...,7.1,,3.4,5.9,6.0,6.9,8.3,9.0,10.7,
2,USGS,015765195,10,170026,,1,3,2013,2022,10,...,6.7,,3.1,4.1,5.1,7.2,8.2,8.9,9.3,
3,USGS,015765195,10,170026,,1,4,2013,2022,10,...,6.7,,3.4,4.3,5.5,6.8,8.3,8.6,9.8,
4,USGS,015765195,10,170026,,1,5,2013,2022,10,...,6.2,,1.9,4.6,4.6,6.1,8.0,8.5,9.3,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,USGS,015765195,63680,214327,,12,27,2017,2022,6,...,6.4,,,1.1,1.4,4.2,10.0,15.0,,
1460,USGS,015765195,63680,214327,,12,28,2017,2022,6,...,12.0,,,1.3,1.4,4.2,28.0,33.0,,
1461,USGS,015765195,63680,214327,,12,29,2017,2022,6,...,4.3,,,1.5,2.0,4.2,6.4,7.5,,
1462,USGS,015765195,63680,214327,,12,30,2017,2022,5,...,5.6,,,1.5,1.7,4.5,10.0,12.0,,


## Reading our data

Next we will read two parquet files using Pandas. The `read_parquet` function takes a quoted string representing the filesystem path to the file we want to read.

We use parquet here because it has some advantages over a CSV file:

- the filesize is smaller
- it is a binary format that reads quickly, whereas CSV is text that needs to be parsed
- parquet preserves the index, including indices of datetime

In [6]:
start = '2017-12-31'
end = '2018-01-01'
df = nwis.get_record(sites=gage, service='iv', start=start, end=end)

In [7]:
df.head()

Unnamed: 0_level_0,00010,00010_cd,site_no,00060,00060_cd,00065,00065_cd,00095,00095_cd,63680,63680_cd
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2017-12-31 00:00:00-05:00,3.1,A,15765195,1.05,A,3.09,A,853.0,A,12.5,A
2017-12-31 00:15:00-05:00,3.1,A,15765195,1.05,A,3.09,A,848.0,A,12.6,A
2017-12-31 00:30:00-05:00,3.0,A,15765195,1.05,A,3.09,A,848.0,A,12.5,A
2017-12-31 00:45:00-05:00,3.0,A,15765195,1.05,A,3.09,A,851.0,A,12.6,A
2017-12-31 01:00:00-05:00,3.0,A,15765195,1.05,A,3.09,A,854.0,A,10.9,A


Looking at `df` we will see it has several other codes. The NWIS codes included here stand for:
- 00010 : Temperature in degrees celcius
- 00060 : Discharge
- 63680 : Turbidity

We can describe them to obtain some summary statistics. 

In [8]:
df.describe()

Unnamed: 0,00010,00060,00065,00095,63680
count,191.0,192.0,192.0,190.0,174.0
mean,2.942932,1.007344,3.083698,873.805263,15.704023
std,0.955835,0.040153,0.006171,40.989339,8.862204
min,1.5,0.93,3.07,831.0,3.1
25%,2.1,0.98,3.08,848.0,9.125
50%,2.8,0.98,3.08,859.0,14.65
75%,3.6,1.05,3.09,882.5,20.475
max,5.3,1.05,3.09,981.0,49.5


In [9]:
df.index

DatetimeIndex(['2017-12-31 00:00:00-05:00', '2017-12-31 00:15:00-05:00',
               '2017-12-31 00:30:00-05:00', '2017-12-31 00:45:00-05:00',
               '2017-12-31 01:00:00-05:00', '2017-12-31 01:15:00-05:00',
               '2017-12-31 01:30:00-05:00', '2017-12-31 01:45:00-05:00',
               '2017-12-31 02:00:00-05:00', '2017-12-31 02:15:00-05:00',
               ...
               '2018-01-01 21:30:00-05:00', '2018-01-01 21:45:00-05:00',
               '2018-01-01 22:00:00-05:00', '2018-01-01 22:15:00-05:00',
               '2018-01-01 22:30:00-05:00', '2018-01-01 22:45:00-05:00',
               '2018-01-01 23:00:00-05:00', '2018-01-01 23:15:00-05:00',
               '2018-01-01 23:30:00-05:00', '2018-01-01 23:45:00-05:00'],
              dtype='datetime64[ns, pytz.FixedOffset(-300)]', name='datetime', length=192, freq=None)

## Save As Parquet

Saving a DataFrame in Parquet format has some advantages over saving to CSV. Parquet files tend to be smaller on disk and faster to read. Parquet will maintain your data types so you do not need to specify dtypes or parse datetime strings on re-reading the file.

In [10]:
df.to_parquet('nwis_{}_{}_{}.parquet'.format(gage, start, end), index=True)