### Retrieve State Usage Data

Downloads State Water Use data for 2000/2005/2010 and creates a formatted Physical Water Use Supply table. 

Original data are from the USGS state water programs. Sample URL (Louisiana)
`https://waterdata.usgs.gov/la/nwis/water_use?format=rdb&rdb_compression=value&wu_area=County&wu_year=2000%2C2005%2C2010&wu_county=ALL&wu_category=ALL&wu_county_nms=--ALL%2BCounties--&wu_category_nms=--ALL%2BCategories--`

##### Workflow:
* Construct the url and retrieve the data into a pandas data frame
* Melt/gather the usage columns into row values under the column name 'Group'
* Remove rows with no usage data (identified by not having "Mgal" in the 'Group' name)
---

In [1]:
#Import modules
import sys, os, urllib
from shutil import copyfile
import pandas as pd
import numpy as np
from openpyxl import load_workbook

In [2]:
#Specify the state and year to process
state = 'tx' #Louisiana
year = 2005

In [3]:
#Get the input filenames
templateFilename = '../Data/Templates/StatePSUT_Template.xlsx'
remapFilename = '../Data/RemapTables/StatePSUTLookup.csv'
outFilename = '../Data/StateData/{0}_{1}.xlsx'.format(state,year)

In [4]:
#Copy the template to the output filename
copyfile(src=templateFilename,dst=outFilename)

In [5]:
#Get the remap table and load as a dataframe
dfRemap = pd.read_csv(remapFilename,dtype='str',index_col="Index")

* Output data will be named with the state code and year (e.g. `la_2010.csv`) and will be saved in the StateData subfolder of the Data directory. This folder will be created, if it does not exist already.

In [6]:
#Set the output file location to the Data/State data folder (relative to this file's location)
outFolder = '../Data/StateData/'
if os.path.exists(outFolder) == False:
    os.mkdir(outFolder)
outFN = outFolder + os.sep + '{0}_{1}.csv'.format(state,year)

* Here we construct the URL for data retreival and then pull the on-line data to an in-memory data frame named `dfRaw`.

In [7]:
#Set the data URL path and parameters and construct the url
path = 'https://waterdata.usgs.gov/{}/nwis/water_use?'.format(state)
values = {'format':'rdb',
         'rdb_compression':'value',
         'wu_area':'County',
         'wu_year': year,
         'wu_county':'ALL',
         'wu_county_nms':'--ALL+Counties--',
         'wu_category_nms':'--ALL+Categories--'
        }
url = path + urllib.urlencode(values)

In [8]:
#Pull data in using the URL and remove the 2nd row of headers
dfRaw = pd.read_table(url,comment='#',header=[0,1],na_values='-')
dfRaw.columns = dfRaw.columns.droplevel(level=1)

In [9]:
#CHECK: Display a sample of the retrieved data
dfRaw.head()

Unnamed: 0,state_cd,state_name,county_cd,county_nm,year,"Total Population total population of area, in thousands","Public Supply population served by groundwater, in thousands","Public Supply population served by surface water, in thousands","Public Supply total population served, in thousands","Public Supply self-supplied groundwater withdrawals, fresh, in Mgal/d",...,Hydroelectric Power total offstream surface-water withdrawals in Mgal/d,"Hydroelectric Power power generated by instream use, in gigawatt-hours","Hydroelectric Power power generated by offstream use, in gigawatt-hours","Hydroelectric Power total power generated, in gigawatt-hours",Hydroelectric Power number of instream facilities,Hydroelectric Power number of offstream facilities,Hydroelectric Power total number of facilities,"Wastewater Treatment returns by public wastewater facilities, in Mgal/d",Wastewater Treatment number of public wastewater facilities,"Wastewater Treatment reclaimed wastewater released by wastewater facilities, in Mgal/d"
0,48,Texas,1,Anderson County,2005,56.408,,,34.12,5.97,...,,,,,,,,,,
1,48,Texas,3,Andrews County,2005,12.748,,,9.49,2.39,...,,,,,,,,,,
2,48,Texas,5,Angelina County,2005,81.557,,,76.68,12.07,...,,,,,,,,,,
3,48,Texas,7,Aransas County,2005,24.64,,,21.75,0.12,...,,,,,,,,,,
4,48,Texas,9,Archer County,2005,9.095,,,8.58,0.0,...,,,,,,,,,,


* With the data now held locally, we reformat the table so that each of the usage (and other) columns are "[melted](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html)" into a single column such that the original column name is stored in a new field and the value is stored in another. This facilitates subsequent analysis, which includes:
 * Removing rows (formerly columns) that don't report volume, e.g. population served values, and
 * Ensuring that volume data is a floating point number.

In [10]:
#Tidy the data: transform so data in each usage column become row values with a new column listing the usage type
rowHeadings = ['county_cd', 'county_nm', 'state_cd', 'state_name', 'year']
dfTidy = pd.melt(dfRaw,id_vars=rowHeadings,value_name='MGal',var_name='Group')
print("Data transformed from {0} rows/columns to {1} rows/columns".format(dfRaw.shape, dfTidy.shape))

Data transformed from (254, 281) rows/columns to (70104, 7) rows/columns


In [11]:
#Remove rows that don't have volume data (i.e. keep only columns with 'Mgal' in the name)
dfTidy = dfTidy[dfTidy['Group'].str.contains('Mgal')].copy(deep=True)
dfTidy.shape

(59690, 7)

In [12]:
#Change the type of the MGal column to float 
dfTidy['MGal'] = dfTidy.MGal.astype(np.float)

In [13]:
#CHECK: Show the structure of the 'tidied' data frame
dfTidy.head()

Unnamed: 0,county_cd,county_nm,state_cd,state_name,year,Group,MGal
1016,1,Anderson County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,5.97
1017,3,Andrews County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,2.39
1018,5,Angelina County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,12.07
1019,7,Aransas County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,0.12
1020,9,Archer County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,0.0


* Next, we translate the labels in the Group field into its components: 
 * useClass: the 

In [14]:
#Join the remap table
dfAll = pd.merge(dfTidy,dfRemap,how='inner',left_on="Group",right_on="Group")
dfAll.head()

Unnamed: 0,county_cd,county_nm,state_cd,state_name,year,Group,MGal,Column1,Row1,Column2,Row2,Column3,Row3
0,1,Anderson County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,5.97,N,31,,,,
1,3,Andrews County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,2.39,N,31,,,,
2,5,Angelina County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,12.07,N,31,,,,
3,7,Aransas County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,0.12,N,31,,,,
4,9,Archer County,48,Texas,2005,Public Supply self-supplied groundwater withdr...,0.0,N,31,,,,


In [17]:
#Open the spreadsheet template
wb = load_workbook(filename=outFilename)
ws = wb['Template']
ws.title = str(year)

In [18]:
#Loop through the first set of row/columns and insert values into the Excel spreadsheet
dfRound1 = dfAll.groupby(['Column1','Row1'])['MGal'].sum()
dfRound1.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound1.iteritems():
    #Set the value in the workbook
    rv = str(row)+ str(column)
    ws[rv] = value
#Save the workbook
wb.save(outFilename)

In [19]:
#Loop through the second set of row/columns and insert values into the Excel spreadsheet
dfRound2 = dfAll.groupby(['Column2','Row2'])['MGal'].sum()
dfRound2.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound2.iteritems():
    #Set the value in the workbook
    rv = str(row)+ str(column)
    ws[rv] = value
#Save the workbook
wb.save(outFilename)

In [20]:
#Loop through the third set of row/columns and insert values into the Excel spreadsheet
dfRound3 = dfAll.groupby(['Column3','Row3'])['MGal'].sum()
dfRound3.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound3.iteritems():
    #print row, column, value
    #Set the value in the workbook
    rv = str(row)+ str(column)
    ws[rv] = value
#Save the workbook
wb.save(outFilename)