# Fetch water volume and sand mass from Open-FF
### Version 1 - Aug 2022

This notebook is used to fetch the most recent Open-FF data set, filter it if desired, and export the data.

Note that FracFocus Latitude and Longitude can be in different projections; Open-FF converts them all to WGS84, and checks
that they are in the locations indicated by the APINumber (see fields **loc_within_state** and **loc_within_county**)

## Step 1: Get data into Colab
Run the cells below to import the most recent Open-FF data

In [None]:
# This cell downloads some support code that is used to pull together the data set.  
!git clone https://github.com/gwallison/colab-support.git &>/dev/null;

# now run the code that defines the routine
%run colab-support/get_dataframe.py

In [None]:
# get_dataset pulls together a set of CSV files from a google storage site, then merges them
#  result: df is a dataframe with all records (though not ALL fields)
# By default, the data set is prefiltered to only the "standard set"; that is, non-duplicates.
df = get_dataset()

# if you want to see what fields are in df, uncomment the following line
# df.columns

In [None]:
# The following code is used to extract just the water and sand data, and associated meta-data
gb1 = df.groupby('UploadKey',as_index=False)[['TotalBaseWaterVolume','date','OperatorName','bgOperatorName',
                                              'StateName','CountyName','APINumber','Latitude','Longitude',
                                              'bgStateName','bgCountyName','bgLatitude','bgLongitude',
                                              'loc_within_county','loc_within_state']].first()
gb1['api10'] = gb1.APINumber.str[:10]
gb1['year'] = gb1.date.dt.year
cond = df.bgCAS=='14808-60-7'
gb2 = df[cond].groupby('UploadKey',as_index=False)['calcMass'].sum().rename({'calcMass':'sandMass'},axis=1)
sand_water = pd.merge(gb1,gb2,on='UploadKey',how='left')
sand_water.drop('UploadKey',axis=1,inplace=True)
print(f'Number of unfiltered records: {len(sand_water)}') 
# show the first few lines
sand_water.head()

## Step 2: Filter whole data, if desired
Uncomment code lines below and edit where you want to filter, then run the cells.

(to uncomment, remove the '#' at the beginning of the line)

In [None]:
## by year
# sand_water = sand_water[sand_water.year.isin([2021,2020])]

In [None]:
## by state
## using bgStateName: all lower case
# sand_water = sand_water[sand_water.bgStateName.isin(['ohio','west virginia'])]

In [None]:
## by Operator
## using bgOperatorName: all lower case, see the Open_FF_Companies Index in the catalog
# sand_water = sand_water[sand_water.bgOperatorName=='pioneer']

In [None]:
## remove records that don't have water or sand
# cond1 = sand_water.sandmass.notna()
# cond2 = sand_water.TotalBaseWaterVolume.notna()
# sand_water = sand_water[cond1 | cond2] # keep if either field has data
# sand_water = sand_water[cond1 & cond2] # keep only if BOTH fields have data

In [None]:
print(f'Number of filtered records: {len(sand_water)}')

## Step 3: Export to a CSV file
Once you run this cell, the output file should be available in Colab's File panel for downloading.

sandmass is in pounds; TotalBaseWaterVolume is in gallons

In [None]:
import datetime
today = str(datetime.datetime.now()).split()[0]
sand_water.to_csv(f'sand_water_{today}.csv')