# Monthly Evaporation Data for each Year
We will obtain the evaporation data for each month for each year across each blockgroup. This will then be implemented in the input file as a Timeseries

#### Imports

In [1]:
import os
import sys
import glob
import pandas as pd
from tqdm import tqdm

#### Download the NetCDF

In [1]:
import wget

In [11]:
url = 'ftp://ftp.cdc.noaa.gov/Datasets/NARR/Monthlies/monolevel/evap.mon.mean.nc'

wget.download(url)

'evap.mon.mean.nc'

#### Convert NetCDF to GeoTIFF

In [12]:
from osgeo import gdal

In [14]:
projection_5070 = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=1,1,-1,0,0,0,0 +units=m +no_defs'
warp_options = gdal.WarpOptions(options = '-t_srs \"' + projection_5070 + '\" -of GTiff')

ds = gdal.Warp('./evap.mon.mean.geotiff', 'NETCDF:./evap.mon.mean.nc:evap', options=warp_options)
ds = None

#### Extract the Bands of the GeoTIFF

The NetCDF data starts recording from January 1979 (Band 1). We only want the data from January 1981 to December 2014.

In [23]:
out_dir = './geotiffs/geotiffs/'
if not os.path.exists(out_dir):
    print('Making', out_dir)
    os.makedirs(out_dir)

In [19]:
start_year = 1981
start_month = 1
end_year = 2014
end_month = 12

start_band = (start_year - 1979) * 12 + start_month
end_band = (end_year - 1979) * 12 + end_month
print('Start:', start_band)
print('End:', end_band)

Start: 25
End: 432


In [29]:
in_ds = gdal.Open('./evap.mon.mean.geotiff')

for band in tqdm(range(start_band, end_band+1)):
    outfile = out_dir + str(band) + '.geotiff'
    
    translate_options = gdal.TranslateOptions(options='-b ' + str(band) + ' -of GTiff')
    gdal.Translate(outfile, in_ds, options=translate_options)

100%|██████████| 408/408 [00:25<00:00, 16.30it/s]


Note that some block groups lie in the ocean, this will cause incorrect evaporation data for them as the oceans have higher evaporation rates.

<img src="https://i.imgur.com/F6UIyls.png">

#### Masking the GeoTIFFs
To solve this issue, we will mask the evaporation data using the land mask provided by NARR NCEP.

##### Download the Land Mask File

In [30]:
url = 'ftp://ftp.cdc.noaa.gov/Datasets/NARR/time_invariant/land.nc'
wget.download(url)

'land.nc'

##### Project the Land Mask

In [32]:
projection_5070 = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=1,1,-1,0,0,0,0 +units=m +no_defs'
warp_options = gdal.WarpOptions(options = '-t_srs \"' + projection_5070 + '\" -of GTiff')

ds = gdal.Warp('./land.geotiff', 'NETCDF:./land.nc:land', options=warp_options)
ds = None

##### Import Gdal_calc.py
This will be used to mask the files.


In [35]:
gdal_path = '/home/matas/anaconda3/envs/swmm/bin/'
sys.path.insert(0, gdal_path)
import gdal_calc

In [100]:
out_dir = './geotiffs/masked_geotiffs/'
if not os.path.exists(out_dir):
    print('Making', out_dir)
    os.makedirs(out_dir)

In [None]:
land_mask = './land.geotiff'

for geotiff in glob.glob('./geotiffs/geotiffs/*.geotiff'):
    print(band_number)
    band_number = geotiff[geotiff.rfind('/')+1:geotiff.rfind('.')]
    outfile = out_dir + band_number + '.geotiff'
    
    gdal_calc.Calc('A*B', A=geotiff, B=land_mask, outfile=outfile, format='GTiff', NoDataValue=0)
    

Now, the block groups that were affected have NoData, since we specified NoData = 0.
<img src="https://i.imgur.com/povTQzI.png">

We will interpolate this missing data from data values around the missing data pixels, extending the raster by one pixel in each direction.

#### Data Interpolation

We will interpolate the data in a 3x3 grid around the target pixel. We will only modify a pixel if it has no data, and if it has at least one neighbor with data.

Neighbors directly adjacent to the target pixel (North, East, South, West) will receive a weight of 1.

Neighbors diagonally adjacent to the target pixel (NW, NE, SE, SW) will receive a weight of $\dfrac{1}{\sqrt{2}}$, since they are farther. Note that each pixel in this dataset represents about 32km$^{2}$.

In [102]:
out_dir = './geotiffs/masked_extended_geotiffs/'
if not os.path.exists(out_dir):
    print('Making', out_dir)
    os.makedirs(out_dir)

In [None]:
import rasterio as rio
import numpy as np

for geotiff in glob.glob('./geotiffs/masked_geotiffs/*.geotiff'):
    name = geotiff[geotiff.rfind('/')+1:geotiff.rfind('.')]
    print(name)
    raster = rio.open(geotiff)
    data = raster.read(1)
    
    rows = data.shape[0]
    cols = data.shape[1]
    nodata = raster.nodatavals
    data = np.ma.masked_equal(data, nodata)
    output = np.copy(data)
    
    for i in range(rows):
        for j in range(cols):
            if data.mask[i][j]:
                try:
                    total = 0
                    count = 0
                    for x in range(-1,2):
                        for y in range(-1,2):
                            if not data.mask[i+x][j+y]:
                                if (x == -1 or x == 1) and (y == -1 or y == 1):
                                    total += data[i+x][j+y] / 1.41421356237
                                else:
                                    total += data[i+x][j+y]
                                count+= 1
                                
                    output[i][j] = total / count
                except: # Array index error
                    pass
                
                
    with rio.open(out_dir + name + '.geotiff', 'w', **raster.profile) as dst:
        dst.write(output, 1)
    

#### Interpolation Result
<img src="https://i.imgur.com/gZSiuIs.png">

#### Zonal Stats
We will now load the block group shapefile and use rasterstats' zonal_stats function to calculate the evaporation data for each block group.

Note that this will take a long time for each file because there are ~150,000 block groups!

In [107]:
import rasterstats as rs
import geopandas as gpd
import pandas as pd
import persistqueue

shape_file = '../../../data/input_file_data/SelectBG_all_land_BGID_final.gpkg'
shape_frame = gpd.read_file(shape_file)
shape_frame['GEOID10'] = shape_frame['GEOID10'].astype(str)
output_stats = ['mean']

In [108]:
out_dir = './evap/'
if not os.path.exists(out_dir):
    print('Making', out_dir)
    os.makedirs(out_dir)

Making ./evap/


#### Persistent Queue

This is a disk-based queue which will save progress in case of a program crash. The files are inserted using q.put() and retrieved with q.get(). 

Once the file is processed we can remove it from the queue using q.ack(). If the program were to crash between q.get() and q.ack(), the file would return to the queue and be processed during a later run.

In [None]:
q = persistqueue.UniqueAckQ('./queue/', auto_commit=True, multithreading=True)

In [1]:
def worker():
    while True:
        file = q.get(block=True)
        band = file[file.rfind('/')+1:file.rfind('.')]
        print('Started', band)
        stats = rs.zonal_stats(shape_frame, file, stats=output_stats, all_touched=True)
        frame = pd.DataFrame.from_dict(stats)
        frame = frame.join(shape_frame['GEOID10'])
        frame.to_pickle(out_dir + band + '.pkl')
        print('Finished', band)
        q.ack(file)

#### Multi-processing
We will create 8 worker processes, you can change this number if your computer has more CPU cores.

In [None]:
if q.size == 0:
    print('Queue is empty, adding files.') 
    for file in glob.glob('./masked_extended_geotiffs/*.geotiff'):
        q.put(file)

for i in range(8):
    p = Process(target=worker)
    p.start()

#### Formatting the Output, Converting Units

We will load each month's pickle file and format them by month and year.

Also, we will convert from millimeters to inches.

##### Function to Convert Bands to Datetime Index

This function will convert the 1-indexed NARR bands (1 = January 1979, 2 = February 1979, etc.) to a datetime index which will be used as our columns in the pandas dataframe.

In [76]:
import calendar
import datetime
def narr_band_to_datetime(band_number):
    month = band_number % 12
    if month == 0:
        month = 12
    year = 1979 + (band_number-1) // 12
    return datetime.datetime(year, month, 1)

In [None]:
files = glob.glob('./evap/*.pkl')

# Create the initial dataframe using the first file in the list
name = files[0][files[0].rfind('/')+1:files[0].rfind('.')]
frame = pd.read_pickle(files[0])
frame.set_index('GEOID10', drop=True, inplace=True)
frame.rename(columns={'mean':narr_band_to_datetime(int(name))}, inplace=True)


# Add onto the initial dataframe using the rest of the files
for file in files[1:]:
    name = file[file.rfind('/')+1:file.rfind('.')]
    subframe = pd.read_pickle(file)
    subframe.set_index('GEOID10', drop=True, inplace=True)
    subframe.rename(columns={'mean':narr_band_to_datetime(int(name))}, inplace=True)
    frame = frame.join(subframe)
    
    
print(frame)
frame.to_pickle('./evaporation.pkl')

##### Sort the Columns, Convert to Inches

The conversion for inches is as follows:

inches / day (SWMM) = mm(NARR) * 8 * 0.0393701

We multiply by 8 because NARR accumulates on a 3-hourly interval (8 per day).

1 millimeter = 0.0393701 inches

In [None]:
frame = pd.read_pickle('./evaporation.pkl')
frame = frame.sort_index(axis='columns')  # Sort by Year and Month
frame = frame.apply(lambda x : x * 8 * 0.0393701)  # Convert to inches
frame.to_pickle('./evaporation_converted.pkl')
print(frame)

#### Compare New Data to Old Data

In [5]:
new_frame = pd.read_pickle('./evaporation_converted.pkl')
new_data = new_frame.loc['010010202002']
new_data = new_data.groupby(new_data.index.month).mean()

old_frame = pd.read_pickle('../../../data/input_file_data/evaporation_converted.pkl')
old_data = old_frame.loc['010010202002']

print('New Data:\n' + str(new_data))
print('Old Data:\n' + str(old_data))

old_data = old_data.reset_index(drop=True)
print('Mean Difference:', str((new_data - old_data).mean()))
print('Max Difference:', str((new_data - old_data).max()))
print('Min Difference:', str((new_data - old_data).min()))

New Data:
1     0.045161
2     0.068187
3     0.106026
4     0.157164
5     0.185812
6     0.193588
7     0.176125
8     0.150855
9     0.125562
10    0.088352
11    0.054638
12    0.041084
Name: 010010202002, dtype: float64
Old Data:
1981-01-01    0.036854
1981-02-01    0.064368
1981-03-01    0.104568
1981-04-01    0.169002
1981-05-01    0.172171
                ...   
2014-08-01    0.164870
2014-09-01    0.118315
2014-10-01    0.090916
2014-11-01    0.044849
2014-12-01    0.037901
Name: 010010202002, Length: 408, dtype: float64
Mean Difference: 0.003139681828711702
Max Difference: 0.04936140216635991
Min Difference: -0.06297576178434318


#### Results

We see that the mean difference between the old and new data is very small. We have successfully obtained monthly evaporation data.