# WNV Model - NARR Precipitation Data Preprocessing
In this notebook we will downlod the daily precipitation and air surface temperature using this [daily dataset from NARR](https://www.esrl.noaa.gov/psd/data/gridded/data.narr.html).

We will then convert this data to county-level pandas dataframes using rasterstats zonal_stats function.
## Imports

In [None]:
try:
    import wget  # Downloading Data
except:
    print('wget not found. Please run \'pip install wget\'.')

try:
    from osgeo import gdal  # GDAL
except:
    print('GDAL not found, please run \'conda install -c conda-forge gdal\'.')

try: 
    import rasterio as rio  # Modifying Raster datasets
except:
    print('Rasterio not found, please run \'pip install rasterio\'.')
    
try:
    import geopandas as gpd
except:
    print('geopandas not found, please run \'conda install geopandas\'')
    
try: 
    import rasterstats as rs
except:
    print('rasterstats not found, please run \'pip install rasterstats\'')

import pandas as pd
import numpy as np
import calendar
import sys, os
import glob
import matplotlib.pyplot as plt

## 1) Downloading the Precipitation Data

First, we will download the data. We will use the [wget](https://pypi.org/project/wget/) module to do this.


#### Specify Download Directory
Note that this is relative to the jupyter notebook file

In [None]:
download_dir = './yearly_netcdf_data/'
if not os.path.exists(download_dir):
    print('Directory not found, attempting to create it')
    os.mkdir(download_dir)

#### Download Data

In [None]:
overwrite = False  # Whether to replace existing files or not

url = 'ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/'

start_year = 1999
end_year = 2015
years = [start_year + i for i in range(end_year+1-start_year)]

for year in years:
    print(year)
    file_name = 'apcp.' + str(year) + '.nc'

    if os.path.exists(download_dir + file_name) and not overwrite:
        pass
    else:
        download_url = url + 'apcp.' + str(year) + '.nc'
        wget.download(download_url, download_dir + 'apcp.' + str(year) + '.nc')

## 2) Converting the NetCDFs to GeoTiff

We will convert each year's NetCDF file to a multi-banded GeoTIFF file, where each band represents one day of that year.

This will require the use of GDAL.
#### Selecting the Files

In [None]:
yearly_netcdf = glob.glob(download_dir + '/*.nc')
print('Successfully found files' if len(yearly_netcdf) == 17 else 'Failed to find files')

#### Defining the Projection
The WNV Model's county shapefile uses the EPSG 4326 projection, so we will use this for our GeoTIFF extraction.

In [None]:
proj = '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'

### Warping the NetCDFs to our Projection
First, we will warp each NetCDF to our specified projection using the [gdalwarp utility](https://gdal.org/programs/gdalwarp.html).

#### Setting the Output Directory

In [None]:
output_directory = './yearly_geotiff_data/'
if not os.path.exists(output_directory):
    print('Directory not found, attempting to create it')
    os.mkdir(output_directory)

#### Warping

In [None]:
overwrite = False  # Overwrite existing GeoTIFFs or not


# The warp options specify our target projection and output type (GeoTIFF)
warp_options = gdal.WarpOptions(options='-t_srs \"' + proj + '\" -of GTiff')
subdataset = 'apcp'

for file in yearly_netcdf:
    name = file[-12:-3]
    print('Processing:', name)
    if os.path.exists(output_directory + name + '.geotiff') and not overwrite:
        pass
    else:
        ds = gdal.Warp(srcDSOrSrcDSTab='NETCDF:' + file + ':' + subdataset, destNameOrDestDS=output_directory + name + '.geotiff', options=warp_options)
        ds = None # Flush the file cache

## 3) Preparing the Land Mask
Since some counties lie on the edge of the continent, their data mixes with the ocean data. Therefore, we need to apply the land mask provided by NARR on each year's GeoTIFF that we just created. 

In [None]:
if not os.path.exists('./land.nc'):
    print('File not found, attempting download.')
    download_url = 'ftp://ftp.cdc.noaa.gov/Datasets/NARR/time_invariant/land.nc'
    wget.download(download_url, './land.nc')
    
mask_file = glob.glob('./land.nc')
print('Successfully found land mask' if len(mask_file)==1 else 'Failed to find land mask')
mask_file = mask_file[0]

### Projecting the Land Mask
In order for the land mask's values to line up with our projected GeoTIFF files, we need to project it in a similar fashion to the GeoTIFF files.

In [None]:
proj = '+proj=longlat +datum=WGS84 +no_defs'
subdataset = 'land'
output_file = './land_mask_geotiff'

warp_options = gdal.WarpOptions(options='-t_srs ' + proj + ' -of GTiff')
ds = gdal.Warp(srcDSOrSrcDSTab='NETCDF:' + mask_file + ':' + subdataset, destNameOrDestDS=output_file + '.geotiff', options=warp_options)
ds = None  # THIS IS VERY IMPORTANT
print('Successfully Warped Land Mask (' + output_file + ')')

### Extending the Land Mask

Since some of the counties lie outside the land mask's region, we need to extend the land mask by one pixel in each direction. This will ensure proper data coverage once we mask the geotiffs.

#### Results of Extension Function
<img src="https://i.imgur.com/dJqJeJS.png" width="300">



#### Extension Function

By creating a numpy masked array, we can create an array of booleans, which represent whether the pixel has data or does not have data. Then we can extend each of the pixels which have no data (if they have a neighbor with data).

In this case, the data is a 1 or a 0, which represents land or not land.

In [None]:
raster = rio.open('./land_mask_geotiff.geotiff')
data = raster.read(1)

rows = data.shape[0]
cols = data.shape[1]
nodata = 0

data = np.ma.masked_equal(data,nodata)
output_data = np.copy(data)

for i in range(rows):
    for j in range(cols):
        if data.mask[i][j]:  # If there's no data
            try:
                for x in range(-1,2):
                    for y in range(-1,2):
                        if not data.mask[i+x][j+y]:
                            output_data[i][j] = data[i+x][j+y]
            except IndexError:
                pass
            
with rio.open('./extended_land_mask.geotiff', 'w', **raster.profile) as dst:
    dst.write(output_data, 1)

## 4) Applying the Land Mask

#### Defining path to gdal_calc.py
Please manually define your GDAL install path. If you're using Anaconda, it should be in your environment's bin folder.
We will use this path to find gdal_calc.py, a script that is included with GDAL.

In [None]:
gdal_path = '/home/matas/anaconda3/envs/swmm/bin/'
sys.path.insert(0, gdal_path)

In [None]:
try:
    import gdal_calc
except:
    print('gdal_calc not found, please specify the path to this file in the cell above')

#### Masking the Yearly GeoTIFF Data

We will take each year's GeoTIFF file and iterate over all of its bands to create 365 GeoTIFF images.

In [None]:
yearly_geotiff = glob.glob('./yearly_geotiff_data/*.geotiff')
print('Successfully found all files' if len(yearly_geotiff) == 17 else 'Failed to find files')

output_directory = './masked_extended_geotiffs/'
if not os.path.exists(output_directory):
    print('Making', output_directory)
    os.mkdir(output_directory)

In [None]:
overwrite = True

mask_file = './extended_land_mask.geotiff'
if not mask_file:
    print('Failed to find mask file!')
    raise KeyboardInterrupt
else:
    print('Found mask file.')

for geotiff in yearly_geotiff:
    
    # Get number of bands to iterate over
    raster = gdal.Open(geotiff)
    band_count = raster.RasterCount
    
    # Specify the year
    year = geotiff[geotiff.rfind('/')+6:geotiff.rfind('.')]
    print(year)
    year_dir = output_directory + year + '/'
    if not os.path.exists(year_dir):
        print('Making', year_dir)
        os.makedirs(year_dir)
    
    # Loop through the year's 365 bands
    for band in range(1, band_count+1):
        outfile = year_dir + str(band) + '_masked_extended.geotiff'
        if not os.path.exists(outfile) or overwrite:
            print('Creating', outfile)
            gdal_calc.Calc('A*B', A=geotiff, B=mask_file, A_band=band, outfile=outfile, format='GTiff', NoDataValue=0)
    

## 5) Extracting the Data

We will use rasterstats' zonal_stats function to extract the data from each day into individual .csv files.

In [None]:
output_directory = './county_data/output_csv/'
if not os.path.exists(output_directory):
    print('Making', output_directory)
    os.mkdir(output_directory)

#### Load the Shape File

In [None]:
shape_frame = gpd.read_file('./urban_counties/urban_counties_wnv_4326.shp')
shape = shape_frame.to_crs('+proj=longlat +datum=WGS84 +no_defs')
shape['GEOID'] = shape['GEOID'].astype(str)  # Convert to string to keep the leading zeroes

#### Extract the Data!

In [None]:
output_columns = ['count', 'mean']

for year in range(1999, 2016):
    year_dir = output_directory + str(year) + '/'
    if not os.path.exists(year_dir):
        print('Making', year_dir)
        os.mkdir(year_dir)
        
        year_files = glob.glob('./masked_extended_geotiffs/' + str(year) + '/*.geotiff')
        year_files = sorted(year_files)
        
        for file in year_files:
            name = file[file.rfind('/')+1:file.rfind('.')]
            stats = rs.zonal_stats(shape, file, stats=output_columns, all_touched=True)
            frame = pd.DataFrame.from_dict(stats)
            
            frame = frame.join(shape_frame['GEOID'])  # Add the GEOID's to the zonal_stats frame
            frame = frame.join(shape_frame['NAME'])   
            frame.to_csv(year_dir + name + '.csv')
            
            print(year_dir + name + '.csv')

## 6) Processing the Data

We will load in each year's data into a pandas dataframe

In [None]:
overwrite = False

for year in range(1999,2016):
#     print(year)
    
    data_files = glob.glob('./county_data/output_csv/' + str(year) + '/*.csv')
    if len(data_files) < 365 :
        print('Failed to find files for', str(year), '!')
        raise KeyboardInterrupt
    
    frame = pd.read_csv(data_files[0])
    frame.drop('Unnamed: 0', axis=1, inplace=True)
    frame = frame[['NAME', 'GEOID', 'count', 'mean']]
    
    file = data_files[0]
    name = file[file.rfind('/')+1:file.rfind('_')]
    name = name[:name.rfind('_')]
    cols = frame.columns.values
    cols[3] = name
    frame.columns = cols
    
    for file in data_files[1:]:
        name = file[file.rfind('/')+1:file.rfind('_')]
        name = name[:name.rfind('_')]
        sub_frame = pd.read_csv(file)
        sub_frame.drop('Unnamed: 0', axis=1, inplace=True)
        
        cols = sub_frame.columns.values
        cols[0] = name
        sub_frame.columns = cols
        frame = frame.join(sub_frame[name])
    
    # Reorder the columns (1,2,3, ... ,364, 365)
    data_columns = frame.iloc[:,3:]
    data_columns.columns = data_columns.columns.astype(int)
    data_columns = data_columns.sort_index(axis=1)
    
    # Convert data to inches
    data_columns = data_columns.apply(lambda x: x / 25.40) 
    frame = frame.iloc[:,:2].join(data_columns)
    
    # Replace NaN values with 0 
    frame.fillna(value=0, inplace=True)
    
    # Save the frame to disk
    frame.to_pickle('./county_data/dataframes/all_data/' + str(year) + '.pkl')

# 7) Daily Air Temperature Data

We will repeat the above process with the air temperature data for each county, so we can filter the snow precipitation in our data analysis.

## Download the Data

In [None]:
download_dir = './air_temperature/netcdfs/'
if not os.path.exists(download_dir):
    print('Making', download_dir)
    os.makedirs(download_dir)


In [None]:
overwrite = False

download_url = 'ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/'
dataset = 'air.sfc'
for year in range(1999, 2016):
    name = dataset + '.' + str(year) + '.nc'
    url = download_url + name
    if not os.path.exists(download_dir + name) or overwrite:
        wget.download(url, download_dir + dataset + '.' + str(year) + '.nc')
    else:
        print('Skipping', name)


## Convert and Project the NetCDF Data to GeoTIFF

In [None]:
overwrite = False

output_dir = './air_temperature/geotiffs/'
proj = '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
warp_options = gdal.WarpOptions(options='-t_srs \"' + proj + '\" -of GTiff')
subdataset = 'air'


files = glob.glob('./air_temperature/netcdfs/*.nc')
print('Found files' if len(files) == 17 else 'Some files may be missing!')

for file in sorted(files):
    name = file[file.rfind('/')+1:file.rfind('.')]
    if not os.path.exists(output_dir + name) or overwrite:
        print(name)
        ds = gdal.Warp(srcDSOrSrcDSTab='NETCDF:' + file + ':' + subdataset, destNameOrDestDS=output_dir + name + '.geotiff', options=warp_options)
        ds = None # Flush the file cache
    else:
        print('Skipping', name)

## Mask and Extract Each Band of the GeoTIFF
This is assuming you have already created **and extended** the land mask in step 3.

#### Import Gdal_Calc
Please manually define your GDAL install path. If you're using Anaconda, it should be in your environment's bin folder.
We will use this path to find gdal_calc.py, a script that is included with GDAL.

In [None]:
gdal_path = '/home/matas/anaconda3/envs/swmm/bin/'
sys.path.insert(0, gdal_path)

In [None]:
try:
    import gdal_calc
except:
    print('gdal_calc not found, please specify the path to this file in the cell above')

In [None]:
overwrite = False

output_dir = './air_temperature/daily_geotiffs/'

# Select the Projected GeoTIFFs
files = glob.glob('./air_temperature/geotiffs/*.geotiff')
print('Found all files' if len(files) == 17 else ('Failed to find files!'))

# Select the Mask File
mask_file = glob.glob('./extended_land_mask.geotiff')
if len(mask_file) == 1:
    mask_file = mask_file[0]
    print('Found mask file!')
else:
    print('Failed to find mask file.')
    raise KeyboardInterrupt
    

for file in sorted(files):
    name = file[file.rfind('/')+1:file.rfind('.')]
    year = name[name.rfind('.')+1:]
    
    year_dir = output_dir + str(year) + '/'
    if not os.path.exists(year_dir):
        print('Making', year_dir)
        os.makedirs(year_dir)
    
    # Get number of bands to iterate over
    raster = gdal.Open(file)
    band_count = raster.RasterCount
 
    # Loop through the year's 365 bands
    for band in range(1, band_count+1):
        outfile = year_dir + str(band) + '.geotiff'
        if not os.path.exists(outfile) or overwrite:
            print('Creating', outfile)
            gdal_calc.Calc('A*B', A=file, B=mask_file, A_band=band, outfile=outfile, format='GTiff', NoDataValue=0)
    

## Zonal Stats

We will now extract the GeoTIFF data for each county, day, year.

The following code will take a while to run.

In [None]:
import rasterstats as rs
import pandas as pd
import geopandas as gpd
import glob
import os

print('Started')
shape_frame = gpd.read_file('./urban_counties/urban_counties_wnv_4326.gpkg')
shape_frame['GEOID'] = shape_frame['GEOID'].astype(str)

print(shape_frame.crs)
output = ['mean', 'count']


for year in range(1999,2016):
    year_geotiffs = glob.glob('./air_temperature/daily_geotiffs/' + str(year) + '/*.geotiff')

    year_dir = './air_temperature/output/' + str(year) + '/' 
    if not os.path.exists(year_dir):
        os.makedirs(year_dir)

    for file in year_geotiffs:
        name = file[file.rfind('/')+1:file.rfind('.')]
        print(year, '-', name)
        stats = rs.zonal_stats(shape_frame, file, stats=output, all_touched=Tru$
        frame = pd.DataFrame.from_dict(stats)
        frame = frame.join(shape_frame['GEOID'])
        frame = frame.join(shape_frame['NAME'])
        frame.to_csv(year_dir + name + '.csv')


## Converting the CSVs to DataFrames

In [None]:
overwrite = False

out_dir = './air_temperature/dataframes/'
if not os.path.exists(out_dir):
    print('Making', out_dir)
    os.makedirs(out_dir)

for year in range(1999, 2016):
    files = glob.glob('./air_temperature/output/' + str(year) + '/*.csv')
    
    # Create the dataframe with the first file
    frame = pd.read_csv(files[0])
    frame = frame.drop(['Unnamed: 0', 'count'], axis=1)
    frame = frame.set_index('GEOID', drop=True)
    frame = frame[['NAME', 'mean']]
    
    # Change 'mean' to day number
    day = files[0][files[0].rfind('/')+1:files[0].rfind('.')]
    cols = frame.columns.values
    cols[1] = day
    frame.columns = cols
    
    for file in files[1:]:
        day = file[file.rfind('/')+1:file.rfind('.')]
        
        sub_frame = pd.read_csv(file)
        sub_frame = sub_frame.set_index('GEOID', drop=True)
        
        cols = sub_frame.columns.values
        cols[1] = day
        
        frame = frame.join(sub_frame[day])
       
    
    data_columns = frame.iloc[:,1:]
    data_columns.columns = data_columns.columns.astype(int)
    data_columns = data_columns.sort_index(axis=1)
    
    
    frame = frame.iloc[:,:1].join(data_columns)
    
    if not os.path.exists(out_dir + str(year) + '.pkl') or overwrite:
        frame.to_pickle(out_dir + str(year) + '.pkl')
        print('Writing', str(year) + '.pkl')
    else:
        print('Skipping', str(year) + '.pkl')

## 8) Selecting the Data from June 1st to August 1
This will be our summer data.

In [None]:
output_dir = './county_data/dataframes/summer_data/'
if not os.path.exists(output_dir):
    print('Making', output_dir)
    os.mkdir(output_dir)

In [None]:
overwrite = True

for year in range(1999,2016):
    frame = pd.read_pickle('./processed_output/' + str(year) + '.pkl')
    
    # LEAP YEAR: Our data is offset by one day
    if len(frame.columns) == 368:
        # Select Jun 1 to Aug 1
        # Day of Year Calendar: https://nsidc.org/data/tools/doy_calendar.html
        data = frame.iloc[:,154:216]
    else:
        data = frame.iloc[:,153:215]
        
    frame = frame.iloc[:,:2].join(data)
    frame.set_index('GEOID', drop=True, inplace=True)
    
    if not os.path.exists(output_dir + str(year) + '.pkl') or overwrite:
        print('Making', output_dir + str(year) + '.pkl')
        frame.to_pickle(output_dir + str(year) + '.pkl')
    

## 9) Filtering the Data by Temperature

Since we will now consider the entire year's precipitation, we need to filter out the data for each year where the air temperature is less than 0C, as the precipitation during this period will be snow.

In [None]:
overwrite = False

out_dir = './county_data/dataframes/filtered_freezing_data/'
if not os.path.exists(out_dir):
    print('Making', out_dir)
    os.makedirs(out_dir)


for year in range(1999,2016):
    rainfall_data = pd.read_pickle('./county_data/dataframes/all_data/' + str(year) + '.pkl').set_index('GEOID', drop=True)
    rainfall = rainfall_data.iloc[:,1:]
    temperature = pd.read_pickle('./air_temperature/dataframes/' + str(year) + '.pkl').iloc[:,1:]
    
    freezing = temperature > 273.15  # False if Temperature < 0C, True if > 0C
    
    freezing = freezing.applymap(lambda x: -1 if x is False else 1)  # convert false to -1, true to 1
    
    data = rainfall * freezing
    data = data.applymap(lambda x: np.nan if x < 0 else x)  # convert freezing temperatures to NaN
    data = data.applymap(lambda x: abs(x))  # to turn the -0.00 values to 0.00
    
    
    frame = rainfall_data.iloc[:,0].to_frame().join(data)
    
    if not os.path.exists('./county_data/dataframes/filtered_freezing_data/' + str(year) + '.pkl') or overwrite:
        frame.to_pickle('./county_data/dataframes/filtered_freezing_data/' + str(year) + '.pkl')
    else:
        print('Skipping', year)