# CMIP5 Data Processing

The purpose of this notebook is to process the CMIP5, downscaled collection of climate change simulations for historical moments. The data set extends from 1/1/1951 until 12/31/2099.

- 1/1/1951 through 12/31/2099

The 30-yr, data focus period is:

- 1/1/2011 through 12/31/2040

CMIP5 model results have been downscaled to the NLDAS-2 grid.

## Imports and Parameters

In [1]:
# this tells Jupyter to embed matplotlib plots in the notebook
%matplotlib notebook

In [2]:
import numpy as np
import pandas as pd
import datetime as dt
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib
import shapely as sp
from matplotlib.collections import PatchCollection
from matplotlib.lines import Line2D
from shapely.geometry import Point
from shapely.geometry import Polygon
from IPython.display import display, HTML
import os
from copy import deepcopy
import pyodbc
import sqlalchemy

Custom python code module

In [3]:
import DBA_DClimComp as DBAD

Output directory

In [4]:
OUT_DIR = r'\\augustine.space.swri.edu\jdrive\Groundwater\R8937_Stochastic_CC_Recharge\Da' \
          r'ta\JNotes\Processed\CMIP5'

All of the PRISM data have been placed in a database

For precipitation we have a wet threshold. The precipitation depth must equal or exceed this threshold before being counted as a wet day

In [5]:
WD_THRESH = 0.2   # in mm

## Processing and Aggregation to Monthly and Annual

Our first processing step for precipitation is to aggregate to monthly and calendar year annual values. These should be output to RData structures for seasonal and harmonic analysis in R, spreadsheets for manual examination, and Python pickle files for later loading into Jupyter notebooks.

Make the connection to the DB using a SQL Alchemy engine object

In [6]:
engine = sqlalchemy.create_engine( DBAD.DSN_STRING )

Acquire a Pandas DataFrame for our CMIP5 grid definition

In [7]:
GridSQL = DBAD.createSQLCMIP5Grid()
GridDF = pd.read_sql( GridSQL, engine, index_col=DBAD.FIELDN_ID )

In [8]:
display( HTML( GridDF.to_html() ) )

Unnamed: 0_level_0,Row,Col,utm_x,utm_y,DS_Method,DS_Resolution
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,0,273143.21875,3282992.75,LOCA,1/16th degree
2,0,1,279194.40625,3282871.75,LOCA,1/16th degree
3,0,2,285245.4375,3282754.25,LOCA,1/16th degree
4,0,3,291296.34375,3282639.75,LOCA,1/16th degree
5,0,4,297347.15625,3282528.75,LOCA,1/16th degree
6,0,5,303397.8125,3282420.75,LOCA,1/16th degree
7,0,6,309448.34375,3282316.25,LOCA,1/16th degree
8,0,7,315498.78125,3282215.0,LOCA,1/16th degree
9,0,8,321549.09375,3282117.0,LOCA,1/16th degree
10,0,9,327599.28125,3282022.25,LOCA,1/16th degree


In [9]:
GridCols = list( GridDF.columns )
GridCols

['Row', 'Col', 'utm_x', 'utm_y', 'DS_Method', 'DS_Resolution']

Acquire a Pandas DataFrame for CMIP5 model definition

In [10]:
ModelSQL = DBAD.createSQLCMIP5Model()
ModelDF = pd.read_sql( ModelSQL, engine, index_col=DBAD.FIELDN_ID )

In [11]:
display( HTML( ModelDF.to_html() ) )

Unnamed: 0_level_0,Result_Ind,ICRunNum,Scenario_ID,CMIP5_Model
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,1,rcp45,access1-0
2,1,1,rcp45,access1-3
3,2,1,rcp45,bcc-csm1-1
4,3,1,rcp45,bcc-csm1-1-m
5,4,1,rcp45,canesm2
6,5,6,rcp45,ccsm4
7,6,1,rcp45,cesm1-bgc
8,7,1,rcp45,cesm1-cam5
9,8,1,rcp45,cmcc-cm
10,9,1,rcp45,cnrm-cm5


In [12]:
ModelCols = list( ModelDF.columns )
ModelCols

['Result_Ind', 'ICRunNum', 'Scenario_ID', 'CMIP5_Model']

Get our start and end dates and then go through our grid cells, read in the precipitation values for the specified time interval, generate monthly and yearly sums. 

Save all of our DataFrames in dictionaries by Grid Code.

In [13]:
START_DT = dt.datetime( 2011, 1, 1, 0, 0, 0 )
END_DT = dt.datetime( 2040, 12, 31, 0, 0, 0 )

In [14]:
DryDict = dict()
WetDict = dict()

In [15]:
NumGPts = len( GridDF )
NumMods = len( ModelDF )
NumMods, NumGPts

(196, 210)

In [16]:
for jJ in range(1, (NumMods + 1)):
    ModInd = jJ
    for iI in range(1, (NumGPts + 1)):
        GridInd = iI
        GridUTMX = float( GridDF.at[iI, GridCols[2]] )
        GridUTMY = float( GridDF.at[iI, GridCols[3]] )
        # make our joint index
        JntInd = "M%d_%d" % (ModInd, GridInd)
        # now are ready to get our precipitation values
        PreSQL = DBAD.createSQLCMIP5Pre( START_DT, END_DT, iI, jJ )
        PreDF = pd.read_sql( PreSQL, engine, index_col=DBAD.FIELDN_STRDT, 
                             parse_dates=[DBAD.FIELDN_STRDT] )
        if len( PreDF ) < 1:
            #print("No values for %s" % JntInd)
            continue
        PreDF.index.name = DBAD.FIELDN_DT
        PreDF.index = PreDF.index.tz_convert( None )
        # now do the resample 
        MonDF = PreDF.resample( 'MS', axis=0, closed='left', label='left' ).sum()
        AnnDF = PreDF.resample( 'AS', axis=0, closed='left', label='left' ).sum()
        # change the column names
        MonDF.columns = ["Precip_mm"]
        AnnDF.columns = ["Precip_mm"]
        # now make our appends
        GMonDF = MonDF.copy()
        GMonDF.columns = [JntInd]
        GAnnDF = AnnDF.copy()
        GAnnDF.columns = [JntInd]
        # now check where we are
        if (iI == 1) and (jJ == 1):
            AllMonDF = GMonDF.copy()
            AllAnnDF = GAnnDF.copy()
        else:
            AllMonDF = AllMonDF.merge( GMonDF, how='inner', left_index=True, right_index=True)
            AllAnnDF = AllAnnDF.merge( GAnnDF, how='inner', left_index=True, right_index=True)
        # the resampling is done so now want go through and get our counts of contiguous
        #  wet days and contiguous dry days. Also track the start date for the contiguous 
        #  series and track the total depth for wet series and the daily depth within the
        #  wet series.
        cNumDays = len( PreDF )
        inWet = False
        inDry = False
        cWetCnt = 0
        cDryCnt = 0
        DryList = list()
        WetList = list()
        for dD in range( cNumDays ):
            cTSInd = PreDF.index[dD]
            cDT = dt.datetime( cTSInd.year, cTSInd.month, cTSInd.day )
            if dD == 0:
                cWStartDT = cDT
                cDStartDT = cDT
            cPDepth = float( PreDF.at[cTSInd,'Precip_mmpd'] )
            if cPDepth >= WD_THRESH:
                # this is the wet day case
                if inWet:
                    cWetCnt += 1
                    totPrecip += cPDepth
                    dayPreL.append( cPDepth )
                else:
                    inWet = True
                    inDry = False
                    cWStartDT = cDT
                    cWetCnt = 1
                    dayPreL = [ cPDepth ]
                    totPrecip = cPDepth
                    if dD > 0:
                        DryList.append( [ cDStartDT, cDryCnt ] )
                        cDryCnt = 0
            else:
                # this is the dry day case
                if inDry:
                    cDryCnt += 1
                else:
                    inWet = False
                    inDry = True
                    cDStartDT = cDT
                    cDryCnt = 1
                    if dD > 0:
                        WetList.append( [ cWStartDT, cWetCnt, totPrecip, dayPreL ] )
                        cWetCnt = 0
                        totPrecip = 0.0
                        dayPreL = list()
            # end of outer depth if
        # end of the day for
        # check for the last entry
        if inWet:
            WetList.append( [ cWStartDT, cWetCnt, totPrecip, dayPreL ] )
        else:
            DryList.append( [ cDStartDT, cDryCnt ] )
        # add our state analysis lists to our dictionaries
        DryDict[JntInd] = DryList
        WetDict[JntInd] = WetList
    # end of inner for loop
# end of outer for loop

Now are ready to output our various items. Do the pickle files first

In [17]:
#display( HTML( AllMonDF.head().to_html() )) 

In [18]:
#len( AllMonDF.columns.tolist() ), len( AllMonDF )

In [19]:
#len( AllAnnDF.columns.tolist() ), len( AllAnnDF )

In [20]:
MonPCKF = os.path.normpath( os.path.join( OUT_DIR, "AllMonth_2011-2040.pickle" ) )
AllMonDF.to_pickle( MonPCKF )
AnnPCKF = os.path.normpath( os.path.join( OUT_DIR, "AllYears_2011-2040.pickle" ) )
AllAnnDF.to_pickle( AnnPCKF )

Next use the feather library for R compatibility

In [21]:
FAllMonDF = AllMonDF.copy()
FAllAnnDF = AllAnnDF.copy()
FAllMonDF = FAllMonDF.reset_index()
FAllAnnDF = FAllAnnDF.reset_index()

In [22]:
MonFeatherF = os.path.normpath( os.path.join( OUT_DIR, "AllMonth_2011-2040.feather" ) )
FAllMonDF.to_feather( MonFeatherF )
AnnFeatherF = os.path.normpath( os.path.join( OUT_DIR, "AllYears_2011-2040.feather" ) )
FAllAnnDF.to_feather( AnnFeatherF )

Finally output to a spreadsheet, transpose the arrays first so that process them easier in Excel

In [23]:
AllMonDF = AllMonDF.transpose()
AllAnnDF = AllAnnDF.transpose()

In [24]:
OutXLSX = os.path.normpath( os.path.join( OUT_DIR, "Precip_Agg_2011-2040.xlsx" ) )
with pd.ExcelWriter(OutXLSX) as writer:
    GridDF.to_excel( writer, sheet_name="Grid_Metadata", na_rep=str(np.nan),
                     index=True, index_label="Id" )
    AllMonDF.to_excel( writer, sheet_name="Monthly", na_rep=str(np.nan),
                       index=True, index_label="Model_and_Grid" )
    AllAnnDF.to_excel( writer, sheet_name="Annual", na_rep=str(np.nan),
                       index=True, index_label="Model_and_Grid" )
# end of with and write output

In [25]:
# delete some data frames to help with memory issues
del AllMonDF
del AllAnnDF
del FAllAnnDF
del FAllMonDF

## Wet and Dry Days

While the monthly and annual aggregation was being completed, also collated the contiguous wet and dry day counts. These are in a dictionary. Need to process these out so that can work with them further

Determine the maximum number of wet days

In [26]:
AllKeys = sorted( WetDict.keys() )
len( AllKeys )

16296

In [27]:
AllKeys == sorted( DryDict.keys() )

True

In [28]:
MaxWetDays = 0
TotWetSeqs = 0
for tKey in AllKeys:
    TotWetSeqs = TotWetSeqs + len( WetDict[tKey] )
    NewWetDays = max( [x[1] for x in WetDict[tKey]] )
    if NewWetDays > MaxWetDays:
        MaxWetDays = NewWetDays
# end of for
MaxWetDays, TotWetSeqs

(59, 22545607)

Go through our dictionaries and create DataFrames for each grid cell and then concatenate these all together.

In [29]:
DDFList = list()
WDFList = list()

In [30]:
for tKey in AllKeys:
    tDryList = DryDict[tKey]
    dNEnts = len( tDryList )
    DataDict = { "MGrid_Id" : [ tKey for x in range(dNEnts) ],
                 "Year" : [x[0].year for x in tDryList],
                 "Month" : [x[0].month for x in tDryList],
                 "Day" : [x[0].day for x in tDryList],
                 "Dry_Count" : [x[1] for x in tDryList], }
    tDryDF = pd.DataFrame( data=DataDict )
    #tDryDF["Month"] = tDryDF.apply( lambda row: ExIntMonth( row["Start_Date"] ), axis=1 )
    DDFList.append( tDryDF )
    tWetList = WetDict[tKey]
    wNEnts = len( tWetList )
    WDaysArray = np.zeros( (wNEnts, MaxWetDays), dtype=np.float32 )
    # fill in the wet days array
    for iI in range(wNEnts):
        wdsList = tWetList[iI][3]
        cNDays = len( wdsList )
        for jJ in range( cNDays ):
            cdDep = wdsList[jJ]
            WDaysArray[iI, jJ] = cdDep
        # end of days for
    # end of rows for
    # now can create our DataFrame
    DataDict = { "MGrid_Id" : [ tKey for x in range(wNEnts) ],
                 "Year" : [x[0].year for x in tWetList],
                 "Month" : [x[0].month for x in tWetList],
                 "Day" : [x[0].day for x in tWetList],
                 "Wet_Count" : [x[1] for x in tWetList], 
                 "Total_Depth" : [x[2] for x in tWetList], }
    for dD in range(1, (MaxWetDays + 1)):
        DayLabel = "Day_%d" % dD
        DataDict[DayLabel] = WDaysArray[:, (dD-1)]
    # end of day label for
    tWetDF = pd.DataFrame( data=DataDict )
    #tWetDF["Month"] = tWetDF.apply( lambda row: ExIntMonth( row["Start_Date"] ), axis=1 )
    WDFList.append( tWetDF )
# end of outer for

In [31]:
WetDayDF = pd.concat( WDFList, ignore_index=True )
DryDayDF = pd.concat( DDFList, ignore_index=True )

MemoryError: 

In [None]:
len( WetDayDF ), len( DryDayDF )

In [None]:
display( HTML( WetDayDF.head().to_html() ) )

In [None]:
display( HTML( WetDayDF.tail().to_html() ) )

In [None]:
display( HTML( DryDayDF.head().to_html() ) )

In [None]:
DryDayDF['Dry_Count'].sum() + WetDayDF['Wet_Count'].sum()

Now output to pickle files in case need to reload

In [None]:
DryPCKF = os.path.normpath( os.path.join( OUT_DIR, "DryDays_2011-2040.pickle" ) )
DryDayDF.to_pickle( DryPCKF )
WetPCKF = os.path.normpath( os.path.join( OUT_DIR, "WetDays_2011-2040.pickle" ) )
WetDayDF.to_pickle( WetPCKF )

And output to feather format for integration with R.

In [None]:
DryFeatherF = os.path.normpath( os.path.join( OUT_DIR, "DryDays_2011-2040.feather" ) )
DryDayDF.to_feather( DryFeatherF )
DryFeatherF = os.path.normpath( os.path.join( OUT_DIR, "WetDays_2011-2040.feather" ) )
WetDayDF.to_feather( DryFeatherF )