# Python Learn by Doing: Climate Change Indicators

Developed By: Dr. Kerrie Geil, Mississippi State University

Date: January 2024

Requirements: list space, RAM, and package requirements

Link: notebook available to download at 

<u> Description </u>

This notebook helps the learner build intermediate python programming skills through data query, manipulation, analysis, and visualization. Learning will be centered around computing ETCCDI climate change indices and determining the statistical significance of change. The notebook is aimed at learners who already have some knowledge of programming and statistics. 

<u> Summary of Contents </u>

put an outline of tasks/skills here

-----

# Collection of Useful Links

- [documentation page for every version of Python](https://www.python.org/doc/versions/)
- every version of Python also includes a tutorial e.g. [The Python Tutorial v3.12.3](https://docs.python.org/release/3.12.3/tutorial/index.html)
- [documentation on climate change indicators from the Expert Team on Climate Change Detection and Indices (ETCCDI)](https://etccdi.pacificclimate.org/index.shtml)
- [account registration page](https://cds.climate.copernicus.eu/user/register?destination=%2Fcdsapp%23!%2Fhome) for the [climate data store](https://cds.climate.copernicus.eu/cdsapp#!/home) where you can find tons of free research quality climate data
- [xarray documentation](https://docs.xarray.dev/en/stable/) which include the api reference, getting started guide, user guide, and developer info
- [kerrie's github repo](https://github.com/kerriegeil/MSU_py_training) is the current location where this notebooks lives and receives updates, this may eventually move to an MSU enterprise repo
- [temporary link to datasets]() used in this notebook, we're still working on a permanent solution for hosting the data and notebooks but data should be available at this link until 1 June 2024

# Introduction to Climate Change Indicators

Climate change indicators (also known as climate change indices) are quantitative measures of some aspect of the climate that can be tracked over time and used to robustly assess climate changes. Good climate change indicators use high quality observational data that is consistent over time (or has been adjusted/homogenized in time <sup>1</sup>) and incorporate statistical methods, if necessary, that reduce the impact of possible data inconsistencies. 

In this notebook we'll look at a few climate change indicators developed by the [Expert Team on Climate Change Detection and Indices (ETCCDI)](https://etccdi.pacificclimate.org/index.shtml) which is a sub-project of [Climate and Ocean Variability, Predictability and Change (CLIVAR)](https://www.clivar.org/), one of the core projects of the [World Climate Research Program's (WCRP)](https://www.wcrp-climate.org/).

The ETCCDI developed a set of [27 core indices](https://etccdi.pacificclimate.org/list_27_indices.shtml) using temperature and precipitation observational data. We'll calculate 5 of these:
- Monthly Maximum Value of Daily Minimum Temperature (TNx)
- Annual Total Precip Amount Over 99th Percentile of Wet Days (R99pTOT)
- Monthly Maximum Consecutive 5-day Precipitation (Rx5day)
- Maximum Length of Consecutive Dry Days (CDD)
- Growing Season Length (GSL)
- Warm Spell Duration Index (WSDI)

**Disclaimer:** This notebook is intended for python programming learning only. The data quality checking and calculation of ETCCDI climate change indices in this notebook may differ from the ETCCDI published instructions for simplicity and/or relevance to our learning goals. Learners wanting to compute the indices according to the exact ETCCDI instructions should consult the [ETCCDI documentation](https://etccdi.pacificclimate.org/index.shtml). The documentation suggests using the [RClimDex software package](https://github.com/ECCC-CDAS/RClimDex.git) written in R to calculate ETCCDI climate change indices. Another option would be to use pre-calculated indices based on multiple gridded datasets available at [climdex.org](https://www.climdex.org/), where you can also find a similar software package for calculating the indices on a dataset of your choice.   


Footnotes:
1. Time-dimension inhomogeneities in meteorological observations are due to things like station relocation and changes in instrumentation. For more info, ETCCDI has compiled examples of these inhomogeneities in [Classic Examples of Inhomogeneities in Climate Datasets](https://etccdi.pacificclimate.org/docs/Classic_Examples.pdf)



# Obtaining the climate data

For the climate change indices covered in this notebook we will need the following data over many data years:

variable abbrev. | description | frequency | units 
---|---|---|---
tmin | minimum surface air temperature | daily | C 
tmax | maximum surface air temperature | daily | C 
prcp | accummulated precipitation | daily | mm/day 

We'll use the AgERA5 dataset developed by the [European Centre for Medium-range Weather Forecasts (ECMWF)](https://www.ecmwf.int/) available from the [Copernicus Climate Data Store (CDS)](https://cds.climate.copernicus.eu/cdsapp#!/dataset/sis-agrometeorological-indicators?tab=overview). A zipped archive of the data files are temporarily available during the workshop at [data.zip](). If you haven't already, download and unzip the data into the same directory where you have this notebook saved. 

The files in data.zip have been pre-processed to save time during the workshop. The pre-processing steps that have been completed for you include:
1) Using the Climate Data Store (CDS) API to download AgERA5 precipitation and 2m min and max temperature. You need a free CDS account and to install the cdsapi python package as well as python dask to do this. The data downloads as one .tar.gz file per year per variable.
2) Unpacking the .tar.gz files. Each g-zipped archive unpacks to one netcdf file per day per variable (many files).
3) Converting temperature units from K to C
4) Converting longitude coordinates from 0 to 360 to -180 to 180
5) Consolidating all the daily files into one single netcdf file per variable that contains all times.

We won't cover this process in further detail here but if you are interested, the scripts used to pre-process the data are available on Kerrie's github. Step 1-2 is performed in [get_AgERA5_daily_parallel.py]() and steps 3-5 are performed in [prep_AgERA5_daily.ipynb]().

# Importing Python Packages and Defining Your Workspace


In [1]:
# importing all the python packages we will need here

import os
from urllib.request import urlretrieve
import xarray as xr
import numpy as np
import pandas as pd

import numpy.testing as npt
import warnings

import matplotlib.pyplot as plt
from collections import OrderedDict
# import gzip
# import shutil

# import pandas as pd

In [2]:
# learners need to update these paths to reflect locations on their own computer/workspace

# path to your working directory (where this notebook is on your computer)
work_dir = r'C://Users/kerrie/Documents/01_LocalCode/repos/MSU_py_training/learn_by_doing/climate_change_indicators/' 
# work_dir = r'C://Users/kerrie.WIN/Documents/code/MSU_py_training/learn_by_doing/climate_change_indicators/' 

# path to where you'll download and store the data files
data_dir = r'C://Users/kerrie/Documents/02_LocalData/tutorials/AgERA5_daily/'
# data_dir=r'C://Users/kerrie.WIN/Documents/data/AgERA5/'

# path to write output files and figures (be sure you have write privileges to this location)
output_dir = r'C://Users/kerrie/Documents/01_LocalCode/repos/MSU_py_training/learn_by_doing/climate_change_indicators/outputs/'
# output_dir = r'C://Users/kerrie.WIN/Documents/code/MSU_py_training/learn_by_doing/climate_change_indicators/outputs/'


# create directories if they don't exist already
# if not os.path.exists(work_dir):
#     os.makedirs(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [3]:
filenames = [data_dir+'prcp_AgERA5_Starkville_Daily_1979-2023.nc',
            data_dir+'tmax_AgERA5_Starkville_Daily_1979-2023.nc',
            data_dir+'tmin_AgERA5_Starkville_Daily_1979-2023.nc']

# Introduction to python xarray 

"Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw Numpy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience"

   -[xarray documentation](https://docs.xarray.dev/en/stable)

This package is most powerful when working with multidimensional data in netcdf format and especially with netcdfs that are written using Climate and Forecast Metadata Conventions ([CF Metadata Conventions](https://cfconventions.org)). Xarray can also handle zarr, tiff, csv, hdf, and grib files but may require additional dependencies to be installed. 

Xarray builds upon Numpy, Pandas, Scipy, netCDF4, Matplotlib and more. It also integrates well with Dask for parallel computing. Xarray is under active community development and pushes new updates approximately monthly.

The core data structures used by xarray are called the **DataArray** and the **Dataset**. DataArrays are N-dimensional arrays of a single data variable with dimension, coordinate, and attribute labels. Datasets contain one or more DataArrays that share one or more dimensions and coordinates. Each variable in a Dataset has its own attributes and the Dataset itself can have its own attributes as well (which come from the file attributes in each netcdf). 

Details of all xarray functions (including what parameters to include as function inputs and what each function returns) can be found in the [xarray API reference](https://docs.xarray.dev/en/stable/api.html). Xarray has pretty great documentation with usage examples, definitely check their [getting started](https://docs.xarray.dev/en/stable/getting-started-guide/index.html) and [user guide](https://docs.xarray.dev/en/stable/user-guide/index.html) documentation for help as you are learning. If you are stuck on something, stack overflow and xarray's issue documentation on github is also useful. I personally often end up at those sites from google searches "python xarray how to ___". 

Let's read in precipitation data from a netcdf file and look at some xarray capabilities.


    

In [7]:
# the netcdf files we're using in this notebook have 1 data variable per file
# read data into a DataArray structure

pr = xr.open_dataarray(filenames[0])
pr

With xarray, when we print a variable, instead of getting the values of that variable what we get (usually) is a view of all the metadata labels that are attached to the variable. The information above shows us that our pr data is the daily total precipitation aggregated from midnight to midnight local time each day, has units of mm per day, and is called 'prcp' in the netcdf file.  

We can also see that the data has 3 dimesions (time, lat, lon), the length of each dimension, and that each dimension is a "coordinate", which are essentially additional labels. Click on the paper and data stack icons to the right of each coordinate. Using the paper icon, you can see that each coordinate has its own attributes (standard_name, units, etc.). Using the data stack icon, you can see that each coordinate is also an array of values, similar to an index in Pandas. The beauty of coordinates is that they allow us to easily select a subset of the data variable using labels that correspond to the coordinate values. 

Definitions for xarray terminology such as DataArray, Dataset, variable, dimension, coordinate, attribute can all be found in xarray's user guide on the [xarray terminology page](https://docs.xarray.dev/en/stable/user-guide/terminology.html).

Let's start looking at how to access labels and select/manipulate our data using labels. 

In [9]:
# access the coordinate information (labels, values, attributes) attached to a variable
# same can be done with lon and time
# this is the same info we can see above by clicking on the coordinate icons
pr.lat

In [11]:
# attributes are always stored as dictionaries which contain items in key:value pairs
# access all the attributes attached to a variable or a coordinate with .attrs

# see dictionary of attributes attached to the variable pr
print(pr.attrs)

# see dictionary of attributes attached to the coordinate lat
pr.lat.attrs

{'long_name': 'Total precipitation (00-00LT)', 'units': 'mm d-1', 'temporal_aggregation': 'Sum 00-00LT', 'standard_name': 'precipitation'}


{'standard_name': 'latitude',
 'long_name': 'latitude',
 'units': 'degrees_north',
 'axis': 'Y'}

Dictionaries are not xarray-specific, they are a python data structure.

https://docs.python.org/3/tutorial/datastructures.html#
https://docs.python.org/3/tutorial/datastructures.html#dictionaries

In [13]:
# since attributes are stored in dictionaries we can use dictionary syntax to access the items 

# list all the pr attibutes names (keys) present in the dictionary
print(pr.attrs.keys())

# access the value of the pr 'units' attribute 
print(pr.attrs['units'])

# access the value of the latitude 'units' attribute
pr.lat.attrs['units']

dict_keys(['long_name', 'units', 'temporal_aggregation', 'standard_name'])
mm d-1


'degrees_north'

Now let's start selecting data from the pr array based on its time coordinate labels.

This particular data consists of only 1 data point at lat=33.5, lon=-88.8 (Starkville, MS). We don't need these singleton dimesions so let's remove them. Notice below the singleton dimensions are removed from the variable, but the lat/lon coordinate metadata sticks around in case we need it later (this would be useful, for example, if we wanted to plot the lat/lon location on a map).

In [14]:
# remove singelton dimensions
pr=pr.squeeze()
pr

In [15]:
# select data for 1 May 1982
pr.sel(time='1982-05-01')

In [16]:
# select data for all of 2020
pr.sel(time=slice('2020-01-01','2020-12-31'))

In [19]:
# what if I want to see the actual data value on a particular day, not just the metadata?
# .data on an xarray object will return the underlying numpy array of data without labels
pr.sel(time='1982-05-01').data

array(0.27938843, dtype=float32)

In [20]:
# add data from multiple days together
# sum all precip in 2020
pr.sel(time=slice('2020-01-01','2020-12-31')).sum('time')

What if we want to select data from the array based on some sort of criteria? .where is a great solution

In [22]:
# selecting with .where
# find pr > 3 inches/day (3 inches/day * 2.54 cm/inch * 10 mm/cm = 76.2 mm/day)

# select pr where greater than 76.2 and drop all other days from the array
pr.where(pr>76.2,drop=True)

In [23]:
# if you don't drop=True xarray keeps the data dimensions 
# and automatically fills the days that don't meet the criteria with nan
pr.where(pr>76.2)

In [24]:
# do the same thing but return the dates that meet the criteria, not the pr values
times_heavy=pr.where(pr>76.2,drop=True).time
times_heavy

In [25]:
# you can also select from a DataArray using a list or array of labels
# select from pr with an array of time labels
pr.sel(time=times_heavy)

if the time coordinate is datetime objects like we have here, this allows you to easily resample or group the data if needed

In [26]:
# resample to annual sum 
# and make the time coordinate label the last day of the year 
# (YE = year end)
pr_annual = pr.resample(time='YE').sum()
pr_annual

In this case some of our variable attributes may no longer make sense. If we want, we can update them

In [27]:
# modifying attributes

# replace an attribute value
pr_annual.attrs['units']='mm/year'

# delete a whole attribute
del pr_annual.attrs['temporal_aggregation']

# knock the last 10 characters off an attribute
pr_annual.attrs['long_name']=pr_annual.attrs['long_name'][:-10]

# look at our new attributes
pr_annual

In [28]:
# another resample example
# resample to seasonal mean
# 'QS' means make the time label for each season at the start of the season
# 'JAN means make the seasonal aggregates JFM, AMJ, JAS, OND
pr.resample(time='QS-JAN').sum()

.resample and .groupby can be used to easily calculate anomalies

We make monthly pr anomalies by calculating monthly mean pr minus the long term mean pr for each month (sometimes referred to a the climatology).

This means, for example, Jan 1979 mean pr - Jan (1979-2023) mean pr, Feb 1979 mean pr - Feb (1979-2023) mean pr, and so on for each monthly data point  


In [None]:
# first we need mean the monthly mean pr values, use .resample
pr_monthly=pr.resample(time='MS').mean()
pr_monthly

We'll use .groupby for the rest of the calculation. Before we proceed, let's take a look at .groupby. If you are familiar with Pandas, this will look very similar since xarray .groupby is based on Pandas .groupby

In [None]:
# .groupby returns an object that can be iterated over in the form (label,grouped data) pairs
pr_monthly.groupby(pr_monthly.time.dt.month)

if we think in terms of xarray dimensions and coordinates, this is similar to getting a new dimension named month that is also a coordinate with values/labels 1-12 

In [None]:
# how to access the group labels
pr_monthly.groupby(pr_monthly.time.dt.month).groups.keys()

In [None]:
# how to access the array of data assigned to a group label
# the following shows the data associated with month group 5
label=5
pr_monthly.groupby(pr_monthly.time.dt.month)[label]

If you look at all the time labels (data stack icon for the time coordinate) you'll see this group is all the May data values in the pr_monthly timeseries. 45 years worth of May values. We can calculate the long term May mean pr by taking the mean of this group of data

In [None]:
pr_monthly.groupby(pr_monthly.time.dt.month)[label].mean()

We can calculate the long term mean for each month of the year by excluding the [label]. The value of the 5th number in the returned array below should be the same as above. Notice also below the new dimension and coordinate named 'month' and it's coordinate labels 1-12. 

In [None]:
pr_monthly.groupby(pr_monthly.time.dt.month).mean()

These new metadata labels from .groupby are what makes .groupby so "smart" when we actually compute the anomalies. We'll be taking each monthly mean pr value in the timeseries and subtracting the appropriate long term mean value, all without any loops to slow us down.

In [None]:
# calculate anomalies
# the first term is 12 groups (months) each containing 45 data values (1 value per year)
# the second term is 12 groups (months) each containing 1 value (long term monthly mean) 

# based on the labels present, for each group .groupby is smart enough to broadcast 
# the 1 value in the second term to each of the 45 values in the first term
# in other words, the appropriate long term mean value is subtracted from every monthly data value in the time series
# no loops (or nested loops) needed

pr_anom= pr_monthly.groupby(pr_monthly.time.dt.month) - pr_monthly.groupby(pr_monthly.time.dt.month).mean()
pr_anom

In [None]:
# take a quick look at our pr daily timeseries and our monthly pr anomalies
fig=plt.figure(figsize=(15,2))
pr.plot()
plt.title('daily pr at Starkville, MS')
plt.show()

fig=plt.figure(figsize=(15,2))
pr_anom.plot()
plt.axhline(y=0,color='grey',linestyle='dashed',linewidth=0.5)
plt.title('monthly mean pr anomaly at Starkville, MS')
plt.ylabel('pr (mm/day)')
plt.show()

Thus far we've been working with data in xarray's DataArray structure. Sometimes there are multiple variables in a netcdf file as well as file attributes (different from variable or coordinate attributes). In this situation you would read the data file into an xarray Dataset (instead of a Data Array). This doesn't really apply to us here since our data files only have 1 variable each but we'll go through an example of how to read data into a Dataset below

In [None]:
# .open_dataset instead of .open_dataarray
ds = xr.open_dataset(filenames[0])
ds

Notice you can now see the file attributes history' and 'NCO' (sometimes also called global attributes). If this file had more than one variable they would appear in the Data variables section under prcp. Similar to how you can access coordinate information from a data variable with varname.coordname, you can access the data variables in a dataset with dataset.varname like below 

In [None]:
pr=ds.prcp
pr

# Data Cleaning

### ETCCDI-suggested data cleaning / quality control

The minimum quality control procedures suggested by ETCCDI are as follows.

Replace data value with Nan for:
- user-defined missing values (i.e -9999-->Nan)
- daily precip values less than 0
- daily max temperature less than daily minimum temperature
- daily temperature greater than 70C (158F) or less than -70C (-94F)
- leap days (i.e Feb 29th)
- impossible dates (i.e. 32nd March, 12th June 2042)
- non-numeric values
- daily temperature outliers (i.e. 3-5 times the standard deviation from the mean value for each calendar day)


Addressing each of these items below...

In [None]:
# first let's start with a fresh read of all the data we'll be using
pr = xr.open_dataarray(filenames[0])
tx = xr.open_dataarray(filenames[1])
tn = xr.open_dataarray(filenames[2])

pr=pr.squeeze()
tx=tx.squeeze()
tn=tn.squeeze()
pr

#### nan for user-defined missing values (i.e -9999-->Nan)

xr.open_dataset does this for you. 


Notice in the variable attributes above there is no _FillValue=-9999., which the value stored for missing data in the netcdf. This is because xarray automatically replaces the _FillValue with nan.

#### nan for daily precip values less than 0

In [None]:
# are there any negatives?
(pr<0).data.sum()

#### nan for daily max temperature less than daily minimum temperature

In [None]:
# is tx ever less than tn?
(tx<tn).data.sum()

In [None]:
# # where tx<tn fill both tx and tn with nan
# tx=xr.where(tx<tn,np.nan,tx)
# tn=xr.where(tx<tn,np.nan,tn)

In [None]:
# # is tx ever less than tn now?
# (tn>tx).data.sum()

#### nan for daily temperature greater than 70C (158F) or less than -70C (-94F)

In [None]:
# is tx>70C, tx<-70C, tn>70C, or tn<-70C?
((tx>70)|(tx<-70)).data.sum(), ((tn>70)|(tn<-70)).data.sum()

#### leap days (i.e Feb 29th)

here we'll just drop the leap days from the data arrays rather than filling with nan


In [None]:
# first let's double check that the time dimension is the same for 
# all of our data arrays
assert list(pr.time.data)==list(tx.time.data), 'pr.time and tx.time are not equal'
assert list(pr.time.data)==list(tn.time.data), 'pr.time and tn.time are not equal'

# another way to do the same thing without having to convert data structure is with numpy.testing
npt.assert_array_equal(pr.time,tx.time,'pr.time and tx.time are not equal')
npt.assert_array_equal(pr.time,tn.time,'pr.time and tn.time are not equal')

In [None]:
# # find all the leap days
# leapdays=pr.time[(pr.time.dt.day==29) & (pr.time.dt.month==2)]
# leapdays

In [None]:
# find the indexes to all the leap days
leap_ind=np.where((pr.time.dt.day==29) & (pr.time.dt.month==2))[0]
leap_ind

In [None]:
# # drop the leap days from the data arrays
# pr=pr.drop_sel(time=leapdays)
# tx=tx.drop_sel(time=leapdays)
# tn=tn.drop_sel(time=leapdays)
# len(pr)

# fill with nan
# pr=pr.where((pr.time.dt.day==29) & (pr.time.dt.month==2),np.nan,pr)
pr[leap_ind]=np.nan
tx[leap_ind]=np.nan
tn[leap_ind]=np.nan
pr[leap_ind]


#### nan for impossible dates (i.e. 32nd March, 12th June 2042)

This data has datetimes for the time dimension. If there were impossible dates, xarray would have had a problem at the open_dataarray statement. So we know there are no impossible dates present.

There could be dates missing, but we can check that just by looking at the length of the time dimension. We have 45 years of daily data, now without leap days. 45years * 365days = 16425days

In [None]:
pr.shape, tx.shape, tn.shape

#### nan for non-numeric values

Similarly, use of netcdf and xarray ensures that there are no non-numeric values. Each variable in the data file is of one data type (e.g. float) and if there were a non-float value present there would have been an error already. We can be assured the data we've read is all float

In [None]:
pr.dtype, tx.dtype, tn.dtype

#### daily temperature outliers (i.e. 3-5 times the standard deviation from the mean value for each calendar day)

In [None]:
# find the time-mean for each day of the year
tx_daily_mean=tx.groupby(tx.time.dt.dayofyear).mean('time')
tn_daily_mean=tn.groupby(tn.time.dt.dayofyear).mean('time')
tx_daily_mean

In [None]:
# find the standard deviation for each day of the yar
# .std throws a runtime warning about degrees of freedom because of 
# nan in the data so we supress the warnings here

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message="Degrees of freedom <= 0 for slice")
    tx_stddev=tx.groupby(tx.time.dt.dayofyear).std('time')
    tn_stddev=tn.groupby(tn.time.dt.dayofyear).std('time')
tx_stddev

In [None]:
# define daily outlier temperature as exceeding the mean +/- 3 times standard deviation
tx_outlier_upper, tx_outlier_lower=(tx_daily_mean+tx_stddev*5), (tx_daily_mean-tx_stddev*5)
tn_outlier_upper, tn_outlier_lower=(tn_daily_mean+tn_stddev*5), (tn_daily_mean-tn_stddev*5)
tx_outlier_upper

In [None]:
print('tx',(tx.groupby(tx.time.dt.dayofyear)>tx_outlier_upper).data.sum(), (tx.groupby(tx.time.dt.dayofyear)<tx_outlier_lower).data.sum())
print('tn',(tn.groupby(tn.time.dt.dayofyear)>tn_outlier_upper).data.sum(), (tn.groupby(tn.time.dt.dayofyear)<tn_outlier_lower).data.sum())

#### let's also look at how many missing values we have per month

In [None]:
# a function that sums the number of nans in each month of data
def get_nans_per_month(data_in):
    month_groups=pd.MultiIndex.from_arrays([data_in.time['time.year'].data,data_in.time['time.month'].data])
    data_in.coords['month_groups']=('time',month_groups)    
    nancount=data_in.isnull().groupby('month_groups').sum()
    return nancount

In [None]:
pr_nan_per_month=get_nans_per_month(pr.copy())
tx_nan_per_month=get_nans_per_month(tx.copy())
tn_nan_per_month=get_nans_per_month(tn.copy())
# pr_nan_per_month

In [None]:
# create datetimes for the x axis
time_months=pd.date_range(tx.time.data[0],tx.time.data[-1],freq='MS')

# plot
fig=plt.figure(figsize=(10,4))
ax=fig.add_subplot(311)
plt.plot(time_months,pr_nan_per_month)
plt.title('prcp, number of nans per month')

ax=fig.add_subplot(312)
plt.plot(time_months,tx_nan_per_month)
plt.title('tmax, number of nans per month')

ax=fig.add_subplot(313)
plt.plot(time_months,tn_nan_per_month)
plt.title('tmin, number of nans per month')

plt.tight_layout()
plt.show()

# Calculate climate change indicators

### Monthly Maximum Value of Daily Minimum Temperature (TNx)

- max(each month of daily minimum temperature values)

Here we are inputting daily data and pulling out 1 value per month.

In [None]:
# this is similar to how we found nans per month

# create an index for every month in the timeseries
month_groups=pd.MultiIndex.from_arrays([tn.time['time.year'].data,tn.time['time.month'].data])

# add the month_groups index as a new coordinate (labels)
# the month_groups coordinate will allow us to groupby months in the next step 
# each day of data in any particular month will be associated with a label in the month_groups coordinate
tn.coords['month_groups']=('time',month_groups)    
tn

In [None]:
# now groupby month and find the maximum value of each month
# this should return 1 value for every month, 45 years x 12 months = length 540
TNx=tn.groupby('month_groups').max()
TNx

In [None]:
# using the monthly datetimes that we created earlier (time_months) for the x axis values, plot TNx

# plot
fig=plt.figure(figsize=(15,2))
plt.plot(time_months,TNx) # plt.plot is from matplotlib.pyplot
plt.title('Monthly Maximum Value of Daily Minimum Temperature (TNx)')
plt.ylabel('degrees C')
plt.show()

### Annual Total Precip Amount Over 99th Percentile on Wet Days (R99pTOT)

- annually, the sum of precipitation when precipitation is > 99th percentile of wet day precipitation in the base period 1981-2010
- where a wet day is precipitation >= 1mm

Here we first use daily data during the base period to determine the 99th percentile of wet day precipitation. Then for each year of daily data we determine if each day exceeds the threshold and calculate an annual sum of precip on days that exceed the threshold. 

In [None]:
# find all the leap days by label
leapdays=pr.time[(pr.time.dt.day==29) & (pr.time.dt.month==2)]
leapdays

In [None]:
# drop leapdays from the data
pr_noleap=pr.drop_sel(time=leapdays)

len(pr_noleap),len(pr)

In [None]:
# slice to only base years
pr_baseyrs=pr_noleap.sel(time=slice('1981','2010'))
pr_baseyrs

In [None]:
# find 99th percentile of wet day precipitation
pr_99w=xr.where(pr_baseyrs>=1.0,pr_baseyrs,np.nan).quantile(0.90)
pr_99w

In [None]:
# fill with nan where pr doesn't meet the criteria
pr_noleap_copy=pr_noleap.copy()
pr_noleap_copy=xr.where(pr_noleap_copy>pr_99w,pr_noleap_copy,np.nan)

# then sum over each year
R99pTOT=pr_noleap_copy.groupby(pr_noleap_copy.time.dt.year).sum()
R99pTOT

In [None]:
# plot
fig=plt.figure(figsize=(15,2))
plt.plot(R99pTOT.year,R99pTOT)
plt.title('Annual Total Precip Amount Over 99th Percentile on Wet Days (R99pTOT)')
plt.ylabel('mm/year')
plt.show()

### Monthly Maximum Consecutive 5-day Precipitation (Rx5day)

- max(5-day rolling mean precipitation within each month)

Here we are inputting daily data, for each month calculating the mean precipitation amount for each 5-day window of data values, then choosing the maximum of 5-day window value for each month.

In [None]:
# we'll again use the pr timeseries without leapdays
# we want to group by each month
# we'll assign an index as a new coordinate for this 
month_groups_noleap=pd.MultiIndex.from_arrays([pr_noleap.time['time.year'].data,pr_noleap.time['time.month'].data])
pr_noleap.coords['month_groups_noleap']=('time',month_groups_noleap) 
pr_noleap

In [None]:
# let's take a look at how to use .groupby
# see what happens if we group by month_groups
# there should be 45 years x 12 months = 540 groups of data
# .groupby returns an object that can be iterated over in the form (label,group_array_of_data) pairs
pr_noleap.groupby(pr_noleap.month_groups_noleap)


In [None]:
# how to access the group labels
pr_noleap.groupby(pr_noleap.month_groups_noleap).groups.keys()

In [None]:
# how to access the array indexes assigned to a group label
label=(1979,5)
pr_noleap.groupby(pr_noleap.month_groups_noleap).groups[label]

In [None]:
# how to access the array data assigned to a group label
pr_noleap.groupby(pr_noleap.month_groups_noleap)[label]

In [None]:
# now we want to find the maximum value of the 5-day rolling mean in each month
# let's test with one month first
window_len = 5  # days

# using rolling to divide the month of data up into 5-day windows
# then take the mean of each 5-day window if all 5 days in the window have finite data values (not nan)
# then find the maximum value of the above
# the result should be a single value
pr_noleap.groupby(pr_noleap.month_groups_noleap)[label].rolling(time=window_len,min_periods=window_len).mean().max()


In [None]:
# now do the calculation for all months using a for loop
results=[]  # empty list

# loop through each label,data pair in the groupby object
for label,data_group in pr_noleap.groupby(pr_noleap.month_groups_noleap):
    # append the result for each month to our results list
    results.append(data_group.rolling(time=window_len,min_periods=window_len).mean().max())

# the result should be 1 value for every month (45 years x 12 months = length 540)
print(len(results))

# look at the first item in the list
results[0]

In [None]:
# even though each monthly calculation returns a single value
# that value is returned as an xarray DataArray with some metadata attached
# to get all our results back into a single object we can concatenate the list of xr.DataArrays

Rx5day=xr.concat(results,dim='time') # concat on a new dimension called time
Rx5day

In [None]:
# make the new time dimension a coordinate (supply time labels)

# first make an array of datetimes
print(pr_noleap.time.data[0],pr_noleap.time.data[-1]) # first and last time in string format

time_months=pd.date_range(pr_noleap.time.data[0],pr_noleap.time.data[-1],freq='MS') # full array of datetimes

print(time_months[0:4])

# assign the time coordinate labels
Rx5day.coords['time']=('time',time_months)    

# while we're at it, assign some variable attributes
# if we use xarray plotting, some of this will show up automagically on the plot
var_atts={'standard_name':'precipitation','units':'mm/day','description':'monthly maximum of monthly 5-day rolling mean prcp'}
Rx5day.attrs=var_atts

Rx5day

In [None]:
# plot
# use xarray plotting (which is based on matplotlib) by calling .plot on the xr.DataArray object 
fig=plt.figure(figsize=(15,2))
Rx5day.plot() 
plt.title('Monthly Maximum Consecutive 5-day Precipitation (Rx5day)')
plt.show()

### Maximum Length of Consecutive Dry Days (CDD)

- annually, the maximum length of consecutive days where precipitation is < 1mm
- not looking at dry spells that span over multiple years, just cutting off the search at the end of each year

Here we are inputting daily data, determining whether each day falls under the precipitation threshold, and finding the longest period of consecutive days each year that meets the threshold requirement. 

In [None]:
# we'll use pr_noleap again 
pr_noleap

In [None]:
# make a mask where 1=dry and 0=wet
threshold = 1  # mm/day
data_mask = xr.where(pr_noleap<threshold,1,0)
data_mask

In [None]:
# group the data by year
data_mask_grouped=data_mask.groupby(data_mask.time.dt.year)
data_mask_grouped

In [None]:
# use nested for loops to find the longest stretch of dry days in each year

CDD=[] # empty list to store results

# loop through each year
for year,data1yr in data_mask_grouped:
    
    counter=0
    longest_counter=0

    # loop through each day
    for iday in range(len(data1yr)):

        # if it's a dry day, increment the counter
        if data1yr[iday]==1: counter+=1

        # keep track of the longest consecutive amount of dry days
        if counter>longest_counter: longest_counter=counter

        # if it's not a dry day (0 or nan), start the counter over at 0
        if data1yr[iday]!=1: counter=0

    # add to the dictionary
    CDD.append(longest_counter)
    longest_counter=0
    counter=0

# look at the results for first 5 years
CDD[0:5]

In [None]:
# annual datetimes for x axis values
time_annual=pd.date_range(tx.time.data[0],tx.time.data[-1],freq='YS')

# plot
fig=plt.figure(figsize=(15,2))
plt.plot(time_annual,CDD)
plt.title('Maximum Length of Consecutive Dry Days (CDD)')
plt.ylabel('days')
plt.show()

### Growing Season Length (GSL)

- annually, growing season starts on the first day of the first six consecutive day period where daily mean temperature is > 5C
- annually, growing season ends on the first day after 1 July of the first six consecutive day period where daily mean temperature is < 5C

Here we are inputting daily data, pulling out 2 dates per year, and calculating the number of days between the two dates.


In [None]:
threshold = 5   # degrees C
window_len = 6  # consectutive days

# calculate mean temperature
t_mean=(tn+tx)/2

In [None]:
# to find the start and end of the growing season
# we will need to "roll" through time.
# in order for the leap days (that we filled with nan)
# to not mess us up, we'll need to drop those days from the data

# leapdays=t_mean.time[(t_mean.time.dt.day==29) & (t_mean.time.dt.month==2)]
t_mean_noleap=t_mean.drop_sel(time=leapdays)
len(leapdays),len(t_mean_noleap),len(t_mean)

In [None]:
# we'll want to group by years to find the start and end of the growing season for each year
# in this case we don't need to assign a new coordinate to use groupby
# we can use .dt on xr.DataArrays of datetimes to select/subset/group (xarray .dt operates the same as pandas .dt)
t_mean_noleap.groupby(t_mean_noleap.time.dt.year)

In [None]:
# how to access the group labels
# t_mean_noleap.groupby(t_mean_noleap.time.dt.year).groups.keys()

In [None]:
# how to access the array indexes assigned to a group label
# testyear=1979
# t_mean_noleap.groupby(t_mean_noleap.time.dt.year).groups[testyear]

In [None]:
# how to access the array data assigned to a group label
# data_1yr=t_mean_noleap[t_mean_noleap.groupby(t_mean_noleap.time.dt.year).groups[testyear]]
year=1979
data_1yr=t_mean_noleap.groupby(t_mean_noleap.time.dt.year)[year]
data_1yr

In [None]:
# make a mask for where daily temperature is greater than 5C
data_mask=xr.where(data_1yr>threshold,1,0)
data_mask

In [None]:
# separate the timeseries into windows of length 6
# the first window has the first value of the data_mask (index 0) in the last position of the window 
# plus the 5 preceeding values of the data_mask, which in this case are nan because we're at the beginning of the data_mask array
# the 6th window (index 5) should be equal to the first 6 values of data_mask (no nans)

# create the windows
windows=data_mask.rolling(time=window_len,center=False).construct('window')

# print the window with the first 6 values of data_mask
print(windows.isel(time=window_len-1))

# print array info
windows


In [None]:
# find the sum of each window
# this will tell us how many days per window are over the 5C threshold
# ignore windows that contain any nans
windows.sum('window',min_count=window_len)

In [None]:
# find the indexes of each 6-day window where all days were over the 5C threshold
# np.where returns a tuple in this case where the resulting array is 
# in the first index of the tuple. this is why we use the [0] to pull the array from the tuple
np.where(windows.sum('window')==window_len)[0]


In [None]:
# now take the first value of the result above
# this is the first window where the requirement was met (6 days above 5C)
# the windows are indexed as their last day, i.e. the 5th index window contains index days 0,1,2,3,4,5 
# the start of the growing season is the first day of the first 6-day period meeting the 5C requirement
# so to get the index of the first day in the window we subtract 5
gs_start_ind=np.where(windows.sum('window')==window_len)[0][0] - (window_len-1)
gs_start_ind

In [None]:
# now let's search for the end of the growing season
# we know to only look after July 1, what index is that?
label=str(year)+'-07-01'
minval=t_mean_noleap.indexes['time'].get_loc(label)
minval

In [None]:
# now do similar steps to find the end of the growing season

# 0/1 mask for where temperature is less than 5C
data_mask=xr.where(data_1yr<threshold,1,0)
data_mask

In [None]:
# split up into 6-day windows
windows=data_mask.rolling(time=window_len,center=False).construct('window')
windows

In [None]:
# get the index to all the windows where temperature is always less than 5C
np.where(windows.sum('window')==window_len)[0]

In [None]:
# get the index to all the windows where temperature is lt 5C
# but remember the index is to the last date in the window and we want the index of the first date
# so we subtract 5 and save the result as an array
possible_inds=np.where(windows.sum('window')==window_len)[0] - (window_len-1)
possible_inds

In [None]:
# subset possible_inds to only indexes that correspond to days after July 1
# and take the first value in that result
gs_end_ind=possible_inds[possible_inds>minval][0]
gs_end_ind

In [None]:
# functions to do what we did above for each year
# we'll put these functions in a loop below

def get_gs_start(data_1yr):
    mask=xr.where(data_1yr>threshold,1,0)
    windows=mask.rolling(time=window_len,center=False).construct('window')
    ind=np.where(windows.sum('window')==window_len)[0][0] - (window_len-1)
    return ind

def get_gs_end(data_1yr):
    mask=xr.where(data_1yr<threshold,1,0)
    windows=mask.rolling(time=window_len,center=False).construct('window')
    # sometimes it may be warm through the end of the year
    # in these cases we would end up with an error if no windows meet the <5C requirement
    # try/except works to pass in the last day of the year as the end of the growing season in these cases
    try:
        possible_inds=np.where(windows.sum('window')==window_len)[0]
        ind=possible_inds[possible_inds>minval][0]
    except:
        ind=364 # index of the last day of year
    return ind

In [None]:
# group the data by year and loop through each year's worth of data
# to find the index of the start and end of the growing season 

# create empty lists for storing results
gs_start_list=[]
gs_end_list=[]

# loop through years of data
# .groupby returns an object that can be iterated over in the form (label,group_array_of_data) pairs
# here "label" are the years
# and "group" are a year's worth of data values
for label,group in t_mean_noleap.groupby(t_mean_noleap.time.dt.year):
    # call our functions and append the result to our lists
    gs_start_list.append(get_gs_start(group))
    gs_end_list.append(get_gs_end(group))

# look at the first 5 values of each list
gs_start_list[0:5], gs_end_list[0:5]

In [None]:
# double check our work
assert all(x>=0 for x in gs_start_list), "negative values in gs_start_list"
assert all(x<=364 for x in gs_start_list), "values>364 in gs_start_list"
assert all(x>=minval for x in gs_end_list), f"values<{minval} in gs_end_list"
assert all(x<=364 for x in gs_start_list), "values>364 in gs_end_list"

In [None]:
# calculate the growing season length for each year
# this is called a "list comprehension"
# it executes a for loop and returns results inside a list
GSL = [end_ind-start_ind for end_ind,start_ind in zip(gs_end_list,gs_start_list)] 

# # below is identical to above
# GSL=[]
# for end_ind,start_ind in zip(gs_end_list,gs_start_list):
#     GSL.append(end_ind-start_ind)

# check out the first 5 values
GSL[0:5]

In [None]:
# using annual datetimes that we've already created for x axis values

# plot
fig=plt.figure(figsize=(15,2))
plt.plot(time_annual,GSL)
plt.title('Growing Season Length (GSL)')

fig=plt.figure(figsize=(15,2))
plt.plot(time_annual,gs_start_list)
plt.title('Start of Growing Season DOY')

fig=plt.figure(figsize=(15,2))
plt.plot(time_annual,gs_end_list)
plt.title('End of Growing Season DOY')

### Warm Spell Duration Index (WSDI)

- 6 consecutive days of hot maximum temperatures
- hot temperature threshold defined as > 90th percentile of maximum temperature for each calendar day using a centered 5-day window in the base period 1981-2010
- warm spells that contain dates for multiple years are assigned to the year when the spell ends

Here we first use daily data during the base period to determine the daily 90th percentile temperature threshold. Then using all years of daily data we decide whether each calendar day exceeds the hot threshold, then find occurrences where the threshold is exceeded for at least 6 consecutive days (this is a warm spell), then sum the number of days annually in the warm spells.

Notice that this is not the same as finding dangerous heat waves with respect to human health because it is based on a temperature threshold for each calendar day. This means that the WSDI will include winter warm spells where the temperature exceeds the 90th percentile of winter daily temperature, which would likely be a comfortable temperature.

In [None]:
# first let's find the 90th percentile temperature for each calendar day (using a centered 5 day window)
# this means that to determine the 90th percentile temperature for a given day we need 
# that day's temperature in each year as well as the temperature for 2 days before and 2 days after in each year
# we'll set it up to find the answer for 1 day first and then make a loop to compute all other days

# starting with Feb 1



n_baseyrs=30
# base_first='1981'
# base_last='2010'

day_first=1
day_last=365
doy=list(np.arange(day_first,day_last+1))*n_baseyrs
doy= [364,365] + doy + [1,2]

len(doy), doy[0:9]


In [None]:
tx_noleap_baseyrs=tx.drop_sel(time=leapdays).sel(time=slice('1980-12-30','2011-01-02'))
tx_noleap_baseyrs

In [None]:
tx_noleap_baseyrs.coords['doy_noleap']=('time',doy)
tx_noleap_baseyrs

In [None]:
tx_windows=tx_noleap_baseyrs.rolling(time=5,center=True).construct('window')
tx_windows


In [None]:
# drop the windows centered on the extra dates
tx_windows=tx_windows.drop_sel(time=['1980-12-30','1980-12-31','2011-01-1','2011-01-02'])
tx_windows

In [None]:
# now groupby our doy index 'doy_noleap'
# each group will contain the temperature for a single doy of every year plus the two days before and two days after
# in other words, each group is the 5-day centered window for a given doy for all years 
# 5 days * 30 years = 150 data values in each group
tx_grouped=tx_windows.groupby(tx_windows.doy_noleap)

# let's look at what is in a data group for doy 15
tx_grouped[15]

In [None]:
# now find the 90th percentile of data values in each group
# we should end up with 1 value for each doy of the year (excluding leap days)
threshold90=tx_grouped.quantile(0.9,dim=['time','window'])
threshold90

In [None]:
# prep tx for comparison to threshold90
# this time use all data (don't subset in time)
nyears=45
doy=list(np.arange(day_first,day_last+1))*nyears
tx_noleap=tx.drop_sel(time=leapdays)
tx_noleap.coords['doy_noleap']=('time',doy)
tx_noleap


In [None]:
# determine which days exceed threshold90
# these are the hot days
tx_hot_mask = tx_noleap.groupby(tx_noleap.doy_noleap) > threshold90
tx_hot_mask

In [None]:
# how many True days and how many False?
ntrue=tx_hot_mask.sum()
nfalse=len(tx_hot_mask)-ntrue
ntrue,nfalse

In [None]:
tx_hot_mask.time.isel(time=6).data + np.timedelta64(1,'D')#.data#,tx_hot_mask.time.isel(time=6).data.timedelta(days=1)
# tx_hot_mask.time.isel(time=6).dt.strftime("%a, %b %d %H:%M").data
# tx_hot_mask.time.isel(time=6).dt.strftime("%Y-%m-%d").timedelta(days=1)
window_len

In [None]:
i=6
np.arange(i-(window_len-1),i+1)

# tx_hot_mask.isel(time=6)
date=tx_hot_mask.time.isel(time=i).data
print(date)
dates=pd.date_range(date-np.timedelta64(window_len-1,'D'),date-np.timedelta64(0,'D'))
dates

In [None]:
# we'll loop in time to identify warm spells in each year and sum the days in warm
count=0
hot_inds=[]
hot_dates=[]
# event_year=[]

for i,value in enumerate(tx_hot_mask):
    if value: count=count+1 # if True start a counter
    else: count=0

    if count>=window_len:
        inds=np.arange(i-(window_len-1),i+1)
        hot_inds.extend(inds)

        # date=tx_hot_mask.time.isel(time=i).data
        # dates=pd.date_range(date-np.timedelta64(window_len-1,'D'),date-np.timedelta64(0,'D'))
        # hot_dates.extend(dates)
     

    # if i < window_len-1:
len(hot_dates)

In [None]:
hot_inds=np.unique(hot_inds)
len(hot_inds)

In [None]:
hot_dates=np.unique(hot_dates)
len(hot_dates)

In [None]:
# now we need to know which year each warm spell event takes place in
# warm spell days are counted for the year when the spell ends
# are there any warm spells that span over two years?
ind1=None
event_inds=[]
for i,value in enumerate(hot_inds[:-1]):
    if ind1==None:
        ind1=value

    if hot_inds[i+1]==value+1:
        pass
    else:
        ind2=value
        event_inds.append((ind1,ind2)) # append a tuple
        ind1=hot_inds[i+1]

print(len(event_inds))
event_inds

In [None]:
tx_hot_mask.isel(time=slice(544,566))

In [None]:
tx_hot_mask.time.isel(time=event_inds[0][0]).dt.year.data


In [None]:
# do owe have any warm spells that span over multiple years
spell_year=[]
count=0
for startstop in event_inds:
    year_start=tx_hot_mask.time.isel(time=startstop[0]).dt.year.data
    year_end = tx_hot_mask.time.isel(time=startstop[1]).dt.year.data
    if year_start==year_end:
        spell_year.append(year_start)
    else:
        spell_year.append(year_end)
        count+=1
print(count)
spell_year

In [None]:
# count days in warm spells per year
WSDI=[]
for data_year in np.arange(1979,2023+1):
    day_count=0
    for i,event_year in enumerate(spell_year):
        if event_year==data_year:
            ndays=event_inds[i][1]-event_inds[i][0]+1
        else:
            ndays=0    
        day_count=day_count+ndays
    WSDI.append(day_count)
len(WSDI)

In [None]:
# using annual datetimes that we've already created for x axis values
# plot
fig=plt.figure(figsize=(15,2))
plt.plot(time_annual,WSDI)
plt.title('Warm Spell Duration Index (WSDI)')

# Plotting linear trends

In [None]:
# first we need to clean up our variable's coordinate labels
# this is because the polyfit function we will use doesn't like coordinates like our multi-index "month_groups"

TNx

In [None]:
TNx=TNx.drop_vars(['month_groups', 'time_level_0', 'time_level_1']) # delete the junk
TNx=TNx.rename({'month_groups':'time'})  # rename the dimension
TNx=TNx.assign_coords({'time':time_months})  # assign new coordinate labels to time dim

# nice and clean
TNx

In [None]:
# generate least squares linear regression coefficients
coefs=TNx.polyfit(dim='time',deg=2)

# generate the x,y points of the linear regression line
regline=xr.polyval(TNx.time,coefs)

# see our handywork
regline

In [None]:
# plot the linear regression over the timeseries TNx
fig=plt.figure(figsize=(15,2))
plt.plot(time_months,TNx) # plt.plot is from matplotlib.pyplot
regline.polyfit_coefficients.plot(linestyle='--')
plt.title('Monthly Warmest Night (TNx) with linear trend')
plt.ylabel('degrees C')
plt.show()

let's plot the seasonal mean TNx with the linear trend for each

In [None]:
TNx_seasonal=TNx.resample(time='QS-DEC').mean()
TNx_seasonal

In [None]:
TNx_DJF=TNx_seasonal[0::4][1:-1]
TNx_MAM=TNx_seasonal[1::4]
TNx_JJA=TNx_seasonal[2::4]
TNx_SON=TNx_seasonal[3::4]

coefs=TNx_DJF.polyfit(dim='time',deg=2)
reg_DJF=xr.polyval(TNx_DJF.time,coefs)

coefs=TNx_MAM.polyfit(dim='time',deg=2)
reg_MAM=xr.polyval(TNx_MAM.time,coefs)

coefs=TNx_JJA.polyfit(dim='time',deg=2)
reg_JJA=xr.polyval(TNx_JJA.time,coefs)

coefs=TNx_SON.polyfit(dim='time',deg=2)
reg_SON=xr.polyval(TNx_SON.time,coefs)

In [None]:
# plot the linear regression over the timeseries TNx
fig=plt.figure(figsize=(15,6))
fig.add_subplot(411)
plt.plot(TNx_DJF.time,TNx_DJF)
plt.ylim(TNx_seasonal.min(),TNx_seasonal.max())
reg_DJF.polyfit_coefficients.plot(linestyle='--')
plt.title('Winter Mean Warmest Night')
plt.ylabel('degrees C')

fig.add_subplot(412)
plt.plot(TNx_MAM.time,TNx_MAM)
plt.ylim(TNx_seasonal.min(),TNx_seasonal.max())
reg_MAM.polyfit_coefficients.plot(linestyle='--')
plt.title('Spring Mean Warmest Night')
plt.ylabel('degrees C')

fig.add_subplot(413)
plt.plot(TNx_JJA.time,TNx_JJA)
plt.ylim(TNx_seasonal.min(),TNx_seasonal.max())
reg_JJA.polyfit_coefficients.plot(linestyle='--')
plt.title('Summer Mean Warmest Night')
plt.ylabel('degrees C')

fig.add_subplot(414)
plt.plot(TNx_SON.time,TNx_SON)
plt.ylim(TNx_seasonal.min(),TNx_seasonal.max())
reg_SON.polyfit_coefficients.plot(linestyle='--')
plt.title('Fall Mean Warmest Night')
plt.ylabel('degrees C')

plt.tight_layout()
plt.show()

### Are the Changes In Value of These Indices Over Time Statistically Significant?

In [None]:
# test if trend is statistically different from zero

### Computing Climate Change Indices on Gridded Data

In [None]:
# download/unzip data

In [None]:
# repeat one of the above analysis

In [None]:
# visualize

# Your Turn!

### Choose one of three coding mini-projects below to complete on your own and prepare to share your findings


**Option 1 (easiest):** Calculate the monthly mean daily temperature range (DTR) and create a figure showing the DTR timeseries. 

&emsp;Hints:
- Use daily tmax and tmin data
- Calculate the daily temperature range as tmax-tmin
- For each month, find the mean of the range values you calculated in the previous step
- Plot your timeseries of monthly values. Include axis labels and a title. 

<br>
<br>

**Option 2 (moderate):** Calculate the cold spell duration index (CDSI) at the xx station and create a figure showing the CDSI timeseries. Extra: see if you can determine whether the change in the CDSI is statistically significant.

&emsp;Hints:
- Use daily tmin data
- Find the daily 10th percentile temperature using a centered 5-day window over the base period 1961-1990
- Using all data years, determine if each day exceeds the threshold (looking for days with tmin < threshold)
- Identify cold spells as periods of 6 consecutive days when the temperature exceeds the threshold
- Count how many total cold spell days there are annually (remember each cold spell is assigned to the year when the spell ends)
- Plot the timeseries of annual values. Include axis labels and a title.
- Extra Step: Determine statistical significance of the trend line (linear regression) or the difference in means between two 30-year periods (1941-1970) and (1991-2020).

<br>
<br>


**Option 3 (hardest):** Use a gridded dataset to compute the annual growing season length (GSL) at each grid cell. Then, calculate the trend in GSL at each grid cell and also determine whether each trend is statisically significant. Present your results in a figure that shows the GSL trend for each grid cell (on a map) and include an indication of whether each grid cell value is statistically significant.

&emsp;Hints:
- Use gridded daily tmax and tmin data
- Calculate daily mean temperature
- Use the same process we showed previously to determine the annual start/end of the growing season and find the annual GSL, except this time do the calculations at each grid cell.
- Calculate the trend (linear regression) in annual GSL at each grid cell.
- Determine if each trend is statistically significant.
- Plot the the map of trend values and indicate significance at each grid with hatching or some other visual indicator. Include a title and legend.


# might need to replace one of the above with
 Option: reproduce any of the indices in this notebook using gridded data for Mississippi

 Option: use dask to speed up computation on gridded data



In [None]:
# peek at the answer figure for option 1

In [None]:
# peek at the answer figure for option 2

In [None]:
# peek at the answer figure for option 3

Don't forget to create answer codes for these and put them in the repo. Direct learners to answers after the work-on-your-own session.