# Python Learn by Doing: Climate Change Indicators

Developed By: Dr. Kerrie Geil, Mississippi State University

Date: January 2024

Requirements: list space, RAM, and pacakge requirements

Link: notebook available to download at 

<u> Description </u>

This notebook helps the learner build intermediate python programming skills through data query, manipulation, analysis, and visualization. Learning will be centered around obtaining climate data, computing climate change indices, and determining the statistical significance of change. The notebook is aimed at learners who already have some knowledge of programming and statistics. 

<u> Summary of Contents </u>

put an outline of tasks/skills here

-----

### Introduction to Climate Change Indicators

Put a description of what they are

Include a bunch of links

Spell out which ones we will be computing

Selection of ETCCDI Climate Extremes Indices
- Monthly Maximum Value of Daily Minimum Temperature (TNx)
- Growing Season Length (GSL)
- Warm Spell Duration Index (WSDI)
- Monthly Maximum Consecutive 5-day Precipitation (Rx5day)
- Maximum Length of Consecutive Dry Days (CDD)
- Annual Total Precip Amount Over 99th Percentile of Wet Days (R99pTOT)

**Disclaimer:** This notebook is intended for python programming learning only. The data quality checking and calculation of ETCCDI climate change indices in this notebook may differ slightly from the ETCCDI published instructions for simplicity and/or relevance to our learning goals. Learners wanting to compute the indices according to the exact ETCCDI instructions should consult their [documentation](https://etccdi.pacificclimate.org/index.shtml) and/or use the [RClimDex software package](https://github.com/ECCC-CDAS/RClimDex.git) written in R to calculate ETCCDI climate change indices. The indices calculated from multiple gridded datasets are also available from [climdex.org](https://www.climdex.org/), which also offers a similar software package for calculating the indices on a dataset of your choice.   


For the climate change indices covered in this notebook we will need the following observational data over many data years:

variable abbrev. | description | frequency | units
---|---|---|---
tmin | minimum surface air temperature | daily | C 
tmax | maximum surface air temperature | daily | C
prcp | accummulated precipitation | daily | mm

### Importing Python Packages and Defining Your Workspace


In [1]:
# importing all the python packages we will need here

import numpy as np
import matplotlib.pyplot as plt
from urllib.request import urlretrieve
import os
import gzip
import shutil

import pandas as pd

In [2]:
# learners need to update these paths to reflect locations on their own computer/workspace

# path to your working directory (where this notebook is on your computer)
work_dir = r'C://Users/kerrie/Documents/01_LocalCode/repos/MSU_py_training/learn_by_doing/climate_change_indicators/' 
# work_dir = r'C://Users/kerrie.WIN/Documents/code/MSU_py_training/learn_by_doing/climate_change_indicators/' 

# path to where you'll download and store the data files
data_dir = r'C://Users/kerrie/Documents/02_LocalData/tutorials/learn_by_doing/climate_change_indicators/' 
# data_dir = r'C://Users/kerrie.WIN/Documents/data/tutorials/learn_by_doing/climate_change_indicators/' 

# path to write output files and figures
output_dir = r'C://Users/kerrie/Documents/01_LocalCode/repos/MSU_py_training/learn_by_doing/climate_change_indicators/outputs/'
# output_dir = r'C://Users/kerrie.WIN/Documents/code/MSU_py_training/learn_by_doing/climate_change_indicators/outputs/'


# create directories if they don't exist already
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

### Obtaining the Data

Describe the data requirements (importance of time dimension standardization and missing data) 

Warnings against performing climate change analyses on just any dataset (example PRISM)

Warnings about high resolution spatial data (much of it is interpolated, high res not always better)

Why we choose to use certain datasets

Links to each dataset's webpage

In [None]:
url = 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-inventory.txt'
filename = data_dir+'ghcnd-inventory.txt'

# urlretrieve(url, filename)

In [None]:
# load metadata file
colnames=['ID','LAT','LON','VAR','START','END']
df = pd.read_csv(filename,sep='\s+', names=colnames)
df

In [None]:
# subset to United States Coop Network stations
df=df.loc[df['ID'].str.contains('USC')]
df

In [None]:
# subset to variables we want
df =df[df.VAR.isin(['TMIN','TMAX','PRCP'])]
df

In [None]:
# subset to approx. Mississippi (rectangular bounding box)
df = df[(df.LON<-88.0978)&(df.LON>-91.6650)&(df.LAT>30.1739)&(df.LAT<34.9960)]
df

In [None]:
# subset to stations with many data years
df=df[(df.START<1920) & (df.END>2020)]
df

In [None]:
# subset to stations that have all three variables
df=df.groupby('ID').filter(lambda x: len(x)==3)
df

In [None]:
df

In [None]:
df['NYEARS']=df.END-df.START+1
subTdf=df.loc[df.VAR=='TMAX']

In [None]:
df_long=df[df.NYEARS>=100]
df_long=df_long.groupby('ID').filter(lambda x: len(x)==3)
df_long

In [None]:
df_long=df_long[(df_long.END>=2020)&(df_long.START<=1920)]
df_long

In [None]:
df_long.ID.unique()

In [None]:
df_tx=df_long[df_long.VAR=='TMAX']

In [None]:
# import matplotlib.pyplot as plt
df_tx=df[df.VAR=='TMAX']
plt.scatter(x=df_tx['LON'], y=df_tx['LAT'])
plt.show()

In [None]:
df_tx.sort_values(['LON','LAT'],axis=0)

In [None]:
# MSU is USC00228374
# Poplarville Exp Stn is USC00227128

describe any steps taken prior to here to decide on the station etc.

In [None]:
# download/unzip temperature data

url = 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/USC00228374.csv.gz'
filename = data_dir+'USC00228374.csv.gz'
# urlretrieve(url, filename)

# with gzip.open(filename, 'rb') as f_in:
#     with open(filename[:-3], 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)

url = 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/readme-by_station.txt'
filename = data_dir+'readme-by_station.txt'
# urlretrieve(url, filename) 


url = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt'
filename = data_dir+'readme.txt'
# urlretrieve(url, filename) 

In [43]:
filename = data_dir+'USC00228374.csv'
colnames=['ID','YYYYMMDD','ELEMENT','DATA_VALUE','M_FLAG','Q_FLAG','S_FLAG','OBS_TIME']
data_types={'ID':'string','YYYYMMDD':'string','ELEMENT':'string','DATA_VALUE':'float','M_FLAG':'string','Q_FLAG':'string','S_FLAG':'string','OBS_TIME':'string'}
na_values=[-9999]
df = pd.read_csv(filename, names=colnames,dtype=data_types,na_values=na_values)
df


Unnamed: 0,ID,YYYYMMDD,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
0,USC00228374,18910901,TMAX,300.0,,,6,
1,USC00228374,18910902,TMAX,306.0,,,6,
2,USC00228374,18910903,TMAX,267.0,,,6,
3,USC00228374,18910904,TMAX,233.0,,,6,
4,USC00228374,18910905,TMAX,233.0,,,6,
...,...,...,...,...,...,...,...,...
368245,USC00228374,20240108,PRCP,0.0,,,H,0700
368246,USC00228374,20240109,PRCP,541.0,,,H,0700
368247,USC00228374,20240110,PRCP,10.0,,,H,0700
368248,USC00228374,20240111,PRCP,0.0,,,H,0700


### Manipulate TMAX, TMIN, & PRCP into a useful format 

- 1D arrays
- data points for every day between start and end dates (consecutive daily timeseries)
- dimensional metadata attached (each data point is associated with metadata that includes the observation date)

In [13]:
df_tx=df[df.ELEMENT=='TMAX'].reset_index(drop=True)
df_tn=df[df.ELEMENT=='TMIN'].reset_index(drop=True)
df_pr=df[df.ELEMENT=='PRCP'].reset_index(drop=True)

df_tn

Unnamed: 0,ID,YYYYMMDD,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
0,USC00228374,18910901,TMIN,194.0,,,6,
1,USC00228374,18910902,TMIN,183.0,,,6,
2,USC00228374,18910903,TMIN,117.0,,,6,
3,USC00228374,18910904,TMIN,117.0,,,6,
4,USC00228374,18910905,TMIN,128.0,,,6,
...,...,...,...,...,...,...,...,...
45729,USC00228374,20240108,TMIN,17.0,,,H,0700
45730,USC00228374,20240109,TMIN,28.0,,,H,0700
45731,USC00228374,20240110,TMIN,-17.0,,,H,0700
45732,USC00228374,20240111,TMIN,-6.0,,,H,0700


In [14]:
# what are the start/end dates for each variable?

df_tx.YYYYMMDD.iloc[0],df_tn.YYYYMMDD.iloc[0],df_pr.YYYYMMDD.iloc[0],df_tx.YYYYMMDD.iloc[-1],df_tn.YYYYMMDD.iloc[-1],df_pr.YYYYMMDD.iloc[-1]

# print(df_tx.DATA_VALUE.isna().sum()) # how many missing values are there? 


('18910901', '18910901', '18910901', '20240113', '20240113', '20240113')

start date = 1891-09-01, end date = 2024-01-13

That's 122 days + 132 years (132*365 days + 32 leap days) + 13 days = 48,347 days 

If there was a data record for every day between the start and end dates each dataframe would have 48,347 rows (but they don't!)

We need to fill in the missing dates in order to create arrays with a time dimension in consecutive days.

In [15]:
# create a datetime index of consecutive dates between the start and end dates, length should be 48,347 
dates=pd.date_range('1891-09-01','2024-01-13')#.to_frame(index=False, name='ALL_DATES')
dates#['ALL_DATES'].dtype

DatetimeIndex(['1891-09-01', '1891-09-02', '1891-09-03', '1891-09-04',
               '1891-09-05', '1891-09-06', '1891-09-07', '1891-09-08',
               '1891-09-09', '1891-09-10',
               ...
               '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07',
               '2024-01-08', '2024-01-09', '2024-01-10', '2024-01-11',
               '2024-01-12', '2024-01-13'],
              dtype='datetime64[ns]', length=48347, freq='D')

In [16]:
# replace the previous index of integer values with datetime values
df_tx.index=pd.DatetimeIndex(df_tx.YYYYMMDD,name='index')
df_tn.index=pd.DatetimeIndex(df_tn.YYYYMMDD,name='index')
df_pr.index=pd.DatetimeIndex(df_pr.YYYYMMDD,name='index')
df_tx

Unnamed: 0_level_0,ID,YYYYMMDD,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1891-09-01,USC00228374,18910901,TMAX,300.0,,,6,
1891-09-02,USC00228374,18910902,TMAX,306.0,,,6,
1891-09-03,USC00228374,18910903,TMAX,267.0,,,6,
1891-09-04,USC00228374,18910904,TMAX,233.0,,,6,
1891-09-05,USC00228374,18910905,TMAX,233.0,,,6,
...,...,...,...,...,...,...,...,...
2024-01-08,USC00228374,20240108,TMAX,133.0,,,H,0700
2024-01-09,USC00228374,20240109,TMAX,133.0,,,H,0700
2024-01-10,USC00228374,20240110,TMAX,94.0,,,H,0700
2024-01-11,USC00228374,20240111,TMAX,133.0,,,H,0700


In [17]:
# reindex the dataframe using the datetime Index of consecutive dates
# any dates that weren't present before will be added and the columns filled with Nan
df_tx=df_tx.reindex(dates)
df_tn=df_tn.reindex(dates)
df_pr=df_pr.reindex(dates)

df_tn

Unnamed: 0,ID,YYYYMMDD,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
1891-09-01,USC00228374,18910901,TMIN,194.0,,,6,
1891-09-02,USC00228374,18910902,TMIN,183.0,,,6,
1891-09-03,USC00228374,18910903,TMIN,117.0,,,6,
1891-09-04,USC00228374,18910904,TMIN,117.0,,,6,
1891-09-05,USC00228374,18910905,TMIN,128.0,,,6,
...,...,...,...,...,...,...,...,...
2024-01-09,USC00228374,20240109,TMIN,28.0,,,H,0700
2024-01-10,USC00228374,20240110,TMIN,-17.0,,,H,0700
2024-01-11,USC00228374,20240111,TMIN,-6.0,,,H,0700
2024-01-12,,,,,,,,


In [20]:
# do 1 data cleaning item before converting to array
# replace data value with nan anywhere there's a quality flag

print('tmax',df_tx.Q_FLAG.value_counts())   # how much of the data has quality flags?
print('tmin',df_tn.Q_FLAG.value_counts())   # how much of the data has quality flags?
print('prcp',df_pr.Q_FLAG.value_counts())   # how much of the data has quality flags?

print('----------------')

print('tmax nans',df_tx.DATA_VALUE.isna().sum()) # how many missing values are there? 
print('tmin nans',df_tn.DATA_VALUE.isna().sum()) # how many missing values are there? 
print('prcp nans',df_pr.DATA_VALUE.isna().sum()) # how many missing values are there? 

print('----------------')

df_tx.loc[~pd.isnull(df_tx.Q_FLAG),['DATA_VALUE']]=np.nan  # replace data value with nan anywhere there's a quality flag
df_tn.loc[~pd.isnull(df_tn.Q_FLAG),['DATA_VALUE']]=np.nan  # replace data value with nan anywhere there's a quality flag
df_pr.loc[~pd.isnull(df_pr.Q_FLAG),['DATA_VALUE']]=np.nan  # replace data value with nan anywhere there's a quality flag

print('tmax nans',df_tx.DATA_VALUE.isna().sum()) # how many missing values are there after applying the quality flags?
print('tmin nans',df_tn.DATA_VALUE.isna().sum()) # how many missing values are there after applying the quality flags?
print('prcp nans',df_pr.DATA_VALUE.isna().sum()) # how many missing values are there after applying the quality flags?

tmax Q_FLAG
I    133
S     15
R      4
G      2
O      1
Name: count, dtype: Int64
tmin Q_FLAG
I    455
S     22
R      5
G      1
Name: count, dtype: Int64
prcp Q_FLAG
K    15
Name: count, dtype: Int64
----------------
tmax nans 2782
tmin nans 2613
prcp nans 3176
----------------
----------------
tmax nans 2937
tmin nans 3096
prcp nans 3191


In [None]:
# download/unzip precipitation data

### Data Cleaning / Quality Control

The minimum quality control procedures suggested by ETCCDI are as follows.

Replace data value with Nan for:
- user-defined missing values (i.e -9999-->Nan)
- daily precip values less than 0
- daily max temperature less than daily minimum temperature
- daily temperature greater than 70C (158F) or less than -70C (-94F)
- leap days (i.e Feb 29th)
- impossible dates (i.e. 32nd March, 12th June 2042)
- non-numeric values
- daily temperature outliers (i.e. 3-5 times the standard deviation from the mean value for each calendar day)



In [21]:
# replace user-defined missing values with nan

# we already did this when we read the data file (na_values=[-9999])
# verify that there aren't any -9999
(df_tx.DATA_VALUE==-9999).sum(),(df_tn.DATA_VALUE==-9999).sum(),(df_pr.DATA_VALUE==-9999).sum()  

(0, 0, 0)

In [22]:
# replace negative precip values

# first see if there are any negatives
(df_pr.DATA_VALUE<0).sum() 

0

In [23]:
# nan where daily tmax is less than daily tmin

# first see if there are any tmax<tmin
(df_tx.DATA_VALUE < df_tn.DATA_VALUE).sum()

0

In [25]:
# nan where tmin or tmax <-70C or >+70C

# first we need to adjust the units to C
df_tx['DATA_VALUE']=df_tx['DATA_VALUE']/10.
df_tn['DATA_VALUE']=df_tn['DATA_VALUE']/10.

In [29]:
# now test

(df_tx.DATA_VALUE>70).sum(),(df_tx.DATA_VALUE<-70).sum(),(df_tn.DATA_VALUE>70).sum(),(df_tn.DATA_VALUE<-70).sum()

(0, 0, 0, 0)

In [41]:
# drop leap days

df_tx=df_tx[~((df_tx.index.to_series().dt.month==2)&(df_tx.index.to_series().dt.day==29))]
df_tn=df_tn[~((df_tn.index.to_series().dt.month==2)&(df_tn.index.to_series().dt.day==29))]
df_pr=df_pr[~((df_pr.index.to_series().dt.month==2)&(df_pr.index.to_series().dt.day==29))]

df_tx

Unnamed: 0,ID,YYYYMMDD,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
1891-09-01,USC00228374,18910901,TMAX,30.0,,,6,
1891-09-02,USC00228374,18910902,TMAX,30.6,,,6,
1891-09-03,USC00228374,18910903,TMAX,26.7,,,6,
1891-09-04,USC00228374,18910904,TMAX,23.3,,,6,
1891-09-05,USC00228374,18910905,TMAX,23.3,,,6,
...,...,...,...,...,...,...,...,...
2024-01-09,USC00228374,20240109,TMAX,13.3,,,H,0700
2024-01-10,USC00228374,20240110,TMAX,9.4,,,H,0700
2024-01-11,USC00228374,20240111,TMAX,13.3,,,H,0700
2024-01-12,,,,,,,,


In [None]:
# impossible dates

# we already indirectly tested for this when we created the DatetimeIndex from the DATES column
# if there had been impossible dates, there would have been an error thrown

In [None]:
# non numeric values

# we already indirectly tested for this when we read in the data file with pd.read_csv
# since we set the data type for the DATA_VALUE column to 'float', a non-numeric value in that column would have thrown an error

In [None]:
# daily temperature outliers



### Monthly Maximum Value of Daily Minimum Temperature (TNx)

- max(each month of daily minimum temperature values)

Here we are inputting daily data and pulling out 1 value per month.

### Growing Season Length (GSL)

- annually, growing season starts on the first day of the first six consecutive day period where daily mean temperature is > 5C
- annually, growing season ends on the first day after 1 July of the first six consecutive day period where daily mean temperature is < 5C

Here we are inputting daily data, pulling out 2 dates per year, and calculating the number of days between the two dates.


### Warm Spell Duration Index (WSDI)

- 6 consecutive days of hot temperatures
- hot temperature threshold defined as > 90th percentile temperature for each calendar day using a centered 5-day window in the base period 1961-1990
- warm spells that contain dates for multiple years are assigned to the year when the spell ends

Here we first use daily data during the base period to determine the daily 90th percentile temperature threshold. Then using all years of daily data we decide whether each calendar day exceeds the hot threshold, then find occurrences where the threshold is exceeded for at least 6 consecutive days (this is a warm spell), then sum the number of days annually in the warm spells.

Notice that this is not the same as finding dangerous heat waves with respect to human health because it is based on a temperature threshold for each calendar day. This means that the WSDI will include winter warm spells where the temperature exceeds the 90th percentile of winter daily temperature, which would likely be a comfortable temperature.

### Monthly Maximum Consecutive 5-day Precipitation (Rx5day)

- max(5-day rolling mean precipitation within each month)

Here we are inputting daily data, for each month calculating the mean precipitation amount for each 5-day window of data values, then choosing the maximum of 5-day window value for each month.


### Maximum Length of Consecutive Dry Days (CDD)

- annually, during the growing season (using mean start and mean end)
- maximum length of consecutive days where precipitation is < 1mm

Here we are inputting daily data, subsetting to data during the growing season, determining whether each day falls under the precipitation threshold, and finding the longest period of consecutive days each year that meets the threshold requirement. 


### Annual Total Precip Amount Over 99th Percentile on Wet Days (R99pTOT)

- annually, the sum of precipitation when precipitation is > 99th percentile of wet day precipitation in the base period 1961-1990
- where a wet day is precipitation >= 1mm

Here we first use daily data during the base period to determine the 99th percentile of wet day precipitation. Then for each year of daily data we determine if each day exceeds the threshold and calculate an annual sum of precip on days that exceed the threshold. 

### Are the Changes In Value of These Indices Over Time Statistically Significant?

### Computing Climate Change Indices on Gridded Data

In [None]:
# download/unzip data

In [None]:
# repeat one of the above analysis

In [None]:
# visualize

# Your Turn!

### Choose one of three coding mini-projects below to complete on your own and prepare to share your findings


**Option 1 (easiest):** Calculate the monthly mean daily temperature range (DTR) at the xx station a create a figure showing the DTR timeseries. 

&emsp;Hints:
- Use daily tmax and tmin data
- Calculate the daily temperature range as tmax-tmin
- For each month, find the mean of the range values you calculated in the previous step
- Plot your timeseries of monthly values. Include axis labels and a title. 

<br>
<br>

**Option 2 (moderate):** Calculate the cold spell duration index (CDSI) at the xx station and create a figure showing the CDSI timeseries. Extra: see if you can determine whether the change in the CDSI is statistically significant.

&emsp;Hints:
- Use daily tmin data
- Find the daily 10th percentile temperature using a centered 5-day window over the base period 1961-1990
- Using all data years, determine if each day exceeds the threshold (looking for days with tmin < threshold)
- Identify cold spells as periods of 6 consecutive days when the temperature exceeds the threshold
- Count how many total cold spell days there are annually (remember each cold spell is assigned to the year when the spell ends)
- Plot the timeseries of annual values. Include axis labels and a title.
- Extra Step: Determine statistical significance of the trend line (linear regression) or the difference in means between two 30-year periods (1941-1970) and (1991-2020).

<br>
<br>


**Option 3 (hardest):** Use a gridded dataset to compute the annual growing season length (GSL) at each grid cell. Then, calculate the trend in GSL at each grid cell and also determine whether each trend is statisically significant. Present your results in a figure that shows the GSL trend for each grid cell (on a map) and include an indication of whether each grid cell value is statistically significant.

&emsp;Hints:
- Use gridded daily tmax and tmin data
- Calculate daily mean temperature
- Use the same process we showed previously to determine the annual start/end of the growing season and find the annual GSL, except this time do the calculations at each grid cell.
- Calculate the trend (linear regression) in annual GSL at each grid cell.
- Determine if each trend is statistically significant.
- Plot the the map of trend values and indicate significance at each grid with hatching or some other visual indicator. Include a title and legend.



In [None]:
# peek at the answer figure for option 1

In [None]:
# peek at the answer figure for option 2

In [None]:
# peek at the answer figure for option 3

Don't forget to create answer codes for these and put them in the repo. Direct learners to answers after the work-on-your-own session.