# Week 3 In-Class Activities:

## Hypothesis Testing with Global Climate Model Output

Similar to last week, we are going to take a look at the output from a global climate model, the Community Earth System Model (CESM). The data are **globally and annually averaged surface temperature** for a pre-industrial integration (GHGs, etc. fixed at year 1850 values) and a set of 30 individual integrations from 1920-2100.

This activity is adapted from Prof. Jennifer Kay's Objective Data Analysis course at CU Boulder (Originally coded by Prof. Jennifer Kay (CU Boulder) with input from Elizabeth Maroon (CU ATOC/CIRES Postdoc 2018))

First, we import the packages that we will need. Make sure that you have installed the netCDF4 package.

In [17]:
#import packages
import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl
mpl.rc('font',size=12) # set default font size and weight for plots
from netCDF4 import Dataset # we need this package to read in the GCM data

Next, we read in the GCM data. The GCM data is stored in a file type called "netCDF". This is a very common climate data file type. To learn more about this file type, check out the following [link](https://climatedataguide.ucar.edu/climate-data-tools-and-analysis/netcdf-overview#:~:text=NetCDF%20). 

You likely have these files downloaded already, but if not download the first file [here](https://github.com/kls2177/ccia_files/blob/master/TS_timeseries_cesmle_1850.nc?raw=true) and the second file [here](https://github.com/kls2177/ccia_files/blob/master/TS_timeseries_cesmle_1920_2100.nc?raw=true).

In [3]:
# Read in the data from netcdf files
# These data are global annual mean surface temperatures from the CESM Large Ensemble (LENS) Project.

# Read in 20th and 21st century integrations. There are 30 individual integrations of 181 years each.
fname1="/Users/Karen/Dropbox/EES1132/Fall2020/Activities/Week2/TS_timeseries_cesmle_1920_2100.nc" # filename
nc1 = Dataset(fname1) # read in file
TS_lens = nc1.variables["gts_ann_allcesmle"][:] # Extract variable we want. Note: TS_lens is a numpy array
year = nc1.variables["year"][:] # Extract another variable

Let's remind ourselves of how this data is structured. Print out the shape of TS_lens. You should see an array with dimensions that correspond to 30 integrations of 181 years each.

In [4]:
# check that you have the right dimensions to your data
TS_lens.shape

(30, 181)

In [5]:
# Read in the pre-industrial control integration. There are 1801 years of this integration.
fname2="TS_timeseries_cesmle_1850.nc"
nc2 = Dataset(fname2)
TS_PI = nc2.variables["gts_annual"][:]

Again, print out the shape of TS_PI. You should have an array with dimensions that correspond to 1801 years.

In [6]:
# check that you have the right dimensions to your data
TS_PI.shape

(1801,)

## I. Hypothesis Testing

If you recall from last week, the pre-industrial control integration displays a fairly stable climate with the wiggles showing the natural variability within the model. The 20th and 21st century integrations, on the other hand, show considerable warming.

Here, we are going to do the following:

- test the hypothesis that the two samples, the sample of 1 ensemble member of years 1980-2005 and the sample of 1 ensemble member of years 2075-2100, are significantly different from the population mean, i.e. the mean of the pre-industrial control run,
- test the hypothesis that two samples of 30 ensemble members each are significantly different from one another,
- calculate the confidence intervals on the two samples.

We are going to explore the above using **both the z-statistic and the t-statistic**.

## II. Difference Between Sample and Population

## **STEP 1:**

The first step we need to do is compute the population statistics, the mean and standard deviation. Recall, that we are going to use the pre-industrial control integration to establish our population statistics.

This week, we can just stick to units of Kelvin for simplicity.

**Reflection Question:**\
Before we get started, clearly state the *null hypothesis* and *alternate hypothesis* that we are testing in this case.

In [1]:
# Mean pre-industrial surface temperature 

avg_TS_PI = 
print(avg_TS_PI)

# Standard deviation of pre-industrial surface temperature 

std_TS_PI = 
print(std_TS_PI)

## **STEP 2:**

Next we need our sample statistics. Let's slice our data again to get two **1-member ensembles** for each time period, but this time, don't take the 26-year average - keep all 26 years:

- the end of the 20th century: 1980-2005
- the end of the 21st century: 2075-2100

In [2]:
# Extract the 1980-2005 time period for the first ensemble member. You should end up with a 26 element array

TS_19802005_1mem = 
print(TS_19802005_1mem.shape)

# Extract the 2075-2100 time period for the first ensemble member. You should end up with a 26 element array
TS_20752100_1mem = 

These are our "samples". We now need to compute our sample mean and sample standard deviations for each.

In [3]:
# Sample means for each time period

avg_19802005 = 
avg_20752100 = 

# Sample standard deviations for each time period

std_19802005 = 
std_20752100 = 

## **STEP 3:**

Now, let's compute the z-statistcs and t-statistics comparing our population to our samples. We will start with the 1980-2005 time period sample. 

**Reflection Question:**\
Given our sample size, which is the most appropriate test?

In [4]:
# Specify a value for N and compute the z-statistic for the 1980-2005 time period sample
N = 

# z-statistic
z_19802005 = 
#print(z_19802005)

# t-statistic
t_19802005 = 
#print(t_19802005)

Once we have our z-statistic, we can refer to our z-table for the standard normal distribution to get the probability of getting a z-statistic of this value of greater.

In [6]:
# use the python z-table to find the probability of getting a z-statistic with this value or greater.

import scipy.stats as st

pz_19802005 = 1 - st.norm.cdf(z_19802005) #note that we need to add the 1 - here because st.norm.cdf computes the 
                                         #probability based on the area under the curve to the left of z 
                                         #as in Fig. 8 in the courseware
print(pz_19802005)


pt_19802005 = 1 - st.t.cdf(t_19802005,N-1)
print(pt_19802005)

Wow! These are a very low probabilities - essentially zero. This means that our sample is significantly different from our population even for the more conservative t-test.

**Reflection Question:**\
If we got such a large z-statistic/t-statistic and low probabilities when we compared the 1980-2005 time period sample to the pre-industrial population, what do you think will happen when we repeat the above for the 2075-2100 time period sample?

Let's find out. Repeat the above for the 2075-2100 time period sample.

In [7]:
# Specify a value for N and compute the z-statistic for the 1980-2005 time period sample
N = 

# z-statistic
z_20752100 = 
print(z_20752100)

# t-statistic
t_20752100 = 
print(t_20752100)

In [8]:
# use the python z-table to find the probability of getting a z-statistic with this value or greater.

pz_20752100 = 1 - st.norm.cdf(z_20752100) #note that we need to add the 1 - here because st.norm.cdf computes the 
                                         #probability based on the area under the curve to the left of z 
                                         #as in Fig. 8 in the courseware
print(pz_20752100)


pt_20752100 = 1 - st.t.cdf(t_20752100,N-1)
print(pt_20752100)

An even larger z-statistic and an essentially zero chance that we should accept our null hypothesis that the sample mean and population mean are the same.

## III. Difference Between Two Samples

Now, we will revisit the 30-member samples from last week but we will look at two different time periods: **2031-2040** and **2061-2070** - near-term and medium-term future climate change time periods. Let's slice our data up again like we did last week.

In [9]:
# Calculate the mean for the 2031-2040 time period for all 30 ensemble members
avg_TS_20312040_30mem = 
print(avg_TS_20312040_30mem.shape)

# Calculate the mean for the 2061-2070 time period for all 30 ensemble members
avg_TS_20612070_30mem = 

We can use a z-statistic or t-statistic to assess whether or not these two samples are significantly different. 

## STEP 1:

First, we will need the sample statistics. This time the sample statistics are going to be somewhat different.

In [10]:
# Compute sample means and sample standard deviations

avg_20312040_30mem = 
avg_20612070_30mem = 

std_20312040_30mem = 
std_20612070_30mem = 

## STEP 2:

In [11]:
# Specify a value for N and compute the z-statistic and t-statistic
N = 

# z-statistic
z_30mem = 
print(z_30mem)

# t-statistic
t_30mem = 
print(t_30mem)

In [12]:
# use the python z-table to find the probability of getting a z-statistic with this value or greater.

pz_30mem = 1 - st.norm.cdf(z_30mem) #note that we need to add the 1 - here because st.norm.cdf computes the 
                                         #probability based on the area under the curve to the left of z 
                                         #as in Fig. 8 in the courseware
print(pz_30mem)


pt_30mem = 1 - st.t.cdf(t_30mem,N-1)
print(pt_30mem)

## IV. Confidence Intervals on Samples

Finally, let's compute the **99\%** confidence intervals on the samples in Section III. 

We can make a bar plot showing these confidence intervals on the sample means. Take a look at the following [example](https://problemsolvingwithpython.com/06-Plotting-with-Matplotlib/06.07-Error-Bars/) for some tips.

**Note:** To add error bars/confidence intervals to a bar plot, we only need the $\pm$ part - we do not need to include the sample mean.

In [13]:
# find two-sided t-statistic for 95% confidence and v = N-1
t_crit = 
print(t_crit)

In [14]:
# calculate confidence intervals (don't include the mean)
ci_20312040 = 
ci_20612070 = 
print(ci_20312040)

In [65]:
# prepare bar plot
labels = ['Near-Term (2011-2040)', 'Medium-Term (2041-2070)']
x_pos = np.arange(len(labels))
means = [avg_20312040_30mem,avg_20612070_30mem]
error = [ci_20312040,ci_20612070]

In [18]:
# plot bar plot
plt.figure(figsize=(8,5))
plt.bar(labels, means,yerr=error,align='center',alpha=0.5,ecolor='black',capsize=10)
plt.ylabel('Temperature (Kelvin)')
plt.xticks(x_pos)
plt.title('Near-Term and Medium-Term Global Climate Change')
plt.ylim(288.5,290.5)

# Save the figure and show
plt.tight_layout()
plt.savefig('bar_plot_with_error_bars.png')
plt.show()

<Figure size 576x360 with 0 Axes>

<Figure size 576x360 with 0 Axes>

**Reflection Question:**\
These confidence intervals are very small. What does this say about our estimate of the mean near-term and medium-term temperatures?

## V. Bonus Section

Let's imagine that we only had 3 ensemble members of our model (the typical number that climate modelling centres were running for the last IPCC report). 

Repeat Sections III and IV but using only the first 3 ensemble members. How do think your results will change? See if you are right.