# Lab 7: Inference and Global Climate Change 

By the end of this lab, you should know how to:

1. Test whether observed data appears to be a random sample from a distribution.
2. Analyze a natural experiment.
3. Implement and interpret a sign test.
4. Create a function to run a general hypothesis test.
5. Analyze visualizations and draw conclusions from them.

In [None]:
name = ...

In [None]:
## import statements
# These lines load the tests. 
from gofer.ok import check

import numpy as np
from datascience import *
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import os
user = os.getenv('JUPYTERHUB_USER')

## Overview 

Climate change is usually referring to the general trend of warming temperatures globally.  Along with these increasing temperatures, unusual shifts in trends in weather activity such as hurricanes, storms, winds, etc are also usually classified under climate change.  While the climate can shift due to natural occurrences, scientists have found that human interventions have contributed to the trend of warming.  One explanation for the warming could be increased solar solar activity, however scientists have found that solar activity has not generally increased during the period when temperature has increased.  

<img src='solar_temp.jpg' width="500" height="340">

Links: [NASA](https://climate.nasa.gov/causes/) [Canada](https://www.canada.ca/en/environment-climate-change/services/climate-change/causes.html)

### Data 

While there's several different metrics we could analyze to make some inferences about overall trends in global climate change, for simplicity's sake we will be focusing on land temperature across different countries.  The original table came from this [database](https://github.com/OpenFloodAI/Climate-Change-Datasets), however it's been reformatted to make the downstream analyses easier.  There are 15 columns: Year, Country, Average Temperature of that country in that year 'avg', and then a column for each month of that year with temperatures.  

In [None]:
temps = Table.read_table('temp_per_country.csv')
temps

### Data Exploration
Let's explore this data a bit.  We will start by getting a list of all of the countries. 

In [None]:
np.unique(temps['country'])

Noticd that entries are continents, not countries, and there are a few duplicates such Netherlands and Netherlands (Europe). Cleaning the data is virtually always the first step.

Let's remove them.

In [None]:
# Use the where() method in a loop to remove unwanted rows
non_countries_and_dups = [
    "Africa",
    "Denmark (Europe)",
    "Europe",
    "France (Europe)",
    "French Southern And Antarctic Lands",
    "Kingman Reef",
    "Netherlands (Europe)",
    "North America",
    "Oceania",
]
for country in non_countries_and_dups:
    temps = temps.where('country', are.not_equal_to(country))

## Part 1: Basic Hypothesis Testing

### How has the average temperature changed between 1850 and 2000?
Has the Earth warmed? If we look at the temperature changes for all of the countries, what will we see?

Let's investigate...

In [None]:
T_1850 = temps.where('year', 1850).select('year', 'country', 'avg')
T_2000 = temps.where('year', 2000).select('year', 'country', 'avg')

temp_change = T_1850.join('country', T_2000, 'country').relabel('avg', 'T1').relabel('avg_2', 'T2')
temp_change = temp_change.with_columns('T_diff', temp_change.column('T2') - temp_change.column('T1'))

temp_change.show(3)

### <font color=blue> **Question 1.** </font>
Calculate the average temperature change across all countries in the data set.

In [None]:
mean_diff = ...

In [None]:
check('tests/q1.py')

### <font color=blue> **Question 2.** </font>
Make a histogram of the temperature changes. The, in markdown cell below  the histogram, describe the distribution. Do you think these changes could be random? Why or why not?

### <font color=blue> **Question 3.** </font>
Let's test whether the change in mean temperature between 1850 and 2000 is statistically significant. In the cell below, formulate an hypothesis and a null hypothesis.

#### Hypothesis:
.....

#### Null Hypothesis:
.....

## Hypothesis testing
**We are going to test null hythosis two ways:<br>**
First, we will use a standard paired t-test taught in traditional statistics classes.<br>
Second, we will use a simulation approach taught in this class, an approach we will use from now on.

## What is a "paired t-test?"
The <i>t</i>-test was developed by a chemist, William Gosset, working for the Guiness brewery in 1908. He didn't want competitors to know the statistical methods he developed were being using at the brewery for quality control, so he published his papers under the pen name "Student," hence the test became knows as the Student t-test.

What Gosset shows was that:
* **IF** two samples are independent
* and **IF** the samples are random
* and **IF** both samples come from populations with a normal distribution
* and **IF** both populations have approximately the same standard deviation
* **THEN** we can calculate the following t-statistic

  $$ t = \frac{\bar{x_1} - \bar{x_2}}{SE} $$
  
with is the difference between the means of the two samples divided by average standard error of the mean, or standard error, for short.

$$ SE^2 = \frac{(SE)_1^2 + (SE)_2^2}{2} $$

What is the "standard error?" It is sample standard deviation divided by the square root of the number of observations:

$$ SE = \frac{s}{\sqrt{n}} $$

The standard error tells you how much the sample mean would vary if you were to repeat a study using new samples from the same population. Notice that the more data you have in your sample, the smaller the standard error, so the less uncertainty you have in your sample mean.

You can think of this this way: the signal in your data is the difference between the two means -- before and after treatment -- or in our case, between two different time periods. The noise is the variation of the data around the the means (standard deviation). If the signal is large compared with the noise, we can reject the null hypothesis, but if difference between the means is small compared with the scatter of the data around the means, the distributions will blur together and we cannot reject the null hypothesis that the observations could be the result of random variation.

A **paired t-test** is a variation of the t-test where we are looking at paired data -- measurements are made before after some experiment. For example, patients before and after treatment, or in our case, the annual temperature before and after 164 years of industialization. 

In this case, the null hypothesis is that the average difference between past and present temperatures is zero. So the t-value becomes.

$$ t = \frac{\bar{x}_{diff}}{SE} $$

Look at this closely. The paired t-value measures how far the average paired differences are from zero relative to the standard error. 

In [None]:
s1 = np.std(temp_change.column('T1'))
s2 = np.std(temp_change.column('T2'))
s = np.sqrt((s1**2 + s2**2) / 2)
dof = 2 * temp_change.num_rows - 2

mean_diff = np.mean(temp_change.column('T_diff'))
s = np.std(temp_change.column('T_diff'))
n = temp_change.num_rows
std_error = s / np.sqrt(n)

print(f'The mean temperature change is: {mean_diff:.2f}')
print(f'The standard deviation of the temperature differences is: {s:.3f}')
print(f'The standard error is: {std_error:.4f}')
print(f'The degrees of freedom is: {dof}')

### <font color=blue> **Question 4.** </font>
Calculate the t-value.

In [None]:
t = ...
print("The t value is:", t)

In [None]:
check('tests/q4.py')

### **Find the p-value** 
This is where the magic occurs in traditional statistics classes. 
    
**Magical Degrees of freedom:** You need to know the "degrees of freedom," which is number of sample points in your samples minus the number of statistics you have already used calculated using the data. (Note: for large data sets the number of degrees of freedom is nearly equal to the number of observations). You estimated the mean and the standard deviation, so above we subtracted 2 from the number of observations.

**Magical p-values:**
Now that you have a t-statistic and the degrees of freedom, can use a table or [an online calculator](https://www.statology.org/t-score-p-value-calculator/) to find the p-value and enter it below, or you can look up values in a table such as [Students-t-table-one-tailed-two-tailed.](https://postimg.cc/RNrSSGdv). Values of the t-statistic that are greater then or equal to the value in the table have a maximal value of p from the table. For example, with 10 degrees of freedom and a t-statistic of 2.30 we exceed the critical value of 2.23 but are less then the value of 3.17 so the p-value is less then 0.05 but greater then 0.01. When the degrees of freedom reaches the bottom of the table then the t-distribution is nearly equal to the normal distribution and we use the values in the row with degrees of freedom labelled $\infty$.

<br>**<center>Critical Values of <i>t**
    <center>See: [NIST](https://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm)

|$\nu$<br>degrees of freedom|95%<br>p = 0.05|99%<br>p = 0.01|
|:-:|:--|:--|
|2|4.303|9.92|
|3|3.18|5.84|
|4|2.78|4.60|
|5|2.57|4.03|
|6|2.45|3.71|
|7|2.36|3.50|
|8|2.31|3.36|
|9|2.26|3.25|
|10|2.23|3.17|
|15|2.13|2.95|
|20|2.09|2.85|
|30|2.04|2.75|
|$\infty$|1.96|2.58|




### <font color=blue> **Question 5.** </font>
The p-value is the probability that the observed temperature increases are random. Obtain an estimate of this value from the table above or the linked tables above. Based on the p-value do we accept or reject the null hypothesis?  Explain in the cell below the check of the p-value.

In [None]:
p = ...

In [None]:
check('tests/q5.py')

<font color='blue'>**Can we reject the null hypothesis? Why or why not?**

## Simulation Approach
Now we will use the second approach to finding a p-value. No formulas, no tables of critical values, no magic, just simulation. 

We have paired temperatures, but under the null hypothesis it should make no difference which of the two temperatures for each country we assign to 1850 and which to 2000, because the null hypothesis assumes no global warming. This is analogous to testing a drug. Under the null hypothesis it would not matter which patient we assign to the control group and which to the treatment group because the null hypothesis is that the drug has no effect.

To simulate this, for each country we randomly assign one of the two temperatures to 1850 and the other to 2000. We then calculate the mean temperature difference. We do this over and over, saving the mean temperature difference each time. This way we build up a distribution of mean temperature differences assuming no global warming. Once we have that distribution, we calculate the p-value as the fraction of the simulated mean differences are as large or larger than the observed mean difference. If this happened one in a hundred times, we'd say the p-value was 1%.

To randomly change the order of T1 and T1, we can using np.random.shuffle(), a function the randomly shuffles the order of an array of numbers, though in this case each array will have just two values: the average temperature in 1850 and average temperature in 2000.

In [None]:
def shuffle_diff(T1, T2):
    '''
    This function takes two values and returns randomly either:
    (T1 - T2) or (T2 - T1) with equal probability.
    '''
    paired_temperatures = make_array(T1, T2)
    np.random.shuffle(paired_temperatures)
    return paired_temperatures[0] - paired_temperatures[1]

In [None]:
# Test the function
T1 = 1
T2 = 2
for _ in np.arange(10):
    print(shuffle_diff(T1, T2))

In [None]:
# Test applying the function to our data table
temp_change_simulate = temp_change.with_columns('T_diff_sim', temp_change.apply(shuffle_diff, 'T1', 'T2'))
temp_change_simulate.show(10)

### <font color=blue> **Question 5.** </font>
Explain why in the table above the sign of T_diff and T_diff_sim is sometimes the same and sometimes reversed.

In [None]:
# Simulate doing this over and over.
# Each iteration we shuffle the order of the paired temperatures.
# Store each simulated mean of all the temperature differences in the list: "simulated_diffs"
num_simulations = 10000
simulated_diffs = []
for _ in np.arange(num_simulations):
    temp_change_simulate = temp_change.with_columns('T_diff_sim', temp_change.apply(shuffle_diff, 'T1', 'T2'))
    simulated_diffs.append(np.mean(temp_change_simulate.column('T_diff_sim'))) 

### <font color=blue> **Question 6.** </font>
Make a histogram of simulated_diffs. *Hint*: You can either use matplotlib commands directly, or you can put the simulated differences into a table and use the .hist() method.

### <font color=blue> **Question 7.** </font>
The p-value is the fraction of temperature diffences simulates under the null hypothesis that greater than or equal to the observed average temperature change between 1850 and 2000.

Calculate the p-value.

In [None]:
p_val_simulation = ...
print(f"P-value from simulation: {p_val_simulation}")    

In [None]:
check('tests/q7.py')

## NULL Hypothesis rejected!
The histogram shows the distribution of possible temperature changes under the null hypothesis. The blue triangle show the actual mean change in temperature. We see that the average of the actual change in temperatures from 1950 to 2000 lies well beyond the distribution of differences we simulated assuming no global warming. Again, the p-value virually zero, so we reject the null hypothesis. Yes, the planet is warming.

In [None]:
plt.hist(simulated_diffs, bins=20)
plt.plot(mean_diff, 0, '^', markersize=20);

### <font color=blue> **Question 8.** </font>
**<font color=blue> Which approach do you prefer? <font>**
    
**Do you wish this class taught you a bunch of statistical equations, the underlying assumptions and rules for when to use each test, or do you prefer to learn a little Python and test hypotheses by running simulations?** 
    
Learn the equations or learn to code? Which do you prefer and why? We are really curious.<br>
Use the markdown cell below for your answer. 

To see another example hypothesis testing using simulation, [read this chapter](https://inferentialthinking.com/chapters/12/3/Deflategate.html?highlight=t+testz)
in your textbook about the football controvery known as "deflategate."

## Part 2: Testing a trend

### <font color=blue> **Question 9.** </font>
The cell below creates a pivot table with years as the rows and each country as a new column. We use the 'avg' column which contains the  average annual temperature. 

In [None]:
pivotTable = temps.select('year', 'country', 'avg').pivot('country', 'year', 'avg', sum)
pivotTable

### <font color=blue> **Select two countries of your choice to study** </font>
Select two countries from our dataset and draw a line plot of the changes in temperature over time.  You only want to graph the years that have data for both your countries of interest (Hint: Use `.select` to choose appropraite columns. You may want to utilize where and are.above() to select those years with data).  There is not a autocheck for this question as you all may have different answers depending on the countres you pick.  

In [None]:
yourCountries = ...
yourCountries.show(5)

In [None]:
yourCountries = pivotTable.select('year','Finland','Brazil').where('year',are.above(1850))
yourCountries.show(5)

In [None]:
yourCountries.plot('year')
plt.ylabel('Temperature [$^\circ$C]')
plt.xticks(np.arange(1850, 2025, 25))
plt.show()

### <font color=blue> **Question 9. Discussion** </font>
In this markdown cell, explain an observation you see from the figure you generated.


...

### <font color=blue> **Question 10.** </font>
Null and alternative hypothesis.  This time well will look at the overall trend rather diffences between starting and ending temperature.

Based on our preliminary figures and what we know about creating good hypotheses, set the null and alternative hypothesis below:  


-  Hypothesis: The temperatures are trending upward.
-  Null hypothesis:   ... 


To test the null hypothesis we're interested in identifying whether the temperature increased or decreased in each time period.  
Temperatures vary widely across countries and years, presumably due to the vast array of differences among the climates and human intervention. Rather than attempting to analyze the temperatures themselves, here we will restrict our analysis to whether or not temperatures increased or decreased over certain time spans. We will not concern ourselves with how much temperatures increased or decreased; only the direction of the changes - whether they increased or decreased.

The np.diff function takes an array of values and computes the differences between adjacent items of a list or array as such:

    [item 1 - item 0 , item 2 - item 1 , item 3 - item 2, ...]

Instead, we may wish to compute the difference between items that are two positions apart. For example, given a 5-element array, we may want:

    [item 2 - item 0 , item 3 - item 1 , item 4 - item 2]

The diff_n function below computes this result.

In [None]:
def diff_n(values, n):
    '''
    Parameters:
    values is an array of numbers
    n is the offset (how far apart the numbers are in the array)
    
    Example: 
    If values = [2, 6, 8, 9, 15] and n = 2,
    the function will subtract values that are 2 apart: (8 - 2), (9 - 6) and (15 - 9) 
    '''
    return np.array(values)[n:] - np.array(values)[:-n]

diff_n(make_array(2, 6, 8, 9, 15), 2)

### <font color=blue> **Question 11.** </font> 
Implement the function changes that takes an array of temperatures for a country, ordered by increasing year. For all two-year periods (e.g., from 1960 to 1962), it computes and returns the number of increases minus the number of decreases.

For example, the array r = make_array(10, 7, 12, 9, 13, 9, 11) contains three increases (10 to 12, 7 to 9, and 12 to 13), 1 decrease (13 to 11), and one change that is neither an increase or decrease (9 to 9). Therefore, changes(r) would return 2, the difference between three increases and one decrease.

Hint: Consider using the `diff_n` function combined with boolean functions which use `np.count_non-zero` when array elements after using `diff_n` represent increases and separately when they represent decreases. Recall that Python counts `True` as 1 and `False` as 0, so counting non-zeros counts up all of the values that are `True`.

In [None]:
def changes(array, years = 2):
    "Return the number of increases minus the number of decreases after two years."
    ...

In [None]:
# renumber the test
check('tests/q11.py')

### <font color=blue> **Question 12.** </font>
Assign changes_by_country to a table with one row per country that has two columns: the Country name and the Temperature changes statistic computed across all years in our data set for that country. It may be useful to split this process into two steps.   The final table's first 2 rows should look like this:

|country    |avg changes| 
|-----------|-----------| 
|Afghanistan|18         | 
|Africa     |8          

Hint: You can use a `group` method to apply your `changes` function to each column in the original data set while grouping on each country. See this example from Olympic data below:

**Note** This temperature dataset has a few peculiarities, such as including Africa in the `country` column.

In [None]:
NORUSA = Table.read_table('NORUSA.csv')
NORUSA_NUMBERS = NORUSA.group(['Year','Team']) # Number of athletes per year
NORUSA_NUMBERS

#### Now compute the increases - decreases for the winter olympics for each team

Below code allows us to group 'Team' across all the years of the Olympics to give the following table.
|Team| Year changes | count changes|
|----|---|---|
|Norway|20|10|
|United States|20|18

Apply this concept to create the table showing net change for each country.

In [None]:
NORUSA_NUMBERS.group('Team',changes)

In [None]:
changes_by_country = ...
changes_by_country

In [None]:
# Need to renumber test
check('tests/q12.py')

### <font color=blue> **Question 13.** </font>
Assign test_stat to the total increases minus the total decreases for all two-year periods and all countries in our data set. For example, if the temperature in Albania went up 23 times and fell 17 times, the total change for Albania would be 6. We want the total value for all the countries together.

In [None]:
test_stat = ...
print('Total increases minus total decreases, across all countries and years:', test_stat)

In [None]:
# Need to renumber test
check('tests/q13.py')

"More increases than decreases," one person exclaims, "Temperatures tend to go up across two-year periods. What dire times we live in."

"Not so fast," another person replies, "Even if temperatures just moved up and down uniformly at random, there would be some difference between the increases and decreases. There were a lot of countries and a lot of years, so there were many chances for changes to happen. If country temperature increase and decrease at random with equal probability, perhaps this difference was simply due to chance!"

Based on the null hypothesis above that country temperatures increase and decrease by chance, we can simulate our test statistic.  Our test statistic should depend only on whether temperature increased or decreased, not on the size of any change. Thus we choose:

    Test Statistic: The number of increases minus the number of decreases

The cell below samples increases and decreases at random from a uniform distribution 100 times. The final column of the resulting table gives the number of increases and decreases that resulted from sampling in this way. Using sample_from_distribution is faster than using sample followed by group to compute the same result.

In [None]:
uniform = Table().with_columns(
    "Change", make_array('Increase', 'Decrease'),
    "Chance", make_array(0.5,        0.5))
uniform.sample_from_distribution('Chance', 100)

### <font color=blue> **Question 14.** </font>
Complete the simulation below, which samples num_changes increases/decreases at random many times and forms an empirical distribution of your test statistic under the null hypothesis that increases and decreases are equally probable. Your job is to:

- fill in the function simulate_under_null, which simulates a single sample under the null hypothesis, and
- fill in its argument when it's called below.

As a hint, num_changes should be approximately the number of countries times the number of time comparisons (you can find the number of year comparisons by using diff_n().  

In [None]:
def simulate_under_null(num_chances_to_change):
    """Simulates some number changing several times, with an equal
    chance to increase or decrease.  Returns the value of your
    test statistic for these simulated changes.
    
    num_chances_to_change is the number of times the number changes.
    """
    uniform = Table().with_columns(
        "Change", make_array('Increase', 'Decrease'),
        "Chance", make_array(0.5,        0.5))
    sample = uniform.sample_from_distribution(..., ...)
    
    ... 
    
    return ...


In [None]:
def empirical_distribution(tbl):
    num_changes = ...
    samples = make_array()
    for i in np.arange(10000):
        samples = np.append(samples, simulate_under_null(...)) 
    Table().with_column('Test statistic under null', samples).hist(bins=np.arange(0, max(samples) + max(samples) * 0.1, 2))
    return samples

In [None]:
samples = empirical_distribution(...) 

In [None]:
# Need to renumber test
check('tests/q14.py')

### <font color=blue> **Question 15.** </font>
Complete the analysis as follows:

1. Compute a P-value. (Hint: you can use np.count_nonzero())
2. Using a 5% P-value cutoff, draw a conclusion about the null and alternative hypotheses.
3. Describe your findings using simple, non-technical language. What does your analysis tell you about temperatures changes over time? What can you claim about causation from your statistical analysis?

**P-value:** ... 


### <font color=blue> **Question 16 Discussion:** </font>
What is your conclusion about the hypotheses?



**Findings:** ...

In [None]:
pvalue = ...
pvalue

### <font color=blue> **Question 17.** </font>

#### Summary
You have tested for global warming two ways:

1. You compared temperatures in 1850 with 2014.  You found that the observed increase was too large to have been random.
2. You looked at all differences in temperature two years apart over a 35 year period, and found that there were far more increases than decreases, again, too many to be explained by random fluctuations.

In both cases, you rejected the null hypothesis as improbable.

<font color=blue>**Think back to the Olympic mini-project challenge question. Use what you have just learned to formulate a new strategy.<font>

**Challenge Question:** Does the host country team have an advantage? Use a markdown cell to create a strategy to address this question. 

### <font color=blue> **Question 18.** </font>
At the end of each lab, please include a reflection. 
* How did this lab go? 
* Have you used a t-test before in another class? Do you understand how to apply this test?
* Were there questions you found especially challenging you would like your instructor to review in class? 
* How long did the lab take you to complete?

Share your feedback so we can continue to improve this class!

**In the markdown cell below this one write your reflection on this lab.**

...

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
checks = [1, 4, 5, 7, 11, 12, 13, 14]
total = len(checks)
for x in checks:
    print('Testing question {}: '.format(str(x)))
    g = check('tests/q{}.py'.format(str(x)))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/total)))

In [None]:
print("Nice work ",name, user)
import time;
localtime = time.asctime( time.localtime(time.time()) )
print("Submitted @ ", localtime)