# Unit 6.2 

We use monthly mean data from station observations that are part of the GHCN data base. 
The code an extension of the code developed for the confidence interval calculations (see unit 5). The main purpose of this coding activity: Conduct a formal t-test and find stations and months for which we can reject the null hypothesis: 

#### H0: Average temperatures have __NOT__ changed over time!
 (more specifically: __There is no difference between temperature climatologies from 1951-1980 and 1991-2020.__)
 
The alternative hypothesis is that they have changed over time and thus the mean temperatures are different. We have good physically-based expectations that the recent decades have higher averages than the earlier decades. However, we will consider in this test the 2-tailed t-test.

Using the monthly mean data we are in the position to explore in which season we have the strongest data support for a significant temperature change.

### Required data files and Python files :

#### (1) Python module "support.py"

In this script you notice we do not define the function for downloading the data from the ACIS server. 

Instead, we can separate the function definitions from our Notebook and import the functions with the same syntax that we use to import packages like _numpy_ or _scipy.stats_. This importing of Python code from separate files is known as import of _modules_. (The file is pure Python code and must have the extension *.py*.) 

Our Python script is called support.py (see GitHub repository unit6, download the script file, and upload it here into directory *unit6*). Note the ending must be .py for this Python code text file. This is referred to as 'import of modules'.

**Download the file support.py from GitHub (see folder unit8) and upload it into the same folder where you have this notebook file. The file must be named _support.py_ !**

Note: Packages are more complex, consisting of entire folders and subfolders with Python code. So modules are much simpler to maintain and a good first start to get your useful functions organized.


#### (2) ghcnd_stations_NY.txt

In this notebook we read a list of station id numbers (and additional 'metadata' information) from a text file that has all of the NY stations listed. Many stations do not have complete time series over the whole time period, so those are discarded.

**Download the file _ghcnd_stations_NY.txt_ from GitHub (see folder data) and upload it into the same folder where you have this notebook file, or put it into your data folder that you used already throughout the course. Remember that you may have to update the path name in the code below when you read this text file. Default is that you have a folder in the parent directory _../data/_ in which the file is located.






### 1.2 Importing all packages and our own module

And check what we imported and how the functions work.

In [None]:
# Python code convention is that standard packages are imported first
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from scipy import stats

In [None]:
# Import of local modules (second after the standard package import) 
import support as spt

# new: use function help() to see the content of the imported module and the doc-string information
#help(spt)

### Reading the file with the NY station data information (metadata)

In [None]:
# Get station list from local file
df_stations=spt.read_station_list(sid_code="USW")
print ("number of stations starting with 'USW' (Airport stations)",len(df_stations['sid']))

In [None]:
# take a look at the data in DataFrame df_stations
df_stations

### 1.3 Preparing the data for statistical calculations

<P style="background-color:lightgreen;color:black;font-size:130%">
<BR>
Check which stations have complete data, and  form a final list of stations to use.
<BR>
<BR>
</P>


**The code below is one 'basic' level approach to deal with storing all station data
that have complete time series without any missing values.**

Here we use a simple approach: 
- Access all the station data from the ACIS server. 
- For each station, the downloaded data is checked for missing values.
- If data are complete for the whole period (1951 to 2020, see variable year1, year2 above), then the station metdata information (station id, latitude, longitude, elevation, and full station name) is appended to new lists.

Once we have screened our station data we can do our calculation and plotting for any of those stations by selecting one of the station id names.

In [None]:
#"USW00094728":  example station id for NYC Central Park
# "USW00014735":  Albany airport
df=spt.get_stationdata_monthly("USW00014735","avgt",startyear=1950,endyear=2020)


In [None]:
df

In [None]:
# 30 year climatology year range

print (f"Check if all monthly data exist (no nan values) between 1951-1980")
dfq=df.query("year >= 1951 and year <= 1980")
# check if data have missign values (nan)
y=dfq['avgt'].values # get numpy array with the temperature data
isnan=np.isnan(y) # checks all data for np.nan values returns array True and False

# the any function returns True if anywhere in the array one or more True values are
if any(isnan): 
    print(">>> Selected station has missing values in the selected time period <<<")
else:
    print("+++ Selected station has complete data in the selected time period. Good! +++")
    




<P style="background-color:purple;color:gold;font-size:130%">
<BR>
Task 1: 
Select the data for the two 30 climate period: 1951-1980 and 1991-2020 and check that both 30 year periods have complete data.
<BR>
<BR>
</P>
    


<P style="background-color:purple;color:gold;font-size:130%">
<BR>
Task 2: 
Calculate the mean and standard deviation and standard error of the mean for each month (see unit5 confidence interval calculations for similar code solution).
<BR>
<BR>
</P>
    


<P style="background-color:purple;color:gold;font-size:130%">
<BR>
Task 3: 
Calculate the approximate 95% Confidence intervals using the 2* standard error of the means 
(or the more exact 95 CI intervals that we calculated in unit5)
<BR>
<BR>
</P>

<P style="background-color:purple;color:gold;font-size:130%">
<BR>
Task 4: Present the climatologies in a graph or two that make it easy to see the differences between the two periods, and how much the confidence intervals overlap.
<BR>
<BR>
</P>
    
- Tip: Check the function plt.errorbar (see unit 5 confidence interval notebook solution guide)

In [None]:
#help(plt.errorbar)

<P style="background-color:purple;color:gold;font-size:130%">
<BR>
Task 5: Calculate the t-test with the help of the function stats.ttest_ind. Choose a winter month and a summer month (or can you figure out how to apply the t-test to all months?). Where you expect to reject the null hypothesis and obtain the smallest p-value?

<BR>
<BR>
</P>
    
Tip: Check out the help or google examples for the application of the function Scipy *stats.ttest_ind*.
(see link below)

Print out the essential information: 
- the difference in the mean, 
- the t-statistic, the p-value, 
- the test decision, 
- and interpretation of the sign if the difference (sign of the t-statistic).



In [None]:
#help(stats.ttest_ind)

## 3 Summary and conclusion

Some comments/remarks here.


---
### References

- [Introduction to import of modules](https://www.programiz.com/python-programming/modules)
- [Function ttest_ind from scipy.stats](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)
- [Welch's form of the t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) (ttest_ind supports calculating this test statistic when we set the keyword parameter _equal_var=False_)