# Week 3 Pre-Class Assessment

## Part I: Confidence Intervals using UTSC Weather Station Data

Building on what we covered in Week 2's pre-class assessment, we will continue our examination of the hourly surface pressure data from the UTSC Weather Station. First, we import the packages that we will need:

In [16]:
#  Import packages
import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl
mpl.rc('font',size=16) #set default font size and weight for plots

Next, we load the hourly UTSC Weather Station surface pressure data from 2015-2017

In [17]:
#  Load data:
#  UTSC surface pressure data for the years 2015-2018 in kPa. 
#  Data are collected hourly.
filename = 'UTSC_P_20152018.csv'
PS = np.genfromtxt(filename, delimiter = ',')

Recall that the mean of our data is:

In [4]:
avg_PS = 
print(avg_PS)

Now, let's calculate the 95% confidence intervals on this mean. 

First, we need to calculate the sample standard deviation. We can do this using the np.std() function, but we need to specify the change to our degrees of freedom. 

The np.std() function assumes that the number of degrees of freedom (dof) is simple the sample size, N. However, for the sample standard deviation our dof is N-1 because we have used up one dof estimating the sample mean.

In [3]:
#  Calculate the sample standard deviation
#  Use np.std(PS,ddof=?), where ddof = "delta degrees of freedom", 
#   such that dof = N - ddof

s_PS = np.std()
print(s_PS)

Note how similar the sample standard deviation is to the standard deviation we calculated last week. Why?

Now, we need to calculate the test statistic that we will use to compute the 95% confidence intervals (two-sided), assuming a certain sample distribution. 

To calculate test statistics we need to import an extra python package:

In [20]:
import scipy.stats as stats

Do we use a normal or z-distribution or a t-distribution? 

Note the corresponding scipy.stats functions are `stats.norm.ppf()` or `stats.t.ppf()`.

Compare the confidence intervals for both the z- and t- test statistics. Are they similar or different? Why?

In [5]:
#  Calculate z-statistic
#  Enter the probability that you are interest in 
#   (in fraction, not percent)

z_stat = stats.norm.ppf()
print(z_stat)

Use your probability tables to check that this is correct.

In [6]:
#  Calculate t-statistic

#  Enter the probability that you are interest in 
#   (in fraction, not percent). 
#  What other key piece of information do you have to provide to 
#   this function?

t_stat = stats.t.ppf()     
print(t_stat)    

In [30]:
#  Calculate upper and lower confidence intervals using the z-statistic

Upper_CI_z = 
Lower_CI_z = 

In [7]:
print(Upper_CI_z,Lower_CI_z)

In [32]:
# Repeat using t_statistic

Upper_CI_t = 
Lower_CI_t = 

In [8]:
print(Upper_CI_t,Lower_CI_t)

Let's plot a histogram. Add the mean, $\pm$ the standard deviation and the upper and lower confidence intervals (computed using the t-statistic) as vertical lines.

In [2]:
# plot histogram
plt.figure(figsize=(10,5))

#fill in here...

**Reflection Question:**\
Why are the confidence intervals so much smaller than the standard deviation?

## Part II: Simple Regression

In Week 3, we will discuss linear regression. As an introduction, let's take a quick look at preparing data for regression analysis.

To begin, let's load in some data. Here we will examine the relationship between globally-averaged temperature anomalies and CO2 concentration. These are data from 1979-2010

In [11]:
filename = "witt.csv"
data = np.genfromtxt(filename, dtype='str',delimiter = ',').T #  I just transposed it so that we can get each data component as it's own array
data = data.astype('float')

In [20]:
#  GISTEMP temperature anomalies in 100th's of degrees C
T = data[1]
print(T)

In [21]:
#  CO2 concentrations (in ppm)
CO2 = data[2]
print(CO2)

Now, let's plot the data in a scatter plot to see if there appears to be a linear relationship between the two variables. 

It doesn't really matter which variable we choose to be on the x-axis or the y-axis when we initially plot our data, but how we build our regression model will depend on the hypothesis we're testing about the relationship between the two variables, i.e. which variable is the predictor and which is the predictand.

In [12]:
#  Plot the data using plt.scatter(). Google this to see how to use it.
plt.scatter()

# add title, labels, etc.

We will discuss in lecture how to compute the slope and intercept of the best fit line, but for now, we will simply use the python tools to help us.

In [23]:
#  Compute the regression coefficients using np.polyfit() 
#   (take a look back at the section on resampling in the courseware)

a = np.polyfit()

In [13]:
print(a)

There should be two values for a, one is the slope and one is the intercept - which is which?

Now we can use the slope and intercept to compute our best-fit line.

In [25]:
# Compute the best-fit line (use y = slope*x + intercept)

y_fit = 

Add this line to the scatter plot. 

Be sure to now choose your x- and y-axes based on a hypothesis about the relationship between the data. 

Properly label your plot and save it as a .png file.

In [14]:
# Replot the scatter plot and add the best-fit line.

plt.scatter()
plt.plot()

# add title, labels, etc.

plt.savefig()