# Module 1 Tutorial

There are numerous open-source libraries, collections of functions, that have been developed in Python that we will make use of in this course.

The first one is called NumPy and you can find the documentation [here](https://numpy.org/). It is one of the most widely-used libraries for scientific computating in python. The second library we will use will be a module from Scipy, called scipy.stats ([scipy.stats documentation](https://docs.scipy.org/doc/scipy/reference/stats.html)), and the third is a library for handling database-like structures called Pandas for which you can find the documentation at this link: [Pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html).

We import the libraries with the following statement:

In [1]:
import numpy
from scipy import stats
import pandas

Now we will start building our toolbox with some simple tools to describe our data: 

## Confidence Intervals and Descriptive Statistics

In module 1 of the course the first thing that is covered is confidence intervals. As we only have access to samples of data we assume that neither the population mean or the population standard deviation are known and we work with point estimates, sample mean, and sample standard deviation (also called standard error).

To build a confidence interval we must specify a confidence level and provide the sample of our data. 

Below is a simple function to obtain the confidence interval of your sample.

In [2]:
def get_confidence_interval(data, confidence=0.95):
    """ Determines the confidence interval for a given set of data, 
        assuming the population standard deviation is not known.

    Args:  # 'arguments', or inputs to the function
        data (single-column or list): The data
        confidence (float): The confidence level on which to produce the interval.

    Returns:
        c_interval (tuple): The confidence interval on the given data (lower, upper).
    """

    n = len(data)  # determines the sample size
    m = numpy.mean(data)  # obtains mean of the sample

    se = stats.sem(data)  # obtains standard error of the sample

    c_interval = stats.t.interval(confidence, n-1, m, se)  # determines the confidence interval
    return c_interval  # which is of the form (lower bound, upper bound)

We can walk through the function above:
The name of the function is *get_confidence_interval* and the function takes two arguments, the first is the sample that you are interested in calculating the confidence interval for, and the second is the desired confidence level. The second argument is optional and will default to 95% if not specified. 95% is a very typical confidence level used in most applications.

Inside the function we first obtain *n*, the sample size. Then we calculate the sample mean using the numpy.mean function ([numpy.mean documentation](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)), and the sample standard error with the scipy.stats.sem function ([scipy.stats.mean documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html)). 
Finally, we calculate the confidence interval using the scipy.stats.t.interval function ([scipy.stats.t documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html)), this function needs the desired confidence level, the degrees of freedom, the sample mean, and the standard error, in order to calculate the upper and lower bounds of the confidence interval.

Let us illustrate this function with Example 12.6 from the course book: in this example both $\mu$ and $\sigma$, the population parameters are unknown. The sample data is given as {1, 0, 5} and the question asks for the 99% confidence interval for $\mu$ assuming a normally distributed population.

This is easily calculated using the function we defined above:

In [3]:
sample_data = [1, 0, 5]
get_confidence_interval(sample_data, confidence=0.99)

(-13.160448409591625, 17.160448409591623)

Another tool that could be useful in order to help us understand our data is provided by the pandas library. The .describe() function produces a statistical description of our sample. In order to call this function however our data needs to be in a pandas.Series or pandas.DataFrame object.

Let's say we have some data stored in two columns labeled "Time" and "Temperature":

In [4]:
sample_dataframe = pandas.DataFrame(
    {
        "Time": [8, 10, 12, 14, 16, 18, 20],
        "Temperature": [14, 15, 17, 19, 19, 18, 16]
    }
)

The DataFrame looks like this:

In [5]:
sample_dataframe

Unnamed: 0,Time,Temperature
0,8,14
1,10,15
2,12,17
3,14,19
4,16,19
5,18,18
6,20,16


We can now request a statistical description of each column from our dataset, usually the dependent variable is most useful:

In [6]:
sample_statistics = sample_dataframe.describe()
sample_statistics

Unnamed: 0,Time,Temperature
count,7.0,7.0
mean,14.0,16.857143
std,4.320494,1.9518
min,8.0,14.0
25%,11.0,15.5
50%,14.0,17.0
75%,17.0,18.5
max,20.0,19.0


And what we see returned is the sample size, the mean of our sample, the standard deviation (which is not of great use, can you explain why?), the minumum, maximum, and different percentiles. We can access the different information from each column by name or by index:

In [7]:
print(sample_statistics["Time"]["mean"])
print(sample_statistics["Time"][1])

14.0
14.0


We can now move on to the next part of module 1.



## Hypothesis Testing
Let's start by constructing a set of sample data to test our functions with. We will use the time and temperature data from before and simply add another day to the data.

In [8]:
column1 = pandas.Series([8, 10, 12, 14, 16, 18, 20])
column2 = pandas.Series([14, 15, 17, 19, 19, 18, 16])
column3 = pandas.Series([15, 15, 18, 19, 21, 20, 18])
sample_dataframe = pandas.DataFrame({"Time": column1, "Temperature Day 1": column2, "Temperature Day 2": column3})
sample_dataframe

Unnamed: 0,Time,Temperature Day 1,Temperature Day 2
0,8,14,15
1,10,15,15
2,12,17,18
3,14,19,19
4,16,19,21
5,18,18,20
6,20,16,18


The question could be, is the mean temperature of the two days statistically different? We can write this as a hypothesis test:

$H_0 : \mu_1 - \mu_2 = 0$\
$H_1 : \mu_1 - \mu_2 \neq 0$

The independent samples t-test can be used to test this hypothesis. But, an underlying assumption of the independent samples t-test is that the two populations being compared have equal variances. 

The test for equal variance can be written as another hypothesis test and is commonly called the Levene test:

$H_0 : \sigma^2_1 - \sigma^2_2 = 0$\
$H_1 : \sigma^2_1 - \sigma^2_2 \neq 0$

So let's add to our toolbox again:

In [9]:
def Levene_Test(data_group1, data_group2):
    Levene_result = stats.levene(data_group1, data_group2)
    return Levene_result

The Levene test takes the data from each sample group and returns an $F$ and $p$ value. Depending on your desired confidence level, you can then accept or reject the Levene test.

In [10]:
Levene_Test(sample_dataframe["Temperature Day 1"], sample_dataframe["Temperature Day 2"])

LeveneResult(statistic=0.04999999999999995, pvalue=0.8268247157865076)

If our confidence limit, $\alpha$, is 0.05 (a common value), we can observe that in this case the $p$ value is larger than $\alpha$ and so we can accept the null hypothesis and state that statistically the two samples have equal variance.

We now want to adress the question of equal means. Let's add the t-test to our toolbox:

In [11]:
def t_test(data_group1, data_group2, confidence=0.95):
    alpha = 1-confidence

    if Levene_Test(data_group1, data_group2)[1]>alpha:
        equal_variance = True
    else:
        equal_variance = False

    t, p = stats.ttest_ind(data_group1, data_group2, equal_var = equal_variance)

    accept_H0 = "False"
    if p>alpha:
        accept_H0 = "True"

    return(t, p, accept_H0)

Our function to perform the $t$-test is called "t-test" and it takes three possible inputs:
1. The data of the first column (or group) that would correspond to $\mu_1$ in the hypothesis test, \\
2. The data of the second column (or group) that would correspond to $\mu_2$ in the hypothesis test, and finally, \\
3. The desired confidence level, this will default to 95% if not specified.

Inside the function, the confidence level is used to determine the $\alpha$ value which is the significance level for the $t$-test.\
Then, the Levene test (which we discussed previously) is run to determine if the two groups have equal variance or not. This is done because the function that performs the $t$-test, [scipy.stats.ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) (written "stats.ttest_ind" in our code), needs this as an input; it modifies the calculations based on whether or not the two groups have equal variance.

So, after the Levene test we calculate the $t$ value and $p$ value of the $t$-test. The inputs to "stats.ttest_ind" are the data for the first group, the data for the second group, and the results of the Levene test.

Finally, we check if $p$ is larger than our desired significance level.

Let us illustrate this for our previous temperature dataset:

In [12]:
t_test(sample_dataframe["Temperature Day 1"],sample_dataframe["Temperature Day 2"],confidence=0.95)

(-0.9999999999999994, 0.33704905795358453, 'True')

The outputs from our function are the $t$ value, the $p$ value, and whether or not we accept the null-hypothesis.