# Tutorial: Statistical Populations and Correlations

#### GEOL 455/855, Week 4 Exercise, Pt. 1
#### Dr. Lynne Elkins

### Part 1: T-Tests

To start, reproduce the example from the lecture, which had a data set with the systolic blood pressure measurements for 14 patients.

Your goal: to determine whether the population mean is *less than 165.*

In [None]:
# Enter the population data set:
systolicbp=[183, 152, 178, 157, 194, 163, 144, 114, 178, 152, 118, 158, 172, 138]

In [None]:
# Enter the population mean value:
bpmean = 165

To do this analysis, you will need to import some statistical data packages. There are a few options, but one of the easiest to use is `stats` in the larger `scipy` package.

In [None]:
# This line imports just the stats part of the scipy package:
from scipy import stats

You will need to calculate a few values to run your test. First, write out (in Markdown) your null and alternative hypotheses:

Null hypothesis H0: 

Alternative hypothesis H1: 

The next cell sets up a calculation using a function from the `stats` package, called "ttest_1samp." As its name suggests, the function conducts a 1-sample t-test calculation. This function takes two arguments, the data set (here, 'systolicbp') and the test population mean ('bpmean'). It then reports two results, a test statistic and a p-value. The syntax is:

```
t_value,p_value=stats.ttest_1samp(dataset,mean)
```

Set this up in the next cell, write additional lines to print the t-value and p-value results to the screen, and run the cell. What happens?

In [None]:
# Set up the t-test calculation here:


There is an additional correction to take into account before you interpret these results: the alternative hypothesis (if written correctly!) is focused on values less than 165, and not those greater than 165. This is called a "one-tailed" hypothesis, and its p-value is half of the total p-value result you calculated above. Calculate a new p-value just for your one-tailed hypothesis below:

In [None]:
# Calculate the one-tailed p-value for your alternative hypothesis, and print it to the screen:


Now write a line of code to compare that result to a threshold  of 0.05 and print the appropriate conclusion to the screen. If your p-value is less than 0.05, you have rejected the null hypothesis H0 in favor of your alternative hypothesis, H1. (Tip for writing better code: Can you set this up so that the text output matches the outcome in either case? An 'if' loop would work well for that!)

In [None]:
# Compare and interpret the p-value result:


#### Paired t-tests

Now consider the problem (and hints) below, and see if you can set up a simple test with your classmates.

In this problem, a group of students was given a math test two times, once before and then after a round of tutoring lessons. Was the tutoring beneficial overall?

Scores on the 1st test: 23 20 19 21 18 20 18 17 23 16 19

Scores on the 2nd test: 24 19 22 18 20 22 20 20 23 20 18

**Your task:** Write hypotheses and design a test to identify whether the tutoring was beneficial or not.

**Tips:** Design your test so that the result (and interpretation) will print automatically to the screen. That way you could reuse your script given a different input data set.

### Part 2. Correlations

Another package that has a lot of statistical routines and functions in Python is NumPy. We will use that here so you can get some practice with it, since it is a very useful package in data science.

In [None]:
# The line below imports the NumPy package, and gives it a convenient nickname.
# This is a good approach for packages you expect to use many times, so you 
# don't have to keep typing them out.
import numpy as np

First up: Pearson correlation coefficients! NumPy has a routine called `np.corrcoef()` that produces a matrix of Pearson coefficients. To use it, you need a pair of arrays that you will compare to see if they are statistically correlated. We will set that up in the next cell using ranges and arrays:

In [None]:
# Use the range command to create an array of integers from 10 to 19:
x = np.arange(10,20)
x

In [None]:
# Create a second array of the same size, with arbitrary numbers (feel free to change these!):
y = np.array([2,1,4,5,8,12,18,25,96,48])
y

Here is how the function works: the following cell creates a correlation coefficient matrix by comparing x and y. The values on the diagonal of the matrix (upper left, lower right) are equal to 1, because they show the relationship between x and x and between y and y. The other values give the degree of correlation between x and y *and* between y and x (in this case, these are equal).

In [None]:
# Calculate Pearson correlation coefficients:

r = np.corrcoef(x,y)
r

`Scipy` and `stats` also include correlation functions, which can be set up almost the same way as the Pearson method above (that is, using numpy arrays). Run some tests for Pearson's r, Spearman's rho, and Kendall's tau correlations using the same two arrays (x and y) and the syntax below:

```
pearsonr()
spearmanr()
kendalltau()

corr,p_value = scipy.stats.function(x,y)

```

In [None]:
# Calculate Pearson's r values using the stats routine:


In [None]:
# Calculate Spearman's rank coefficients:


In [None]:
# Calculate Kendall's tau:


What was different? Well for one thing, these functions are written to return just two values: the correlation coefficient (first number), and the p-value (second number).

#### Pandas methods

What if you have imported data using a dataframe (a type of data table), or want to save them that way for easy printing or use in other functions? It turns out that the Pandas package is also good for calculating statistics, and has some similar routines built in. In this case, there is a single command for all correlation calculations, which you can feed an argument to specify which type ('pearson', 'spearman', 'kendall') you want to use. You can even feed in another outside function of your own if you like, making this a very versatile method!

But first, you need a couple of (indexed) dataframes to work with:

In [None]:
import pandas as pd

In [None]:
# Set up a series of x values from 10 to 19, but this time as a dataframe column:
x = pd.Series(range(10,20))
x

In [None]:
# Set up your y value as a list of numbers:
y = pd.Series([2,1,4,5,8,12,18,25,96,48])
y

It turns out that Pearson's r is the default correlation coefficient provided by the `.corr()` function, so if you simply use the variable names with no other arguments, that's what you will get:

In [None]:
x.corr(y)

In [None]:
y.corr(x)

But you can also modify the method by adding the argument "method='spearman'" inside the parentheses. Try it below and see what you get, both for Spearman's rank and Kendall's tau! (If you aren't sure how to do this, just be bold: try typing it and running it as a test! After all, if you make an error in the syntax, it just won't run, and then you can try again.)

In [None]:
# Calculate Spearman's rank:


In [None]:
# Calculate Kendall's rho:


How do the results compare to your calculations using the other packages?

#### Further reading

Today's exercise was adapted from a realpython free tutorial that you can complete online! That tutorial contains further examples and implementations of correlation calculation methods. I encourage you to keep working through that tutorial, either in class or on your own, until you are comfortable with these methods:

https://realpython.com/numpy-scipy-pandas-correlation-python/