# Lab | Inferential statistics
Jorge Castro DAPT NOV2021

### Instructions

1. It is assumed that the mean systolic blood pressure is `μ = 120 mm Hg`. In the Honolulu Heart Study, a sample of `n = 100` people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

   - Set up the hypothesis test.
   - Write down all the steps followed for setting up the test.
   - Calculate the test statistic by hand and also code it in Python. It should be 4.76190. We will take a look at how to make decisions based on this calculated value.

2. If you finished the previous question, please go through the code for `principal_component_analysis_example` provided in the `files_for_lab` folder .

### Hipothesis test
#### Recap of concepts from Hypothesis Testing

* Null hypothesis: Denoted with H0, a null hypothesis is an assumption that the population average is identical to a specific value. The typical notation is μ = μ0, where μ refers to the population mean and μ0 refers to the hypothesized value.

* Alternate hypothesis: An alternative hypothesis is the opposite of the null hypothesis. We compare this hypothesis with the null hypothesis to decide whether or not we reject the null hypothesis. We denote the alternative hypothesis with H1 or Ha.

* Significance Level: Indicates whether we are confident enough to reject the null hypothesis.

* Test Statistic: Once we determine the type of hypothesis test and that our assumptions have been met, we use our data to decide whether to reject or not reject the null hypothesis. (z-test, t-test)

* p-value: is a measure used to help us reject or not the null hypothesis

* Step 1: Define the null hypothesis - This is our assumption about the population: we assume that the avaerage of our population is identical to the regular population. It is defined by H0 and in this case H0: μ = 120;

* Step 2: Define the alternative hypothesis - This means, what if our assumption is not true. It is defined by Ha and in this case Ha: μ <> 120.

* Step 3: Determine if it is a one-tailed or a two-tailed test. Two-tailed is when the mean tested (alternative hypothesis) can be > or < than the mean of the population; the one-tailed test is when the mean tested is either < or >, but only one of those. In this case we are checking if the mean of the systolic blood pressure is different from the mean of the regular population, so we will consider it a two-tailed test.

* Step 4: Decide a test statistics based on the information available. We will assume that the data are normally distributed. Provided that the number of observations are 100 and population variance is known, we will use a t-test. This test is based on a "t-distribution" which is a normal distribution.

If the population variance is not known or the testing sample is less then 30, we use a t-test. T test is based on students t distribution which is very similar to a standard normal distribution except that it is much flatter.

* Step 5: Level of significance: This defines the rejection region/critical region, it's the probability of making the wrong decision when the null hypothesis is true. We choose 0.05, which is the most usual. It is defined by greek letter 'alpha'. In the medical field this would go down to 0.01.

* Step 6: Calculate the test statistic based on the given information.

* Step 7: Check the table to determine the critical value.

For z-test you have fixed values according to Confidence Level.
For t-test you have to calculate according to the degrees of freedom (df), which is the sample_size - 1.

* Step 8: Make conclusions:

If the test statistic falls in the critical region, then we reject the Null Hypothesis
If the test statistic falls in the region between the critical region, then we fail to reject the Null Hypothesis.

In [6]:
#Manual calculation
import math

sample_mean = 130.1
pop_mean = 120
sample_std = 21.21
n = 100

statistic = (sample_mean - pop_mean)/(sample_std/math.sqrt(n))
print("Calculated t =",round(statistic,3))

import scipy.stats

#find T critical value. Two-tailed test, alpha 0.05
critical_value=scipy.stats.t.ppf(q=1-.05/2,df=n-1)

print("We reject H0") if abs(statistic) > abs(critical_value) else print("We cannot reject H0")

Calculated t = 4.762
We reject H0


In [7]:
def academic_t_test(sample_mean, sample_std, pop_mean, pop_size, alpha=.05, tails="Two"):
    import math
    import numpy as np
    import scipy.stats
  
    statistic = (sample_mean - pop_mean)/(sample_std/math.sqrt(n))
    
    #find T critical value
    #Two-tailed test
    if tails=="Two":
        critical_value=scipy.stats.t.ppf(q=1-alpha/2,df=pop_size-1)
    #One-tailed test
    elif tails=="Left":
        critical_value=scipy.stats.t.ppf(q=alpha,df=pop_size-1)
    elif tails=="Right":
        critical_value=scipy.stats.t.ppf(q=1-alpha,df=pop_size-1)
    else: 
        return 'Please, indicate "Left", "Right" or "Two" in the "tails" argument'
    
    conclusion= "We reject H0" if abs(statistic) > abs(critical_value) else "We cannot reject H0"
    
    results=[round(statistic,3),round(critical_value,3),conclusion]
    
    return results

In [8]:
academic_t_test(sample_mean, sample_std, pop_mean, 100)


[4.762, 1.984, 'We reject H0']