### Lab | Inferential statistics

#### It is assumed that the mean systolic blood pressure is μ = 120 mm Hg. In the Honolulu Heart Study, a sample of n = 100 people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

##### Set up the hypothesis test.

H0: μ = 120 \
H1: μ != 120

##### Write down all the steps followed for setting up the test.

1. Define the null hypothesis
2. Define the alternate hypothesis
3. Determine the level of significance (alpha)
4. Find the critical value
5. Calculate the T statistic and the p value
6. Accept / reject the null hypothesis

##### Calculate the test statistic by hand and also code it in Python. It should be 4.76190. We will take a look at how to make decisions based on this calculated value.

In [1]:
import math

sample_mean = 130.1
pop_mean = 120
pop_std = 21.21
n = 100

statistic = (sample_mean - pop_mean)/(pop_std/math.sqrt(n))
statistic

4.761904761904759

In [2]:
from scipy.stats import t
p_value = t.sf(abs(statistic), n-1) * 2
p_value

6.562701817208617e-06

Yes, this sample is significantly different from the population as we can't determine that μ = 120 based on it on any significance level.

### Lab | Inferential statistics - T-test & P-value

#### We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

##### In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [3]:
import pandas as pd

df = pd.read_csv('machine.txt', delimiter='\t', encoding='utf-16')
df

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [5]:
df = df.rename(columns=lambda x: x.lower().replace(' ', '_'))
df

Unnamed: 0,new_machine,____old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [7]:
df = df.rename(columns={'____old_machine': 'old_machine'})
df

Unnamed: 0,new_machine,old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [14]:
# New Machine
new_samples = len(df['new_machine'])
new_mean = df['new_machine'].mean()
new_std = df['new_machine'].std()


# Old machine
old_samples = len(df['old_machine'])
old_mean = df['old_machine'].mean()
old_std = df['old_machine'].std()

Defining hypotheses

H0: μ_old < μ_new
H1: μ_old >= μ_new

In [15]:
from scipy.stats import ttest_ind, norm

new_machine = norm.rvs(loc=new_mean, scale=new_std, size=new_samples)
old_machine = norm.rvs(loc=old_mean, scale=old_std, size=old_samples)

In [16]:
new_machine

array([41.73495162, 42.47729281, 42.24412794, 41.57437166, 43.64009793,
       42.31021923, 41.55146059, 43.00503103, 41.04365992, 42.53631933])

In [17]:
old_machine

array([41.83856076, 42.90284547, 44.06219468, 44.06904421, 43.20719195,
       43.52012736, 43.12418567, 42.84411102, 43.25519862, 44.31450299])

In [19]:
ttest_ind(new_machine, old_machine)

Ttest_indResult(statistic=-3.2928004458750357, pvalue=0.004045074686054998)

We reject the null hypothesis, as we can't say to 95% confidence that the new machine is faster than the old machine, given that our p value is 0.004 < 0.05.