# Statistics-5 (Basic Concepts of Hypothesis Testing)

# One Sample Z Test

Performed when the population means and standard deviation are known.

## Example-1

- Suppose that a beach is safe to swim if the mean level of lead in the water is 10.0 (μ0) parts/million.  
- We assume Xi ~ N(μ, σ = 1.5)
- Water safety is going to be determined by taking 40 water samples and using the test statistic. 
- Sample mean = 10.5
- α = 0.05

In [1]:
from scipy import stats
import numpy as np

In [2]:
#H0: mu = 10
#H1: mu > 10

In [3]:
x_bar = 10.5
n = 40
sigma = 1.5
mu = 10 

Calculate the test statistic

In [4]:
z = (x_bar - mu) / (sigma/np.sqrt(n))
z

2.1081851067789197

Calculate the p-value

In [5]:
p_value = 1 - stats.norm.cdf(z)
p_value

0.017507490509831247

In [6]:
alpha = 0.05

if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of alternative hypothesis.


## Example-2

- A department store manager determines that a new billing system will be cost-effective only if the mean monthly account is more than 170 dollars.
- A random sample of 400 monthly accounts is drawn, for which the sample mean is 178 dollars. 
- The accounts are approximately normally distributed with a standard deviation of 65 dollars.


- Can we conclude that the new system will be cost-effective?

In [7]:
# H0: mu = 170
# H1: mu > 170

In [8]:
x_bar =  178
n = 400
sigma =  65 
mu =  170

Calculate the test statistic

In [9]:
z = (x_bar - mu) / (sigma/np.sqrt(n))
z

2.4615384615384617

In [10]:
#STANDARD ERROR
sigma/np.sqrt(n)

3.25

In [11]:
p_value = 1 - stats.norm.cdf(z)
p_value

0.006917128192854505

In [12]:
1 - stats.norm.cdf(178, 170, 3.25)

0.006917128192854505

In [13]:
alpha = 0.05

if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of alternative hypothesis.


# One Sample t Test

## Example-1

- Bon Air ELEM has 1000 students. The principal of the school thinks that the average IQ of students at Bon Air is at least 110. To prove her point, she administers an IQ test to 20 randomly selected students. 
- Among the sampled students, the average IQ is 108 with a standard deviation of 10. 
- Based on these results, should the principal accept or reject her original hypothesis? α = 0.01

In [14]:
x_bar =  108
n = 20
s = 10
mu =  110
alpha = 0.01

Calculate the test statistic

In [15]:
t = (x_bar - mu) / (s/np.sqrt(n))
t

-0.8944271909999159

In [16]:
p_value = stats.t.cdf(t, df=n-1)
p_value

0.1911420676837155

In [17]:
if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.01 level of significance, we fail to reject the null hypothesis.


### From a dataset

In [18]:
#pip install statsmodels

In [19]:
import statsmodels.api as sm

In [20]:
df = sm.datasets.get_rdataset(dataname = "Pima.tr", package = "MASS")
df.keys()

dict_keys(['data', '__doc__', 'package', 'title', 'from_cache'])

In [21]:
print(df.__doc__)

.. container::

   Pima.tr R Documentation

   .. rubric:: Diabetes in Pima Indian Women
      :name: diabetes-in-pima-indian-women

   .. rubric:: Description
      :name: description

   A population of women who were at least 21 years old, of Pima Indian
   heritage and living near Phoenix, Arizona, was tested for diabetes
   according to World Health Organization criteria. The data were
   collected by the US National Institute of Diabetes and Digestive and
   Kidney Diseases. We used the 532 complete records after dropping the
   (mainly missing) data on serum insulin.

   .. rubric:: Usage
      :name: usage

   ::

      Pima.tr
      Pima.tr2
      Pima.te

   .. rubric:: Format
      :name: format

   These data frames contains the following columns:

   ``npreg``
      number of pregnancies.

   ``glu``
      plasma glucose concentration in an oral glucose tolerance test.

   ``bp``
      diastolic blood pressure (mm Hg).

   ``skin``
      triceps skin fold thickness (mm).



In [22]:
df.data

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age,type
0,5,86,68,28,30.2,0.364,24,No
1,7,195,70,33,25.1,0.163,55,Yes
2,5,77,82,41,35.8,0.156,35,No
3,0,165,76,43,47.9,0.259,26,No
4,0,107,60,25,26.4,0.133,23,No
...,...,...,...,...,...,...,...,...
195,2,141,58,34,25.4,0.699,24,No
196,7,129,68,49,38.5,0.439,43,Yes
197,0,106,70,37,39.4,0.605,22,No
198,1,118,58,36,33.3,0.261,23,No


In [23]:
df = df.data

In [24]:
df.head()

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age,type
0,5,86,68,28,30.2,0.364,24,No
1,7,195,70,33,25.1,0.163,55,Yes
2,5,77,82,41,35.8,0.156,35,No
3,0,165,76,43,47.9,0.259,26,No
4,0,107,60,25,26.4,0.133,23,No


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   npreg   200 non-null    int64  
 1   glu     200 non-null    int64  
 2   bp      200 non-null    int64  
 3   skin    200 non-null    int64  
 4   bmi     200 non-null    float64
 5   ped     200 non-null    float64
 6   age     200 non-null    int64  
 7   type    200 non-null    object 
dtypes: float64(2), int64(5), object(1)
memory usage: 12.6+ KB


In [26]:
df.describe()

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,3.57,123.97,71.26,29.215,32.31,0.460765,32.11
std,3.366268,31.667225,11.479604,11.724594,6.130212,0.307225,10.975436
min,0.0,56.0,38.0,7.0,18.2,0.085,21.0
25%,1.0,100.0,64.0,20.75,27.575,0.2535,23.0
50%,2.0,120.5,70.0,29.0,32.8,0.3725,28.0
75%,6.0,144.0,78.0,36.0,36.5,0.616,39.25
max,14.0,199.0,110.0,99.0,47.9,2.288,63.0


In [27]:
# suppose we hypothesize that the population mean of bmi among Pima Indian women is above 30.
# Because we found sample mean as x_bar = 32.3

In [28]:
# bmi mean:
# Ho: mu = 30
# Ha: mu > 30

In [29]:
df.bmi.mean()

32.31

In [30]:
# sample size (n) = 200
# sample std (s)  = 6.13
# sample mean (xbar) = 32.31

In [31]:
onesample = stats.ttest_1samp(df.bmi, 30, alternative='greater')
onesample

Ttest_1sampResult(statistic=5.329070841262502, pvalue=1.3307205153727868e-07)

In [32]:
onesample.statistic

5.329070841262502

In [33]:
1 - stats.t.cdf(onesample.statistic, 199)

1.3307205160018043e-07

In [34]:
#help(stats.ttest_1samp)

In [35]:
stats.ttest_1samp(df.bmi, 30)

Ttest_1sampResult(statistic=5.329070841262502, pvalue=2.6614410307455736e-07)