<a href="https://clarusway.com/contact-us/"><img align="center" src="https://i.ibb.co/B43qn24/officially-licensed-logo.png" alt="Open in Clarusway LMS" width="110" height="200" title="This notebook is licensed by Clarusway IT training school. Please contact the authorized persons about the conditions under which you can use or share."></a>

# [One Sample Z-Test](https://www.analystsoft.com/en/products/statplus/content/help/analysis_basic_statistics_one_sample_z-test.html)

The One-Sample z-test is used when we want to know ``whether the difference between the mean of a sample mean and the mean of a population is large enough to be statistically significant``, that is, if it is unlikely to have occurred by chance. The test is considered robust for violations of normal distribution and it is usually applied to relatively ``large samples (N > 30)`` or ``when the population variance is known``, otherwise you might consider using t-test.

**``Assumptions:``**
1. Mean and variance of the population are known.

2. The test statistic follows normal distribution.

**[One Sample Z-Test: Definition, Formula, and Example](https://www.statology.org/one-sample-z-test/)**<br>
**[One-Sample Z-Test](http://psychology.emory.edu/clinical/bliwise/Tutorials/TOM/meanstests/zone.htm#:~:text=The%20one%2Dsample%20Z%20test%20compares%20a%20sample%20to%20a,central%20tendency%20and%20variability%2Fdispersion.)**<br>
**[One Sample Z-Test: How to Run One](https://www.statisticshowto.com/one-sample-z-test/)**

**``Performed when the population means and standard deviation are known.``**

## Example-1

- Suppose that a beach is safe to swim if the mean level of lead in the water is ``10.0 (μ0)`` parts/million.  
- We assume Xi ~ N (μ, ``σ=1.5``)
- Water safety is going to be determined by taking 40 water samples and using the test statistic. 
- ``Sample mean=10.5``
- ``α=0.05``

In [35]:
# Importing related libraries

# import statsmodels.api as sm
import scipy.stats as stats
from math import sqrt
import numpy as np

In [36]:
# Writing the given variables

x_bar = 10.5  # sample mean 
n = 40  # number of samples
sigma = 1.5  # sd of population
mu = 10  # Population mean 

**Stating the null (H0) and alternative hypothesis (Ha or H1):**

<img src=https://i.ibb.co/4Sggp8f/one-two-tailed.png width="500" height="200">

**Calculating the test statistic:**

<img src=https://i.ibb.co/2ynY3cY/z-formula1.png width="600" height="200">

In [37]:
z = (x_bar-mu) / (sigma/sqrt(n))
z

2.1081851067789197

**Calculating the p-value:**

<img src=https://i.ibb.co/jRcqVnx/rejection-areas.png width="900" height="200">

**[Source](https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-Module7-T-tests/PH717-Module7-T-tests4.html)**

In [38]:
p_value = 1 - stats.norm.cdf(z)
# p_value = 1 - stats.norm.cdf(2.1081851067789197)
p_value

0.017507490509831247

<img src=https://i.ibb.co/MNgLbWR/one-tailed.png width="500" height="200">

<img src=https://i.ibb.co/ChYm8Rg/rejection-acceptance-areas.png width="500" height="200">

**Making a decision:**

In [39]:
alpha = 0.05

if p_value < alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of alternative hypothesis.


## Example-2

- A department store manager determines that a new billing system will be cost-effective only if the ``mean`` monthly account is more than ``170 dollars``.
- A ``random sample of 400`` monthly accounts is drawn, for which the ``sample mean`` is ``178 dollars``.
- The accounts are ``approximately normally distributed`` with a ``standard deviation of 65 dollars``.
- Can we conclude that the new system will be cost-effective?

**Stating the null (H0) and alternative hypothesis (Ha or H1):**

**Calculating the test statistic:**

In [40]:
# Writing the given variables

mu = 170

x_bar = 178
n = 400

sigma = 65

alpha = 0.05

In [41]:
z = (x_bar - mu) / (sigma/np.sqrt(n))
z

2.4615384615384617

In [42]:
# Standard Error (standard deviation of sampling distribition od sample mean)

sigma/np.sqrt(n)

3.25

**Calculating the p-value:**

In [43]:
p_value = 1 - stats.norm.cdf(z)

In [44]:
1 - stats.norm.cdf(178, 170, 3.25)

0.006917128192854505

**Making a decision:**

In [45]:
if p_value < alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of alternative hypothesis.


### Z-test Exercise for Students

``1,500 women`` followed the Atkin’s diet for a month. A ``random sample of 43 women`` gained an ``average of 6.7 pounds``. Test the hypothesis that the average weight gain per woman for the month was ``over 5 pounds``. The ``standard deviation`` for all women in the group was ``7.1``.

**Stating the null (H0) and alternative hypothesis (Ha or H1):**

**Calculating the test statistic:**

**Calculating the p-value:**

**Making a decision:**

# [One Sample t-Test](https://vitalflux.com/one-sample-t-test-formula-examples/)

In statistics, the t-test is often used in research when the researcher wants to know ``whether there is a significant difference between the mean of sample and the population``, or ``whether there is a significant difference between the means of two different groups`` but we do ``NOT`` have full population information available to us.

**[One-Sample t-Test](http://psychology.emory.edu/clinical/bliwise/Tutorials/TOM/meanstests/tone.htm)**<br>
**[One Sample T Test - Easily Explained w/ 5+ Examples!](https://calcworkshop.com/hypothesis-test/one-sample-t-test/)**<br>
**[One Sample T Test – Clearly Explained with Examples | ML+](https://www.machinelearningplus.com/statistics/one-sample-t-test/)**<br>
**[How do you find the t-test statistic?](https://www.omnicalculator.com/statistics/t-test)**

<img src=https://i.ibb.co/nsxqsbY/t-formula.png width="600" height="200">

**``Performed when the population standard deviation are unknown.``**

## Example-1

- Bon Air ELEM has 1000 students. The principal of the school thinks that the ``average IQ`` of students at Bon Air is ``at least 110``. To prove her point, she administers an IQ test to ``20 randomly selected students``. 
- Among the sampled students, the ``average IQ`` is ``108`` with a ``standard deviation of 10``. 
- Based on these results, should the principal accept or reject her original hypothesis? ``α=0.01``

In [46]:
# Writing the given variables

x_bar = 108  # sample mean 
n = 20  # number of students
s = 10  # sd of sample
mu = 110  # Population mean 
alpha = 0.01

**Calculating the test statistic:**

In [47]:
t = (x_bar - mu)/(s/sqrt(n))
t

-0.8944271909999159

In [48]:
p_value = stats.t.cdf(t, df=n-1)
p_value

0.1911420676837155

**Making a decision:**

In [49]:
if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.01 level of significance, we fail to reject the null hypothesis.


### Bonus

A critical-T or critical-Z value is a ``“cut-off point”`` on the t and z distribution. A t-distribution/z-distribution is a probability distribution that is used to calculate population parameters when the sample size is small for t-distribution/large for z-distribution and when the population variance is unknown for t-distribution/known for z-distribution. T & Z values are used to analyze whether to support or reject a null hypothesis.

In [None]:
# How to calculate critical t-value

stats.t.ppf(alpha, df=n-1)

While making your test you can compare:
- Either ``t0 vs tcritical`` and ``z0 vs zcritical``
- Or ``p-value vs alpha value (α)``

**[Hypothesis testing and p-values](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/more-significance-testing-videos/v/hypothesis-testing-and-p-values)**

### T-test Exercise for Students

Imagine a company wants to test the claim that their batteries last ``more than 40 hours``. Using a simple ``random sample of 15 batteries`` yielded a ``mean of 44.9 hours``, with a ``standard deviation of 8.9 hours``. Test this claim using a ``significance level of 0.05``.

**Stating the null (H0) and alternative hypothesis (Ha or H1):**

**Calculating the test statistic:**

**Calculating the p-value:**

**Making a decision:**

### Summary

<img src=https://i.ibb.co/pKK7xVk/Z-Test-vs-T-Test.png width="600" height="200">

### scipy.stats

In [50]:
# pip install statsmodels

In [51]:
import statsmodels.api as sm

In [52]:
df = sm.datasets.get_rdataset(dataname = "Pima.tr", package = "MASS")
df.keys()

dict_keys(['data', '__doc__', 'package', 'title', 'from_cache'])

In [53]:
print(df.__doc__)

.. container::

   Pima.tr R Documentation

   .. rubric:: Diabetes in Pima Indian Women
      :name: diabetes-in-pima-indian-women

   .. rubric:: Description
      :name: description

   A population of women who were at least 21 years old, of Pima Indian
   heritage and living near Phoenix, Arizona, was tested for diabetes
   according to World Health Organization criteria. The data were
   collected by the US National Institute of Diabetes and Digestive and
   Kidney Diseases. We used the 532 complete records after dropping the
   (mainly missing) data on serum insulin.

   .. rubric:: Usage
      :name: usage

   ::

      Pima.tr
      Pima.tr2
      Pima.te

   .. rubric:: Format
      :name: format

   These data frames contains the following columns:

   ``npreg``
      number of pregnancies.

   ``glu``
      plasma glucose concentration in an oral glucose tolerance test.

   ``bp``
      diastolic blood pressure (mm Hg).

   ``skin``
      triceps skin fold thickness (mm).



In [54]:
df.data

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age,type
0,5,86,68,28,30.2,0.364,24,No
1,7,195,70,33,25.1,0.163,55,Yes
2,5,77,82,41,35.8,0.156,35,No
3,0,165,76,43,47.9,0.259,26,No
4,0,107,60,25,26.4,0.133,23,No
...,...,...,...,...,...,...,...,...
195,2,141,58,34,25.4,0.699,24,No
196,7,129,68,49,38.5,0.439,43,Yes
197,0,106,70,37,39.4,0.605,22,No
198,1,118,58,36,33.3,0.261,23,No


In [55]:
df = df.data

In [56]:
df.head()

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age,type
0,5,86,68,28,30.2,0.364,24,No
1,7,195,70,33,25.1,0.163,55,Yes
2,5,77,82,41,35.8,0.156,35,No
3,0,165,76,43,47.9,0.259,26,No
4,0,107,60,25,26.4,0.133,23,No


In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   npreg   200 non-null    int64  
 1   glu     200 non-null    int64  
 2   bp      200 non-null    int64  
 3   skin    200 non-null    int64  
 4   bmi     200 non-null    float64
 5   ped     200 non-null    float64
 6   age     200 non-null    int64  
 7   type    200 non-null    object 
dtypes: float64(2), int64(5), object(1)
memory usage: 12.6+ KB


In [58]:
df.describe()

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,3.57,123.97,71.26,29.215,32.31,0.460765,32.11
std,3.366268,31.667225,11.479604,11.724594,6.130212,0.307225,10.975436
min,0.0,56.0,38.0,7.0,18.2,0.085,21.0
25%,1.0,100.0,64.0,20.75,27.575,0.2535,23.0
50%,2.0,120.5,70.0,29.0,32.8,0.3725,28.0
75%,6.0,144.0,78.0,36.0,36.5,0.616,39.25
max,14.0,199.0,110.0,99.0,47.9,2.288,63.0


In [59]:
# Supposing that we hypothesize that the population mean of bmi among Pima Indian women is above 30.
# Because we found sample mean as x_bar = 32.3

In [60]:
# bmi mean:
# Ho: mu = 30
# Ha: mu >30

In [61]:
df.bmi.mean()

32.30999999999998

In [62]:
# sample size = 200
# sample std = 6.13
# sample mean = 32.3

In [63]:
onesample = stats.ttest_1samp(df["bmi"], 30)
onesample

Ttest_1sampResult(statistic=5.329070841262502, pvalue=2.6614410307455736e-07)

In [64]:
# help(stats.ttest_1samp)

In [65]:
onesample.pvalue/2  # because it is a two sided test we should divide the p_value by 2 at the and if we are seeking one side test.

1.3307205153727868e-07

In [66]:
alpha = 0.05

if onesample.pvalue/2<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of alternative hypothesis.


<a href="https://clarusway.com/contact-us/"><img align="center" src="https://i.ibb.co/B43qn24/officially-licensed-logo.png" alt="Open in Clarusway LMS" width="110" height="200" title="This notebook is licensed by Clarusway IT training school. Please contact the authorized persons about the conditions under which you can use or share."></a>