In [22]:
import pandas as pd
import numpy as np


data = pd.read_csv('data/data.csv', sep = ',')

# <font color=green>Parametrical tests</font>
***

When a test makes certain assumptions about how the parameters of a population are distributed, we are working with ** Parametric Tests **.

## <font color = green>Two-tailed test </font>
***

## <font color='red'> Problem </font>

The company ** Suco Bom ** produces ** fruit juices in 500 ml packages **. Its production process is almost entirely automated and the juice packaging is filled by a machine that sometimes has a certain mismatch, leading to errors in filling the packaging for more or less content. When the average volume falls below 500 ml, the company worries about losing sales and having problems with the inspection agencies. When the volume exceeds 500 ml, the company starts to worry about losses in the production process.

The company's quality control sector ** Suco Bom ** periodically extracts ** samples from 50 packages ** to monitor the production process. For each sample, a ** hypothesis test ** is carried out to assess whether the machinery is out of adjustment. The quality control team assumes a ** significance level of 5% **.

Suppose now that a ** sample of 50 packages ** was selected and that the ** sample mean observed was 503.24 ml **. ** Is this sample mean value sufficiently greater than 500 ml to make us reject the hypothesis that the process average is 500 ml at the 5% significance level? **

The ** two-tailed test ** is widely used in ** quality tests **, such as the one presented in our problem above. Another example is the evaluation of parts that must have a perfect fit (nuts and bolts, keys and locks).

![Teste Bicaudal](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img005.png)

---

### Problem Data

In [23]:
sample = [509, 505, 495, 510, 496, 509, 497, 502, 503, 505, 
           501, 505, 510, 505, 504, 497, 506, 506, 508, 505, 
           497, 504, 500, 498, 506, 496, 508, 497, 503, 501, 
           503, 506, 499, 498, 509, 507, 503, 499, 509, 495, 
           502, 505, 504, 509, 508, 501, 505, 497, 508, 507]

In [24]:
sample = pd.DataFrame(sample, columns=['Sample'])
sample.head()

Unnamed: 0,Sample
0,509
1,505
2,495
3,510
4,496


In [25]:
sample_average = sample.mean()[0]
sample_average

503.24

In [26]:
sample_std = sample.std()[0]
sample_std

4.483803050527347

In [27]:
mean = 500
significance = 0.05
confidence = 1 - significance
n = 50

### **Step 1** - formulation of hypotheses $H_0$ and $H_1$

#### <font color='red'> Remember, the null hypothesis always contains the equality claim </font>

### $H_0: \mu = 500$

### $H_1: \mu \neq 500$

### ** Step 2 ** - choosing the appropriate sample distribution
<img src='https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img003.png' width=70%>

### Is the sample size greater than 30?
#### Ans .: Yes

### Is the population standard deviation known?
#### Ans .: No

### ** Step 3 ** - fixing the test significance ($\alpha$)

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

In [28]:
from scipy.stats import norm

In [29]:
probability = (0.5 + (confidence / 2))
probability

0.975

### Get $z_{\alpha/2}$

In [30]:
z_alpha_2 = norm.ppf(probability)
z_alpha_2

1.959963984540054

![Região de Aceitação](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img006.png)

---

### ** Step 4 ** - calculation of the test statistic and verification of this value with the test acceptance and rejection areas

# $$z = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}$$

In [31]:
z = (sample_average - mean) / (sample_std / np.sqrt(n))
z

5.109559775991877

![Estatística-Teste](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img007.png)

---

### ** Step 5 ** - Acceptance or rejection of the null hypothesis

<img src='https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img013.png' width=90%>

### <font color='red'>Critical value criteria</font>

> ### Two-tailed test
> ### Reject $H_0$ if $z \leq -z_{\alpha / 2}$ or if $z \geq z_{\alpha / 2}$

In [32]:
z <= -z_alpha_2

False

In [33]:
z >= -z_alpha_2

True

### <font color = 'green'> Conclusion: Since the sample average $\bar{x}$ is significantly greater than 500 ml, we reject $H_0$. In this case, steps must be taken to adjust the machinery that fills the packaging. </font>

### <font color='red'>$p-value$ criteria</font>

> ### Two-tailed tests
> ### Reject $H_0$ if value $p\leq\alpha$

In [34]:
p_value = 2 * (1 - norm.cdf(z))
p_value

3.2291031715203644e-07

In [35]:
p_value <= significance

True

In [37]:
#other way to calculate with norm
p_value = 2 * (norm.sf(z))
p_value

3.229103172445718e-07

In [38]:
p_value <= significance

True

In [39]:
from statsmodels.stats.weightstats import ztest

In [40]:
ztest(x1 = sample, value = mean)

(array([5.10955978]), array([3.22910317e-07]))

In [41]:
from statsmodels.stats.weightstats import DescrStatsW

In [42]:
test = DescrStatsW(sample)

In [45]:
z, p_value = test.ztest_mean(value = mean)
print(z[0])
print(p_value[0])

5.109559775991874
3.2291031724457596e-07
