## Importing libraries and dataset

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv('data/data.csv', sep = ',')

Unnamed: 0,UF,Sex,Age,Color,Years of study,Income,Height
0,11,0,23,8,12,800,1.603808
1,11,1,23,2,12,1150,1.73979
2,11,1,35,8,15,880,1.760444
3,11,0,46,2,6,3500,1.783158
4,11,1,47,8,9,150,1.690631


## <font color = green> Tests for Two Samples </font>
***

## <font color = 'red'> Problem </font>


In our dataset we have the income of heads of households obtained from the National Household Sample Survey - PNAD in 2015. A well-known problem in our country concerns income inequality, especially between men and women.

Two random samples, one from ** 500 men ** and the other with ** 500 women **, were selected in our dataset. In order to prove this inequality, ** test the equality of means ** between these two samples with a level of ** significance of 1% **.

---

It is also possible to use hypothesis tests to compare two different samples. In this type of test, you want to decide whether a sample is different from the other.

### Selection of samples

In [5]:
men = data.query('Sex == 0').sample(500, random_state=101)['Income']

In [7]:
women = data.query('Sex != 0').sample(500, random_state=101)['Income']

### Problem data

In [12]:
men_sample_average = men.mean() 
men_sample_average

2142.608

In [13]:
women_sample_average = women.mean() 
women_sample_average

1357.528

In [14]:
men_sample_std = men.std()  
men_sample_average

2142.608

In [15]:
women_sample_std = women.std()  
women_sample_std

1569.9011907484578

In [17]:
significance = 0.01
confidence = 1 - significance
n_men = 500
n_women = 500
d_0 = 0 #difference between averages

---

### ** Step 1 ** - formulation of hypotheses $ H_0 $ and $ H_1 $

#### <font color = 'red'> Remember, the null hypothesis always contains the equality claim </font>

### $\mu_1\Rightarrow $ Average incomes of male household heads
### $\mu_2\Rightarrow $ Average income of female household heads


$
\begin{cases}
H_0: \mu_1 \leq \mu_2\\
H_1: \mu_1 > \mu_2
\end{cases}
$

### or

 $
\begin{cases}
H_0: \mu_1 -\mu_2 \leq 0\\
H_1: \mu_1 -\mu_2 > 0
\end{cases}
$

### ** Step 2 ** - choose the appropriate sample distribution
<img src='https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img003.png' width = 70%>

### <font color = 'red'> Important note </font>
> In tests involving two samples using the Student's $ t $ table, the number of degrees of freedom will always be equal to $ n_1 + n_2 - 2 $

### Is the sample size greater than 30?
#### Ans .: Yes

### Is the population standard deviation known?
#### Ans .: No

---



### ** Step 3 ** - fixing the test significance ($\alpha$)

In [19]:
probability = confidence
probability 

0.99

In [21]:
from scipy.stats import norm

z_alpha = norm.ppf(probability)
z_alpha.round(2)

2.33

![Região de Aceitação](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img011.png)

---

### ** Step 4 ** - calculation of the test statistic and verification of this value with the test acceptance and rejection areas

# $$ z = \frac{(\bar{x_1} - \bar{x_2}) - D_0} {\sqrt {\frac {s_1 ^ 2} {n_1} + \frac {s_2 ^ 2} {n_2}} } $$

In [26]:
numerator = (men_sample_average - women_sample_average) - d_0

denominator = np.sqrt((men_sample_std ** 2 /n_men) + (women_sample_std ** 2/n_women))

z = numerator/denominator
z

5.86562005776475

![Estatística-Teste](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img012.png)

---

### ** Step 5 ** - Acceptance or rejection of the null hypothesis

<img src='https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img014.png' width=90%>

### <font color = 'red'> Critical value criterion </font>

> ### One-tail test
> ### Reject $ H_0 $ if $ z \geq z _ {\alpha} $

In [27]:
z >= z_alpha

True

### <font color = 'green'> Conclusion: With a 99% confidence level, we reject $ H_0 $, that is, we conclude that the average income of male household heads is higher than the average income of male household heads. female heads of households. Confirming the allegation of income inequality between the sexes. </font>

### <font color='red'>$p$ value criterion</font>

> ### One tail test
> ### Reject $H_0$ if value $p\leq\alpha$

https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.DescrStatsW.html

https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.CompareMeans.ttest_ind.html

In [35]:
from statsmodels.stats.weightstats import DescrStatsW, CompareMeans

In [36]:
test_men = DescrStatsW(men)

In [37]:
test_women = DescrStatsW(women)

In [47]:
test_with_descrStatsW = test_men.get_compare(test_women)

In [48]:
z, p_value = test_with_descrStatsW.ztest_ind(alternative='larger', value = 0)
p_value

2.2372867859458255e-09

In [43]:
p_value <= significance

True

In [49]:
test_with_compareMeans = CompareMeans(test_men, test_women)

In [50]:
z, p_value = test_with_compareMeans.ztest_ind(alternative='larger', value = 0)
p_value

2.2372867859458255e-09

In [51]:
p_value <= significance

True