### Problem

A new treatment to end smoking is being used in a group of ** 35 patients ** volunteers. For each patient tested, information was obtained on the number of cigarettes consumed per day before and after the end of treatment. Assuming a ** 95% confidence level ** is it possible to conclude that, after the application of the new treatment, there was a change in the smoking habit of the group of patients tested?

## <font color = green>Wilcoxon test </font>
### Comparison of two populations - dependent samples
***

Used when comparing two related samples, paired samples. It can be applied when you want to test the difference of two conditions, that is, when the same element is subjected to two measurements.

### Problem data

In [55]:
smoke = {
    'Before': [39, 25, 24, 50, 13, 52, 21, 29, 10, 22, 50, 15, 36, 39, 52, 48, 24, 15, 40, 41, 17, 12, 21, 49, 14, 55, 46, 22, 28, 23, 37, 17, 31, 49, 49],
    'After': [16, 8, 12, 0, 14, 16, 13, 12, 19, 17, 17, 2, 15, 10, 20, 13, 0, 4, 16, 18, 16, 16, 9, 9, 18, 4, 17, 0, 11, 14, 0, 19, 2, 9, 6]
}
significance = 0.05
confidence = 1 - significance
n = 35

In [56]:
import pandas as pd

smoke = pd.DataFrame(smoke)
smoke.head()

Unnamed: 0,Before,After
0,39,16
1,25,8
2,24,12
3,50,0
4,13,14


In [57]:
average_before = smoke['Before'].mean()
average_before

31.857142857142858

In [58]:
average_after = smoke['After'].mean()
average_after

11.2

### ** Step 1 ** - formulation of hypotheses $ H_0 $ and $ H_1 $

#### <font color = 'red'> Remember, the null hypothesis always contains the equality claim </font>

### $H_0: \mu_{before} = \mu_{after}$

### $H_1: \mu_{before} > \mu_{after}$

---

### ** Step 2 ** - choose the appropriate sample distribution

### Is the sample size larger than 20?
#### Ans .: Yes



---

### ** Step 3 ** - fixing the test significance ($\alpha$)

### Getting $z_{\alpha / 2} $

In [59]:
probability = (0.5 + (confidence / 2))
probability

0.975

In [60]:
from scipy.stats import norm

z_alpha_2 = norm.ppf(probability)
z_alpha_2.round(2)

1.96

![Acceptance Region] (https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img006.png)

---

### ** Step 4 ** - calculation of the test statistic and verification of this value with the test acceptance and rejection areas

# $$ Z = \frac {T - \mu_T} {\sigma_T} $$

Where

## $ T $ = smaller of the sums of stations with the same sign

# $$ \mu_T = \frac {n (n + 1)} {4} $$
# $$ \sigma_T = \sqrt {\frac {n (n + 1) (2n + 1)} {24}} $$

### Building the table with the posts

In [61]:
smoke

Unnamed: 0,Before,After
0,39,16
1,25,8
2,24,12
3,50,0
4,13,14
5,52,16
6,21,13
7,29,12
8,10,19
9,22,17


In [62]:
smoke['Diff'] = smoke['After'] - smoke['Before']
smoke

Unnamed: 0,Before,After,Diff
0,39,16,-23
1,25,8,-17
2,24,12,-12
3,50,0,-50
4,13,14,1
5,52,16,-36
6,21,13,-8
7,29,12,-17
8,10,19,9
9,22,17,-5


In [63]:
smoke['|Diff|'] = smoke['Diff'].abs()
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|
0,39,16,-23,23
1,25,8,-17,17
2,24,12,-12,12
3,50,0,-50,50
4,13,14,1,1


In [64]:
smoke.sort_values(by = '|Diff|', inplace=True)
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|
4,13,14,1,1
20,17,16,-1,1
31,17,19,2,2
21,12,16,4,4
24,14,18,4,4


In [65]:
smoke['Post'] = range(1, len(smoke) + 1)
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|,Post
4,13,14,1,1,1
20,17,16,-1,1,2
31,17,19,2,2,3
21,12,16,4,4,4
24,14,18,4,4,5


In [66]:
post = smoke[['|Diff|', 'Post']].groupby(['|Diff|']).mean()
post.head()

Unnamed: 0_level_0,Post
|Diff|,Unnamed: 1_level_1
1,1.5
2,3.0
4,4.5
5,6.0
8,7.0


In [67]:
post.reset_index(inplace=True)
post

Unnamed: 0,|Diff|,Post
0,1,1.5
1,2,3.0
2,4,4.5
3,5,6.0
4,8,7.0
5,9,8.5
6,11,10.0
7,12,11.5
8,13,13.0
9,17,15.0


In [68]:
smoke.drop(['Post'], axis=1, inplace=True)
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|
4,13,14,1,1
20,17,16,-1,1
31,17,19,2,2
21,12,16,4,4
24,14,18,4,4


In [69]:
smoke = smoke.merge(post, left_on='|Diff|', right_on='|Diff|', how = 'left')
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|,Post
0,13,14,1,1,1.5
1,17,16,-1,1,1.5
2,17,19,2,2,3.0
3,12,16,4,4,4.5
4,14,18,4,4,4.5


In [70]:
smoke['Post (+)'] = smoke.apply(lambda x: x['Post'] if x['Diff'] > 0 else 0, axis = 1)
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|,Post,Post (+)
0,13,14,1,1,1.5,1.5
1,17,16,-1,1,1.5,0.0
2,17,19,2,2,3.0,3.0
3,12,16,4,4,4.5,4.5
4,14,18,4,4,4.5,4.5


In [71]:
smoke['Post (-)'] = smoke.apply(lambda x: x['Post'] if x['Diff'] < 0 else 0, axis = 1)
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|,Post,Post (+),Post (-)
0,13,14,1,1,1.5,1.5,0.0
1,17,16,-1,1,1.5,0.0,1.5
2,17,19,2,2,3.0,3.0,0.0
3,12,16,4,4,4.5,4.5,0.0
4,14,18,4,4,4.5,4.5,0.0


In [72]:
smoke.drop(['Post'], axis = 1, inplace=True)
smoke.head()

Unnamed: 0,Before,After,Diff,|Diff|,Post (+),Post (-)
0,13,14,1,1,1.5,0.0
1,17,16,-1,1,0.0,1.5
2,17,19,2,2,3.0,0.0
3,12,16,4,4,4.5,0.0
4,14,18,4,4,4.5,0.0


### Get $ T $

## $ T $ = smaller of the sums of stations with the same sign

In [73]:
T = min(smoke['Post (+)'].sum(), smoke['Post (-)'].sum())
T

22.0

### Get $\mu_T$

# $$\mu_T = \frac{n(n+1)}{4}$$


In [74]:
mu_T = (n * (n + 1)) / 4
mu_T

315.0

### Get $\sigma_T$

# $$\sigma_T = \sqrt{\frac{n(n + 1)(2n + 1)}{24}}$$

In [75]:
import numpy as np

sigma_T = np.sqrt((n * (n + 1) * ((2 * n) + 1)) / 24)
sigma_T

61.053255441458646

### Get $Z_{teste}$

# $$Z = \frac{T - \mu_T}{\sigma_T}$$

In [76]:
Z = (T - mu_T) / sigma_T
Z

-4.799088891843698

![Estatística-Teste](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img021.png)

---

### ** Step 5 ** - Acceptance or rejection of the null hypothesis

<img src='https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img022.png' width='80%'>

### <font color='red'>Critical value criterion</font>

> ### Reject $H_0$ if $Z \leq -z_{\alpha / 2}$ or if  $Z \geq z_{\alpha / 2}$

In [77]:
Z <= -z_alpha_2

True

In [78]:
Z >= -z_alpha_2

False

### <font color = 'green'> Conclusion: We reject the hypothesis that there is no difference between groups, that is, there is a difference between the average number of cigarettes smoked by patients before and after treatment. And as it is possible to verify through the averages of cigarettes smoked per day before (31.86) and after (11.2) of the treatment, we can conclude that the treatment presented a satisfactory result. </font>

### <font color = 'red'> $ p $ value criterion </font>

> ### Reject $ H_0 $ if $ p \leq \alpha $

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

In [79]:
from scipy.stats import wilcoxon

In [80]:
T, p_value = wilcoxon(smoke['Before'], smoke['After'])
print(T, p_value)

22.0 1.584310018505865e-06


In [81]:
p_value <= significance

True

---

### Problem 

In our dataset we have the income of heads of households obtained from the National Household Sample Survey - PNAD in 2015. A well-known problem in our country concerns income inequality, especially between men and women.

Two random samples, one from ** 6 men ** and the other with ** 8 women **, were selected in our dataset. In order to prove such inequality ** test the equality of means ** enter these two samples with a level of ** significance of 5% **.

## <font color = green> Mann-Whitney test </font>
### Comparison of two populations - independent samples
***

### Sample selection

In [82]:
import pandas as pd

data = pd.read_csv('data/data.csv', sep=',')

In [83]:
women = data.query('Sex == 1 and Income > 0').sample(n= 8, random_state=101)['Income']

In [84]:
men = data.query('Sex == 0 and Income > 0').sample(n= 6, random_state=101)['Income']

### Problem data

In [85]:
women_sample_average = women.mean()
women_sample_average

1090.75

In [86]:
men_sample_average = men.mean()
men_sample_average

1341.6666666666667

In [87]:
significance = 0.05
confidence = 1 - significance
n_1 = len(men) #n_1 is always the smaller set
n_2 = len(women)

### ** Step 1 ** - formulation of hypotheses $ H_0 $ and $ H_1 $

#### <font color = 'red'> Remember, the null hypothesis always contains the equality claim </font>

### $\mu_m\Rightarrow $ Average income of female household heads
### $\mu_h\Rightarrow $ Average income of male household heads

$
\begin {cases}
H_0: \ mu_m = \ mu_h \\
H_1: \ mu_m <\ mu_h
\end {cases}
$

---

### ** Step 2 ** - choose the appropriate sample distribution

The ** $ t $ Student ** distribution should be chosen, since nothing is mentioned about the population distribution, the population standard deviation is unknown and the number of elements investigated is less than 30.

---

### ** Step 3 ** - fixing the test significance ($\alpha$)

### Get $t_{\alpha}$

In [88]:
degrees_of_freedom = n_1 + n_2 - 2
degrees_of_freedom

12

In [89]:
from scipy.stats import t as t_student


t_alpha = t_student.ppf(significance, degrees_of_freedom)
t_alpha.round(2)

-1.78

![Região de Aceitação](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img023.png)

---

### ** Step 4 ** - calculation of the test statistic and verification of this value with the test acceptance and rejection areas
## 1. Define the n's:
### $ n_1 $ = number of elements from the smallest group
### $ n_2 $ = number of members of the largest group
---
## 2. Get the sum of posts
### $ R_1 $ = sum of posts in the group $ n_1 $
### $ R_2 $ = sum of posts in the group $ n_2 $
---
## 3. Get the statistics
# $$ u_1 = n_1 \times n_2 + \frac {n_1 \times (n_1 + 1)} {2} - R_1 $$
# $$ u_2 = n_1 \times n_2 + \frac {n_2 \times (n_2 + 1)} {2} - R_2 $$
---
## 4. Select the smallest U
# $$ u = min (u_1, u_2) $$
---
## 5. Get the test statistic
# $$ Z = \frac {u - \mu {(u)}} {\sigma {(u)}} $$

Where

# $$ \mu{(u)} = \frac{n_1 \times n_2} {2} $$
# $$ \sigma {(u)} = \sqrt{\frac {n_1 \times n_2 \times (n_1 + n_2 + 1)} {12}} $$

### Getting the posts

In [90]:
M = pd.DataFrame(men)
M['Sex'] = 'Men'
M.head()

Unnamed: 0,Income,Sex
67872,1200,Men
30211,2000,Men
64406,850,Men
26519,800,Men
61540,2000,Men


In [91]:
W = pd.DataFrame(women)
W['Sex'] = 'Women'
W.head()

Unnamed: 0,Income,Sex
6251,1100,Women
34764,400,Women
40596,788,Women
11303,4300,Women
22733,250,Women


In [92]:
sex = M.append(W)
sex.reset_index(inplace=True, drop=True)
sex.head()

Unnamed: 0,Income,Sex
0,1200,Men
1,2000,Men
2,850,Men
3,800,Men
4,2000,Men


In [93]:
sex.sort_values(by = 'Income', inplace=True)
sex.head()

Unnamed: 0,Income,Sex
10,250,Women
7,400,Women
11,400,Women
12,700,Women
8,788,Women


In [94]:
sex['Post'] = range(1, len(sex) + 1)
sex.head()

Unnamed: 0,Income,Sex,Post
10,250,Women,1
7,400,Women,2
11,400,Women,3
12,700,Women,4
8,788,Women,5


In [95]:
post = sex[['Income', 'Post']].groupby(['Income']).mean()
post

Unnamed: 0_level_0,Post
Income,Unnamed: 1_level_1
250,1.0
400,2.5
700,4.0
788,5.5
800,7.0
850,8.0
1100,9.0
1200,10.5
2000,12.5
4300,14.0


In [96]:
post.reset_index(inplace=True)
post.head()

Unnamed: 0,Income,Post
0,250,1.0
1,400,2.5
2,700,4.0
3,788,5.5
4,800,7.0


In [97]:
sex.drop(['Post'], axis = 1, inplace=True)
sex.head()

Unnamed: 0,Income,Sex
10,250,Women
7,400,Women
11,400,Women
12,700,Women
8,788,Women


In [98]:
sex = sex.merge(post, left_on='Income', right_on='Income', how = 'left')
sex

Unnamed: 0,Income,Sex,Post
0,250,Women,1.0
1,400,Women,2.5
2,400,Women,2.5
3,700,Women,4.0
4,788,Women,5.5
5,788,Women,5.5
6,800,Men,7.0
7,850,Men,8.0
8,1100,Women,9.0
9,1200,Men,10.5


### Getting $ R $

### $ R_1 $ = sum of posts in the group $ n_1 $
### $ R_2 $ = sum of posts in the group $ n_2 $

In [99]:
Temp = sex[['Sex', 'Post']].groupby('Sex').sum()
Temp

Unnamed: 0_level_0,Post
Sex,Unnamed: 1_level_1
Men,61.0
Women,44.0


In [100]:
R_1 = Temp.loc['Men'][0]
R_1

61.0

In [101]:
R_2 = Temp.loc['Women'][0]
R_2

44.0

### Get $u$

# $$u_1 = n_1 \times n_2 + \frac{n_1 \times (n_1 + 1)}{2} - R_1$$
# $$u_2 = n_1 \times n_2 + \frac{n_2 \times (n_2 + 1)}{2} - R_2$$

# $$u = min(u_1, u_2)$$


In [102]:
u_1 = n_1 * n_2 + ((n_1 * (n_1 + 1)) / (2)) - R_1
u_1

8.0

In [103]:
u_2 = n_1 * n_2 + ((n_2 * (n_2 + 1)) / (2)) - R_2
u_2

40.0

In [104]:
u = min(u_1, u_2)
u

8.0

### Get $\mu{(u)}$

# $$\mu{(u)} = \frac{n_1 \times n_2}{2}$$

In [105]:
mu_u = (n_1 * n_2) / 2
mu_u

24.0

### Get $\sigma{(u)}$

# $$\sigma{(u)} = \sqrt{\frac{n_1 \times n_2 \times (n_1 + n_2 + 1)}{12}}$$

In [109]:
sigma_u = np.sqrt((n_1 * n_2 * (n_1 + n_2+ 1)) / 12)
sigma_u

7.745966692414834

### Get $Z$

# $$Z = \frac{u - \mu{(u)}}{\sigma{(u)}}$$

In [111]:
Z = (u - mu_u)/sigma_u 
Z.round(2)

-2.07

![Estatística-Teste](https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img024.png)

---

### ** Step 5 ** - Acceptance or rejection of the null hypothesis

<img src = 'https://caelum-online-public.s3.amazonaws.com/1229-estatistica-parte3/01/img025.png' width = '80% '>

### <font color='red'>Critical value criterion</font>

> ### Reject $H_0$ if $Z \leq -t_{\alpha}$

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

In [113]:
Z <= t_alpha

True

### <font color = 'green'> Conclusion: We reject the hypothesis that there is no difference between the groups, that is, we conclude that the average incomes of female heads of households is less than the average of incomes of heads male households. Confirming the allegation of income inequality between the sexes. </font>

### <font color='red'>$p$ value criterion</font>

> ### Reject $H_0$ if value $p\leq\alpha$

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

In [115]:
from scipy.stats import mannwhitneyu

In [116]:
u, p_value = mannwhitneyu(women, men, alternative='less')

In [117]:
p_value <= significance

True

---