# STATISTICAL INFERENCE: FOURTH SESSION

## Hypothesis Testing: Two populations

In the previous session, we studied the hypotesis tests for one population. In this session, we will use the same commands and see how they must be modified so as to perform hypotesis tests with two populations.


In [2]:
# read txt file

Data <-read.table(file="heights.txt", header=TRUE, dec=",", sep="\t")

# For the test of the variances of this session, we will need to install and load the following packet:
# install.packages("EnvStats")
# library(EnvStats)



## Ratio of variances with two independent populations

### Hypothesis test of the ratio of variances of the height of women and men

We will perform the following test:

$$H_0:\ \sigma_1^2 \geq \sigma_2^2$$
$$H_1:\ \sigma_1^2 < \sigma_2^2$$

We consider a significance level of 10%

In [3]:

select_women<-Data$SEX=="female"
#
women <-Data$HEIGHT[select_women]       
#
select_men<-Data$SEX=="male"
#
men <-Data$HEIGHT[select_men]
#
var.test(women,men,ratio=1,alternative="less")


	F test to compare two variances

data:  women and men
F = 0.88431, num df = 81, denom df = 88, p-value = 0.288
alternative hypothesis: true ratio of variances is less than 1
95 percent confidence interval:
 0.000000 1.269746
sample estimates:
ratio of variances 
         0.8843092 


#### Explanation:

The statistic of the hypothesis testing is 0.88431, the number of degrees of freedom is 81 and 88 and the p-value 0.288. As a consequence, we cannot reject that the variances of the heights of the women and men are the same (since the p-value is larger than 0.1).

#### Observations:

* To the perform the following test 

$$H_O:\ \sigma_1^2 \geq 2 \sigma_2^2$$
$$H_1:\ \sigma_1^2 < 2 \sigma_2^2$$

instead of ratio=1 we must put ratio=2.

* For the bilateral test, we must put alternative="two.sided"
* For the unilateral right test, we must put alternative="greater" 
* Remark that, as in the previous session, the argument conf.level does not appear. The reason is the same.

In [5]:
# The following test is the same as the previos one. Do you see why?
var.test(men,women,ratio=1, alternative="greater")


	F test to compare two variances

data:  men and women
F = 1.1308, num df = 88, denom df = 81, p-value = 0.288
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.7875588       Inf
sample estimates:
ratio of variances 
          1.130826 


## Comparison of the mean of two independent populations

### Hypothesis test of the difference of the height of women and men

We consider the following problem: 

$$H_O:\ \mu_{WOMEN} \leq \mu_{MEN}$$
$$H_1:\ \mu_{WOMEN} > \mu_{MEN}$$

We consider a significance level of 5%

In [6]:
t.test(women,men,mu=0,alternative="greater",var.equal=TRUE)


	Two Sample t-test

data:  women and men
t = -4.0665, df = 169, p-value = 1
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -2.790305       Inf
sample estimates:
mean of x mean of y 
 165.7805  167.7640 


#### Explanation:

The point estimation of the mean height of the women: 165.7805

The point estimation of the mean height of men: 167.7640

The value of the statistical pivot is -4.0665, the degrees of freedom 169 and the p-value is 1. Therefore, as the p-value is larger than 0.05, we cannot reject that the mean of the heights of women is smaller or equal than the mean height of the men.


#### Observations:

* If we want to perform the following test:

$$H_O:\ \mu_{WOMEN} - \mu_{MEN} \leq 40$$
$$H_1:\ \mu_{WOMEN} - \mu_{MEN} > 40 $$

we must mu=40 instead of mu=0.

* For the unilateral left test, we must put  alternative="less"
* For the bilateral test, we must put alternative="two.sided" 
* Note that we have put var.equal=TRUE (the reason for this is that we did not reject the equality of variances). We put var.equal=FALSE if we know that $\sigma_1$ and $\sigma_2$ are unequal.
* Remark that, as in the previous session, we are not using the argument conf.level. The reason is the same.

In [7]:
# The following test is the same as the previous one. Do you see why?
t.test(men,women,mu=0,alternative="less",var.equal=TRUE)


	Two Sample t-test

data:  men and women
t = 4.0665, df = 169, p-value = 1
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf 2.790305
sample estimates:
mean of x mean of y 
 167.7640  165.7805 


## Hypothesis Test of Paired Data

### Difference of the mean height of fathers and mothers

Let D=FATHER-MOTHER (that is, the difference between the height of the father minus that of the mother). We will perform a test to check that the mean of D is positive, i.e., that the mean of the height of the fathers minus the mean height of the mothers is positive. :

$$H_0:\ \mu_D \leq 0 $$
$$H_1:\ \mu_D > 0$$

We consider a significance level of 1%

In [8]:
father<-Data$FATHER
#
mother<-Data$MOTHER
#
t.test(father,mother, alternative="greater")


	Welch Two Sample t-test

data:  father and mother
t = 23.914, df = 316.34, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 10.52428      Inf
sample estimates:
mean of x mean of y 
 172.1871  160.8830 


#### Explanation:

The value of the statistic is 23.914, the degrees of freedom 316.34 and the p-value is smaller than 0.0001. As a result, since the value of the p-value is smaller than 0.01, we conclude that the mean height of the fathers is larger than that of the mothers with a significance level of 1%.


#### Observations: 
* To perform the following test:

$$H_O:\ \mu_D  \leq 10$$
$$H_1:\ \mu_D > 10 $$ 

we must put mu=10 instead of mu=0.

* For the unilateral left test, we must put  alternative="less"
* For the bilateral test, we must put alternative="two.sided" 
* Remark that, as in the previous session, we are not using the argument conf.level. The reason is the same.
* Remark that, unlike in the previous session, we are not using the argument var.equal.

## Comparison of proportions (success probability) of two independent populations

### Hypothesis testing of the difference of the proportion of being from Alava between men and women

We will check that the proportion of men from Alava ($p_1$) is larger than the proportion of men from Alava ($p_2$).

$$H_0:\ p_1 \leq p_2 $$
$$H_1:\ p_1 > p_2$$

We consider a significance level of 2%

In [10]:
# we count how many women and men are from Alava
Data.alava <- subset(Data,BIRTHPLACE==1)
women.alava <- subset(Data.alava,SEX=="female")
men.alava <- subset(Data.alava,SEX=="male")
n_women_alava <- length(women.alava$HEIGHT)
# 
n_men_alava <- length(men.alava$HEIGHT)
# we count the number of women and men
Data.women <- subset(Data,SEX=="female")
n_women <- length(Data.women$HEIGHT)
Data.men <- subset(Data,SEX=="male")
n_men <- length(Data.men$HEIGHT)
# we perform the test
prop.test(c(n_women_alava,n_men_alava),c(n_women,n_men),alternative="greater")



	2-sample test for equality of proportions with continuity correction

data:  c(n_women_alava, n_men_alava) out of c(n_women, n_men)
X-squared = 0.43037, df = 1, p-value = 0.2559
alternative hypothesis: greater
95 percent confidence interval:
 -0.06365059  1.00000000
sample estimates:
   prop 1    prop 2 
0.2560976 0.2022472 


#### Explanation:

The value of the statistical pivot is 0.43037 and of the p-value 0.2559. Since the p-value is smaller than Como el valor del p-valor es mayor que 0.02, entonces no podemos rechazar la igualdad entre la proporcion de mujeres que son de alava y la proporcion de hombres que son de alava.


#### Observaciones: 
* Para hacer el contraste unilateral a izquierda, hay que poner alternative="less"
* Para hacer el contraste bilateral, hay que poner alternative="two.sided"
* Observese que, como en la práctica anterior, no hemos usado el argumento conf.level. El motivo es el mismo.