# STATISTICAL INFERENCE: FIFTH SESSION

## Analysis of Variance

Let's consider the database of the file heights_AOV.xlsx (which available at eGela and and is the same as the database of the previous sessions, but it contains a new variable with the height of the sons or daughters. This variable is called CHILD)

Our goal is to know where there exists a significance difference in the mean height of the children between Alava, Vizcaya and Guipuzcoa, by means of the Analysis of Variance.

In [1]:
# read xls file
library("readxl")

Data <-read_excel("heights_AOV.xlsx")

str(Data)

# we see that BIRTHPLACE is an integer variable. We will need this variable to be cualitative (factor). Therefore
# we create the variable BIRTHPLACE.F

data.birthplace<-Data$BIRTHPLACE
#
Data$BIRTHPLACE.F <- factor(data.birthplace, levels=c(1,2,3), 
                              labels=c("Alava", "Vizcaya", "Guipuzcoa"))


tibble [171 × 8] (S3: tbl_df/tbl/data.frame)
 $ ID        : num [1:171] 1 2 3 4 5 6 7 8 9 10 ...
 $ FATHER    : num [1:171] 174 177 173 174 160 167 171 174 175 174 ...
 $ MOTHER    : num [1:171] 156 159 161 156 165 157 162 158 162 161 ...
 $ SEX       : chr [1:171] "mujer" "hombre" "hombre" "hombre" ...
 $ BIRTHPLACE: num [1:171] 2 2 1 2 1 3 1 3 3 1 ...
 $ HEIGHTS   : num [1:171] 165 170 168 167 162 163 165 168 168 167 ...
 $ WEIGHTS   : num [1:171] 65 67 51 69 54 61 65 76 67 69 ...
 $ CHILD     : num [1:171] 165 170 153 167 147 163 150 168 168 152 ...


In [2]:
# we observe that PROVINCIA.F is a variable of type factor
str(Data)

tibble [171 × 9] (S3: tbl_df/tbl/data.frame)
 $ ID          : num [1:171] 1 2 3 4 5 6 7 8 9 10 ...
 $ FATHER      : num [1:171] 174 177 173 174 160 167 171 174 175 174 ...
 $ MOTHER      : num [1:171] 156 159 161 156 165 157 162 158 162 161 ...
 $ SEX         : chr [1:171] "mujer" "hombre" "hombre" "hombre" ...
 $ BIRTHPLACE  : num [1:171] 2 2 1 2 1 3 1 3 3 1 ...
 $ HEIGHTS     : num [1:171] 165 170 168 167 162 163 165 168 168 167 ...
 $ WEIGHTS     : num [1:171] 65 67 51 69 54 61 65 76 67 69 ...
 $ CHILD       : num [1:171] 165 170 153 167 147 163 150 168 168 152 ...
 $ BIRTHPLACE.F: Factor w/ 3 levels "Alava","Vizcaya",..: 2 2 1 2 1 3 1 3 3 1 ...


### Test the normality of the heights of the kids of Alava, Vizcaya and Guipuzcoa

Consider a significance level of 5%

In [5]:
# We start with those of Alava
data.alava<-Data$BIRTHPLACE.F=="Alava"
#
shapiro.test(Data$CHILD[data.alava])
# the p-value is larger than 0.05, therefore we cannot reject normality


	Shapiro-Wilk normality test

data:  Data$CHILD[data.alava]
W = 0.95762, p-value = 0.1485


In [6]:
# We focus on Vizcaya
data.vizcaya<-Data$BIRTHPLACE.F=="Vizcaya"
#
shapiro.test(Data$CHILD[data.vizcaya])
# the p-value is larger than 0.05, therefore we cannot reject normality


	Shapiro-Wilk normality test

data:  Data$CHILD[data.vizcaya]
W = 0.98345, p-value = 0.373


In [9]:
# We finally focus on Guipuzcoa
data.guipuzkoa<-Data$BIRTHPLACE.F=="Guipuzcoa"
#
shapiro.test(Data$CHILD[data.guipuzkoa])
# the p-value is larger than 0.05, therefore we cannot reject normality


	Shapiro-Wilk normality test

data:  Data$CHILD[data.guipuzkoa]
W = 0.98188, p-value = 0.6335


#### Explanation

In all the cases, we have seen that the p-value is larger than the significance level

This means that normality is not an issue to perform the Analysis of Variance

### Test the homogeneity of the variances of the heights of the kids of Alava, Vizcaya and Guipuzcoa

Consider a significance level of 2%

In [13]:
# We will use the command leveneTest. We must load (and install the first time) the car package
#install.packages("car")
#library("car")

data.birthplaceF<-Data$BIRTHPLACE.F
#
leveneTest(Data$CHILD, data.birthplaceF, center=mean) 

Unnamed: 0_level_0,Df,F value,Pr(>F)
Unnamed: 0_level_1,<int>,<dbl>,<dbl>
group,2,1.005729,0.3679694
,168,,


#### Explanation:

The value of the statictical pivot of the test of Levene is 1.005729, the degrees of freedom 2 and the p-value 0.3679694

#### Conclusion:

As the p-value is larger than 0.02, we cannot reject the homogeneity of variances between the heights of the kids of Alava, Vizcaya and Guipuzcoa

As a result, the inequality of variances is not an issue to perform the test of Analysis of Variance.

### Analysis of Variance

According to the previous tests, the hypothesis required to perform the Analysis of Variance test seem to hold. 

We consider a significance level of 5%.

In [12]:
Anova <-  aov(CHILD ~ BIRTHPLACE.F,data=Data)
summary(Anova)

              Df Sum Sq Mean Sq F value Pr(>F)    
BIRTHPLACE.F   2   6705    3353     299 <2e-16 ***
Residuals    168   1884      11                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#### Explanation:

The value of the statistical pivot of the test is 299 and the p-value is smaller than 0.0001. Therefore, since the p-value is smaller than the significance level (in this case 0.05), we conclude that there are differences on the mean height depending on the birthplace. 

### Multiple Comparisons

We know that the mean of the heights of the kids is not the same in the different birthplaces under consideration. Now, we go beyond this result and aim to known which are different. To this purpose, we use the analysis of multiple comparisons. There are different methods, but the most used ones are Scheffe and Tukey. We will use the Tukey method. 

Let us consider a significance level of 2%.

In [15]:
# load the following package (and install it if you are using it for the first time
library(multcomp)
comp_Tukey <- glht(Anova, linfct = mcp(BIRTHPLACE.F = "Tukey"))
summary(comp_Tukey)


	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = CHILD ~ BIRTHPLACE.F, data = Data)

Linear Hypotheses:
                         Estimate Std. Error t value Pr(>|t|)    
Vizcaya - Alava == 0     14.94528    0.65134  22.945   <1e-06 ***
Guipuzcoa - Alava == 0   14.88821    0.71537  20.812   <1e-06 ***
Guipuzcoa - Vizcaya == 0 -0.05707    0.60083  -0.095    0.995    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)


#### Explanation:

There are three different tests:

* The first hypothesis test contrasts the difference between the mean heights of the kids of Vizcaya ($\mu_V$) and of Alava ($\mu_A$), that is, 

$$H_0: \mu_{V}  =  \mu_{A}$$
$$H_1: \mu_{V}\neq \mu_{A}$$

For this test, we see that the value of the statistic is 22.945 and of the p-value is smaller than 0.0001. Therefore, Por tanto, $\mu_{V}\neq \mu_{A}$ with a significance level of 5%.

* The second test contrasts the difference on the mean of the height of the kids of Guipuzcoa ($\mu_G$) and of Alava, that is, 

$$H_0: \mu_{G}  =  \mu_{A}$$
$$H_1: \mu_{G}\neq \mu_{A}$$

For this test, we see that the value of the statistic is 20.812 and of the p-value is 0.0001. Therefore, $\mu_{G}\neq \mu_{A}$ with a significance level of 5%

* The third test constrast the difference between the mean height of the kids in Guipuzkoa and of Vizcaya, that is, 

$$H_0: \mu_{G}  =  \mu_{V}$$
$$H_1: \mu_{G}\neq \mu_{V}$$

For this test, the value of the statistic is -0.095 and of the p-value is 0.995. Therefore, we cannot conclude that $\mu_{G}\neq \mu_{V}$ with a significance level of 5%. 

#### Conclusion:

The mean height of the kids of Alava is different to the mean height of the kids of Guipuzkoa of and ot the mean height of the kids of Vizcaya. However, we cannot conclude that there exists a significance different between the mean height of kid in Vizcaya and Guipuzcoa.


## Confidence Intervals

We know more about the differences between the three populations. To know which of the means is larger, we use the confidence intervals. 

In [16]:
confint(comp_Tukey)


	 Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = CHILD ~ BIRTHPLACE.F, data = Data)

Quantile = 2.3617
95% family-wise confidence level
 

Linear Hypotheses:
                         Estimate lwr      upr     
Vizcaya - Alava == 0     14.94528 13.40701 16.48355
Guipuzcoa - Alava == 0   14.88821 13.19871 16.57770
Guipuzcoa - Vizcaya == 0 -0.05707 -1.47604  1.36190


#### Explanation:

We have computed the following three confidence intervals:
$$I_{\mu_V-\mu_A}^{0.95}=(13.40743,16.48313)$$
$$I_{\mu_G-\mu_A}^{0.95}=(13.19917,16.57724)$$
$$I_{\mu_G-\mu_A}^{0.95}=(-1.47566,1.36151)$$

From the first CI, we conclude that $$\mu_A<\mu_V$$ and from the second one that$$\mu_A<\mu_G.$$ 

Therefore, we conclude that the mean height of the kids of Alava is smaller than the mean height of the kids of Vizcaya and Guipuzcoa.