## Analysis of Variance (ANOVA)

In this notebook, we explore the Analysis of Variance (ANOVA) technique. Analysis of Variance (ANOVA) is a statistical procedure for comparing means of two or more populations. Essentially, we wish to understand whether two populations are significantly different from each other by comparing their means. 

Previously, we prepared a series of distribution plots for a single numerical feature where each distribution plot corresponds to the values for the numerical feature for a given attribute of a categorical feature. Here, we use ANOVA to evaluate statistically and what we see in those plots.

In [None]:
source('src/load_data-02.r')
source('src/multiplot.r')

In [None]:
dim(housing_df)

In [None]:
head(housing_df)

In [None]:
count_empty_total()

In [None]:
attach(housing_df)

### One-way ANOVA

One-way ANOVA is perhaps the simplest ANOVA technique and handles a special case of this problem, testing for equal group means using a single feature. The idea is essentially this

1. Identify a numerical feature for analysis (often the target feature)
2. Split that numerical features into groups using a categorical feature
3. Run a one-way ANOVA on these groups
    4. If it is found that than means are equal for all groups, then this categorical feature may be less relevant for predicting the numerical feature in question
    5. If it is found that the means are not equal for all groups, then this categorical feature may be important for predicting the numerical feature in question

In a one-way ANOVA, the null hypothesis is that the mean responses are equal for all groups. The alternative hypothesis is that the mean responses are not equal for all groups. It is helpful to recall that any statistical test, it is standard that if the $p$-value of the test is less than 0.05, then the null hypothesis can be rejected.

**A $p$-value greater than 0.05 does not necessarily mean that the alternative hypothesis should be accepted.**

In [None]:
multiplot(hist_with_kde_numerical_by_category(SalePrice,MoSold),
          hist_with_kde_numerical_by_category(SalePrice,ExterQual), 
          cols = 2)

#### Month Sold

Consider the null hypothesis:

$$H_0: \text{the mean responses is equal for all groups}$$

In [None]:
meansd = function(x) c(mean=mean(x), sd=sd(x))
by(SalePrice, MoSold, FUN=meansd)

In [None]:
oneway.test(SalePrice ~ MoSold)

##### This test shows that we CAN NOT reject the null hypothesis

#### Exterior Quality

Consider the null hypothesis:

$$H_0: \text{the mean responses is equal for all groups}$$

In [None]:
by(SalePrice, ExterQual, FUN=meansd)

In [None]:
oneway.test(SalePrice ~ ExterQual)

##### This test shows that we CAN reject the null hypothesis