# Statistical tests with R

##### Guilherme Gimenez Jr

## Agenda

1. **About Hypothesis Testing**
2. **Importing our data**
    1. **Setting up hypothesis**
    2. **Jumping right into it**

### 1. About Hypothesis Testing

Building a hypothesis is one of the first steps in starting in the analysis of experiments, it is all about **answering a question**. Of course, before building a hypothesis is always nice to conduct an Exploratory Data Analysis (EDA) to get more insight into what information is available and what lies within the data.

#### What is a hypothesis?

Well, a hypothesis can be thought of as **an educated guess** about something in the data, it **must be testable** either by an experiment or observation**.

When we are proposing a hypothesis we should **write a statement**:

> [...] **if** and **then** [...] are necessary in formalized hypothesis. [...] In a formalized hypothesis, a tentative relationship is stated. [...] Formalized hypotheses contain two variables. One is "independent" and the other is "dependent". The independent variable is the one you, the "scientist" control, and the dependent variable is the one that you observe and/or measure the results. [University of California](https://www.csub.edu/~ddodenhoff/Bio100/Bio100sp04/formattingahypothesis.htm)

A very nice example of a well-written hypothesis:

- **If** skin cancer is **related** to ultraviolet light, **then** people with high exposure to UV light will have a higher frequency of skin cancer.

Notice that we have included the **dependent** variable, _skin cancer_ , the **independent variable**, _UV light_ , and also the expectations, or results, for an experiment, _higher frequency of skin cancer_.

#### What is Hypothesis Testing?

In simple words, we are **test the odds of our results happening by chance** (or if we have meaningful results). In order to do this we need to have a **null hypothesis $H_{0}$** (and an alternative hypothesis $H_{1}$) that we will try to reject or accept.

This is where the most known _p-value_ term is born: in order to determin the statistical significance of the results we analyse the _p-value_ , if is less than or equal to a _particular threshold_ there is evidence against the null hypothesis. Different fields use different threshold values when performing hypothesis testing, in our case we will use $\alpha=0.05$.

> Under the null hypothesis, a parameter of interest is set to a particular value, typically zero, which represents the "no effect" relative to the effectt the research is testing for. [Too Big to Fail: Large Samples and the p-value Problem](https://www.researchgate.net/publication/270504262_Too_Big_to_Fail_Large_Samples_and_the_p-Value_Problem)

The following table shows the possible outcomes:

|||Actual Validity of $H_{0}$|Actual Validity of $H_{0}$|
|-|-|-|-|
|||**$H_{0}$ is true**|**$H_{0}$ is false**|
|**Decision Made**|**Accept $H_{0}$**|True Negative|False Negative (Type II error)|
|**Decision Made**|**Reject $H_{0}$**|False Positive (Type I error)|True Positive|

The _p-value_ is a _very slippery terrain_ and must be dealt with caution. There are some problems related hypothesis testing like:

- A high number of observations can lead to significant _p-values_ even if there is no statistical significance
- Selective reporting and _p-hacking_ are some issues that arrive due to heavy usage of p-values

#### Types of statistical tests 

The appropriate statistical test for the data depend on **the number and type of variables** that will be included in the analysis.

There are [several tables](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/) that can help when choosing the right statistical test to perform. Here we are going to see statistical tests when dealing with **independent groups**.

|Nature of Dependent Variables|Test|
|-|-|

### 2. Importing our data

In order to perform our statistical tests we will use some datasets that are already availabe on R (with our vanilla instalation).

- **esoph** - Smoking, Alcohol And (O)Esophageal Cancer: Data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France.
- 

In [None]:
# Let's see which datasets are already available from our vanilla R installation
data()

#### Categorical Testing - esoph

This [data](https://rdrr.io/r/datasets/esoph.html) contains data for 88 age/alcohol/tobacco combinations. These are the following variables:

|Variable|Description|Values|
|-|-|-|
|agegp|Age Group|25--34 years|
|||35--44|
|||45--54|
|||55--64|
|||65--74|
|||75+|
|alcgp|Alcohol consumption|0--39 gm/day|
|||40--79|
|||80--119|
|||120+|
|tobgp|Tobacco Consumption|0--9 gm/day|
|||10--19|
|||20--29|
|||30+|
|ncontrols|Number of controls||
|ncases|Number of cases||

In [None]:
head(esoph)
# If this doesn't work on RStudio, try running this command:
# data(esoph)

Let's explore our data a little bit, we need to understand our variables types and their distribution...

In [None]:
# Getting statistical information about each variable
summary(esoph)

OK! We have a very similar distribution of age, alcohol consumption and tobacco consumption in our dataset. Also, instead of having continuous values we have categorical ones.

With this information we can already elect some tests for our data: **categorical like Fisher and chi-squared**.

But we don't have any hypothesis yet.

##### Building our hypothesis

> **If** age is related to (o)esophageal cancer, **then** as age increases, so does the frequency of cases.

###### Choosing our test

This is a nice hypothesis, we are basically testing for evidence that age may be related to (o)esophageal cancer. Because we have 2+ groups for our hypothesis testing (6 age groups). Here we will use a test called 'equality of proportions'.

In [None]:
# Let's build a contingency table for our hypothesis
table(esoph$agegp, esoph$ncases)

In [None]:
# Let's visualize our data normalizing the number of controls and the number of cases
boxplot(esoph$ncases / (esoph$ncases + esoph$ncontrols) ~ esoph$agegp)

Just by looking at this data, can we see evidence for our hypothesis?

R provides us a very nice interface for performing tests related to **proportions**, this can be done using the `prop.test` interface.

In [None]:
# This can help you finding more information about the test
?prop.test

In [None]:
# To perform our test we need two vectors:
#   1. "Successes": containing the total number of cases
#   2. "Trials": conducting the number of trials

# tapply is a function for applying a function to an array
# think of it as a groupby in this context

case <- tapply(esoph$ncases, esoph$agegp, sum)
total <- tapply(esoph$ncontrols + esoph$ncases, esoph$agegp, sum)

In [None]:
# Finally, let's perform our test, we pass to the function
# the array containing the total number of cases and the total number of trials
prop.test(x=case, n=total)

Let's analyze our results:

**Our p-value is less than our alpha ($2.224*10^{-13}<0.05$) meaning that we succeeded to reject the null hypothesis**, in this case:

- $H_{0}$ The proportion of cases is the same in each age group.
- $H_{1}$ The proportion of cases is **not** the same in each age group.

OK, we have just conducted our first statistical test but..... well, we still haven't evidence to support our hypothesis that there is a linear trend between age group and (o)esophageal cancer.

In [None]:
prop.trend.test(case, total)

Again, **our p-value is less than our alpha ($4.136*10^{-14}<0.05$) meaning that we succeeded to reject the null hypothesis**, in this case:

- $H_{0}$ There is **no linear trend** in the proportion of cases across age groups.
- $H_{1}$ There is a **linear trend** in the proportion of cases across age groups.

Keep in mind that **this test can only be used if there is an ordinal variable**, in our case each category group corresponds to an increasing age group.

#### Hands-On

What about these hypotheses?

> **If** alcgp is related to (o)esophageal cancer, **then** as alcgp increases, so does the frequency of cases.

> **If** tobgp is related to (o)esophageal cancer, **then** as tobgp increases, so does the frequency of cases.

What did you learn for the hypotheses?

Ok, Ok! We've seen how to test hypotheses regarding comparison of proportions. But what if we want to check for independence of different categorical variables? We then would not have only a matrix of 1xn, but whether a matrix of mxn.

##### Building our hypothesis

> **If** alcgp _and_ tobpg are dependent to cancer status, **then** the interaction in alcgp and tobpg interact affects the frequency of cases.

##### Choosing our statistical test

Previously we've seen a special-case for chi-squared called 'equality of proportions' (sometimes referred as z-test) - or one-way chi-squared. In this case we can both use the Fisher's Exact Test or Chi-Square Test (due to the nature of the first one it is used on small-sampled data)

In [None]:
# First we need to aggregate our data into our matrix of m x n
# where m is the number of categories for variable A and n is
# the number of categories for feature B

# This is also referred as the contigency table or multi-way r' c table

table_1 <- tapply(esoph$ncases, list(esoph$tobgp, esoph$alcgp), sum)

In [None]:
table_1

In [None]:
# We have cases where values are really small (<= 5), this makes
# R raise the error for the approximation where the accuracy for
# X-squared cannot be trusted

chisq.test(table_1)

In [None]:
# Fisher's test can be a little catchy especially for really large datasets
# and when you have values that are too large or too small (< 5), in the later case
# R will complain about the workspace (basically controls the size of the network algorithm)
# and you must increase its value (default is 2e5) but this will also increase execution time.

fisher.test(table_1, workspace=2e6)

OK, let's analyze our results:

**Our p-value is greater than our alpha ($0.5966>0.05$) meaning that we failed to reject the null hypothesis** (or we accept the alternative hypothesis), in this case:

- $H_{0}$ alcgp and tobpg **are independent**.
- $H_{1}$ alcgp and tobpg **are not independent**.

Therefore, alcgp and tobpg are independent (have no relationship or interaction) in respect to the number of cases.

#### Hands-On

What about these hypotheses?

> **If** alcgp and age are dependent to cancer status, then the interaction in alcgp and age interact affects the frequency of cases.

> **If** tobpg and age are dependent to cancer status, then the interaction in tobpg and age interact affects the frequency of cases.

What did you learn for the hypotheses?

### Summing Up

We've seen a little bit for hypotesis testing with categorical variables:

|function|proportion|input|comments|
|-|-|-|-|
|`prop.test`|single proportion|vector of successes and trials|accurate for large datasets|
|`chisq.test`|$\ge2$ proportions|matrix or contigency table|frequencies should have values greater than 5 for accuracy|
|`fisher.test`|$\ge2$ proportions|matrix or contigency table|may have some problems with large datasets or small values|