# Hypothesis Testing

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span></li><li><span><a href="#Sample-Mean-and-Population-Mean---I" data-toc-modified-id="Sample-Mean-and-Population-Mean---I-2">Sample Mean and Population Mean - I</a></span></li><li><span><a href="#Sample-Mean-and-Population-Mean---II" data-toc-modified-id="Sample-Mean-and-Population-Mean---II-3">Sample Mean and Population Mean - II</a></span></li><li><span><a href="#Hypothesis-Formulation" data-toc-modified-id="Hypothesis-Formulation-4">Hypothesis Formulation</a></span></li><li><span><a href="#Type-I-and-Type-II-Errors" data-toc-modified-id="Type-I-and-Type-II-Errors-5">Type I and Type II Errors</a></span></li><li><span><a href="#P-Values" data-toc-modified-id="P-Values-6">P-Values</a></span></li><li><span><a href="#Significance-Level" data-toc-modified-id="Significance-Level-7">Significance Level</a></span></li><li><span><a href="#One-Sample-T-Test" data-toc-modified-id="One-Sample-T-Test-8">One Sample T-Test</a></span></li><li><span><a href="#Two-Sample-T-Test" data-toc-modified-id="Two-Sample-T-Test-9">Two Sample T-Test</a></span></li><li><span><a href="#ANOVA" data-toc-modified-id="ANOVA-10">ANOVA</a></span></li><li><span><a href="#Assumptions-of-Numerical-Hypothesis-Tests" data-toc-modified-id="Assumptions-of-Numerical-Hypothesis-Tests-11">Assumptions of Numerical Hypothesis Tests</a></span></li></ul></div>

## Introduction

Statistical hypothesis testing is a process that allows you to evaluate if a change or difference seen in a dataset is “real”, or if it’s just a result of random fluctuation in the data.

It provides a framework for evaluating how confident one can be in making conclusions based on data.

Some instances where this might come up include:

* a professor expects an exam average to be roughly 75%, and wants to know if the actual scores line up with this expectation. Was the test actually too easy or too hard?<br><br>
* a product manager for a website wants to compare the time spent on different versions of a homepage. Does one version make users stay on the page significantly longer?

## Sample Mean and Population Mean - I

A `sample` is a subset of an entire `population` (for example, all the oak trees in a park). The mean of a sample is a `sample mean` and it is an estimate of the `population mean`.

For a population, the mean is a constant value no matter how many times it’s recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:

* data on the entire population is not available
* data on the entire population is available, but it is so large that it is unfeasible to analyze
* meaningful answers to questions can be found faster with sampling

## Sample Mean and Population Mean - II

A `sampling error` occurs when a sample is not representative of the population it comes from.

If the sample selection is poor, then you will have a sample mean seriously skewed from the population mean.

One way to mitigate the risk of having a skewed sample mean is to take a larger set of samples. This will more closely approximate the population mean, and reduce the chance of sampling error.

## Hypothesis Formulation

* Begin the statistical hypothesis testing process by defining a hypothesis, or an assumption about your population that you want to test
* A hypothesis can be written in words, but can also be explained in terms of the sample and population means
* When constructing hypotheses for a hypotheses test, we formulate a null hypotheses
* A null hypotheses states that there is no difference between the populations you are comparing, and it implies that any difference seen in the sample data is due to sampling error
* For example, to compare the time users spend on different versions of a homepage, the null hypothesis might be:
    * "The average time spent on homepage A is the same as the average time spent on homepage B."
* It could also be restated in terms of population mean:
    * "The population mean of time spent on homepage A is the same as the population mean of time spent on homepage B."
* After collecting some sample data on how users interact with each homepage, you can then run a hypothesis test using the data collected to determine whether your null hypothesis is true or false, or can be rejected (i.e. there is a difference in time spent on homepage A or B).


## Type I and Type II Errors

In statistical hypothesis testing, there are two types of error.

* A Type I error occurs when a hypothesis test finds a correlation between things that are not related. This error is sometimes called a “false positive” and occurs when the null hypothesis is rejected even though it is true.<br><br>
* A Type II error, is failing to find a correlation between things that are actually related. This error is referred to as a “false negative” and occurs when the null hypothesis is not rejected even though it is false.

## P-Values

What result does a hypothesis test actually return, and how can you interpret it?

A hypothesis test returns a few numeric measures, one of which is the `p-value`. 

P-values help determine how confident you can be in validating the null hypothesis. In this context, a p-value is the probability that, assuming the null hypothesis is true, you would see at least such a difference in the sample means of your data.

**Example:**

You gather `10` green and `10` red apples to compare their weights. The green apples average `150` grams in weight, and the red apples average `160` grams in weight.

A hypothesis test to see if there is a significant difference in the weight of green and red apples returns a p-value of `0.2`.

How can this p-value be interpreted?

**Ans:** There is a 20% chance that the difference in average weight of green and red apples is due to random sampling.

## Significance Level

While the p-value indicates a level of confidence in the null hypothesis, it does not definitely claim whether you should reject the null hypothesis.

To make this decision, you need to determine a threshold p-value for which all p-values below it will result in rejecting the null hypothesis. This threshold is known as the `significance level`.

A higher significance level is more likely to give a false positive, as it makes is “easier” to state that there is a difference in the populations of your data when such a difference might not actually exist.

It is an industry standard to set a significance level of `0.05` or less, meaning that there is a `5%` or less chance that your result is due to sampling error.

## One Sample T-Test

A product manager hypothesizes the average age of visitors to BuyPie.com is 30. In the past hour, the website had 100 visitors and the average age was 31. Are the visitors older than expected? Or is this just the result of chance (sampling error) and a small sample size?

We can test this using a One Sample T-Test. A One Sample T-Test compares a sample mean to a hypothetical population mean. It answers the question “What is the probability that the sample came from a distribution with the desired mean?”

The first step is formulating a null hypothesis, which again is the hypothesis that there is no difference between the populations you are comparing. The second population in a One Sample T-Test is the hypothetical population you choose. The null hypothesis that this test examines can be phrased as follows: `"The set of samples belongs to a population with the target mean".`

One result of a One Sample T-Test will be a p-value, which tells you whether or not you can reject this null hypothesis. If the p-value you receive is less than your significance level, normally `0.05`, you can reject the null hypothesis and state that there is a significant difference.

R has a function called `t.test()` in the `stats` package which can perform a One Sample T-Test.

```r
results <- t.test(sample_distribution, mu = expected_mean)
```

* `sample_distribution` is the sample of values that were collected
* `mu` is an argument indicating the desired mean of the hypothetical population
* `expected_mean` is the value of the desired mean

`t.test()` will return, among other information, a p-value — this tells you how confident you can be that the sample of values came from a distribution with the specified mean.

**Example:**

1. A small dataset called `ages`, represents the ages of customers to BuyPie.com in the past hour. Calculate the mean of `ages`.

In [1]:
ages <- c(32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22)
ages

In [2]:
ages_mean <- mean(ages)
ages_mean

2. Use the `t.test()` function with `ages` to see what p-value the experiment returns for this distribution, where we expect the mean to be `30`.

In [3]:
results <- t.test(ages, mu=30)
results


	One Sample t-test

data:  ages
t = 0.59738, df = 13, p-value = 0.5605
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
 27.38359 34.61641
sample estimates:
mean of x 
       31 


## Two Sample T-Test

A Two Sample T-Test compares two sets of data, which are both approximately normally distributed.

You are given two distributions representing the time spent per visitor to BuyPie.com last week, `week_1`, and the time spent per visitor to BuyPie.com this week, `week_2`.

Did the average time spent per visitor change (i.e. was there a statistically significant bump in user time on the site)? Or is this just part of natural fluctuations?

One way of testing whether this difference is significant is by using a Two Sample T-Test. 

`The null hypothesis, in this case, is that the two distributions have the same mean`.

**1. Find the means of these two distributions**.

In [10]:
week_1 <- c(23.90507, 26.67632, 27.27434, 24.25757, 32.40423, 39.56919, 23.07010,
29.82068, 27.59434, 28.05640, 27.06757, 30.41193, 25.71359, 24.94295,
28.23124, 24.95338, 18.51232, 27.46235, 28.38017, 13.91206, 29.02616,
26.90747, 22.86777, 24.89383, 25.96948, 26.86870, 20.72676, 27.35988,
20.68409, 21.19846, 16.25801, 23.92518, 24.47923, 29.47051, 27.28425,
26.93339, 28.61027, 18.88377, 33.65469, 25.69470, 20.98291, 22.69700,
28.60279, 21.36000, 30.77685, 20.83416, 23.79367, 19.75567, 29.54421,
20.14331)
week_1_mean <- mean(week_1)
week_1_mean

In [11]:
week_2 <-  c(18.63432, 31.28788, 34.96798, 21.81678, 28.21620, 39.39314, 35.52223,
27.54222, 33.64395, 25.31674, 28.81392, 30.73580, 26.37242, 26.09456,
26.34073, 19.42196, 32.58798, 24.84002, 28.93348, 20.43668, 22.72496,
32.31728, 35.38431, 29.66710, 24.53513, 30.91406, 19.56118, 24.90817,
30.13164, 31.47466, 27.77684, 16.51307, 35.07702, 31.74818, 36.36053,
27.70501, 29.49870, 27.65575, 37.18504, 25.16055, 29.26554, 38.22163,
28.92102, 24.82154, 38.30155, 34.76021, 22.26869, 28.82594, 32.00975,
36.46438)
week_2_mean <- mean(week_2)
week_2_mean

**2. Find the standard deviations of these two distributions**. 

In [12]:
week_1_sd <- sd(week_1)
week_1_sd

In [13]:
week_2_sd <- sd(week_2)
week_2_sd

**3. Run a Two Sample T-Test using the t.test() function**.

In [15]:
results <- t.test(week_1, week_2)
results


	Welch Two Sample t-test

data:  week_1 and week_2
t = -3.5109, df = 94.554, p-value = 0.0006863
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.594299 -1.552719
sample estimates:
mean of x mean of y 
 25.44806  29.02157 


## ANOVA

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of `0.05` is to use ANOVA. 

**ANOVA (Analysis of Variance)** tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you’re saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.

The `stats` package function `aov()` is used to perform ANOVA on multiple datasets. `aov()` takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame `df_scores` as follows:

![img.png](attachment:img.png)

Then run an ANOVA test as follows:

```r
results <- aov(score ~ group, data = df_scores)
```

**Note:** `score ~ group` indicates the relationship you want to analyze (i.e. how each `group`, or major, relates to `score` on the video game)

To retrieve the p-value from the results of calling `aov()`:

```r
summary(results)
```

The null hypothesis, in this case, is that all three populations have the same mean score on this video game. 

If you reject this null hypothesis (if the p-value is less than `0.05`), you can say you are reasonably confident that a pair of datasets is significantly different. 

After using only ANOVA, however, you can’t make any conclusions on which two populations have a significant difference.

## Assumptions of Numerical Hypothesis Tests

**1. The samples should each be normally distributed.**

For example, imagine you have three datasets, each representing a day of traffic data in three different cities. Each dataset is independent, as traffic in one city should not impact traffic in another city. However, it is unlikely that each dataset is normally distributed. In fact, each dataset probably has two distinct peaks, one at the morning rush hour and one during the evening rush hour. In this scenario, using a numerical hypothesis test would be inappropriate.

**2. The population standard deviations of the groups should be equal.**

For ANOVA and Two Sample T-Tests, using datasets with standard deviations that are significantly different from each other will often obscure the differences in group means.

To check for similarity between the standard deviations, it is normally sufficient to divide the two standard deviations and see if the ratio is “close enough” to 1. “Close enough” may differ in different contexts, but generally staying within `10%` should suffice.

**3. The samples must be independent.**

When comparing two or more datasets, the values in one distribution should not affect the values in another distribution. In other words, knowing more about one distribution should not give you any information about any other distribution.

Here are some examples where it would seem the samples are not independent:

* the number of goals scored per soccer player before, during, and after undergoing a rigorous training regimen
* a group of patients’ blood pressure levels before, during, and after the administration of a drug

It is important to understand your datasets before you begin conducting hypothesis tests on them so that you know you are choosing the right test.