### Types of Hypothesis Test
When we are trying to compare datasets, we often need a way to be confident knowing if datasets are significantly different from each other.

In this lesson, you will learn how about how we can use hypothesis testing to answer these questions. There are several different types of hypothesis tests for the various scenarios you may encounter. Luckily, SciPy has built-in functions that perform all of these tests for us, normally using just one line of code.

For numerical data, we will cover:

    One Sample T-Tests
    Two Sample T-Tests
    ANOVA
    Tukey Tests
    For categorical data, we will cover:

    Binomial Tests
    Chi Square
After this lesson, you will have a wide range of tools in your arsenal to find meaningful correlations in data.

### 1 Sample T-Testing
Let's imagine the fictional business BuyPie, which sends ingredients for pies to your household, so that you can make them from scratch. Suppose that a product manager wants the average age of visitors to BuyPie.com to be 30. In the past hour, the website had 100 visitors and the average age was 31. Are the visitors too old? Or is this just the result of chance and a small sample size?

We can test this using a univariate T-test. A univariate T-test compares a sample mean to a hypothetical population mean. It answers the question "What is the probability that the sample came from a distribution with the desired mean?"

When we conduct a hypothesis test, we want to first create a null hypothesis, which is a prediction that there is no significant difference. The null hypothesis that this test examines can be phrased as such: "The set of samples belongs to a population with the target mean".

The result of the 1 Sample T Test is a p-value, which will tell us whether or not we can reject this null hypothesis. Generally, if we receive a p-value of less than 0.05, we can reject the null hypothesis and state that there is a significant difference.

SciPy has a function called ttest_1samp, which performs a 1 Sample T-Test for you.

ttest_1samp requires two inputs, a distribution of values and an expected mean:

tstat, pval = ttest_1samp(example_distribution, expected_mean)
print pval
It also returns two outputs: the t-statistic (which we won't cover in this course), and the p-value — telling us how confident we can be that the sample of values came from a distribution with the mean specified.

We have provided a small dataset called ages, representing the ages of customers to BuyPie.com in the past hour.

First, print out ages to the console and examine the numbers.

In [2]:
from scipy.stats import ttest_1samp
import numpy as np

ages = np.array([32., 34., 29., 29., 22., 39., 38., 37., 38., 36., 30., 26., 22., 22.])
print(ages)

[32. 34. 29. 29. 22. 39. 38. 37. 38. 36. 30. 26. 22. 22.]


Even with a small dataset like this, it is hard to make judgments from just looking at the numbers.

To understand the data better, let's look at the mean. Calculate the mean of ages using np.mean. Store it in a variable called ages_mean and print it out.

In [3]:
ages_mean = np.mean(ages)
print(ages_mean)

31.0


Use ttest_1samp with ages to see what p-value the experiment returns for this distribution, where we expect the mean to be 30.

Store the p-value in a variable called pval. Remember that it is the second output of the ttest_1samp function. We don't use the first output, the t-statistic, so you can store it in a variable with whatever name you'd like.

In [4]:
tstat, pval = ttest_1samp(ages, 30.)
print(pval)

0.5605155888171379
