#### ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Lesson 2.04 | Statistical Inference

## Introduction
We can organize most of statistics into two large sub-fields: **descriptive** statistics and **inferential** statistics.

- **Descriptive statistics** focuses on summarizing, describing, and understanding data we observe.
- **Inferential statistics** focuses on generalizing results from a sample to a larger population.

For today, we're going to focus almost entirely on inferential statistics. We'll still rely on descriptive stats, but now we're taking the descriptive statistics and generalizing them!

Our goal is to calculate sample statistics and then rely on properties of a random sample (and perhaps additional assumptions) to be able to make inferences that we generalize to the larger **population of interest**.

We will also rely on the library scipy (`scipy.stats`) for much of our work today, so let's import it now `as stats`.

In [2]:
import scipy.stats as stats

### Populations

Most data science problems have to do with studying **populations** in some form or another.
- All DSI students currently enrolled at GA.
- All nations in the European Union.
- All microwaves constructed in my factory this year.
- All hurricanes to enter the Gulf of Mexico in the past decade.
- All people who will vote in the 2020 election.

Often, our goal will be to "learn about a population."
- When we say "learn," we generally want to:
    - understand how a population behaves (i.e. look at the distribution of sleep) or
    - understand why it behaves the way it does (i.e. what causes sleep to take on certain values).
    
When we describe populations, we're often interested in particular measurements of this population. For example, we might be interested in the true mean hours of sleep or the true standard deviation of hours of sleep each night among DSI students.
- These measurements are called **parameters**.

<details><summary>
Why can't we just measure the population directly to get the true values of these parameters?
</summary>
```
- Time
- Money
- Accessibility
- Privacy
```
</details>

As a result, we study populations by taking subsets of these populations called **samples**.

When we take a sample, we want to try to learn about the parameter. We usually calculate some value on a sample, called a **statistic**. We make inferences about our **parameters** so that we learn about our population.

Let's tie these terms together with a visual.



## Central Limit Theorem
Normality underlies many of the inferential techniques that we seek to use. It is important for us to determine when Normality is a condition we've met.

Consider the variable $X$. We can take a sample from this population of size $n$ and find the mean of that sample. Let's call this $\bar{x}_1$. We can take another sample from this population, also of size $n$, and find the mean of that sample. Let's call this $\bar{x}_2$. We can do this over and over until we've calculated the mean of every possible sample of size $n$. If we plotted every sample mean $\bar{x}$ on a histogram, note that we have plotted every possible value of $\bar{x}$ and how frequently we observe each of these values. Thus, we get another distribution! This is called "the sampling distribution of $\bar{X}$."

**The Central Limit Theorem states that, as $n \rightarrow \infty$, the sampling distribution of $\bar{X}$ approaches a Normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.**

- If $X \sim N(\mu,\sigma)$, then $\bar{X}$ is exactly $N\left(\mu,\frac{\sigma}{\sqrt{n}}\right)$.
- If $X$ is not Normally distributed (or is unknown), then $\bar{X}$ is approximately $N\left(\mu,\frac{\sigma}{\sqrt{n}}\right)$ if the sample size $n$ is at least 30.

#### Why do we care?
If $\bar{X}$ is Normally distributed, then we know how $\bar{X}$ behaves and that the sample mean was drawn from a Normal distribution. We can then use the sample mean to conduct inference on the population mean!

## Inference
There are two main types of questions we ask when conducting inference on parameters.
- What is a range of likely values for my parameter?
- Is this a likely value for my parameter?

---

### Range of Likely Parameters Values: Confidence Intervals
Suppose you are playing some video game like Call of Duty or Halo. If you had an enemy who was invisible, would you rather use a rocket launcher or a sniper rifle?
- In the context of statistics, the true parameter value is unknown (invisible). Rather than identify one value of the parameter, it's helpful to identify a range of likely values for the parameter.

**A confidence interval describes a set of likely values for the parameter based on a statistic.** Confidence intervals will be centered at our "best guess" and then include a margin of error. 
- The technical term for this "best guess" is a **point esitmate**.
- The technical term for the margin of error is called our **standard error**, or the standard deviation of a statistic.

Thus, the structure of a confidence interval will be:

$$[\text{point estimate}] \pm [\text{multiplier}]\times[\text{standard error}]$$

Suppose I want to learn about my population mean.
- We use our sample mean as our point estimate for our population mean.

#### How do we find the "margin of error" for our population mean?
Because we know that $\bar{X}$ is Normally distributed with standard deviation $\frac{\sigma}{\sqrt{n}}$, we have an estimate of the variability of sample means from sample to sample. As such, our "standard error" will be $\frac{\sigma}{\sqrt{n}}$.

Similarly, because we know that $\bar{X}$ is Normally distributed, our "multiplier" for our standard error will come from the Normal distribution.

- If we want to be 90% confident that the true mean lies in our confidence interval our multiplier should be 1.645.
- If we want to be 95% confident that the true mean lies in our confidence interval our multiplier should be 1.96.
- If we want to be 99% confident that the true mean lies in our confidence interval our multiplier should be 2.575.

We call 90%, 95%, and 99% the *confidence level*.

Putting it all together, our $z$-based confidence interval is $\bar{x} \pm z \times \frac{\sigma}{\sqrt{n}}$

### Interpretation
Suppose a 95% confidence interval for the mean number of burritos Matt eats in a week is $(2.5,5.5)$. There are two interpretations from this.
- We are 95% **confident** that the true mean of number of burritos Matt eats in a week is between 2.5 and 5.5.
- If we pulled 100 samples and constructed confidence intervals in the same manner, we expect that 95 of the intervals would contain the true mean of number of burritos Matt eats in a week.

**Check:** What is the point estimate for this confidence interval? What is the multiplier? What is the standard error?

There are two main types of questions we ask when conducting inference on parameters.
- What is a range of likely values for my parameter?
- Is this a likely value for my parameter?

---

### Is This a Likely Value for my Parameter: Hypothesis Tests
A hypothesis test is a way to learn more about a parameter of interest. It is an inversion of a confidence interval, in the sense that a confidence interval and hypothesis tests will provide identical results. A confidence interval conveys more information, but interpreting a hypothesis test is important.

Summary: We are going to come up with two hypotheses, called the null and alternative hypotheses. We'll assume the null hypothesis is true, then gather evidence and measure how likely the null hypothesis is to be true. 
- If there's a lot of evidence to suggest that the null hypothesis is false, we'll say that the null hypothesis is false, meaning the alternative hypothesis is true.
- If there's not a lot of evidence to suggest that the null hypothesis is false, we'll say that we can't conclude the null hypothesis is false.

Five steps to hypothesis testing:
1. Construct a null hypothesis that you seek to contradict and its complement, the alternative hypothesis.
2. Specify a level of significance.
3. Calculate your point estimate.
4. Calculate your test statistic.
5. Find your $p$-value and make a conclusion.

#### Step 1: Construct a null hypothesis that you seek to contradict and its complement, the alternative hypothesis.

Your hypotheses should be **mutually exclusive** and **collectively exhaustive**.
- Mutually exclusive: There is no overlap between the two hypotheses.
- Collectively exhaustive: The two hypotheses together explain everything that could possibly occur.

Suppose you want to show that the mean number of burritos I eat in a week is not equal to 4. Then your null hypothesis ($H_0$) and alternative ($H_A$) hypothesis are:

- $H_0: \mu = 4$
- $H_A: \mu \neq 4$

**Note**: Your alternative hypothesis should always be what you want to show. Your null hypothesis should be everything else!

If you want to show that the mean number of burritos I eat in a week is **less than** 4, then your hypotheses are:

- $H_0: \mu \geq 4$
- $H_A: \mu < 4$

Your hypotheses must be mutually exclusive and include all possible values of the parameter. **What you want to show should be in your alternative hypothesis.**

**Check:** Suppose you wanted to show that the mean number of burritos I eat in a week is **more than** 4. What are the hypotheses?

#### Step 2: Specify a level of significance.
You may have heard "alpha equals point oh-five!" before. A lower case alpha ($\alpha$) is used to denote level of significance (or, more cryptically, Type I error). 
- In a technical sense, $\alpha$ is the probability of rejecting the null hypothesis given that the null hypothesis is actually true. 
- In a more practical sense, $\alpha$ determines how much evidence we need before we're going to conclude that the null hypothesis is false.

Standard levels for $\alpha$ are 0.01, 0.05, and 0.10. A higher alpha level means that you are likelier to reject your null hypothesis, but this also makes it likelier that you **improperly** reject your null hypothesis. 

The most common level is 0.05, but check your field's standards!

#### Step 3: Calculate your point estimate.
In this case, your point estimate will simply be your sample mean - this should be easy!

#### Step 4: Calculate your test statistic.
In this case, your test statistic will be $z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}$ - should still be easy!

#### Step 5: Find your $p$-value and make a conclusion.
The definition of $p$-value is "the probability that, given a re-run of your experiment, you get a test statistic that is as extreme or more extreme than the test statistic you just received." Within the context of what we just did in Step 4, your $p$-value indicates that a re-run of the experiment would yield a $z$-score that is as extreme or more extreme than the one you just got.

This is a measure of how extreme our experiment's results are. This should make sense - remember above how we talked about quantifying how far observations are from their expected value (a.k.a. population mean). This is the same thing! We're quantifying how far our sample statistic (a summary of our sample observations) is from our expected value of the statistic by seeing how many standard errors (standard deviations) we are from the expected value.

- If that $p$-value is less than your pre-determined $\alpha$, then you can reject your null hypothesis and conclude that your alternative hypothesis is indeed correct.
- If that $p$-value is more than your pre-determined $\alpha$, then you **fail to reject** your null hypothesis and **cannot conclude that either the null or the alternative is correct**.
- If that $p$-value is equal to your pre-determined $\alpha$, then your results are inconclusive. Start a brand new study over or assume that you cannot reject your null hypothesis.


#### Note on Nonparametric Statistics
Thus far, our inference has been **parametric.** That is, we have assumed a certain distribution for our data. However, there are alternatives in the case where we cannot assume a particular distribution for our data. When we make no assumptions about the distribution for our data, we call our data **nonparametric**. For nearly every parametric test, there is a nonparametric analog available.

One example of a nonparametric test is the [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) test. [SciPy K-S Test Documentation](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html)