# Lab 13 - Simulations and hypotheses

We will start this lab by simulating some data using the [NumPy](http://www.numpy.org) package.  We import this package with the code `import numpy as np`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## 13.1 Simulating Data

#### Smokers and non-smokers in New York City

The smoking rate in New York City was 11.6% in 2016.  This statistic means that in theory if we picked 100 New Yorkers at random, we expect 11.6 of them to smoke.  But what happens in practice if we select 100 random New Yorkers?  We are going to simulate this scenario using code.

First we define our population:

In [None]:
population = ["Smoker","Non-smoker"]
pop_prob = [0.116,1-0.116]

This code defines our population as having two groups `Smoker` and `Non-smoker`, with the probabilities 0.116 and 1-0.116 = 0.884, respectively.  

We can generate a random sample of 100 people from our population with the code `np.random.choice(population,p=pop_prob,size=100)`.  Type and run it below.

The code has simulated 100 people and labeled each as a `Smoker` or `Non-smoker` according to the probabilities we gave.  Try re-running this code.  What happens?

Also notice that the word `array` appears at the beginning of the output.  An array is similar to a list or a Series, in that it holds many different values of the same type (integers, strings, etc.). This particular array is an `ndarray`, which is one of the kinds of arrays produced by `NumPy`.  We have to convert the array into a Pandas Series before we can use any Pandas function on it.

Next we want to count the number of smokers.  To do this, we:

1. assigning this array to a variable
2. convert the array into a Pandas Series using `pd.Series(array_variable_name)`
3. run `value_counts()` on the Series to count the number of smokers and non-smokers

Can figure out how to write this code?

### 13.2 Multiple samples

To understand the variation in the number of smokers, we will take a bunch of random samples and make a histogram of the counts of smokers in each sample.  To do this, we will need to extract the number of smokers from the `value_counts()` output.

Save the output of `value_counts()` above in the variable `counts`. 

To compute the number of smokers in `counts`, use the code `counts["Smoker"]`:

In Lab 11, we sampled multiple times from our dataset and made a histogram of the mean, median, or variance of the samples.  In this lab, we'll simulate multiple samples and make a histogram of the number of smokers in each sample.

The pseudo-code for how to do this is:

<code>create a new list for the counts
loop 200 times:
        simulate a new sample
        count the number of smokers in the sample
        add (append) this smoker count to your list
turn the list into a Pandas series and plot a histogram from it</code>

Note that "count the number of smokers in the sample" will likely be two lines of code.

<details> <summary>Pattern:</summary>
<code>new_list = []
for i in range(num_times_to_loop):
    sample = np.random.choice(population,p=pop_prob,size=sample_size)
    counts = pd.Series(sample).value_counts()
    num_in_category = counts["category_to_count"]
    new_list.append(num_in_category)
pd.Series(new_list).hist()
</code>
</details>

What do you notice about the histogram?  Run your code again.  How does the histogram change (or not change)?

What happens when you increase the number of simulations?

How likely is it that a sample has 12 smokers?  20 smokers?

#### 13.3 Election candidate polling

Suppose the true support for a candidate in an election is 53%.  A pollster samples 100 people. How much variation is there in the support for the candidate?  Generate 250 samples of size 100 and plot a histgram of the number of people who would vote for the candidate in each sample.

How likely is it that sample has only 50 people who will vote for the candidate?  What about 40 people?

What happens if you increase the number of people polled?  Does the amount of variation change?

When you increase the number of people polled, how likely is it that 50\% of those people vote for the candidate?  What about 40\%?

### 13.4 Hypotheses

A *statistical hypothesis* is a specific, testable assumption about a population that is either true or false.  For example, the smoking rate in New York City has declined since 2016.  We know from above that the smoking rate in 2016 was 11.6\%.  

If the proportion of smokers in New York City today is $p$, how would we translate the hypothesis that "the smoking rate in New York City has declined since 2016" into a mathematical statement?

<details> <summary>Answer:</summary>
$p < 0.116$
</details>

The statement we are trying to show becomes our *alternative hypothesis*: $H_A: p < 0.116$.  The *null hypothesis* covers the remaining possibilities that the smoking rate has stayed the same or increased: $H_0: p \geq 0.116$.

For the election polling scenario above, suppose the next month we have a hypothesis that the support for the candidate has changed (either increased or decreased).  Let $p$ be the proportion of people who will vote for the candidate now.

What is our alternative hypothesis?

<details> <summary>Answer:</summary>
$H_A: p = 0.53$
</details>
<br>
What is our null hypothesis?

<details> <summary>Answer:</summary>
$H_0: p \neq 0.53$
</details>