**Name:** 

**Section:** 

**Date:**

# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook that looks like ▶| or hold down `Shift` + `Return`.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("election.ipynb")

Before you begin the activity, run the code cell below to import all the libraries and modules needed.

In [None]:
import numpy as np
from scipy import special
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

## Introduction

The outcome of the 2016 US presidential election took many people and many pollsters by surprise. In this assignment we will carry out a post mortem simulation study in an attempt to understand what happened. Doing such an analysis is especially important given how often federal elections happen. 

According to [whitehouse.gov](https://www.whitehouse.gov/about-the-white-house/our-government/elections-and-voting/#:~:text=Federal%20elections%20occur%20every%20two,is%20held%20every%20fourth%20year.) 

> "Federal elections occur every two years, on the first Tuesday after the first Monday in November. Every member of the House of Representatives and about one-third of the Senate is up for reelection in any given election year. A presidential election is held every fourth year." 

> "Federal elections are administered by state and local governments, and the specifics of how elections are conducted differ from state to state. The Constitution and laws of the United States grant States wide latitude in how they administer elections.".

# Random Sample

In Pennsylvania, 6,165,478 people voted in the 2016 Presidential election.
Trump received 48.18% of the vote and Clinton recieved 47.46%.
This doesn't add up to 100% because other candidates received votes.
All together these other candidates received 100% - 48.18% - 47.46% = 4.36% of the vote.

The table below displays the counts and proportions.


|   Voted for   |  Trump|    Clinton|    Other|
|-----------|-----------|-----------|---------|
| Probability      |   0.4818   | 0.4746  |   0.0436 |
| Number of people | 2,970,733  | 2,926,441 | 268,304 |

**Question 1.** Suppose we pick a simple random sample of 20 of the 6,165,478 Pennsylvania (PA) voters. In the sample, let $N_T$ be the number of Trump voters, $N_C$ the number of Clinton voters, and $N_O$ the number of "other" voters. $N_T$, $N_C$, and $N_O$ are random: they depend on how the sample comes out. In the Foundations of Data Science course we called such quantities "statistics".

Pick the correct option: 

$N_T + N_C + N_O$ is equal to

(a) 3

(b) 20

(c) 6,165,478

(d) a random quantity

**Note:** If your answer is (a), put  

```
ans_q1 = 'a'
```

for the purpose of grading.

In [None]:
ans_q1 = ...

In [None]:
grader.check("q1")

**Question 2.** Pick the correct option.

A simple random sample of 20 PA voters is like a sample drawn at random without replacement, because

(a) that's the definition of "simple random sample"

(b) there are only 3 categories of voters, which is small in comparison to 20

(c) there are only 20 people in the sample, which is small in comparison to the total number of PA voters

(d) all PA voters are equally likely to be selected

**Note:** If your answer is (a), put  

```
ans_q2 = 'a'
``` 

for the purpose of grading. 

In [None]:
ans_q2 = ...

In [None]:
grader.check("q2")

# Multinomial Probability

Earlier this semester we learned that the binomial distribution allows one to compute the probability of obtaining a given number of binary outcomes. For example, it can be used to compute the probability of getting 6 heads out of 10 coin flips. The flip of a coin is a binary outcome because it only has two possible outcomes: heads and tails. 

The multinomial distribution is used to compute the probabilities in situations in which there are more than two possible outcomes. For example, suppose that two students played numerous games of [rock, paper, scissors](https://youtu.be/ND4fd6yScBM) and it was determined that the probability Student A would win is 0.40, the probability Student B would win is 0.35, and the probability the game would end in a draw is 0.25. The multinomial distribution can be used to answer questions such as: 

> *"If these two students played 10 games, what is the probability that Student A would win 5 games, Student B would win 3 games, and the remaining 2 games would end in a draw?"* 

The following formula gives the probability of obtaining a specific set of outcomes when there are three possible outcomes for each event 

$$p=\frac{n!}{n_1! \cdot n_2! \cdot n_3!}p_1^{n_1} \cdot p_2^{n_2} \cdot p_3^{n_3}$$

where

- $p$ is the probability,

- $n$ is the total number of events,

- $n_1$ is the number of times outcome 1 occurs,

- $n_2$ is the number of times outcome 2 occurs,

- $n_3$ is the number of times outcome 3 occurs,

- $p_1$ is the probability of outcome 1

- $p_2$ is the probability of outcome 2, 

- and $p_3$ is the probability of outcome 3.

**Question 3.** Suppose that two students played 10 games of rock, paper, scissors and it was determined that the probability Student A would win is 0.40, the probability Student B would win is 0.35, and the probability the game would end in a draw is 0.25. If these two students played 10 games, Student A would win 5 games, Student B would win 3 games, and the remaining 2 games would end in a draw. Assign the correct value to each variable listed in the cell below.**Question 2.** Pick the correct option.

In [None]:
# Number of games
n = ...

# Number of games won by Student A
n1 = ...

# Number of games won by Student B
n2 = ...

# Number of games drawn
n3 = ...

# Probability Stundet A wins
p1 = ...

# Probability Student B wins
p2 = ...

# Probability of a draw
p3 = ...

In [None]:
grader.check("q3")

**Question 4.** What is the probability that Student A would win 5 games, Student B wins 3 games, and the remaining 2 games end in a draw?

**Hint:** Le's use the `special.factorial` function to compute factorial products.

In [None]:
ans_q4 = special.factorial(...)/ \
(special.factorial(...) * special.factorial(...)*special.factorial(n3)) \ 
* p1**n1 * p2**n2 * p3**n3
ans_q4

In [None]:
grader.check("q4")

For the sample defined in **Question 1.**, the probability that the sample contains $t$ Trump voters, $c$ Clinton voters and $o$ "other" voters is denoted by $P(N_T = t, N_C = c, N_O = o)$, where $t$, $c$, and $o$ can be any three non-negative integers that add up to 20 (the number in the simple random sample). 

<!-- BEGIN QUESTION -->

**Question 5.**  Define a function `prob_sample_counts` that takes any three non-negative integers $t$, $c$, and $o$, and returns $P(N_T = t, N_C = c, N_O = o)$. 

**Hints:**

- The probability is 0 for some choices of the arguments, and your function should return 0 in those cases. For example, the number of sampled voters can't be negative.

- Remember that **Question 2.** implies you can use results for sampling with replacement.

In [None]:
def prob_sample_counts(t, c, o):
    """
    Input
    -----
    t: number of votes for Trump
    c: number of votes for Clinton
    o: number of votes for Other
    
    Return
    ------
    The probability of getting the sample t, c, o
    """
    n = ...
    n1 = t
    n2 = c
    n3 = o
    p1 = 0.4818
    p2 = 0.4746
    p3 = 0.0436
    return special.factorial(n)/(special.factorial(n1)*special.factorial(n2)*special.factorial(n3)) * p1**n1 * p2**n2 * p3**n3
    

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 6.** Check that your function determines a probability distribution by summing all the terms. For this test we will assume that the number of voters in total is 10.

**Hints:** 

- You will need to iterate through certain values of $c$ and $t$ to compute the sum of all probabilities.

- To see how you can iterate through multiple values (i.e. $c$ and $t$) click [here](https://www.w3schools.com/python/gloss_python_for_nested.asp).

**Note:** Due to floating point errors you may see `1.0000000000000004`. We will ignore this amount of loss of significance for the purpose of this assignment.

In [None]:
ans_q6 = 0
for c in range(0, ...):
    for t in range(0, ...):
        if (c + t <= ...):
            o = ... - c - t
            ans_q6 = ans_q6 + prob_sample_counts(t, c, o)
    
ans_q6

<!-- END QUESTION -->

Let's use `prob_sample_counts` to find the chance that the sample consists of 11 Trump voters, 8 Clinton voters, and 1 *"other"* voter.

In [None]:
prob_sample_counts(t=11, c=8, o=1)

## Election Polling

Political polling is a type of public opinion polling that can represent a snapshot of public opinion at the particular moment in time. Voter opinion shifts from week to week, even day to day, as candidates battle it out on the campaign trail.

Polls usually start with a "horse-race" question, where respondents are asked whom they would vote for in a head-to-head race if the election were tomorrow: Candidate A or Candidate B. The survey begins with this question so that the respondent is not influenced by any of the other questions asked in the survey. Some other questions are asked to help assess how likely is it that the respondent will vote. For example, questions about age, education, and sex are asked in order to adjust the findings in case one group appears overly represented in the sample. To contact people, pollsters typically use [random digit dialing](https://en.wikipedia.org/wiki/Random_digit_dialing).

If we're trying to predict the results of the Clinton vs. Trump presidential race, what is the population of interest? 

All eligible voters in the USA are the population of interest. In this presidential race, it means people who satisfy the following requirements (according to [usa.gov](https://www.usa.gov/who-can-vote)):

- Are a U.S. citizen (some areas allow non-citizens to vote in local elections only) 

- Meet your state’s residency requirements 

- Are 18 years old on or before Election Day (some areas allow 16 year olds to vote in local elections only) 

- Are registered to vote by your state’s voter registration deadline.

## How might the sampling frame differ from the population?

After the fact, many experts have studied the 2016 election results. For example, according to the American Association for Public Opinion Research (AAPOR), predictions made before the election were flawed for three key reasons:

1. Those sampled were not representative of the voting population, e.g., some said that there was an over-representation of college graduates in some poll samples. 

1. Voters changed their preferences a few days before the election.

1. Voters kept their support for Trump to themselves (hidden from the pollsters).

In **Simulation Study of the Sampling Error** ,  

- we will carry out a study of the sampling error when there is no bias. In other words, we will try to compute the chance that we get the election result wrong even if we collect our sample in a manner that is completely correct. In this case, any failure of our prediction is due entirely to random chance.

In **Simulation Study of Response Bias**,

- we will carry out a study of the sampling error when there is bias of the first type from the list above. In other words, we will try to compute the chance that we get the election result wrong if we have a small systematic bias. In this case, any failure of our prediction is due to a combination of random chance and our bias.

**Note:** To learn more about sampling frames, sampling error and sampling bias click the link to [read  the transcript from a conversation I had with ChatGPT](https://chat.openai.com/share/2fb7089a-1e81-4de2-ae9e-2d148b20fed1).

**Question 7.** Why can't we assess the impact of the other two biases (voters changing preference and voters hiding their preference)? Discuss this with a neighbor.

## How large was the sampling error?

In some states the race was very close, and it may have been simply sampling error, i.e., random chance that the majority of the voters chosen for the sample voted for Clinton.

A 2 or 3-point polling error in Trump’s favor (typical error historically) would likely be enough to tip the Electoral College to him.

One year after the 2016 election, Nate Silver wrote in [**The Media Has A Probability Problem**](https://fivethirtyeight.com/features/the-media-has-a-probability-problem/) that "the media’s demand for certainty -- and its lack of statistical rigor -- is a bad match for our complex world." FiveThirtyEight forecast that Clinton had about a 70 percent chance of winning.  

We will first carry out a simulation study to assess the impact of the sampling error on the predictions.

**Note:** To learn more about sampling frames, sampling error and sampling bias click the link to [read  the transcript from a conversation I had with ChatGPT](https://chat.openai.com/share/2fb7089a-1e81-4de2-ae9e-2d148b20fed1).

## The Electoral College

The US president is chosen by the Electoral College, not by the popular vote. Each state is allotted a certain number of electoral college votes, as a function of their population.
Whomever wins in the state gets all of the electoral college votes for that state.

There are 538 electoral college votes (hence the name of the Nate Silver's site, [FiveThirtyEight](https://abcnews.go.com/538)).

Pollsters correctly predicted the election outcome in 46 of the 50 states. For these 46 states Trump received 231 and Clinton received 232 electoral college votes. The remaining 4 states accounted for a total of 75 votes, and whichever candidate received the majority of the electoral college votes in these states would win the election. 

These states were Florida, Michigan, Pennsylvania, and Wisconsin.

|State |Electoral College Votes|
| --- | --- |
|Florida | 29 |
|Michigan | 16 |
|Pennsylvania | 20 |
|Wisconsin | 10|

For Donald Trump to win the election, he had to win either:

* Florida + one (or more) other states
* Michigan, Pennsylvania, and Wisconsin


The electoral margins were very narrow in these four states, as seen below:


|State | Trump |   Clinton | Total Voters |
| --- | --- |  --- |  --- |
|Florida | 49.02 | 47.82 | 9,419,886  | 
|Michigan | 47.50 | 47.27  |  4,799,284|
|Pennsylvania | 48.18 | 47.46 |  6,165,478|
|Wisconsin | 47.22 | 46.45  |  2,976,150|

Those narrow electoral margins can make it hard to predict the outcome given the sample sizes that the polls used. 

## Simulation Study of the Sampling Error

Now that we know how people actually voted, we can carry out a simulation study that imitates the polling. Our ultimate goal in this problem is to understand the chance that we will incorrectly call the election for Hillary Clinton even if our sample was collected with absolutely no bias.

For your convenience, the results of the vote in the four pivotal states is repeated below:

|State | Trump |   Clinton | Total Voters |
| --- | --- |  --- |  --- |
|Florida | 49.02 | 47.82 | 9,419,886  | 
|Michigan | 47.50 | 47.27  |  4,799,284|
|Pennsylvania | 48.18 | 47.46 |  6,165,478|
|Wisconsin | 47.22 | 46.45  |  2,976,150|

**Question 8.** Using the table above, create a function named `draw_state_sample(N, state)` returns the distribution of the number of `N` voters from the given state, based on the election results. The result is returned as a list, where the first element is the number of Trump votes, the second element is the number of Clinton votes, and the third is the number of Other votes.

For example, `draw_state_sample(1500, "florida")` could return `[727, 692, 81]`. You may assume that the state name is given in all lower case.

**Hint:** `np.random.multinomial` is used to to calculate the distribution of values.

**Note:** Make sure that your function is working before you continue. You can check with your instructor.

In [None]:
states = {
    "florida": np.array([49.02, 47.82, 3.16])/100,
    "michigan": np.array([47.50, 47.27, 5.23])/100,
    "pennsylvania": np.array([48.18, 47.46, 4.36])/100,
    "wisconsin": np.array([47.22, 46.45, 6.33])/100
}

def draw_state_sample(N, state):  
    """
    Input
    -----
    N: The number of voters from the state
    state: The state Florida, Michigan, Pennsylvania, or Wisconsin
    
    Return
    ------
    Returns a list where the first item is the number of Trump votes, the 
    second item is the number of Clinton votes, and the third item is the number
    of votes for other.
    """
    state_values = states[state.lower()]
    simulated_votes = np.random.multinomial(N, state_values)
    return simulated_votes

draw_state_sample(1500, 'florida')

**Question 9.** Now, create a function named `trump_advantage` that takes in a sample of votes (like the one returned by `draw_state_sample`) and returns the difference in the proportion of votes between Trump and Clinton. 

For example `trump_advantage([100, 60, 40])` would return `0.2`, since Trump had 50% of the votes in this sample and Clinton had 30%.

**Note:** Make sure that your function is working before you continue. You can check with your instructor.

In [None]:
def trump_advantage(voter_sample):
    """
    Input
    -----
    voter_sample: The list from the draw_state_sample function.
    
    Return
    ------
    Returns the difference in the proportion of votes between Trump and Clinton.
    """
    total = sum(...)
    return (voter_sample[...] - voter_sample[..])/total
    
trump_advantage([100, 60, 40])

Now let's simulate Trump's advantage across 100,000 simple random samples of 1500 voters for the state of Pennsylvania and store the results of each simulation in a list called `simulations`.

In [None]:
simulations = []
for i in range(100000):
    simulations.append(trump_advantage(draw_state_sample(1500, "Pennsylvania")))

In [None]:
simulations[:10]

Make a histogram of the sampling distribution of Trump's proportion advantage in Pennsylvania.

In [None]:
min_val, max_val = min(simulations), max(simulations)

bins = np.arange(-0.1, 0.15, .01)
plt.hist(simulations, bins=bins, edgecolor='white', density=True)

plt.title("Trump Advantage %")
plt.xlabel("Percent of Advantage")

plt.show();

**Question 10.** What does the histogram reveal about Trump's advantage based on our simulation? Discuss your thoughts with a neighbor.

**Question 11.** Now write a function named `trump_wins(N)` that creates a sample of $N$ voters for each of the four crucial states (Florida, Michigan, Pennsylvania, and Wisconsin) and returns 1 if Trump is predicted to win based on these samples and 0 if Trump is predicted to lose.

Recall that for Trump to win the election, he must either:

- Win the state of Florida and 1 or more other states

- Win Michigan, Pennsylvania, and Wisconsin

**Hint:** It would be helpful to use your `trump_advantage` function to complete this part of the assignment.

In [None]:
def trump_wins(N):
    """
    Input
    -----
    N: A sample of voters from Florida, Michigan, Pennsylvania, and Wisconsin.
    
    Return
    ------
    Returns 1 if Trump is predicted to win based on these samples and 
    0 if Trump is predicted to lose.
    """
    wins = []
    
    for name, state_values in states.items():
        advantage = trump_advantage(draw_state_sample(N, name))
        if advantage > 0:
            wins.append(name)
    if len(wins) > 1 and 'florida' in wins:
        return 1
    elif len(wins) > 2:
        return 1
    return 0

trump_wins(10000)

**Question 12.** If we repeat 100,000 simulations of the election (i.e. we call `trump_wins(1500)` 100,000 times) what proportion of these simulations predict a Trump victory?

**Note:** This number represents the percent chance that a given sample will correctly predict Trump's victory even if the sample was collected with absolutely no bias.

In [None]:
wins = np.array([])

for _ in range(100000):
    outcome = trump_wins(1500)
    wins = np.append(wins, outcome)

proportion_trump_wins = np.average(wins)
proportion_trump_wins

We have just completed our study of sampling error, and found how our predictions might look if there was no bias in our sampling process. Essentially, we assumed that the people surveyed didn't change their minds, didn't hide who they voted for, and were representative of those who voted on election day.

## Simulation Study of Selection Bias

According to an article by Grotenhuis, Subramanian, Nieuwenhuis, Pelzer and Eisinga entitled [Better poll sampling would have cast more doubt on the potential for Hillary Clinton to win the 2016 election](https://blogs.lse.ac.uk/usappblog/2018/02/01/better-poll-sampling-would-have-cast-more-doubt-on-the-potential-for-hillary-clinton-to-win-the-2016-election/#Author):

> "In a perfect world, polls sample from the population of voters, who would state their political preference perfectly clearly and then vote accordingly."

That's the simulation study that we just performed. 

It's difficult to control for every source of selection bias. And, it's not possible to control for some of the other sources of bias. Next we investigate the effect of small sampling bias on the polling results in these four battleground states. Throughout this problem, we'll examine the impacts of a 0.5 percent bias in favor of Clinton in each state. Such a bias has been suggested because highly educated voters tend to be more willing to participate in polls.

**Question 13.** In Pennsylvania, Clinton received 47.46 percent of the votes and Trump 48.18 percent. Increase the population of Clinton voters and correspondingly decrease the percent of Trump voters by the `percentage_point_bias` of 0.5. Then simulate Trump's advantage across 100,000 simple random samples of 1500 voters for the state of **Pennsylvania** and store the results of each simulation in a list called `biased_simulations`.

In [None]:
percentage_point_bias = 0.5

states_biased = {
        "florida": np.array([49.02 - percentage_point_bias, 47.82 + percentage_point_bias, 3.16])/100,
        "michigan": np.array([47.50 - percentage_point_bias, 47.27 + percentage_point_bias, 5.23])/100,
        "pennsylvania": np.array([48.18 - percentage_point_bias, 47.46 + percentage_point_bias, 4.36])/100,
        "wisconsin": np.array([47.22 - percentage_point_bias, 46.45 + percentage_point_bias, 6.33])/100
}

def draw_biased_state_sample(N, state):
    """
    Input
    -----
    N: A sample of voters from Florida, Michigan, Pennsylvania, and Wisconsin.
    state: Florida, Michigan, Pennsylvania, and Wisconsin.
    
    
    Return
    ------
    Returns a list where the first item is the number of Trump votes, the 
    second item is the number of Clinton votes, and the third item is the number
    of votes for other.
    """
    state_values = states_biased[state.lower()]
    simulated_votes = np.random.multinomial(N, state_values)
    return simulated_votes

draw_biased_state_sample(1500, 'florida')

Now let's simulate Trump's advantage across 100,000 simple random samples of 1500 voters for the state of Pennsylvania using the biased proportions and store the results of each simulation in a list called `biased_simulations`.

In [None]:
biased_simulations = []
for i in range(100000):
    biased_simulations.append(trump_advantage(draw_biased_state_sample(1500, "Pennsylvania")))

In [None]:
biased_simulations[:10]

Make a histogram of the sampling distribution of Trump's proportion advantage in Pennsylvania.

In [None]:
min_val, max_val = min(biased_simulations), max(biased_simulations)

bins = np.arange(-0.1, 0.15, .01)
plt.hist(biased_simulations, bins=bins, edgecolor='white', density=True)

plt.title("Trump Advantage Biased %")
plt.xlabel("Percent of Advantage")

plt.show();

**Question 14.** What does the histogram reveal about Trump's advantage based on our simulation? Discuss your thoughts with a neighbor.

**Question 15.** Write 2-3 sentences comparing the histograms you created (Trump Advantage % and Trump Advantage Biased %). Discuss your thoughts with a neighbor.

**Question 16.** Now write a function named `trump__biased_wins(N)` that creates a sample of $N$ voters for each of the four crucial states (Florida, Michigan, Pennsylvania, and Wisconsin) and returns 1 if Trump is predicted to win based on these samples and 0 if Trump is predicted to lose.

Recall that for Trump to win the election, he must either:

- Win the state of Florida and 1 or more other states

- Win Michigan, Pennsylvania, and Wisconsin

**Hint:** It would be helpful to use your `trump_advantage` function to complete this part of the assignment.

In [None]:
def trump_wins_biased(N):
    """
    Input
    -----
    N: A sample of voters from Florida, Michigan, Pennsylvania, and Wisconsin.
    
    Return
    ------
    Returns 1 if Trump is predicted to win based on these samples and 
    0 if Trump is predicted to lose.
    """
    wins_biased = []
    
    for name, state_values in states_biased.items():
        advantage = trump_advantage(draw_biased_state_sample(N, name))
        if advantage > 0:
            wins_biased.append(name)
    if len(wins_biased) > 1 and 'florida' in wins_biased:
        return 1
    elif len(wins_biased) > 2:
        return 1
    return 0

trump_wins_biased(10000)

Now perform 100,000 simulations of all four states and return the proportion of these simulations that result in a Trump victory. This is the same fraction that you computed in **Question 12**, but now using your biased samples.

**Note:** This number represents the chance that a sample biased 1% in Hillary Clinton's favor will correctly predict Trump's victory. You should observe that the chance is significantly lower than with an unbiased sample (i.e. your answer in **Question 12**).

In [None]:
wins = np.array([])

for _ in range(100000):
    outcome = trump_wins_biased(1500)
    wins = np.append(wins, outcome)

proportion_trump_wins_biased = np.average(wins)
proportion_trump_wins_biased

## Would increasing the sample size have helped?

Try a sample size of 5,000 and run 100,000 simulations of a sample with replacement. What proportion of the 100,000 times is Trump predicted to win the election in the unbiased setting? In the biased setting?

Save your answers as `high_sample_size_unbiased_proportion_trump` and `high_sample_size_biased_proportion_trump`.

In [None]:
high_sample_size_unbiased_proportion_trump = np.average(np.array([trump_wins(5000) for i in range(100000)]))
high_sample_size_biased_proportion_trump = np.average(np.array([trump_wins_biased(5000) for i in range(100000)]))
print('The simulated probability of Trump winning the election (unbiased with 5000 sampled voters):', high_sample_size_unbiased_proportion_trump)
print('The simulated probability of Trump winning the election (biased with 5000 sampled voters):', high_sample_size_biased_proportion_trump)

**Question 17.** What do your observations from the previous question say about the impact of sample size on the sampling error and on the bias? Discuss your thought with a neighbor.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)