# Preliminaries

In [1]:
# Don't change this cell; just run it.
import numpy as np  # The array library.

# The random number generator
rng = np.random.default_rng()

# Plotting
import matplotlib.pyplot as plt

# Fancy plots
plt.style.use('fivethirtyeight')

# The OKpy testing system.
from client.api.notebook import Notebook
ok = Notebook('brexit_proportions.ok')

# Brexit proportions

Remember the [Brexit referendum](/permutation/population_permutation)?

In that page, we found that the survey company working for Hansard ended up
with 774 respondents who said they voted Remain, and 541 who said they voted
Leave.  In terms of proportions of Remain voters in the survey, that is:

In [2]:
survey_n_remain = 774
survey_n_leave = 541
survey_n_total = survey_n_remain + survey_n_leave
survey_prop_remain = survey_n_remain / survey_n_total
survey_prop_remain

It is odd that the survey company found more Remain than Leave voters, given
that the [final UK-wide
result](https://www.electoralcommission.org.uk/who-we-are-and-what-we-do/elections-and-referendums/past-elections-and-referendums/eu-referendum/results-and-turnout-eu-referendum)
of the referendum was 16,141,241 voting Remain and 17,410,742 voting Leave.
This gives a final UK-wide proportion of remain votes, to all votes cast,
of:

In [3]:
uk_n_remain = 16141241
uk_n_leave = 17410742
uk_n_total = uk_n_remain + uk_n_leave
uk_prop_remain = uk_n_remain / uk_n_total
uk_prop_remain

Let's say you are Hansard, and the survey company has given you the data.   You tell them:

> I'm worried that your survey may be not be representative of the voting
> population, because your survey has 58.9% Remain voters, but the UK had 48.1%
> of Remain voters.

And they reply:

> Oh, that's just sampling error.

Explain what the survey company means, when they say "that's just
sampling error".

*Write your answer here, replacing this text.*

## Reply to the survey company

Now you know about sampling error, you can reply to the survey company.

We can reply by following the recipe in the [inference page](/iteration/inference).

Here are the steps from that page.

* find the **data**.
* Calculate some **measure of interest** from the data.
* Make a simple (null-world) model of the world to offer as an explanation of
  the data.
* **Simulate the data** many times using the null-world.
* For each simulation **calculate the measure of interest**.  Call these the
  **simulated measures**.
* Use the **simulated measures** to build up the **sampling distribution**.
* Compare the **observed measure** to the **sampling distribution**, to see
  whether it represents a rare or common event, given the model.

The **data** is — in the survey sample, there were 774 Remain voters and 541
Leave voters.  In the UK population, 48.1% voted Remain — the value of
`uk_prop_remain` above.

Let's say our **measure of interest** is the proportion of Remain voters in the
survey — `survey_prop_remain` above.

Now, *describe* (in words) your null-world simple model.  Consider looking
again at the inference page linked above, if you need more information.

*Write your answer here, replacing this text.*

The next step is to simulate one trial in this world.  Here is a cell for you
to do that.  Set `simulated_prop_remain` to be a proportion of Remain voters from that simulated world.

In [4]:
voters = rng.choice([1, 0], p=[uk_prop_remain, 1 - uk_prop_remain],
                    size=survey_n_total)
simulated_prop_remain = np.count_nonzero(voters) / survey_n_total
# Show the result
simulated_prop_remain

In [5]:
_ = ok.grade('q_simulated_prop_remain')

Now you have you worked out how to get your measure from one trial, run 10000
trials, repeating this procedure, and storing the result in
`simulated_props`.

In [6]:
simulated_props = np.zeros(10000)

for i in np.arange(10000):
    voters = rng.choice([1, 0], p=[uk_prop_remain, 1 - uk_prop_remain],
                        size=survey_n_total)
    simulated_prop_remain = np.count_nonzero(voters) / survey_n_total
    simulated_props[i] = simulated_prop_remain

# Show the first 10 results of the simulation
simulated_props[:10]

In [7]:
_ = ok.grade('q_simulated_props')

Finally, calculate the estimated p value (proportion) of simulated proportions which were greater than or equal to the observed proportion in the survey.

In [8]:
estimated_p = np.count_nonzero(simulated_props >= survey_prop_remain) / 10000
# Show the result
estimated_p

In [9]:
_ = ok.grade('q_estimated_p')

Now the data is in, give a convincing reply to the survey company, as to whether the difference in proportions is really explicable by sampling error.

*Write your answer here, replacing this text.*

## Done.

Congratulations, you're done with the assignment!  Be sure to:

- **run all the tests** (the next cell has a shortcut for that).
- **Save and Checkpoint** from the `File` menu.

In [10]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]