### Key concepts
1. Hypothesis testing
2. Polls and sampling
3. Margin of error
4. T-tests and T-statistics
5. Election simulation

In [0]:
import math
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import numpy.random
import pandas as pd
import random
import scipy.stats

def coin_flip(p=0.5):
    val = random.random()
    if val < p:
        return 1.0
    else:
        return 0.0

def coin_flips(n=10, p=0.5):
    rand = numpy.random.rand(n)
    return rand[(rand < p)].size / n

def random_avg(n=10):
    res = 0
    for i in range(n):
        res += 10 * random.random()
    return res / n

def simulate(f, runs=10000):
    return pd.DataFrame({'Val': [f() for i in range(runs)]})

def poll_distribution(n, p):
    res = 0
    for i in range(10000):
        res += coin_flips(n, p)
    return res / 10000

def histogram(df, bins=10):
    if isinstance(df, pd.Series):
        df = pd.DataFrame({df.name: df})
    counts = df.assign(Count=1)
    col_name = df.columns[0]
    counts = counts.groupby(col_name).count()
    df.plot.hist(bins=bins, xlim=(0, 1))
    plt.show()

In [0]:
# This could fail, that's ok.
from ipywidgets import interact

## Hypothesis Testing 👩‍🔬

To convert data into information, we use the scientific method. 
1. We have a **theory**
2. We propose a **hypothesis** that should be true if our theory is true
3. We think of data as an **experiment** to test our hypothesis

Most hypotheses are negatives or null hypotheses, and we try to use the data to reject the null hypothesis.

- Data does *not* follow a particular distribution
- Two variables are *not* independent of each other

We can never categorically disprove a null hypothesis. We can only say it is highly unlikely.

In the following exercise, we will test whether a sequence of coin flips comes from a biased coin by showing that it is highly unlikely to come from a fair coin.

Let's say I rolled 55% heads out of 100 flips. How likely is that scenario with a normal coin?
What if I rolled 55% heads out of 1000 flips.

We can test that by running many simulations of a normal coin. We count the simulations that yielded at least 55% heads.

In [0]:
def test_estimate_out_of_bounds(num_samples, data_val, true_p=0.5, num_tests=10000):
    success = 0
    for i in range(num_tests):
        sample = coin_flips(num_samples, true_p)
        if data_val >= true_p:
            # If this simultion yields at least as many heads as our data, 
            # then this distribution could have produced this data
            if sample >= data_val:
                success += 1
        elif data_val <= true_p:
            if sample <= data_val:
                success += 1
    return success / num_tests

In [0]:
test_estimate_out_of_bounds(10, 0.55)

## TRY THIS
Write a function to repeatedly flip a biased coin. 
How many attempts do you need before you can say that the chance of those flips being fair is less then 1%

In [0]:
# return the number of coin flips you need to make before it becomes highly unlikely (< 1%)
# that those coin flips came from an unbiased coin.
# NOTE: Do the test every 100 coin flips, because testing in this way is expensive.
def detect_bias(bias=0.55):
    num_heads = 0
    num_flips = 0
    while True:
        # flip the biased coin
        num_heads += ...
        num_flips += 1
        if num_flips % 100 == 0:
            likelihood = test_estimate_out_of_bounds(..., ...)
            if likelihood < 0.01:
                return num_flips

In [0]:
detect_bias(0.55)

In [0]:
detect_bias(0.53)

In [0]:
detect_bias(0.51)

What happens is that the more coin flips you do, the tighter the range of possible probabilities 
that could have generated those flips. 
Also, the closer two distributions are, the more samples you need to prove the hypothesis that they are different


In [0]:
@interact
def h(flips=(50, 10000), bias=(0.0, 1.0)):
    histogram(simulate(lambda: coin_flips(flips, bias)))

In [0]:
histogram(simulate(lambda: coin_flips(1000, 0.5)))

In [0]:
histogram(simulate(lambda: coin_flips(10000, 0.5)))

In [0]:
histogram(simulate(lambda: coin_flips(100, 0.55)))

In [0]:
histogram(simulate(lambda: coin_flips(1000, 0.55)))

In [0]:
histogram(simulate(lambda: coin_flips(10000, 0.55)))

In [2]:
!git clone https://<user>:<password>@github.com/haroldfox/ts-stuy-2019

Cloning into 'ts-stuy-2019'...
remote: Enumerating objects: 144, done.[K
remote: Counting objects: 100% (144/144), done.[K
remote: Compressing objects: 100% (135/135), done.[K
remote: Total 144 (delta 44), reused 58 (delta 5), pack-reused 0[K
Receiving objects: 100% (144/144), 12.24 MiB | 5.57 MiB/s, done.
Resolving deltas: 100% (44/44), done.


In [0]:
polls = pd.read_csv('ts-stuy-2019/datasets/presidential_polls.csv')
polls = polls[pd.notnull(polls['samplesize'])]
polls['samplesize'] = polls['samplesize'].astype(np.int64)
electoral_college = pd.read_csv('ts-stuy-2019/datasets/electoral-college-votes.csv', names=['state', 'ElectoralVotes'])
polls['Total'] = polls['rawpoll_clinton'] + polls['rawpoll_trump']
polls['AdjTotal'] = polls['adjpoll_clinton'] + polls['adjpoll_trump']
polls['Clinton'] = polls['rawpoll_clinton'] / polls['Total']
polls['AdjClinton'] = polls['adjpoll_clinton'] / polls['AdjTotal']
polls['AdjTrump'] = polls['adjpoll_trump'] / polls['AdjTotal']
polls['Trump'] = polls['rawpoll_trump'] / polls['Total']
polls = polls[(~polls['state'].str.contains('CD'))]
polls = polls[['pollster', 'grade', 'state', 'Clinton', 'Trump', 'AdjClinton', 'AdjTrump', 'samplesize']]
state_polls = polls[polls['state'] != 'U.S.'].groupby('state', as_index=False).first()
state_polls = pd.merge(state_polls, electoral_college, on='state')
national_polls = polls[polls['state'] == 'U.S.'].head()

A poll is like taking a sample of data from the true distribution. We imagine every American votes randomly accordingly to some true probability, `true_p`. A poll asks a small number of those Americans what their vote will be, and we use that sample to estimate `true_p`.

In [5]:
national_polls

Unnamed: 0,pollster,grade,state,Clinton,Trump,AdjClinton,AdjTrump,samplesize
0,Google Consumer Surveys,B,U.S.,0.518004,0.481996,0.510636,0.489364,24316
1,ABC News/Washington Post,A+,U.S.,0.494505,0.505495,0.491859,0.508141,1128
4,Pew Research Center,B+,U.S.,0.534884,0.465116,0.517813,0.482187,2120
5,Fox News/Anderson Robbins Research/Shaw & Comp...,A,U.S.,0.517647,0.482353,0.513715,0.486285,1221
6,IBD/TIPP,A-,U.S.,0.505096,0.494904,0.514804,0.485196,1018


In [0]:
state_polls

## TRY This

Take a look at the national polls. 
Now look up the official national vote counts for Clinton and Trump.
How accurate were those polls given the true probability. Use AdjClinton as the poll value.

In [0]:
# return the likelihood of a particular poll result given the true election result.
def test_poll(sample_size, poll_value, true_result):
    return test_estimate_out_of_bounds(..., ..., ...)

## Margin of error
What's important about a poll is not just its percentage, but its margin of error: the range of true percentages that would be consistent with it. Precisely, it means that the true percentage is within the poll's value with a high probability, generally 95%. 

## TRY This
Look at the national polls.
Given the true percentage, what range of true results would be consistent with that poll? 
Plug a range of different values into test_poll and choose the highest and lowest values which have likelihood > 5%

Is the actual election result within the margin of error of the poll?

In [0]:
# Use test_poll above and try different values of the true_result parameter. 
# Start at the poll_value itself and increment it by 0.001 until the 
# likelihood that it is consistent with the true result is < 5%.
# Return the difference between this largest, acceptable value and poll_value.
def margin_of_error(sample_size, poll_value):
    diff = 0
    while True:
        likelihood = test_estimate_out_of_bounds(..., ..., ...)
        if likelihood < 0.05:
          return diff * 2
        diff += 0.001

## T-stats and t-tests
We don't have to use simulations to test whether our data matches some hypothetical normal distribution

Instead, we can compute something called a t-statistic. The t-statistic itself follows a distribution, the t-distribution. When the computed t-statistic is an outlier value, then we can confidently reject our null hypothesis. Achieving that particular t-statistic by chance was too unlikely.

In [0]:
def t_stat(sample_size, poll_value, true_p):
    sum_poll_values = poll_value * sample_size
    s = sum_poll_values - (sum_poll_values * sum_poll_values) / sample_size
    s = math.sqrt(s / (sample_size - 1))
    return (poll_value - true_p) / (s / math.sqrt(sample_size))

In [0]:
def t_test_poll(sample_size, poll_value, true_p):
    t = t_stat(sample_size, poll_value, true_p)
    likelihood = scipy.stats.t.sf(abs(t), sample_size - 1)
    return likelihood

## TRY This
Calculate the margin of error again, this time using t_test_poll instead of test_poll

In [0]:
def t_test_margin_of_error(sample_size, poll_value):
    diff = 0
    while True:
        likelihood = t_test_poll(..., ..., ...)
        if likelihood < 0.05:
          return diff * 2
        diff += 0.001

## Election simulation 🗳️🇺🇸

The USA uses the electoral college system, which is a lot more complex than a simple universal vote. Let's look at state polls to see what the electoral college estimates were. 

In [0]:
# given a poll value and a sample size, we estimate the chance of Clinton winning as the 
# likelihood that the true probability for a state is at least 50%
def chance_win(sample_size, poll_value):
    t = t_stat(sample_size, poll_value, 0.5)
    return scipy.stats.t.cdf(t, sample_size - 1)

In [0]:
state_polls['WinProb'] = state_polls.apply(lambda row: chance_win(row['samplesize'], row['AdjClinton']), axis=1)

In [0]:
# swing states
state_polls[(state_polls['WinProb'] >= 0.01) & (state_polls['WinProb'] <= 0.99)]

## Exercises
1. Run a simulation of 10000 elections. For each simulation, for each state, simulate whether
Clinton wins the state according to the WinProb column. If Clinton gets over 270 electoral votes, she wins. How many simulated elections does Clinton win in this way?
Does this seem accurate? Look at state_polls. What seems wrong?
What happens if we increase our uncertainty about winning by lowering the effective sample size to 100?





Does this seem accurate? Look at state_polls. What seems wrong?

2. What happens if we increase our uncertainty about winning by lowering the effective sample size to 100?

3. Investigate the idea a systematic polling bias (Nate Silver has this idea in his predictions). We're not sure what the direction of the bias may be, but it affects every poll. Run a simulation of 10,000 elections as in exercise #1. Sample a poll-bias value from the normal distribution, then for each state, subtract this poll-bias from the poll value, calculate the new probability of winning, and simulate the state's outcome based on this probability. With the systematic polling bias term, now what is the probability of Clinton or Trump?