# Week 02 \\\ Examples for lecture
## Irregularities in Absentee Voting in PA?

## Load Libraries

In [None]:
# for reading json files
import json

# numerical libraries
import numpy as np
import scipy as sp
import pystan

# pandas!
import pandas as pd

# plotting libraries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
%pylab inline

In [None]:
sns.set(style="white")

## A function to print a long string nicely

In [None]:
def print_info(info,wpl=12):
    """
    nicely print a long paragraph
    """
    
    long_info = info.split()
    num_lines = round(len(long_info) / wpl)
    
    info_break = []
    
    # break up the long string into multiple lines
    for i in range(num_lines):
        hld = ''
        chunk = long_info[wpl*i:wpl*(i+1)]
        
        # piece each line into one string
        for i in range(len(chunk)):
            hld = hld + chunk[i] + ' '
        
        info_break.append(hld)
    
    # now print!
    for i in range(len(info_break)):
        print(info_break[i])

In [None]:
def print_vars(var_dict):
    """
    nicely print the infomation about each variable
    """
    # what's the longest variable name?
    max_len = 0
    for k in var_dict.keys():
        if len(k) > max_len:
            max_len = len(k)
    
    for k in var_dict.keys():
        len_k = len(k)
        print(str(k) + ' '*(max_len - len_k + 1) + ' :::  ' + var_dict[k])

## Class Example 1: Absentee Ballots

Background information about this example is in [this New York Times article](https://www.nytimes.com/1994/04/11/us/probability-experts-may-decide-pennsylvania-vote.html). Jackman presents this example in his _Bayesian Analysis for the Social Sciences_ in Example 2.13 on pages 87-92 and Example 2.14 on pages 95-98. The exercise provides an opportunity to talk about how to construct a random variable, priors, and likelihood. In addition, this is a real world example where a judge had to make a decision about an election outcome, so further underscores our point that we need insights from noisy data to inform our choices.

### Read in data

In [None]:
# read json file into a dictionary
with open('data/absentee_data.json', 'r') as f:
    json_data = json.load(f)

# close the file
f.close()

In [None]:
# what's the source?
print(json_data['source'])

In [None]:
# where can i get these data?
print(json_data['url'])

In [None]:
# print some info about the dataset
print_info(json_data['info'])

In [None]:
# what variables are in the dataset?
print_vars(json_data['vars'])

In [None]:
# just give it to me in a dataframe
data = pd.DataFrame(json_data['data'])
data

### What is our question?

> In November 1993 Pennsylvania conducted elections for its state legislature. The result in the Senate election in the 2nd district (based in Philadelphia) was challenged in court, and ultimately overturned. The Democratic candidate won 19, 127 of the votes cast by voting machine, while the Republican won 19,691 votes cast by voting machine, giving the Republican a lead of 564 votes. However, the Democrat won 1,396 absentee ballots, while the Republican won just 371, more than offsetting the Republican lead based on the votes recorded by machines on election day.
> The Republican candidate sued, claiming that many of the absentee ballots were fraudulent. The judge solicited expert analysis from Orley Ashenfelter, an economist at Princeton University, who examined the relationship between absentee vote margins and machine vote margins in 21 previous Pennsylvania Senate elections in seven districts in the Philadelphia area over the preceding decade.

Suppose instead that we are providing expert analysis. Should we advise the judge to throw out the election outcome, which would initiate a costly redo of the election and precipitate criminal charges against the Democratic candidate?

In [None]:
# here is the row of data in question
data.loc[['199302']]

### What is our random variable of interest?

Let $i = 1, \ldots, 21$ index the previous decade of elections.

To get us thinking:
* We want to know how unusual it is for the Democratic candidate to win 79 percent of the absentee ballots.
* Unusual with respect to what? Past machine shares? Past absentee shares?
* Was it a really good year for Democrats?

#### What's usual for machine ballots?

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(data['machdem'][:-1]/(data['machdem'][:-1]+data['machrep'][:-1]))
plt.axvline(0.4927353289710959,lw=3,color='black')
plt.text(0.51,0.50,'Disputed 1993',family='serif',size=12)
plt.text(0.51,0.43,'Outcome',family='serif',size=12)
plt.title('Empirical PDF of Percentage of Machine Ballots for Democrats',family='serif',size=14)
plt.xlim(0,1)
plt.xlabel('Percentage of Votes won by Democrats',family='serif',size=12)
plt.ylabel('Density',family='serif',size=12);

#### What's usual for absentee ballots?

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(data['absdem'][:-1]/(data['absdem'][:-1]+data['absrep'][:-1]),color='red')
plt.axvline(0.7900396151669496,lw=3,color='black')
plt.text(0.58,0.38,'Disputed 1993',family='serif',size=12)
plt.text(0.58,0.25,'Outcome',family='serif',size=12)
plt.title('Empirical PDF of Percentage of Absentee Ballots for Democrats',family='serif',size=14)
plt.xlim(0,1)
plt.xlabel('Percentage of Votes won by Democrats',family='serif',size=12)
plt.ylabel('Density',family='serif',size=12);

#### How have percentages won by Democrats varied over time?

In [None]:
# compute the percent won by Democrats over all elections
data['prcnt_dem_abs'] = 100*(data['absdem']/(data['absdem']+data['absrep']))
data['prcnt_dem_mch'] = 100*(data['machdem']/(data['machdem']+data['machrep']))

# compute percentiles
lft = data.groupby('year').quantile(0.25)[['prcnt_dem_abs','prcnt_dem_mch']].rename(columns={'prcnt_dem_abs':'abs_low','prcnt_dem_mch':'mch_low'})
mid = data.groupby('year').quantile(0.50)[['prcnt_dem_abs','prcnt_dem_mch']].rename(columns={'prcnt_dem_abs':'abs_mid','prcnt_dem_mch':'mch_mid'})
rght = data.groupby('year').quantile(0.75)[['prcnt_dem_abs','prcnt_dem_mch']].rename(columns={'prcnt_dem_abs':'abs_hgh','prcnt_dem_mch':'mch_hgh'})

# and merge together
m1 = pd.merge(left=lft,right=mid,left_on='year',right_on='year')
m2 = pd.merge(left=m1,right=rght,left_on='year',right_on='year')

In [None]:
plt.figure(figsize=(12,6))

# ranges
plt.fill_between(m2.index,m2.mch_low,m2.mch_hgh,alpha=0.2,color='sandybrown')
plt.fill_between(m2.index,m2.abs_low,m2.abs_hgh,alpha=0.2,color='indigo')

# middle of ranges
plt.plot(m2.mch_mid,color='sandybrown',lw=4,label='absentee ballots')
plt.plot(m2.abs_mid,color='indigo',lw=4,label='machine ballots')

# labels
plt.title('Percent of votes won by Democrats, 1982-1992',family='serif',size=14)
plt.ylabel('Percentage won by Democrats',family='serif',size=12)
plt.legend();

#### So the previous graph shows something interesting, let's look at a scatter plot.

In [None]:
# now let's plot all the relationships between machine and absentee votes
plt.figure(figsize=(8,8))
plt.scatter(data['machdem']/(data['machdem']+data['machrep']),data['absdem']/(data['absdem']+data['absrep']),color='mediumorchid')
plt.text(0.50,0.80,'Disputed 1993 Outcome',family='serif',size=12)
plt.plot(0.4927353289710959,0.7900396151669496, marker='o', markersize=8, color="indigo")
plt.title('Percentage of Machine versus Absentee Ballots won by Democrats',family='serif',size=14)
plt.xlim(0.3,1)
plt.ylim(0.3,1)
plt.xlabel('Machine',family='serif',size=12)
plt.ylabel('Absentee',family='serif',size=12);

#### Now, let's translate this to a random variable

Let our random variable be $y_i = a_i - m_i$. Where $a_i$ is the Democratic percentage of the two-party vote cast via absentee ballot; $m_i$ is the Democratic percentage of the two-party vote cast via machine ballot; and $y_i$ is the difference between the two.

### Model for $y_i$

To a Bayesian a model is a likelihood and a prior.

#### Likilihood
We will use a normal likelihood for this random variable:
$y_i \sim \textrm{Normal}(\mu,\sigma)$

Why a normal likelihood? The variable is continuous and varies from $(-100,100)$, so the Normal is a not wrong first choice for epistemological and ontological reasons as dicsussed previously.

#### Priors
We need to put priors on the two parameters of the Normal distribution:
* the mean, $\mu$, and 
* variance, $\sigma^2$. 

What should we use for the prior of $\mu$?
* How would we expect the mean difference between absentee versus machine percentages to be? 
* It has to be between $(-100,100)$.
* Do we expect there to be differences in the use of absentee ballots by Democrats and Republicans?
* Are Democratic-leaning districts better at turning out absentee voters?

How much do we think the difference varies over elections? ie, the variance.
* How often will the mean be between plus or minus $\tau$?


In [None]:
# compute our random variable
data['y'] = data['prcnt_dem_abs'] - data['prcnt_dem_mch'] 

# setup figure
plt.figure(figsize=(10,5))

# draw plot
sns.kdeplot(data.y[:-1],shade=True,color='tomato')
sns.rugplot(data.y[:-1], color='tomato', linewidth=2);
sns.despine()
plt.title('PDF of our random variable',family='serif',size=14)
plt.xlabel('Percent Absentee minus Percent Machine',family='serif',size=12)
plt.xlim(-50,50);

### Building our model for $y$

#### Setting the prior for $\mu$

In [None]:
# compute the pdf of our prior for mu
x_mu = np.linspace(-100,100,400)
y_mu = sp.stats.norm.pdf(x_mu, loc=0, scale=20)

In [None]:
# plot the prior for **mu**

# setup figure
plt.figure(figsize=(10,5))

# make the figure
plt.plot(x_mu,y_mu,lw=3,color='slategray')
plt.fill_between(x_mu,y_mu,alpha=0.25,color='slategray')
sns.despine()

# label the figure
plt.xlabel('prior for $\mu$',family='serif',size=12)
plt.ylabel('density',family='serif',size=12);

#### Setting the prior for $\sigma$

For the variance we need a distribution that generates values greater than 0. What is some intuition for formulating prior information on the degree to which a quantity varies? One way to do this is ask yourself within what range of plus or minus $\tau$ will 50 percent of the values lie? Picking $\tau$ is equivalent to specifying the 75th percentile of the dependent variable.

Below is a function that will give you the variance of a normal distribution such that 75 percent of the mass is below $\tau$.

In [None]:
def sigma_given_tau(tau):
    """
    give this function what you think tau is,
    and it will tell you what the variance
    should be so that 75 percent of the 
    values are less than tau
    """
    return tau / sp.stats.norm.ppf(0.75)

In [None]:
# suppose tau is +/-3
sigma_given_tau(3)

In [None]:
# suppose tau is +/-10
sigma_given_tau(10)

To get a handle of what the distributoin of $y$ looks like under these assumptions, we can plot the PDF.

In [None]:
# first, compute the pdf of our random variable given our selection of tau
x_rv = np.linspace(-100,100,400)
y_rv_one = sp.stats.norm.pdf(x_rv, loc=0, scale=4.447806)
y_rv_two = sp.stats.norm.pdf(x_rv, loc=0, scale=14.82602)

In [None]:
# setup figure
plt.figure(figsize=(10,5))

# make the figure
plt.axvline(-3,lw=2,color='silver')
plt.axvline(3,lw=2,color='silver')
plt.plot(x_rv,y_rv_one,lw=3,color='cornflowerblue',label='tau = 3')
plt.fill_between(x_rv,y_rv_one,alpha=0.25,color='cornflowerblue')
plt.xlim(-15,15)
sns.despine()

# label the figure
plt.title('What does our data look like if 50% of observations are between $\pm 3$?',family='serif',size=14)
plt.xlabel('y',family='serif',size=12)
plt.ylabel('density',family='serif',size=12);

Now let's put our 2 tau's side by side to see what they look like.

In [None]:
# setup figure
plt.figure(figsize=(12,6))

# make the figure
plt.plot(x_rv,y_rv_one,lw=3,color='cornflowerblue',label='tau = 3')
plt.fill_between(x_rv,y_rv_one,alpha=0.25,color='cornflowerblue')

plt.plot(x_rv,y_rv_two,lw=3,color='seagreen',label='tau = 10')
plt.fill_between(x_rv,y_rv_two,alpha=0.25,color='seagreen')

sns.despine()

# label the figure
plt.title('What does our data look like if 50% of observations are between $\pm tau$?',family='serif',size=14)
plt.xlabel('y',family='serif',size=12)
plt.ylabel('density',family='serif',size=12)
plt.legend();

Now we can translate our two chosen $\tau$'s into a gamma distribution. The gamma distribution has non-negative support, and we can pick the parameters such that the mean is between 3 and 10. Those parameters in the ``scipy.stats`` version of the gamma distribution and the corresponding PDF are below.

In [None]:
# compute the pdf of our prior for sigma
x_sig = np.linspace(0,30,300)
y_sig = sp.stats.gamma.pdf(x_sig, a=3.25, loc=0, scale=2)

In [None]:
# plot the prior for **sigma**

# setup figure
plt.figure(figsize=(10,5))

# make the figure
plt.plot(x_sig,y_sig,lw=3,color='cornflowerblue')
plt.fill_between(x_sig,y_sig,alpha=0.25,color='cornflowerblue')
sns.despine()

# label the figure
plt.xlabel('prior for $\sigma$',family='serif',size=12)
plt.ylabel('density',family='serif',size=12);

This gamma prior gives us values greater than zero, has a mean of 6 and concentrates most of the weight between 3 and 10. However, it still allows our model to find high variance values if needed---it does not rule them out (Cromwell).

>[Cromwell's rule](https://en.wikipedia.org/wiki/Cromwell's_rule), named by statistician Dennis Lindley, states that the use of prior probabilities of 1 ("the event will definitely occur") or 0 ("the event will definitely not occur") should be avoided, except when applied to statements that are logically true or false, such as 2+2 equaling 4 or 5.

>The reference is to Oliver Cromwell, who wrote to the General Assembly of the Church of Scotland on 5 August 1650, including a phrase that has become well known and frequently quoted:

>I beseech you, in the bowels of Christ, think it possible that you may be mistaken.

>If the prior probability assigned to a hypothesis is 0 or 1, then, by Bayes' theorem, the posterior probability (probability of the hypothesis, given the evidence) is forced to be 0 or 1 as well; no evidence, no matter how strong, could have any influence.

#### The Bayesian Two-Step: simulating data from our priors

We can now do the Bayesian Two-Step by simulating values from our priors, then sampling from our likelihood to generate data. This shows us how our priors interact with the likelihood. These interactions can be complex, but the Two-Step can handle it. 

Questions to think about as you look at these plots:
1. Does our model generate plausible data? 
2. Does our model completely fail by only generating outrageous data?
3. Does our model allow for extreme states of the world?

#### Implement Bayes' Two-Step in Python

In [None]:
# setup simulation
n_sim = 500
prior_data = np.zeros(shape=(n_sim,2))
pr_pred_draws = {}
col_names = ['prior_mu','prior_sigma']
n_pr_data = len(data) - 1 # number of observed data points excluding the last one

# generate data
for i in range(n_sim):
    
    # Step One: 
    # simulate from prior distributions
    prior_data[i,0] = sp.stats.norm.rvs(loc=0, scale=20,size=1)[0] # simulate prior mean
    prior_data[i,1] = sp.stats.gamma.rvs(a=3.25, loc=0, scale=2,size=1)[0] # simulate prior variance
    
    # Step Two: 
    # for the generated prior values,
    # sample from our likelihood
    pr_pred_draws[i] = sp.stats.norm.rvs(loc=prior_data[i,0], scale=prior_data[i,1],size=n_pr_data)
   
y_prior = pd.DataFrame(prior_data,columns=col_names)

#### Implement Bayes' Two-Step in Stan

In [None]:
# what's in this stan model anyway?
f = open('absentee_prior.stan', 'r')
file_contents = f.read()
print (file_contents)
f.close()

In [None]:
# compile Stan model
pm = pystan.StanModel(file="absentee_prior.stan")

In [None]:
# conduct MCMC using Stan
pr_draws = pm.sampling(iter=1000, chains=1)

In [None]:
# give us a dictionary containing posterior draws for each parameter in the model
pr_pd = pr_draws.extract(permuted=True)

In [None]:
pr_pd.keys()

#### Compare Stan & Python priors

In [None]:
# plot the joint distribution of our priors generated by Stan
sns.jointplot(pr_pd['mu'], pr_pd['sigma'], kind='kde', color='slategray')
sns.despine();

Since we only have two parameters, we can plot the joint distribution of our priors.

In [None]:
# plot the joint distribution of our priors generated by Python
sns.jointplot(y_prior.prior_mu, y_prior.prior_sigma, kind='kde', color='slategray')
sns.despine();

#### Prior predictive Distribution

Now we plot simulated data from our prior distributions against our observed data.

In [None]:
plt.figure(figsize=(10,10))

nscttr = 101
#colors = sns.diverging_palette(10, 220, sep=80, n=nscttr)
colors = sns.diverging_palette(145, 280, s=85, l=25, n=nscttr)

c = 0
for i in np.random.choice(range(n_sim),nscttr,replace=False):
    plt.scatter(pr_pred_draws[i],data.y[:-1]+sp.stats.norm.rvs(0,0.5,size=21),color=colors[c],s=10)
    c += 1

# make it pretty
plt.axvline(0,color='silver')
plt.axhline(0,color='silver')
plt.xlim(-75,75)
plt.ylim(-40,20)
sns.despine()

# label the figure
plt.title('Prior Predictive Distribution',family='serif',size=14)
plt.xlabel('data sampled from prior',family='serif',size=12)
plt.ylabel('observed data',family='serif',size=12);

The above graph shows us what kinds of datasets our model _can_ produce. We want it to be able to produce things that look like the data as well as extreme versions of it. Follow the clump of points at $y = -21$, these are all possibilites our model could generate.

#### Prior Flipbook

Now we look at a flip book of possible datasets from our prior, not against the data, but as a histogram.

In [None]:
# setup figure
plt.figure(figsize=(15,7.5))

nflips = 21
colors = sns.diverging_palette(10, 220, sep=80, n=nflips)

# plot a few randomly selected datasets generated from our prior
c = 0
for i in np.random.choice(range(n_sim),nflips,replace=False):
    sns.kdeplot(pr_pred_draws[i],color=colors[c],shade=True)
    c += 1
    
# make it pretty
#plt.axvline(0,color='silver')
plt.xlim(-75,75)
sns.despine()

# label the figure
plt.title('Flipbook of data generated from prior',family='serif',size=14)
plt.xlabel('y',family='serif',size=12)
plt.ylabel('density',family='serif',size=12);

### Estimating our model

We will finally (!) estimate this model using MCMC and Stan. First we inspect our model, then we compile and conduct sampling. 

In [None]:
# what's in this stan model anyway?
f = open('absentee.stan', 'r')
file_contents = f.read()
print (file_contents)
f.close()

In [None]:
# compile Stan model
sm = pystan.StanModel(file="absentee.stan")

In [None]:
# build dataset for Stan
stan_data = {'N':len(data)-1,'y':data.y[:-1]}

In [None]:
# conduct MCMC using Stan
fit = sm.sampling(data=stan_data, iter=4000, chains=4)

In [None]:
# print some pretty summary stats about the posterior
print(fit)

In [None]:
# give us a dictionary containing posterior draws for each parameter in the model
ps = fit.extract(permuted=True)

In [None]:
ps.keys()

#### Visualizing the posterior

In [None]:
# Initialise the figure and a subplot axes.
num_rows = 4
num_cols = 2
fig, ax = plt.subplots(num_rows, num_cols, figsize=(16, 16))

# overall title
fig.suptitle('Graphical depictions of the posterior',y=1.025,fontsize=18,fontfamily='serif')

# ___ROW ONE___
# **trace plots**
ax[0,0].plot(ps['mu'],lw=1,alpha=0.75,color='cornflowerblue')
ax[0,0].set_title('Trace plot of $\mu$',fontsize=12,fontfamily='serif')
ax[0,1].plot(ps['sigma'],lw=1,alpha=0.75,color='cornflowerblue')
ax[0,1].set_title('Trace plot of $\sigma$',fontsize=12,fontfamily='serif')

# ___ROW TWO___
# **joint distribution** of parameters
sns.kdeplot(y_prior.prior_mu, y_prior.prior_sigma,
            color='cornflowerblue', shade=True, shade_lowest=False,ax=ax[1,0])
sns.kdeplot(ps['mu'], ps['sigma'],
            color="salmon", shade=True, shade_lowest=False,ax=ax[1,1])

ax[1,0].set_title('Prior joint distribution of parameters',fontsize=12,fontfamily='serif')
ax[1,0].set_xlim(-30,30)
ax[1,0].set_ylim(0,20)
ax[1,1].set_title('Posterior joint distribution of parameters',fontsize=12,fontfamily='serif')
ax[1,1].set_xlim(-30,30)
ax[1,1].set_ylim(0,20)

# ___ROW THREE___
# prior and posterior of **mu**
ax[2,0].plot(x_mu,y_mu,lw=3,color='cornflowerblue')
ax[2,0].fill_between(x_mu,y_mu,alpha=0.25,color='cornflowerblue')

sns.kdeplot(ps['mu'],shade=True, lw=3,color='salmon',shade_lowest=False,ax=ax[2,1])

ax[2,0].set_title('Prior distribution of $\mu$',fontsize=12,fontfamily='serif')
ax[2,0].set_xlim(-50,50)
ax[2,1].set_title('Posterior distribution of $\mu$',fontsize=12,fontfamily='serif')
ax[2,1].set_xlim(-50,50)

# ___ROW FOUR___
# prior and posterior of **sigma**
ax[3,0].plot(x_sig,y_sig,lw=3,color='cornflowerblue')
ax[3,0].fill_between(x_sig,y_sig,alpha=0.25,color='cornflowerblue')

sns.kdeplot(ps['sigma'],shade=True, lw=3,color='salmon',shade_lowest=False,ax=ax[3,1])

ax[3,0].set_title('Prior distribution of $\sigma$',fontsize=12,fontfamily='serif')
ax[3,0].set_xlim(0,20)
ax[3,1].set_title('Posterior distribution of $\sigma$',fontsize=12,fontfamily='serif')
ax[3,1].set_xlim(0,20)


# make the plot prettier
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10,10))

nscttr = 101
#colors = sns.diverging_palette(10, 220, sep=80, n=nscttr)
colors = sns.diverging_palette(145, 280, s=85, l=25, n=nscttr)

c = 0
for i in np.random.choice(range(n_sim),nscttr,replace=False):
    # randomly sample 21 values from the posterior
    y_post = np.random.choice(ps['y_sim'],len(data.y[:-1]),replace=False)
    
    # then plot against observations
    plt.scatter(y_post,data.y[:-1]+sp.stats.norm.rvs(0,0.5,size=21),color=colors[c],s=10)
    c += 1

# make it pretty
plt.axvline(0,color='silver')
plt.axhline(0,color='silver')
plt.xlim(-30,20)
plt.ylim(-30,20)
sns.despine()

# label the figure
plt.title('Posterior Predictive Distribution',family='serif',size=14)
plt.xlabel('data sampled from posterior',family='serif',size=12)
plt.ylabel('observed data',family='serif',size=12);

The above graphical check isn't that relevant here, because our model isn't designed to predict specific points. The main focus of the model is to uncover the distribution of the random variable of interest. To get at that, let's look at the empirical CDF of the data versus the posterior.

Here is a function to compute the empirical CDF of any data series.

In [None]:
def ecdf(data):
    """ Compute ECDF """
    x = np.sort(data)
    n = x.size
    y = np.arange(1, n+1) / n
    return(x,y)

In [None]:
# compute the empirical CDF of the data and posterior
x_data, y_data = ecdf(data.y[:-1])
x_post, y_post = ecdf(ps['y_sim'])

In [None]:
plt.figure(figsize=(10,5))

plt.plot(x_data,y_data,lw=2,color='cornflowerblue',label='data')
plt.plot(x_post,y_post,lw=2,color='salmon',label='posterior')

# make it pretty
plt.axvline(0,color='silver')
plt.xlim(-30,20)
sns.despine()

# label the figure
plt.title('Empirical CDF: Posterior versus Data',family='serif',size=14)
plt.xlabel('$y$',family='serif',size=12)
plt.ylabel('probability',family='serif',size=12);

A slightly more interesting way to do this, is to randomly sample 21 data points from the posterior without replacement and plot many CDFs to get a sense for how much variability is in our model.

In [None]:
plt.figure(figsize=(10,5))

# plot a bunch of ECDFs from the posterior
for i in range(300):
    # randomly sample 21 values from the posterior
    y_post_samp = np.random.choice(ps['y_sim'],len(data.y[:-1]),replace=False)
    
    # compute ECDF
    x_post, y_post = ecdf(y_post_samp)
    
    # then plot 
    plt.plot(x_post,y_post,lw=0.25,alpha=0.25,color='salmon',label='posterior')

# plot the CDF of the data
plt.plot(x_data,y_data,lw=4,color='cornflowerblue',label='data')
    
# make it pretty
plt.xlim(-30,20)
plt.ylim(0,1)
sns.despine()

# label the figure
plt.title('Empirical CDF: Posterior versus Data',family='serif',size=14)
plt.xlabel('$y$',family='serif',size=12)
plt.ylabel('probability',family='serif',size=12);

The model does a decent job of capturing the variability of the observed posterior. There are notable departures between the posterior and data from $y = -5$ to $y = 4$, which corresponds to a region where we have few observations. 

### Now, let's answer the question

How unusual it is for the Democratic candidate to win 29 percent more absentee ballots than machine ballots? 

We can compute this from our posterior. What is the probability that we observe $y \geq 29.7$?

$P(y \geq 29.7 | \mathbf{y})$

In [None]:
# this isn't observed in our posterior, so Pr(y >= 29.7) = 0 !
y_post = pd.DataFrame(ps['y_sim'],columns=['y_sim'])
y_post[y_post.y_sim >= 29.7]

Since this is a rare event, we can compute the probability directly from our posterior samples using the CDF of the normal distribution.

In [None]:
prob_1993 = np.zeros(len(ps['y_sim']))

for i in range(len(ps['y_sim'])):
    prob_1993[i] = 1 - sp.stats.norm.cdf(29.7,loc=ps['mu'][i],scale=ps['sigma'][i])

In [None]:
plt.figure(figsize=(10,4))
sns.kdeplot(prob_1993,shade=True)
sns.despine()

# label the figure
plt.title('PDF of the Computed Probablity of $y \geq 29.7$',family='serif',size=14)
plt.xlabel('$y$',family='serif',size=12)
plt.ylabel('probability',family='serif',size=12)
plt.xlim(0,0.002);

So what do we tell the judge about the plausibility of observing a 29.7 difference between the percentage of absentee ballots won by Democrats and the percentage of machine ballots won by the Democrats?

How would we write this up?