# Week 02
# Examples for lecture and lab

## Load Libraries

In [None]:
# for reading json files
import json

# numerical libraries
import numpy as np
import scipy as sp
import pystan

# pandas!
import pandas as pd

# plotting libraries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
%pylab inline

## Make a nice machine-readable dataset

### Absentee Ballot Data

In [None]:
absentee_df = pd.read_csv('absentee.csv')

In [None]:
absentee_df.columns

In [None]:
# pandas added a column for the index in the CSV
absentee_df.drop('Unnamed: 0', axis=1, inplace=True)

# fix the year column
absentee_df['year'] = absentee_df['year'] + 1900

# make a key that uniquely identifies each row
absentee_df['yrdt'] = [int(str(absentee_df.iloc[i,0]).replace('\n','')+'0'+str(absentee_df.iloc[i,1])) for i in range(len(absentee_df))]

# and make it the index
absentee_df.set_index('yrdt',inplace=True)

# and output the dataframe to a dictionary
absentee_data_dict = absentee_df.to_dict()

In [None]:
# define the metadata for the dataset
absentee_dict = {'info':'In November 1993, the state of Pennsylvania conducted elections for its state legislature. The result in the Senate election in the 2nd district (based in Philadelphia) was challenged in court, and ultimately overturned. The Democratic candidate won 19,127 of the votes cast by voting machine, while the Republican won 19,691 votes cast by voting machine, giving the Republican a lead of 564 votes. However, the Democrat won 1,396 absentee ballots, while the Republican won just 371 absentee ballots, more than offsetting the Republican lead based on the votes recorded by machines on election day. The Republican candidate sued, claiming that many of the absentee ballots were fraudulent. The judge in the case solicited expert analysis from Orley Ashenfelter, an economist at Princeton University. Ashenfelter examined the relationship between absentee vote margins and machine vote margins in 21 previous Pennsylvania Senate elections in seven districts in the Philadelphia area over the preceding decade.',
                 'source':'Ashenfelter, Orley. 1994. Report on Expected Asbentee Ballots. Typescript. Department of Economics, Princeton University.',
                 'url':'https://CRAN.R-project.org/package=pscl',
                 'vars':{'year':'a numeric vector, year of election, 19xx',
                         'district':'a numeric vector, Pennsylvania State Senate district',
                         'absdem':'a numeric vector, absentee ballots cast for the Democratic candidate',
                         'absrep':'a numeric vector, absentee ballots cast for the Republican candidate',
                         'machdem':'a numeric vector, votes cast on voting machines for the Democratic candidate',
                         'machrep':'a numeric vector, votes cast on voting machines for the Republican candidate',
                         'dabs':'a numeric vector, Democratic margin among absentee ballots',
                         'dmach':'a numeric vector, Democratic margin among ballots case on voting machines'
                         }
                }

In [None]:
# store data with metadata
absentee_dict['data'] = absentee_data_dict

# do we have all the info in the dictionary?
absentee_dict.keys()

In [None]:
# write metadata to a dictionary
with open('absentee_data.json', 'w') as fp:
    json.dump(absentee_dict, fp)

# close the file
fp.close()

### Math score data

In [None]:
math_data = pd.read_csv('hsb.csv')

In [None]:
# rename column
math_data.rename(columns={'Unnamed: 0':'student_id'},inplace=True)

# and make it the index
math_data.set_index('student_id',inplace=True)

# and output the dataframe to a dictionary
math_data_dict = math_data.to_dict()

In [None]:
math_dict = {'info':'The data file used for this presentation is from the 1982 High School and Beyond Survey and is used extensively in Hierarchical Linear Models by Raudenbush and Bryk. It consists of 7,185 students nested in 160 schools. ',
             'source':'High School & Beyond (HS&B) is a nationally representative, longitudinal study of 10th and 12th graders in 1980. Follow-up surveys conducted throughout their postsecondary years. Surveys of students, teachers, and parents of sampled students. https://nces.ed.gov/surveys/hsb/',
             'url':' https://CRAN.R-project.org/package=merTools',
             'vars':{'schid':'a numeric vector, 160 unique values',
                     'mathach':'a numeric vector for the performance on a standardized math assessment',
                     'female':'a numeric vector coded 0 for male and 1 for female',
                     'ses':'a numeric measure of student socio-economic status',
                     'minority':'a numeric vector coded 0 for white and 1 for non-white students',
                     'schtype':'a numeric vector coded 0 for public and 1 for private schools',
                     'meanses':'a numeric, the average SES for each school in the data set',
                     'size':'a numeric for the number of students in the school'
                    }
            }

In [None]:
# store data with metadata
math_dict['data'] = math_data_dict

# do we have all the info in the dictionary?
math_dict.keys()

In [None]:
# write metadata to a dictionary
with open('math_data.json', 'w') as fp:
    json.dump(math_dict, fp)

# close the file
fp.close()

## Rock the Vote

In [None]:
rtv_data = pd.read_csv('rock_the_vote.csv')

In [None]:
# rename column
rtv_data.rename(columns={'Unnamed: 0':'cable_system_id'},inplace=True)

# and make it the index
rtv_data.set_index('cable_system_id',inplace=True)

# and output the dataframe to a dictionary
rtv_data_dict = math_data.to_dict()

In [None]:
rtv_dict = {'info':'Voter turnout data spanning 85 cable TV systems, randomly allocated to a voter mobilization experiment targetting 18-19 year olds with "Rock the Vote" television advertisments. Green and Vavreck (2008) implemented a cluster-randomized experimental design in assessing the effects of a voter mobilization treatment in the 2004 U.S. Presidential election. The clusters in this design are geographic areas served by a single cable television system. So as to facilitate analysis, the researchers restricted their attention to small cable systems whose reach is limited to a single zip code. Further, since the experiment was fielded during the last week of the presidential election, the researchers restricted their search to cable systems that were not in the 16 hotly-contested “battleground” states (as designated by the Los Angeles Times).',
            'source':'Green, Donald P. and Lynn Vavreck. 2008. Analysis of Cluster-Randomized Experiments: A Comparison of Alternative Estimation Approaches. Political Analysis 16:138-152.',
            'url':' https://CRAN.R-project.org/package=pscl',
            'vars':{'strata':'numeric, experimental strata',
                    'treated':'numeric, 1 if a treated cable system, 0 otherwise',
                    'r':'numeric, number of 18 and 19 year olds turning out',
                    'n':'numeric, number of 19 and 19 year olds registered',
                    'p':'numeric, proportion of 18 and 19 year olds turning out',
                    'treatedIndex':'numeric, a counter indexing the 42 treated units'
                   }
           }

In [None]:
# store data with metadata
rtv_dict['data'] = rtv_data_dict

# do we have all the info in the dictionary?
rtv_dict.keys()

In [None]:
# write metadata to a dictionary
with open('rock_the_vote.json', 'w') as fp:
    json.dump(rtv_dict, fp)

# close the file
fp.close()

## English Premier League

In [None]:
epl_data = pd.read_csv('epl_scores_data.csv')

In [None]:
# rename column
epl_data.rename(columns={'Unnamed: 0':'id'},inplace=True)

# and make it the index
epl_data.set_index('id',inplace=True)

# rename columns 
rename_dict = {}
cols = list(epl_data.columns)

for c in cols:
    rename_dict[c] = c.replace('epl_data.','')
    
epl_data.rename(columns=rename_dict,inplace=True)

# and output the dataframe to a dictionary
epl_data_dict = epl_data.to_dict()

In [None]:
epl_data.head()

In [None]:
epl_dict = {'info':'This is data from the 2015/2016 season of the English Premier League, which consists of 20 teams. Each two teams play two games with each other (home and away games). There are 38 weeks and 380 games in each season. We model the score difference (home team goals − away team goals) in each match.',
            'source':'https://github.com/stan-dev/stancon_talks/tree/master/2017/Contributed-Talks/02_kharratzadeh',
            'url':'footbal-data.co.uk',
            'vars':{'home_team':'numeric index of the home team',
                    'away_team':'numeric index of the away team',
                    'home_goals':'goals scored by home team',
                    'away_goals':'goals scored by away team',
                    'score_diff':'home_goals minus away_goals',
                    'home_week':'index of the week of the home team',
                    'away_week':'index of the week of the away team'
                   }
           }

In [None]:
epl_teams = pd.read_csv('epl_team_data.csv')

In [None]:
# rename column
epl_teams.rename(columns={'Unnamed: 0':'id'},inplace=True)

# and make it the index
epl_teams.set_index('id',inplace=True)

# rename columns 
rename_dict = {}
cols = list(epl_teams.columns)

for c in cols:
    rename_dict[c] = c.replace('epl_data.','')
    
epl_teams.rename(columns=rename_dict,inplace=True)

# and output the dataframe to a dictionary
epl_teams_dict = epl_teams.to_dict()

In [None]:
epl_teams.head()

In [None]:
# store data with metadata
epl_dict['data'] = epl_data_dict
epl_dict['teams'] = epl_teams_dict

# do we have all the info in the dictionary?
epl_dict.keys()

In [None]:
# write metadata to a dictionary
with open('epl_data.json', 'w') as fp:
    json.dump(epl_dict, fp)

# close the file
fp.close()

## A function to print a long string nicely

In [None]:
def print_info(info,wpl=12):
    """
    nicely print a long paragraph
    """
    
    long_info = info.split()
    num_lines = round(len(long_info) / wpl)
    
    info_break = []
    
    # break up the long string into multiple lines
    for i in range(num_lines):
        hld = ''
        chunk = long_info[wpl*i:wpl*(i+1)]
        
        # piece each line into one string
        for i in range(len(chunk)):
            hld = hld + chunk[i] + ' '
        
        info_break.append(hld)
    
    # now print!
    for i in range(len(info_break)):
        print(info_break[i])

In [None]:
def print_vars(var_dict):
    """
    nicely print the infomation about each variable
    """
    # what's the longest variable name?
    max_len = 0
    for k in var_dict.keys():
        if len(k) > max_len:
            max_len = len(k)
    
    for k in var_dict.keys():
        len_k = len(k)
        print(str(k) + ' '*(max_len - len_k + 1) + ' :::  ' + var_dict[k])

## Class Example 1: Absentee Ballots

Background information about this example is in [this New York Times article](https://www.nytimes.com/1994/04/11/us/probability-experts-may-decide-pennsylvania-vote.html). Jackman presents this example in his _Bayesian Analysis for the Social Sciences_ in Example 2.13 on pages 87-92 and Example 2.14 on pages 95-98. The exercise provides an opportunity to talk about how to construct a random variable, priors, and likelihood. In addition, this is a real world example where a judge had to make a decision about an election outcome, so further underscores our point that we need insights from noisy data to inform our choices.

### Read in data

In [None]:
# read json file into a dictionary
with open('data/absentee_data.json', 'r') as f:
    json_data = json.load(f)

# close the file
f.close()

In [None]:
# what's the source?
print(json_data['source'])

In [None]:
# where can i get these data?
print(json_data['url'])

In [None]:
# print some info about the dataset
print_info(json_data['info'])

In [None]:
# what variables are in the dataset?
print_vars(json_data['vars'])

In [None]:
# just give it to me in a dataframe
data = pd.DataFrame(json_data['data'])
data

### What is our question?

> In November 1993 Pennsylvania conducted elections for its state legislature. The result in the Senate election in the 2nd district (based in Philadelphia) was challenged in court, and ultimately overturned. The Democratic candidate won 19, 127 of the votes cast by voting machine, while the Republican won 19,691 votes cast by voting machine, giving the Republican a lead of 564 votes. However, the Democrat won 1,396 absentee ballots, while the Republican won just 371, more than offsetting the Republican lead based on the votes recorded by machines on election day.
> The Republican candidate sued, claiming that many of the absentee ballots were fraudulent. The judge solicited expert analysis from Orley Ashenfelter, an economist at Princeton University, who examined the relationship between absentee vote margins and machine vote margins in 21 previous Pennsylvania Senate elections in seven districts in the Philadelphia area over the preceding decade.

Suppose instead that we are providing expert analysis. Should we advise the judge to throw out the election outcome, which would initiate a costly redo of the election and precipitate criminal charges against the Democratic candidate?

In [None]:
# here is the row of data in question
data.loc[['199302']]

### What is our random variable of interest?

Let $i = 1, \ldots, 21$ index the previous decade of elections.

To get us thinking:
* We want to know how unusual it is for the Democratic candidate to win 79 percent of the absentee ballots.
* Unusual with respect to what? Past machine shares? Past absentee shares?
* Was it a really good year for Democrats?

#### What's usual for machine ballots?

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(data['machdem'][:-1]/(data['machdem'][:-1]+data['machrep'][:-1]))
plt.axvline(0.4927353289710959,lw=3,color='black')
plt.text(0.51,0.50,'Disputed 1993',family='serif',size=12)
plt.text(0.51,0.43,'Outcome',family='serif',size=12)
plt.title('Empirical PDF of Percentage of Machine Ballots for Democrats',family='serif',size=14)
plt.xlim(0,1)
plt.xlabel('Percentage of Votes won by Democrats',family='serif',size=12)
plt.ylabel('Density',family='serif',size=12);

#### What's usual for absentee ballots?

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(data['absdem'][:-1]/(data['absdem'][:-1]+data['absrep'][:-1]),color='red')
plt.axvline(0.7900396151669496,lw=3,color='black')
plt.text(0.58,0.38,'Disputed 1993',family='serif',size=12)
plt.text(0.58,0.25,'Outcome',family='serif',size=12)
plt.title('Empirical PDF of Percentage of Absentee Ballots for Democrats',family='serif',size=14)
plt.xlim(0,1)
plt.xlabel('Percentage of Votes won by Democrats',family='serif',size=12)
plt.ylabel('Density',family='serif',size=12);

#### How have percentages won by Democrats varied over time?

In [None]:
# compute the percent won by Democrats over all elections
data['prcnt_dem_abs'] = data['absdem']/(data['absdem']+data['absrep'])
data['prcnt_dem_mch'] = data['machdem']/(data['machdem']+data['machrep'])

# compute percentiles
lft = data.groupby('year').quantile(0.25)[['prcnt_dem_abs','prcnt_dem_mch']].rename(columns={'prcnt_dem_abs':'abs_low','prcnt_dem_mch':'mch_low'})
mid = data.groupby('year').quantile(0.50)[['prcnt_dem_abs','prcnt_dem_mch']].rename(columns={'prcnt_dem_abs':'abs_mid','prcnt_dem_mch':'mch_mid'})
rght = data.groupby('year').quantile(0.75)[['prcnt_dem_abs','prcnt_dem_mch']].rename(columns={'prcnt_dem_abs':'abs_hgh','prcnt_dem_mch':'mch_hgh'})

# and merge together
m1 = pd.merge(left=lft,right=mid,left_on='year',right_on='year')
m2 = pd.merge(left=m1,right=rght,left_on='year',right_on='year')

In [None]:
plt.figure(figsize=(12,6))

# ranges
plt.fill_between(m2.index,m2.abs_low,m2.abs_hgh,alpha=0.2,color='indigo')
plt.fill_between(m2.index,m2.mch_low,m2.mch_hgh,alpha=0.2,color='silver')

# middle of ranges
plt.plot(m2.abs_mid,color='indigo',lw=4,label='machine ballots')
plt.plot(m2.mch_mid,color='silver',lw=4,label='absentee ballots')

# labels
plt.title('Percent of votes won by Democrats, 1982-1992',family='serif',size=14)
plt.ylabel('Percentage won by Democrats',family='serif',size=12)
plt.legend();

#### So the previous graph shows something interesting, let's look at a scatter plot.

In [None]:
# now let's plot all the relationships between machine and absentee votes
plt.figure(figsize=(8,8))
plt.scatter(data['machdem']/(data['machdem']+data['machrep']),data['absdem']/(data['absdem']+data['absrep']),color='plum')
plt.text(0.50,0.80,'Disputed 1993 Outcome',family='serif',size=12)
plt.plot(0.4927353289710959,0.7900396151669496, marker='o', markersize=8, color="indigo")
plt.title('Percentage of Machine versus Absentee Ballots won by Democrats',family='serif',size=14)
plt.xlim(0.3,1)
plt.ylim(0.3,1)
plt.xlabel('Machine',family='serif',size=12)
plt.ylabel('Absentee',family='serif',size=12);

#### Now, let's translate this to a random variable

Let our random variable be $y_i = a_i - m_i$. Where $a_i$ is the Democratic percentage of the two-party vote cast via absentee ballot; $m_i$ is the Democratic percentage of the two-party vote cast via machine ballot; and $y_i$ is the difference between the two.

### Model for $y_i$

To a Bayesian a model is a likelihood and a prior.

#### Likilihood
We will use a normal likelihood for this random variable:
$y_i \sim \textrm{Normal}(\mu,\sigma)$

Why a normal likelihood? The variable is continuous and varies from $(-100,100)$. 

#### Priors
We need to put priors on the two parameters of the Normal distribution, the mean, $\mu$, and variance, $\sigma^2$. 

What should we use for the prior of $\mu$?
* How would be expect the mean difference between absentee versus machine percentages to be? 
* It has to be between $(-100,100)$.
* Do we expect there to be differences in the use of absentee ballots by Democrats and Republicans?
* Are Democratic-leaning districts better at turning out absentee voters?

How much do we think the difference varies over elections? ie, the variance.
* How often will the mean be between plus or minus $\tau$?


In [None]:
# plot the prior

In [None]:
# talk about Cromwell's rule, how does our posterior look if we completely rule out differences greater than 25?

In [None]:
# implement a prior predictive check

In [None]:
# fit the model using Stan

In [None]:
# do a few graphical posterior predictive checks

In [None]:
# what should our decision be? and how would we write this?

## Class Example 2: Rock the Vote

Jackman presents this example in his _Bayesian Analysis for the Social Sciences_ in Example 7.9 on pages 355-362. The exercise provides an opportunity to estimate a binomial dependent variable and sets us up to talk about this example later when we talk about multi-level models. Also, a great opportunity to dive into Bayesian modeling in the context of a field experiment.

>Prior to the presidential election in November 2004, we assembled a nationwide list of cable systems that covered only a single zip code. Small cable TV systems are a fertile source of experimental data for social scientists because their small size makes them inexpensive and conducive to large-N randomized studies. In order to test the televised messages in an environment that would not be dominated by other election-related advertisements, we removed all cable systems in 16 states that the Los Angeles Times classified as presidential battlegrounds (closely contested states). We then excluded any systems that had no time available in prime time during the week before the election or that cost more than 15 dollars per 30-second advertisement on the USA television network. We excluded all systems in Mississippi because its voter file is very difficult to obtain. This left 85 cable systems for randomization.

>Random assignment of the cable systems took place as follows. Each system was matched with one or two other systems in the same state according to its past turnout rate in presidential elections. This procedure resulted in 40 strata containing the 85 cable systems. After sorting the list of 85 cable systems by strata and then by a random number, the first cable system in each stratum was assigned to the treatment condition, the others to control.

>People living within the treatment systems saw two different 30-second advertisements produced by Rock the Vote. Both advertisements used the same format. The first dealt with the draft and the second, with education. In the draft advertisement, a young couple dancing at a party is talking about the man’s new job. He is very excited to be working in promotions and hopes to start his own firm in 6 months. The woman interrupts him and says, ‘‘That’s if you don’t get drafted.’’ The man is puzzled. She clarifies, ‘‘Drafted, for the war?’’ He responds, ‘‘Would they do that?’’ The advertisement closes with everyone at the party looking into the camera and the words, ‘‘It’s up to you’’ on the screen. The voiceover says, ‘‘The Draft. One of the issues that will be decided this November. Remember to vote on November 2nd.’’ The closing image is of the Rock the Vote logo on a black screen.

>The second Rock the Vote advertisement dealt with education. A young man arrives at work with news that he has been accepted to college. His colleagues congratulate him and one of them asks, ‘‘Books, room, board, tuition ... how can you pay for all of that?’’ The advertisement closes with everyone looking out at the camera and the words, ‘‘It’s up to you’’ written on the screen. The voiceover is similar to the one above but with education substituted for draft. We showed both advertisements equally in all cable systems.

>Each cable system comprises several thousand voters, and the entire data set encompasses approximately 850,000 registered voters. Of special  interest are the 23,869 voters who are 18 and 19 years of age, for whom this election represents the first federal election in which they are eligible to vote and to whom these ads were specifically addressed. The methodological question is what is the most efficient and reliable way to analyze these data? This question was particularly compelling since our previous mass-media turnout experiments suggested the effects of treatment were likely to be small in magnitude, but not zero (Vavreck  and Green 2006)

In [None]:
# explore the question

In [None]:
# what is the model?

In [None]:
# prior checks

In [None]:
# model estimation

In [None]:
# graphical posterior checks

In [None]:
# what is the conclusion? how would we communicate this to people?

## Lab Example 1: Math Scores

Jackman presents this example in his _Bayesian Analysis for the Social Sciences_ in Example 7.6 on pages 323-328. The exercise allows the students to begin the lab with a straightforward continuous dependent variable and also sets us up to bring up this example later when we talk about multi-level models.

>The 1982 High School and Beyond Survey is a nationally representative sample of US public and Catholic schools, covering 7185 students in 160 schools. The chief outcome of interest is a standardized measure of math ability, with a mean of $12.75$ and interquartile range $[7.28, 18.32]$.

In [None]:
# explore the question

In [None]:
# what is the model?

In [None]:
# prior checks

In [None]:
# model estimation

In [None]:
# graphical posterior checks

In [None]:
# what is the conclusion? how would we communicate this to people?

## Lab Example 2: English Premier League Scores

This is an [example](https://github.com/stan-dev/stancon_talks/tree/master/2017/Contributed-Talks/02_kharratzadeh) presented at the 2017 Stan conference. While the model presented is somewhat complicated, there are lots of ways to model these data simply in order to become familiar with the process. Thought is we make this one more open-ended and just see what people come up with. One of the students is _very interested_ in English Premier League soccer, so I also thought this would help keep people engaged.

>In this case study, we provide a hierarchical Bayesian model for the English Premier League in the season of 2015/2016. The league consists of 20 teams and each two teams play two games with each other (home and away games). So, in total, there are 38 weeks, and 380 games. We model the score difference (home team goals $-$ away team goals) in each match. The main parameters of the model are the teams' abilities which is assumed to vary over the course of the 38 weeks. The initial abilities are determined by performance in the previous season plus some variation. Please see the next section for more details.

In [None]:
# explore the question

In [None]:
# what is the model?

In [None]:
# prior checks

In [None]:
# model estimation

In [None]:
# graphical posterior checks

In [None]:
# what is the conclusion? how would we communicate this to people?