# Bootstrap
Bootstrapping is a popular technique for estimating a parameter or statistic with limited data. In a sense, bootstrapping itself has parameters to estimate. How much data does one need? How many bootstrap replicates are required to achieve a good estimate of the target statistic? Using sample data from the 2020 Presidential election, we will attempt to estimate the vote for 4 of the closest battleground states: Georgia, Pennsylvania, Michigan, and Arizon.

In [52]:
import pandas as pd
import numpy as np


## The Data
From the original dataset, we will create a dictionary of sets of indices into the original dataset, each for a different sample size.

In [53]:
# Get the dataset containing the population.
vote = pd.read_csv('11-Python Statistics in EDA\\11.1.3-Inference and Modeling II\countypres_2000-2020.csv')
vote.head()

Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode
0,2000,ALABAMA,AL,AUTAUGA,1001.0,PRESIDENT,AL GORE,DEMOCRAT,4942.0,17208.0,20191203,TOTAL
1,2000,ALABAMA,AL,AUTAUGA,1001.0,PRESIDENT,GEORGE W. BUSH,REPUBLICAN,11993.0,17208.0,20191203,TOTAL
2,2000,ALABAMA,AL,AUTAUGA,1001.0,PRESIDENT,RALPH NADER,GREEN,160.0,17208.0,20191203,TOTAL
3,2000,ALABAMA,AL,AUTAUGA,1001.0,PRESIDENT,OTHER,OTHER,113.0,17208.0,20191203,TOTAL
4,2000,ALABAMA,AL,BALDWIN,1003.0,PRESIDENT,AL GORE,DEMOCRAT,13997.0,56480.0,20191203,TOTAL


In [54]:
# Get 2020 Presidential Election Votes in swing states for Biden 
swing=['GEORGIA', 'PENNSYLVANIA', 'MICHIGAN',  'ARIZONA', 'WISCONSIN', 'MINNESOTA', 'COLORADO', 'NORTH CAROLINA', 'OHIO', 'FLORIDA']
biden = vote[(vote['year']==2020) & (vote['office'] == 'PRESIDENT') & (vote['candidate'] == 'JOSEPH R BIDEN JR') & (vote['state'].isin(swing))]
biden.head()


Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode
50930,2020,ARIZONA,AZ,APACHE,4001.0,PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,16460.0,35172.0,20210622,EARLY VOTE
50931,2020,ARIZONA,AZ,APACHE,4001.0,PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,6539.0,35172.0,20210622,ELECTION DAY
50932,2020,ARIZONA,AZ,APACHE,4001.0,PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,294.0,35172.0,20210622,PROVISIONAL
50945,2020,ARIZONA,AZ,COCHISE,4003.0,PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,21563.0,60442.0,20210622,EARLY VOTE
50946,2020,ARIZONA,AZ,COCHISE,4003.0,PRESIDENT,JOSEPH R BIDEN JR,DEMOCRAT,1495.0,60442.0,20210622,ELECTION DAY


In [55]:
# Summarize by swing state
biden_summary = biden.groupby(['state', 'candidate']).agg({'candidatevotes': ['sum'], 'totalvotes': 'sum','county_name': 'count'})
biden_summary['result'] = biden_summary['candidatevotes'] / biden_summary['totalvotes'] 
biden_summary.reset_index(inplace=True)
biden_summary.columns = biden_summary.columns.get_level_values(0)
biden_summary = biden_summary.rename(columns={'county_name': 'n_counties'})
biden_summary 



Unnamed: 0,state,candidate,candidatevotes,totalvotes,n_counties,result
0,ARIZONA,JOSEPH R BIDEN JR,1672143.0,10155882.0,45,0.164648
1,COLORADO,JOSEPH R BIDEN JR,1804352.0,3256980.0,64,0.553995
2,FLORIDA,JOSEPH R BIDEN JR,5297045.0,11067456.0,67,0.478615
3,GEORGIA,JOSEPH R BIDEN JR,2474507.0,19993928.0,636,0.123763
4,MICHIGAN,JOSEPH R BIDEN JR,2804040.0,5539302.0,83,0.506208
5,MINNESOTA,JOSEPH R BIDEN JR,1717077.0,3277171.0,87,0.523951
6,NORTH CAROLINA,JOSEPH R BIDEN JR,2684292.0,22099208.0,400,0.121466
7,PENNSYLVANIA,JOSEPH R BIDEN JR,3458229.0,6915283.0,67,0.500085


In [56]:
cols = ['state', 'candidatevotes', 'totalvotes']
biden = biden[cols]
biden


Unnamed: 0,state,candidatevotes,totalvotes
50930,ARIZONA,16460.0,35172.0
50931,ARIZONA,6539.0,35172.0
50932,ARIZONA,294.0,35172.0
50945,ARIZONA,21563.0,60442.0
50946,ARIZONA,1495.0,60442.0
...,...,...,...
63948,PENNSYLVANIA,45088.0,118478.0
63951,PENNSYLVANIA,9191.0,28089.0
63954,PENNSYLVANIA,72129.0,204697.0
63957,PENNSYLVANIA,4704.0,14858.0


In [57]:
# Get the number of counties per state.
def get_ncounties(data, state):
    return data[data['state']==state]['n_counties'].values[0] 
get_count(biden_summary,'ARIZONA')



45

In [58]:
# Get the indices from the original data.
def get_indices(data, state, n):    
    indices = data[data['state'] == state].index
    idx = np.random.choice(indices, size=n)
    return idx

In [61]:
biden_sample = {}
for state in biden_summary['state'].values:
    biden_sample[state] = {}
    n_counties = get_ncounties(biden_summary, state)
    for p in range(10,55,5):
        idx = get_indices(biden, state, n = int(n_counties * p / 100))
        data = biden.loc[idx]
        biden_sample[state][str(p)] = {}
        biden_sample[state][str(p)] = idx
        biden_sample[state][str(p)] = data

biden_sample        

{'ARIZONA': {'10':          state  candidatevotes  totalvotes
  50960  ARIZONA         39430.0     73272.0
  50946  ARIZONA          1495.0     60442.0
  51035  ARIZONA        978457.0   2068144.0
  50976  ARIZONA          1400.0     27662.0,
  '15':          state  candidatevotes  totalvotes
  51066  ARIZONA          6425.0     51767.0
  50932  ARIZONA           294.0     35172.0
  50961  ARIZONA          4963.0     73272.0
  51005  ARIZONA           996.0      3685.0
  51080  ARIZONA        282457.0    520397.0
  51005  ARIZONA           996.0      3685.0,
  '20':          state  candidatevotes  totalvotes
  50946  ARIZONA          1495.0     60442.0
  51110  ARIZONA         10859.0     19556.0
  51051  ARIZONA          2943.0    104668.0
  51127  ARIZONA            72.0    143221.0
  50991  ARIZONA          1055.0     14995.0
  51051  ARIZONA          2943.0    104668.0
  51096  ARIZONA          9812.0    184974.0
  50961  ARIZONA          4963.0     73272.0
  50960  ARIZONA        