Data QA Case: ``mggg-states``
========================

Below are the steps involved in performing automated data quality checks on ``mggg-states`` data. 

*Note:* the automated checks are not completely exhaustive and further manual checks are required.

Step 0. Setup
----------------

In [1]:
# !pip3 install numpy
# !pip3 install pandas
# !pip3 install geopandas
# !pip3 install wikipedia

# !pip3 install git+https://github.com/KeiferC/gdutils.git

In [2]:
import numpy as np
import pandas as pd
import geopandas as gpd
import json
import wikipedia
import os

import gdutils.datamine as dm
import gdutils.dataqa as dq
import gdutils.extract as et

from typing import Any, List, Tuple, Dict, Hashable, Union, NoReturn

Step 1. Data collection
---------------------------

In [3]:
state_names = [
    'Alabama', 'Alaska','Arizona', 'Arkansas', 'California', 
    'Colorado', 'Connecticut', 'Delaware',  'Florida', 'Georgia', 
    'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 
    'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 
    'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 
    'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 
    'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 
    'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 
    'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 
    'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
state_abbreviations = [
    'AL', 'AK', 'AZ', 'AR', 'CA', 
    'CO', 'CT', 'DE', 'FL', 'GA', 
    'HI', 'ID', 'IL', 'IN', 'IA', 
    'KS', 'KY', 'LA', 'ME', 'MD', 
    'MA', 'MI', 'MN', 'MS', 'MO', 
    'MT', 'NE', 'NV', 'NH', 'NJ', 
    'NM', 'NY', 'NC', 'ND', 'OH', 
    'OK', 'OR', 'PA', 'RI', 'SC', 
    'SD', 'TN', 'TX', 'UT', 'VT', 
    'VA', 'WA', 'WV', 'WI', 'WY']

states = list(zip(state_names, state_abbreviations))

__Step 1.1.__ Gather ``mggg-states`` data

In [4]:
# dm.clone_gh_repos(account='mggg-states', account_type='orgs', 
#                   outpath=os.path.join('qafiles', 'mggg'))
#     # this will take some time to complete

In [5]:
mggg_gdfs = {}

for filepath in dm.list_files_of_type('.zip', os.path.join('qafiles', 'mggg')):
      mggg_gdfs[os.path.basename(filepath)[:-4]] = et.read_file(filepath).extract()

__Step 1.2.__ Gather MEDSL data for comparison purposes

In [None]:
# print('{:27} : {}'.format('Repo Name', 'Repo URL'))
# print('------------------------------------------------------------------')
# for (repo, url) in dm.list_gh_repos(account='MEDSL', account_type='orgs'):
#     print("{:27} : {}".format(repo, url))

In [7]:
medsl_repos = ['official-precinct-returns', # precinct-level 2016 election results
               '2018-elections-official'] # constituency-level 2018 election results
    
# dm.clone_gh_repos(account='MEDSL', account_type='orgs', repos=medsl_repos, 
#                   outpath=os.path.join('qafiles', 'medsl'))
#     # this will take some time to complete

In [8]:
medsl_dfs = {}

for filepath in dm.list_files_of_type('.zip', os.path.join('qafiles', 'medsl')):
    medsl_dfs[os.path.basename(filepath)[:-4]] = et.read_file(filepath).extract()

__Step 1.3.__ Gather Wikipedia data for comparison purposes

In [9]:
# Generate wikipedia page titles
pres_election = [('PRES', 'United States presidential election')]
fed_elections = [('SEN', 'United States Senate election')]
               # ('USH', United States House of Representatives election')]
election_years_to_check = [2016, 2018]

wiki_titles = []
for yr in election_years_to_check:
    generate_key = lambda yr, ekey: ekey + str(yr % 100)
    generate_title = lambda yr, etype: str(yr) + ' ' + etype
    
    if yr % 4 == 0:
        wiki_titles.append((generate_key(yr, pres_election[0][0]),
                            generate_title(yr, pres_election[0][1])))
        
    [wiki_titles.append(((generate_key(yr, ekey) + '_' + st_abv),
                         (generate_title(yr, etype) + ' in ' + st) ))
         for ekey, etype in fed_elections
         for st, st_abv in states]

In [10]:
# Gather wikipedia page URLs
wiki_urls = {}
for wiki_title in wiki_titles:
    key, title = wiki_title
    try:
        wiki_urls[key] = (title, wikipedia.page(title=title).url)
    except Exception as e:
        continue

In [None]:
# Print retrieved page URLs
# ^ necessary for verifying URL - election mapping
# Wikipedia API tries to find best match, not exact match
for wiki_key in wiki_urls:
    title, url = wiki_urls[wiki_key]
    print('{:10} : {}\n\t{}'.format(wiki_key, title, url))

In [None]:
# Gather wikipedia tabular election results
wiki_tables = {}
for wiki_key in wiki_urls:
    try:
        wiki_tables[wiki_key] = pd.read_html(wiki_urls[wiki_key][1])
    except Exception as e:
        print("Unable to gather Wikipedia tabular data:", e)

In [13]:
# Display wikipedia tabular election data
for wiki in wiki_tables:
    print('================================================')
    print('Wiki: {} '.format(wiki))
    print('================================================')
    
    for i in range(len(wiki_tables[wiki])):
        print('TABLE {}: ############################\n{}\n\n\n'.format(
                i, wiki_tables[wiki][i].head()))

Wiki: SEN16_AL 
TABLE 0: ############################
                                                   0  \
0                                                NaN   
1                     ← 2010 November 8, 2016 2022 →   
2                                             ← 2010   
3  Nominee Richard Shelby Ron Crumpton Party Repu...   
4                                                NaN   

                                                   1       2   3  
0                                                NaN     NaN NaN  
1                     ← 2010 November 8, 2016 2022 →     NaN NaN  
2                                   November 8, 2016  2022 → NaN  
3  Nominee Richard Shelby Ron Crumpton Party Repu...     NaN NaN  
4                                                NaN     NaN NaN  



TABLE 1: ############################
        0                 1       2
0  ← 2010  November 8, 2016  2022 →



TABLE 2: ############################
              0               1             2   3
0  


TABLE 1: ############################
        0                 1       2
0  ← 2010  November 8, 2016  2022 →



TABLE 2: ############################
              0              1                2
0           NaN            NaN              NaN
1     Candidate  Kamala Harris  Loretta Sanchez
2         Party     Democratic       Democratic
3  Popular vote        7542753          4701417
4    Percentage          61.6%            38.4%



TABLE 3: ############################
                                                   0  \
0  U.S. senator before election Barbara Boxer Dem...   

                                               1  
0  Elected U.S. Senator Kamala Harris Democratic  



TABLE 4: ############################
                             Elections in California
0  Federal government U.S. President 1852 1856 18...
1                                     U.S. President
2  1852 1856 1860 1864 1868 1872 1876 1880 1884 1...
3                                        U.S. Senat

TABLE 20: ############################
  Party      Party.1                          Candidate    Votes       %  \
0   NaN   Democratic  Richard Blumenthal (incumbent)[a]  1008714  63.19%   
1   NaN   Republican                         Dan Carter   552621  34.62%   
2   NaN  Libertarian                       Richard Lion    18190   1.14%   
3   NaN        Green                    Jeffery Russell    16713   1.05%   
4   NaN  Independent             Andrew Rule (write-in)       26   0.00%   

        ±  
0  +8.03%  
1  -8.60%  
2     NaN  
3     NaN  
4     NaN  



TABLE 21: ############################
  vte(2015 ←) 2016 United States elections (→ 2017)  \
0                                     U.S.President   
1                                        U.S.Senate   
2                                         U.S.House   
3                                         Governors   
4                                            Mayors   

  vte(2015 ←) 2016 United States elections (→ 2017).1  
0  


TABLE 7: ############################
          1998
0  Amendment 2



TABLE 8: ############################
               Mayoral elections
0  2010 (special) 2012 2016 2020



TABLE 9: ############################
  Party     Party.1                 Candidate   Votes       %
0   NaN  Democratic  Brian Schatz (incumbent)  162891  86.17%
1   NaN  Democratic        Makani Christensen   11898   6.29%
2   NaN  Democratic           Miles Shiratori    8620   4.56%
3   NaN  Democratic              Arturo Reyes    3819   2.02%
4   NaN  Democratic          Tutz Honeychurch    1815   0.96%



TABLE 10: ############################
         Party      Party.1         Candidate  Votes        %
0          NaN   Republican      John Carroll  26747   74.58%
1          NaN   Republican      John P. Roco   3956   11.03%
2          NaN   Republican  Karla Gottschalk   3045    8.49%
3          NaN   Republican   Eddie Pirkowski   2114    5.89%
4  Total votes  Total votes       Total votes  35862  100.0

TABLE 23: ############################
           Poll source Date(s)administered  Samplesize Margin oferror  \
0  Bellwether Research     May 11–15, 2016         600         ± 4.0%   
1           WTHR/Howey   April 18–21, 2016         500         ± 4.3%   

  ToddYoung (R) BaronHill (D) Undecided  
0           36%           22%       30%  
1           48%           30%       22%  



TABLE 24: ############################
  Poll source Date(s)administered  Samplesize Margin oferror  \
0  WTHR/Howey   April 18–21, 2016         500         ± 4.0%   

  MarlinStutzman (R) BaronHill (D) Undecided  
0                39%           36%       25%  



TABLE 25: ############################
         Party      Party.1                         Candidate    Votes  \
0          NaN   Republican                        Todd Young  1423991   
1          NaN   Democratic                         Evan Bayh  1158947   
2          NaN  Libertarian                      Lucy Brenton   149481   
3          N


TABLE 19: ############################
                      Pollsource          Date(s)administered  Samplesize  \
0  Anzalone Liszt Grove Research  August 29–September 1, 2016         605   
1                      SurveyUSA              March 4–8, 2016         600   

  Margin oferror JohnNeelyKennedy (R) CarolineFayard (D) Undecided  
0         ± 4.0%                  49%                38%       13%  
1         ± 4.1%                  54%                34%       12%  



TABLE 20: ############################
                      Pollsource          Date(s)administered  Samplesize  \
0  Anzalone Liszt Grove Research  August 29–September 1, 2016         605   

  Margin oferror DavidDuke (R) CarolineFayard (D) Undecided  
0         ± 4.0%           15%                64%       21%  



TABLE 21: ############################
  Pollsource Date(s)administered  Samplesize Margin oferror  \
0  SurveyUSA     March 4–8, 2016         600         ± 4.1%   

  CharlesBoustany (R) JohnNeely


TABLE 13: ############################
                                Hypothetical polling           Unnamed: 1  \
0  Poll source Date(s)administered Samplesize Mar...                  NaN   
1                                        Poll source  Date(s)administered   
2                           Remington Research Group         January 2015   
3                           Remington Research Group   February 2–3, 2015   

   Unnamed: 2      Unnamed: 3 Unnamed: 4   Unnamed: 5 Unnamed: 6 Unnamed: 7  
0         NaN             NaN        NaN          NaN        NaN        NaN  
1  Samplesize  Margin oferror   RoyBlunt  JohnBrunner      Other  Undecided  
2        1355             ± ?        60%          40%          —          —  
3         747            3.6%        50%          19%          —        32%  



TABLE 14: ############################
                Poll source Date(s)administered  Samplesize Margin oferror  \
0  Remington Research Group        January 2015        1355     


TABLE 19: ############################
  Poll source Date(s)administered  Samplesize Margin oferror JimRubens (R)  \
0    WMUR/UNH  August 20–28, 2016         433         ± 4.7%           27%   
1    WMUR/UNH     July 9–18, 2016         469         ± 4.2%           30%   
2    WMUR/UNH    April 7–17, 2016         553         ± 4.2%           30%   

  MaggieHassan (D) Other Undecided  
0              51%    8%       14%  
1              48%    6%       16%  
2              46%     —       24%  



TABLE 20: ############################
             Poll source Date(s)administered  Samplesize Margin oferror  \
0  Public Policy Polling    April 9–13, 2015         747         ± 3.6%   

  OvideLamontagne (R) MaggieHassan (D) Undecided  
0                 35%              54%       11%  



TABLE 21: ############################
             Poll source Date(s)administered  Samplesize Margin oferror  \
0  Public Policy Polling    April 9–13, 2015         747         ± 3.6%   

  OvideLamo


TABLE 14: ############################
  Party         Party.1                Candidate   Votes       %       ±
0   NaN      Republican  John Hoeven (incumbent)  268788  78.48%  +2.40%
1   NaN  Democratic-NPL          Eliot Glassheim   58116  16.97%  -5.20%
2   NaN     Libertarian         Robert Marquette   10556   3.08%  +1.45%
3   NaN     Independent           James Germalic    4675   1.36%     NaN
4   NaN             NaN                Write-ins     366   0.11%     NaN



TABLE 15: ############################
  vte(2015 ←) 2016 United States elections (→ 2017)  \
0                                     U.S.President   
1                                        U.S.Senate   
2                                         U.S.House   
3                                         Governors   
4                                            Mayors   

  vte(2015 ←) 2016 United States elections (→ 2017).1  
0  Alabama Alaska American Samoa Arizona Arkansas...   
1  Alabama Alaska Arizona Arkansas Ca


TABLE 37: ############################
             Poll source Date(s)administered  Samplesize Margin oferror  \
0  Public Policy Polling     May 21–24, 2015         799         ± 3.5%   

  PatToomey (R) SethWilliams (D) Other Undecided  
0           44%              33%     —       23%  



TABLE 38: ############################
         Party          Party.1               Candidate            Votes  \
0          NaN       Republican  Pat Toomey (incumbent)          2951702   
1          NaN       Democratic           Katie McGinty          2865012   
2          NaN      Libertarian  Edward T. Clifford III           235142   
3  Total votes      Total votes             Total votes      '6,051,856'   
4          NaN  Republican hold         Republican hold  Republican hold   

                 %                ±  
0           48.77%           -2.24%  
1           47.34%           -1.65%  
2            3.89%              NaN  
3         '100.0%'              NaN  
4  Republican hold

TABLE 34: ############################
         Party      Party.1       Candidate  Votes        %
0          NaN  Libertarian    Wes Benedict    NaN      NaN
1          NaN  Libertarian  Kerry McKennon    NaN      NaN
2  Total votes  Total votes     Total votes    NaN  100.00%



TABLE 35: ############################
                           Source   Ranking          As of
0  The Cook Political Report[114]  Likely R  July 23, 2020
1           Inside Elections[115]    Lean R  July 10, 2020
2      Sabato's Crystal Ball[116]  Likely R   July 9, 2020
3                  Daily Kos[117]  Likely R  July 22, 2020
4                   Politico[118]    Lean R   July 6, 2020



TABLE 36: ############################
                                     John Cornyn (R)
0  Federal officials Donald Trump, President of t...



TABLE 37: ############################
                                        MJ Hegar (D)
0  Federal officials Catherine Cortez Masto, U.S....



TABLE 38: ################


TABLE 16: ############################
                    Poll source Date(s)administered  Samplesize[a]  \
0  Public Policy Polling (D)[A]    June 13–14, 2017            762   

  Marginof error Shelley Moore Capito (R) GenericDemocrat Undecided  
0         ± 3.4%                      48%             35%       17%  



TABLE 17: ############################
         Party      Party.1                         Candidate  Votes        %  \
0          NaN   Republican  Shelley Moore Capito (incumbent)    NaN      NaN   
1          NaN   Democratic             Paula Jean Swearengin    NaN      NaN   
2          NaN  Independent                    Franklin Riley    NaN      NaN   
3  Total votes  Total votes                       Total votes    NaN  100.00%   

    ±  
0 NaN  
1 NaN  
2 NaN  
3 NaN  



TABLE 18: ############################
  vte(2019 ←) 2020 United States elections (→ 2021)  \
0                                     U.S.President   
1                                      


TABLE 33: ############################
                                   James Bradley (R)
0  Individuals Carl DeMaio, former San Diego city...



TABLE 34: ############################
                                       Erin Cruz (R)
0  Individuals Marco Gutierrez, co-founder of Lat...



TABLE 35: ############################
                                  Patrick Little (R)
0  Politicians David Duke, white nationalist and ...



TABLE 36: ############################
                            Derrick Michael Reid (L)
0  Organizations Libertarian Party of California[...



TABLE 37: ############################
                     John Thompson Parker (PFP)
0  Organizations Green Party of California[115]



TABLE 38: ############################
  Campaign finance reports as of May 16, 2018                 \
                                    Candidate Total receipts   
0                        Dianne Feinstein (D)     $9,953,612   
1                           Kevin de L


TABLE 11: ############################
  Party     Party.1           Candidate  Votes       %
0   NaN  Republican          Ron Curtis   6370  23.73%
1   NaN  Republican   Consuelo Anderson   5172  19.26%
2   NaN  Republican   Robert C. Helsham   3988  14.85%
3   NaN  Republican     Thomas E. White   3657  13.62%
4   NaN  Republican  Rocky De La Fuente   3065  11.42%



TABLE 12: ############################
         Party      Party.1             Candidate  Votes       %
0          NaN  Nonpartisan  Arturo Pacheco Reyes    441  38.02%
1          NaN  Nonpartisan       Charles Haverty    416  35.86%
2          NaN  Nonpartisan   Matthew K. Maertens    303  26.12%
3  Total votes  Total votes           Total votes   1160    100%



TABLE 13: ############################
                          Source   Ranking             As of
0  The Cook Political Report[20]   Solid D  October 26, 2018
1           Inside Elections[21]   Solid D  November 1, 2018
2      Sabato's Crystal Ball[22]    Sa

TABLE 29: ############################
         Party      Party.1                     Candidate    Votes        %  \
0          NaN   Democratic  Elizabeth Warren (incumbent)  1633371   60.34%   
1          NaN   Republican                   Geoff Diehl   979210   36.17%   
2          NaN  Independent               Shiva Ayyadurai    91710    3.39%   
3          NaN     Write-in                      Write-in     2799    0.10%   
4  Total votes  Total votes                   Total votes  2707090  100.00%   

         ±  
0   +6.60%  
1  -10.02%  
2      NaN  
3      NaN  
4      NaN  



TABLE 30: ############################
                                 vteElizabeth Warren  \
0  U.S. Senator from Massachusetts (2013–present)...   
1                                           Politics   
2                                          Campaigns   
3                                             Family   
4                                              Books   

                             


TABLE 12: ############################
         Party      Party.1                 Candidate   Votes       %
0          NaN   Republican  Roger Wicker (incumbent)  130118  82.79%
1          NaN   Republican          Richard Boyanton   27052  17.21%
2  Total votes  Total votes               Total votes  157170    100%



TABLE 13: ############################
                                       Jensen Bohren
0  Organizations Jackpine Radicals[30] Vote STEM[...



TABLE 14: ############################
                                         David Baria
0  Organizations End Citizens United[34] Mississi...



TABLE 15: ############################
         Poll source Date(s)administered  Samplesize DavidBaria JensenBohren  \
0  Triumph Campaigns   April 10–11, 2018         446         7%           4%   

  OmeriaScott HowardSherman Undecided  
0          9%            2%       79%  



TABLE 16: ############################
                                Hypothetical polling       


TABLE 16: ############################
                           Poll source   Date(s)administered  Samplesize  \
0                         DFM Research   October 23–27, 2018         683   
1             Grassroots Targeting (R)        Late June 2018        1000   
2  Meeting Street Research (R-Fischer)   January 24–28, 2018         500   
3   Public Policy Polling (D-Raybould)  November 10–12, 2017        1190   

  Marginof error DebFischer (R) JaneRaybould (D) Undecided  
0         ± 3.8%            54%              39%        7%  
1         ± 3.1%            63%              28%         –  
2              –            51%              34%         –  
3         ± 2.8%            42%              31%       27%  



TABLE 17: ############################
                                Hypothetical polling           Unnamed: 1  \
0  Poll source Date(s)administered Samplesize Mar...                  NaN   
1                                        Poll source  Date(s)administered   
2


TABLE 11: ############################
  Mayoral elections
0    2009 2013 2017



TABLE 12: ############################
                              Kirsten Gillibrand (D)
0  Individuals Amy Schumer, actress[11] Organizat...



TABLE 13: ############################
                                    Chele Farley (R)
0  U.S. President Donald Trump, 45th President of...



TABLE 14: ############################
                          Source    Ranking               As of
0  The Cook Political Report[47]    Solid D  September 29, 2017
1           Inside Elections[48]    Solid D  September 28, 2018
2      Sabato's Crystal Ball[49]     Safe D  September 27, 2017
3                   Fox News[50]  Likely D†        July 9, 2018
4                        CNN[51]    Solid D       July 12, 2018



TABLE 15: ############################
                                 Poll source            Date(s)administered  \
0                               Research Co.             November 1–3, 2018  


TABLE 8: ############################
                                Hypothetical polling           Unnamed: 1  \
0  Poll source Date(s)administered Samplesize Mar...                  NaN   
1                                        Poll source  Date(s)administered   
2                        ALG Research (D-Whitehouse)       May 7–14, 2018   

   Unnamed: 2      Unnamed: 3         Unnamed: 4     Unnamed: 5 Unnamed: 6  
0         NaN             NaN                NaN            NaN        NaN  
1  Samplesize  Marginof error  SheldonWhitehouse  LincolnChafee  Undecided  
2         329          ± 5.5%                72%            14%        14%  



TABLE 9: ############################
                   Poll source Date(s)administered  Samplesize Marginof error  \
0  ALG Research (D-Whitehouse)      May 7–14, 2018         329         ± 5.5%   

  SheldonWhitehouse LincolnChafee Undecided  
0               72%           14%       14%  



TABLE 10: ############################
      


TABLE 28: ############################
     Poll source      Date(s)administered  Samplesize Marginof error  \
0  JMC Analytics  March 18–March 20, 2017         625         ± 3.9%   

  OrrinHatch (R) EvanMcMullin (I) GenericDemocrat Other Undecided  
0            29%              33%             11%   10%       17%  



TABLE 29: ############################
              Poll source            Date(s)administered  Samplesize  \
0  Dan Jones & Associates  August 30 – September 5, 2017         608   

  Marginof error ChrisStewart (R) JennyWilson (D) Undecided  
0         ± 4.0%              34%             30%       36%  



TABLE 30: ############################
              Poll source            Date(s)administered  Samplesize  \
0  Dan Jones & Associates  August 30 – September 5, 2017         608   

  Marginof error MattHolland (R) JennyWilson (D) Undecided  
0         ± 4.0%             23%             30%       47%  



TABLE 31: ############################
  Party          


TABLE 1: ############################
        0                 1       2
0  ← 2012  November 6, 2018  2024 →



TABLE 2: ############################
              0              1            2   3
0           NaN            NaN          NaN NaN
1       Nominee  Tammy Baldwin  Leah Vukmir NaN
2         Party     Democratic   Republican NaN
3  Popular vote        1472914      1184885 NaN
4    Percentage          55.4%        44.5% NaN



TABLE 3: ############################
                                                   0  \
0  U.S. senator before election Tammy Baldwin Dem...   

                                               1  
0  Elected U.S. Senator Tammy Baldwin Democratic  



TABLE 4: ############################
                              Elections in Wisconsin
0  Federal government Presidential elections 1848...
1                             Presidential elections
2  1848 1852 1856 1860 1864 1868 1872 1876 1880 1...
3                             Presidential primarie

In [14]:
# Manually gather found tabular election results
wiki_dfs = {}

# wiki_pres16_by_state = [{state_abv : table} for _______] # TODO

# has to be done manually because every wikipedia page is different

# wiki_dfs['PRES16_AL'] = wiki_pres16_by_state['AL']
# wiki_dfs['PRES16_AK'] = wiki_pres16_by_state['AK']
# wiki_dfs['PRES16_AZ'] = wiki_pres16_by_state['AZ']
# wiki_dfs['PRES16_AR'] = wiki_pres16_by_state['AR']
# wiki_dfs['PRES16_CA'] = wiki_pres16_by_state['CA']
# wiki_dfs['PRES16_CO'] = wiki_pres16_by_state['CO']
# wiki_dfs['PRES16_CT'] = wiki_pres16_by_state['CT']
# wiki_dfs['PRES16_DE'] = wiki_pres16_by_state['DE']
# wiki_dfs['PRES16_FL'] = wiki_pres16_by_state['FL']
# wiki_dfs['PRES16_GA'] = wiki_pres16_by_state['GA']
# wiki_dfs['PRES16_HI'] = wiki_pres16_by_state['HI']
# wiki_dfs['PRES16_ID'] = wiki_pres16_by_state['ID']
# wiki_dfs['PRES16_IL'] = wiki_pres16_by_state['IL']
# wiki_dfs['PRES16_IN'] = wiki_pres16_by_state['IN']
# wiki_dfs['PRES16_IA'] = wiki_pres16_by_state['IA']
# wiki_dfs['PRES16_KS'] = wiki_pres16_by_state['KS']
# wiki_dfs['PRES16_KY'] = wiki_pres16_by_state['KY']
# wiki_dfs['PRES16_LA'] = wiki_pres16_by_state['LA']
# wiki_dfs['PRES16_ME'] = wiki_pres16_by_state['ME']
# wiki_dfs['PRES16_MD'] = wiki_pres16_by_state['MD']
# wiki_dfs['PRES16_MA'] = wiki_pres16_by_state['MA']
# wiki_dfs['PRES16_MI'] = wiki_pres16_by_state['MI']
# wiki_dfs['PRES16_MN'] = wiki_pres16_by_state['MN']
# wiki_dfs['PRES16_MS'] = wiki_pres16_by_state['MS']
# wiki_dfs['PRES16_MO'] = wiki_pres16_by_state['MO']
# wiki_dfs['PRES16_MT'] = wiki_pres16_by_state['MT']
# wiki_dfs['PRES16_NE'] = wiki_pres16_by_state['NE']
# wiki_dfs['PRES16_NV'] = wiki_pres16_by_state['NV']
# wiki_dfs['PRES16_NH'] = wiki_pres16_by_state['NH']
# wiki_dfs['PRES16_NJ'] = wiki_pres16_by_state['NJ']
# wiki_dfs['PRES16_NM'] = wiki_pres16_by_state['NM']
# wiki_dfs['PRES16_NY'] = wiki_pres16_by_state['NY']
# wiki_dfs['PRES16_NC'] = wiki_pres16_by_state['NC']
# wiki_dfs['PRES16_ND'] = wiki_pres16_by_state['ND']
# wiki_dfs['PRES16_OH'] = wiki_pres16_by_state['OH']
# wiki_dfs['PRES16_OK'] = wiki_pres16_by_state['OK']
# wiki_dfs['PRES16_OR'] = wiki_pres16_by_state['OR']
# wiki_dfs['PRES16_PA'] = wiki_pres16_by_state['PA']
# wiki_dfs['PRES16_RI'] = wiki_pres16_by_state['RI']
# wiki_dfs['PRES16_SC'] = wiki_pres16_by_state['SC']
# wiki_dfs['PRES16_SD'] = wiki_pres16_by_state['SD']
# wiki_dfs['PRES16_TN'] = wiki_pres16_by_state['TN']
# wiki_dfs['PRES16_TX'] = wiki_pres16_by_state['TX']
# wiki_dfs['PRES16_UT'] = wiki_pres16_by_state['UT']
# wiki_dfs['PRES16_VT'] = wiki_pres16_by_state['VT']
# wiki_dfs['PRES16_VA'] = wiki_pres16_by_state['VA']
# wiki_dfs['PRES16_WA'] = wiki_pres16_by_state['WA']
# wiki_dfs['PRES16_WV'] = wiki_pres16_by_state['WV']
# wiki_dfs['PRES16_WI'] = wiki_pres16_by_state['WI']
# wiki_dfs['PRES16_WY'] = wiki_pres16_by_state['WY']

wiki_dfs['SEN16_AL'] = wiki_tables['SEN16_AL'][19]
wiki_dfs['SEN16_AK'] = wiki_tables['SEN16_AK'][20]
wiki_dfs['SEN16_AZ'] = wiki_tables['SEN16_AZ'][45]
wiki_dfs['SEN16_AR'] = wiki_tables['SEN16_AR'][16]
wiki_dfs['SEN16_CA'] = wiki_tables['SEN16_CA'][53]
wiki_dfs['SEN16_CO'] = wiki_tables['SEN16_CO'][25]
wiki_dfs['SEN16_CT'] = wiki_tables['SEN16_CT'][20]
wiki_dfs['SEN16_FL'] = wiki_tables['SEN16_FL'][64]
wiki_dfs['SEN16_GA'] = wiki_tables['SEN16_GA'][16]
wiki_dfs['SEN16_HI'] = wiki_tables['SEN16_HI'][18]
wiki_dfs['SEN16_ID'] = wiki_tables['SEN16_ID'][15]
wiki_dfs['SEN16_IL'] = wiki_tables['SEN16_IL'][29]
wiki_dfs['SEN16_IN'] = wiki_tables['SEN16_IN'][25]
wiki_dfs['SEN16_IA'] = wiki_tables['SEN16_IA'][20]
wiki_dfs['SEN16_KS'] = wiki_tables['SEN16_KS'][17]
wiki_dfs['SEN16_KY'] = wiki_tables['SEN16_KY'][22]
wiki_dfs['SEN16_LA'] = wiki_tables['SEN16_LA'][24]
wiki_dfs['SEN16_MD'] = wiki_tables['SEN16_MD'][29]
wiki_dfs['SEN16_MO'] = wiki_tables['SEN16_MO'][24]
wiki_dfs['SEN16_NV'] = wiki_tables['SEN16_NV'][32]
wiki_dfs['SEN16_NH'] = wiki_tables['SEN16_NH'][23]
wiki_dfs['SEN16_NY'] = wiki_tables['SEN16_NY'][15]
wiki_dfs['SEN16_NC'] = wiki_tables['SEN16_NC'][42]
wiki_dfs['SEN16_ND'] = wiki_tables['SEN16_ND'][14]
wiki_dfs['SEN16_OH'] = wiki_tables['SEN16_OH'][29]
wiki_dfs['SEN16_OK'] = wiki_tables['SEN16_OK'][12]
wiki_dfs['SEN16_OR'] = wiki_tables['SEN16_OR'][14]
wiki_dfs['SEN16_PA'] = wiki_tables['SEN16_PA'][38]
wiki_dfs['SEN16_SC'] = wiki_tables['SEN16_SC'][16]
wiki_dfs['SEN16_SD'] = wiki_tables['SEN16_SD'][9]
wiki_dfs['SEN16_UT'] = wiki_tables['SEN16_UT'][19]
wiki_dfs['SEN16_VT'] = wiki_tables['SEN16_VT'][12]
wiki_dfs['SEN16_WA'] = wiki_tables['SEN16_WA'][15]
wiki_dfs['SEN16_WI'] = wiki_tables['SEN16_WI'][21]

wiki_dfs['SEN18_AZ'] = wiki_tables['SEN18_AZ'][40]
# wiki_dfs['SEN18_CA'] = wiki_tables['SEN18_CA'][]
# wiki_dfs['SEN18_CT'] = wiki_tables['SEN18_CT'][]
# wiki_dfs['SEN18_DE'] = wiki_tables['SEN18_DE'][]
# wiki_dfs['SEN18_FL'] = wiki_tables['SEN18_FL'][]
# wiki_dfs['SEN18_HI'] = wiki_tables['SEN18_HI'][]
# wiki_dfs['SEN18_IN'] = wiki_tables['SEN18_IN'][]
# wiki_dfs['SEN18_ME'] = wiki_tables['SEN18_ME'][]
# wiki_dfs['SEN18_MD'] = wiki_tables['SEN18_MD'][]
# wiki_dfs['SEN18_MA'] = wiki_tables['SEN18_MA'][]
# wiki_dfs['SEN18_MI'] = wiki_tables['SEN18_MI'][]
# wiki_dfs['SEN18_MN'] = wiki_tables['SEN18_MN'][]
# wiki_dfs['SEN18_MS'] = wiki_tables['SEN18_MS'][]
# wiki_dfs['SEN18_MO'] = wiki_tables['SEN18_MO'][]
# wiki_dfs['SEN18_MT'] = wiki_tables['SEN18_MT'][]
# wiki_dfs['SEN18_NE'] = wiki_tables['SEN18_NE'][]
# wiki_dfs['SEN18_NV'] = wiki_tables['SEN18_NV'][]
# wiki_dfs['SEN18_NJ'] = wiki_tables['SEN18_NJ'][]
# wiki_dfs['SEN18_NM'] = wiki_tables['SEN18_NM'][]
# wiki_dfs['SEN18_NY'] = wiki_tables['SEN18_NY'][]
# wiki_dfs['SEN18_ND'] = wiki_tables['SEN18_ND'][]
# wiki_dfs['SEN18_OH'] = wiki_tables['SEN18_OH'][]
# wiki_dfs['SEN18_PA'] = wiki_tables['SEN18_PA'][]
# wiki_dfs['SEN18_RI'] = wiki_tables['SEN18_RI'][]
# wiki_dfs['SEN18_TN'] = wiki_tables['SEN18_TN'][]
# wiki_dfs['SEN18_TX'] = wiki_tables['SEN18_TX'][]
# wiki_dfs['SEN18_UT'] = wiki_tables['SEN18_UT'][]
# wiki_dfs['SEN18_VT'] = wiki_tables['SEN18_VT'][]
# wiki_dfs['SEN18_WA'] = wiki_tables['SEN18_WA'][]
# wiki_dfs['SEN18_WV'] = wiki_tables['SEN18_WV'][]
# wiki_dfs['SEN18_WI'] = wiki_tables['SEN18_WI'][]
# wiki_dfs['SEN18_WY'] = wiki_tables['SEN18_WY'][]

[Coming Soon] __Step 1.4.__ Gather Ballotpedia data for comparison purposes

*Note:* Depending on response to API access, this step may need to be done manually

Step 2. Data wrangling
---------------------------

__Step 2.1.__ Wrangle MEDSL data

In [15]:
def pivot_medsl(medsl_dfs_dict: Dict[str, Union[pd.DataFrame, gpd.GeoDataFrame]]
        ) -> Dict[str, Union[pd.DataFrame, gpd.GeoDataFrame]]:
    for df in medsl_dfs_dict:
        medsl_pvt = medsl_dfs_dict[df].pivot_table(index='precinct',
                                                   columns=['office', 'party'],
                                                   values='votes')
        medsl_pvt.columns = [' '.join(col).strip() for col in medsl_pvt.columns.values]
        medsl_dfs_dict[df] = et.ExtractTable(medsl_pvt).extract()

In [None]:
# View available data
for df in medsl_dfs:
    print('--------{}--------'.format(df))
    print(medsl_dfs[df].head())

In [None]:
# Extract relevant MEDSL election data
medsl_18_dfs = {}
medsl_pres16_dfs = {}
medsl_sen16_dfs = {}
medsl_ush16_dfs = {}

for state in states:
    st, st_abv = state
    try:
        medsl_18_dfs[st] = et.ExtractTable(medsl_dfs['precinct_2018'], 
                                           column='state', value=st).extract()
    except Exception as e:
        print('Missing data in medsl_18:', e)
    
    try:
        medsl_pres16_dfs[st] = et.ExtractTable(medsl_dfs['2016-precinct-president'], 
                                               column='state', value=st).extract()
    except Exception as e:
        print('Missing data in medsl_pres16:', e)
    
    try:
        medsl_sen16_dfs[st] = et.ExtractTable(medsl_dfs['2016-precinct-senate'], 
                                               column='state', value=st).extract()
    except Exception as e:
        print('Missing data in medsl_sen16:', e)
    
    try:
        medsl_ush16_dfs[st] = et.ExtractTable(medsl_dfs['2016-precinct-house'], 
                                               column='state', value=st).extract()
    except Exception as e:
        print('Missing data in medsl_ush16:', e)

        
pivot_medsl(medsl_18_dfs)
pivot_medsl(medsl_pres16_dfs)
pivot_medsl(medsl_sen16_dfs)
pivot_medsl(medsl_ush16_dfs)

Step 3. Data standardization check of ``mggg-states``
---------------------------------------------------------------

__Step 3.1__ Generate naming conventions

In [18]:
with open('naming_convention.json') as json_file:
    standards_raw = json.load(json_file)
    
offices = dm.get_keys_by_category(standards_raw, 'offices')
parties = dm.get_keys_by_category(standards_raw, 'parties')
counts = dm.get_keys_by_category(standards_raw, 'counts')
others = dm.get_keys_by_category(standards_raw, 
            ['geographies', 'demographics', 'districts', 'other'])

elections = [office + format(year, '02') + party 
                 for office in offices
                 for year in range(0, 21)
                 for party in parties 
                 if not (office == 'PRES' and year % 4 != 0)]

counts = [count + format(year, '02') for count in counts 
                                     for year in range(0, 20)]

standards = elections + counts + others

__Step 3.2.__ Check ``mggg-states`` data compliance with naming conventions

In [19]:
naming_check = {}

for gdf in mggg_gdfs:
      naming_check[gdf] = dq.compare_column_names(mggg_gdfs[gdf], standards)

In [None]:
matched_columns = {}

# Print results
for gdf in naming_check:
    print('=========================================')
    print('Dataset: {}'.format(gdf))
    print('=========================================')
    
    (matches, diffs) = naming_check[gdf]
    matched_columns[gdf] = matches
    
    diffs = list(diffs)
    diffs.sort()
    
    print('Discrepancies from naming convention:')
    print(diffs)
    print()

Step 4. Compare ``mggg-states`` data with external sources
----------------------------------------------------------

In [21]:
available_mggg_states = []
for state in states:
    state_name, state_abv = state
    
    # messy name matching because file naming isn't standardized
    mggg_gdf_names = [gdf_name for gdf_name in list(mggg_gdfs) 
                               if gdf_name.startswith(state_abv.lower() + '_') or
                                  gdf_name.startswith(state_abv + '_') or
                                  gdf_name.startswith(state_name.lower() + '_') or
                                  gdf_name.startswith(state_name.upper() + '_') or
                                  gdf_name.startswith(state_name + '_') or
                                  gdf_name.startswith(
                                      state_name.replace(' ', '_').lower() + '_') or
                                  gdf_name.startswith(
                                      state_name.replace(' ', '_').upper() + '_') or
                                  gdf_name.startswith(
                                      state_name.replace(' ', '_') + '_')]
                      
    available_mggg_states.append((state_name, mggg_gdf_names))

In [None]:
# View available columns
for state in available_mggg_states:
    st, gdf_names = state
    if gdf_names:
        print('======== {} ========'.format(st))
        for name in gdf_names:
            print('{} --------'.format(name))
            print(sorted(list(matched_columns[name])))
            print()
        print()

__Step 4.1.__ Compare against MEDSL

In [23]:
# Generate Naming Convention Translations between MGGG and MEDSL
pres16_cols = [
    ('PRES16D', 'US President democratic'), 
    ('PRES16G', 'US President green'), 
    ('PRES16L', 'US President libertarian'), 
    ('PRES16R', 'US President republican')
]

sen16_cols = [
    ('SEN16D', 'US Senate democrat'),
    ('SEN16G', 'US Senate green'),
    ('SEN16L', 'US Senate libertarian'),
    ('SEN16R', 'US Senate republican')
]

ush16_cols = [
    ('USH16D', 'US House democratic'),
    ('USH16G', 'US House green'),
    ('USH16L', 'US House libertarian'),
    ('USH16R', 'US House republican')
]

fed18_cols = [
    ('SEN18D', 'US Senate democratic'),
    ('SEN18G', 'US Senate green'),
    ('SEN18L', 'US Senate libertarian'),
    ('SEN18R', 'US Senate republican'),
    ('USH18D', 'US House democrat'),
    ('USH18G', 'US House green'),
    ('USH18L', 'US House libertarian'),
    ('USH18R', 'US House republican')
]

In [None]:
def bulk_compare(bulk_results: Dict[str, List[Tuple[Hashable, Any]]], 
                 st: str, 
                 mggg_names: List[str], 
                 medsls: List[pd.DataFrame], 
                 cols: Tuple[str, str]
        ) -> NoReturn: # Returns bulk_results by reference
    """
    Returns, by reference, a dict containing state names as keys and a value of a 
    dict containing mggg_gdf names as keys and the results of dm.compare_column_sums
    as values.
    
    """
    if st in medsls:
        try:
            mggg_results = bulk_results[st]
        except Exception:
            mggg_results = {}
        for mggg_name in mggg_names:
            x = []
            y = []
            for mggg_col, medsl_col in cols:
                if (mggg_col in list(mggg_gdfs[mggg_name]) and 
                    medsl_col in list(medsls[st])):
                    x.append(mggg_col)
                    y.append(medsl_col)
                
            if x:
                try:
                    # append results
                    mggg_results[mggg_name] += dq.compare_column_sums(
                                                    mggg_gdfs[mggg_name], 
                                                    medsls[st], x, y)
                except Exception as e:
                    try: # if results don't already exist
                        mggg_results[mggg_name] = dq.compare_column_sums(
                                                    mggg_gdfs[mggg_name], 
                                                    medsls[st], x, y)
                    except Exception as z:
                        print("Unable to compare {} and {} in {}. {}".format(
                                    x, y, mggg_name, z))
            else:
                mggg_results[mggg_name] = []
        
            bulk_results[st] = mggg_results
            

results = {}

for state in available_mggg_states:
    st, mggg_names = state
    
    if mggg_names:
        bulk_compare(results, st, mggg_names, medsl_pres16_dfs, pres16_cols)
        bulk_compare(results, st, mggg_names, medsl_sen16_dfs, sen16_cols)
        bulk_compare(results, st, mggg_names, medsl_ush16_dfs, ush16_cols)
        bulk_compare(results, st, mggg_names, medsl_18_dfs, fed18_cols)

In [None]:
# Print comparison results
print("============================================================================")
print("Results of State-level Aggregation comparisons between mggg-states and MEDSL")
print("============================================================================")
print()
print('{:37} : {}'.format('mggg-states column [vs] MEDSL column', 'difference in sums'))
print('----------------------------------------------------------------------------')

for st in results:
    print("######## {} ########".format(st))
    for mggg_name in results[st]:
        print("{} ========".format(mggg_name))
        if results[st][mggg_name]:
            for col_v_col, diff in results[st][mggg_name]:
                print('{:37} : {}'.format(col_v_col, diff))
        else:
            print("No comparable columns found.")
        print()
    print()


__Don't panic just yet -- potential explanations for differences:__

- MGGG data may have columns with mixed datatypes. Some columns contain strings, when they should contain numbers
- Some shapefiles use VTDs and towns, which may differ from precincts
- Undercounts may be the result of excluding absentee ballots in the count
- Sums should be spot-checked manually in addition to the automated checks
- The comparisons are conducted on the intersection between mggg-states and MEDSL datasets and are limited to 2016 and 2018 data
- MEDSL float values indicate proration -- to investigate
- I also haven't rigourously examined the MEDSL data :P

[Coming Soon] __Step 4.2.__ Compare against Wikipedia

[Coming Soon] __Step 4.3.__ Compare against Ballotpedia

__Step 4.4__ Data summation for manual checks against external sources

In [None]:
# Print MGGG summation results
print("============================================================================")
print("Results of State-level Aggregations - mggg-states")
print("============================================================================")
print()
print('{:11} : {:10}\t\t{}'.format('mggg-states column', 'sum', 'dtype'))
print('----------------------------------------------------------------------------')

for st, gdf_names in available_mggg_states:
    if gdf_names:
        print('######## {} ########'.format(st))
        for gdf_name in gdf_names:
            print("{} ========".format(gdf_name))

            cols_to_sum = [col for col in matched_columns[gdf_name] if col != 'geometry']
            col_sums = dq.sum_column_values(mggg_gdfs[gdf_name], cols_to_sum)

            for result in col_sums:
                col_name, col_sum = result
                if col_name in matched_columns[gdf_name]:
                    if isinstance(col_sum, str):
                        col_sum = "{} is a str".format(col_name)
                    print('{:11} : {:10}\t\t{}'.format(col_name, col_sum, str(type(col_sum))))
            print('\n\n')
    

In [None]:
# Print MEDSL summation results
def print_medsl_sums(st: str, 
                     medsl_name: str, 
                     medsls: Dict[str, Union[pd.DataFrame, gpd.GeoDataFrame]]
        ) -> NoReturn:
    if st in medsls:
        print("######## {} : {} ########".format(st, medsl_name))
        
        cols_to_sum = [col for col in list(medsls[st]) if col != 'geometry']
        try:
            col_sums = dq.sum_column_values(medsls[st], cols_to_sum)
        except Exception as e:
            pass
        
        for result in col_sums:
            col_name, col_sum = result
            print('{:65} : {}\t{}'.format(col_name, col_sum, str(type(col_sum))))

        print('\n\n')

        
print("============================================================================")
print("Results of State-level Aggregations - MEDSL")
print("============================================================================")
print()
print('{:65} : {}\t{}'.format('MEDSL column', 'sum', 'dtype'))
print('----------------------------------------------------------------------------')

for st, st_abv in states:
    print_medsl_sums(st, 'MEDSL 2018', medsl_18_dfs)
    print_medsl_sums(st, 'MEDSL PRES16', medsl_pres16_dfs)
    print_medsl_sums(st, 'MEDSL SEN16', medsl_sen16_dfs)
    print_medsl_sums(st, 'MEDSL USH16', medsl_ush16_dfs)
    

Step 5. Check topological soundness of ``mggg-states`` data
-----------------------------------------------------------------------

__Step 5.1.__ Check for empty geometries

In [28]:
topological_warnings = []
for gdf in mggg_gdfs:
    if any(mggg_gdfs[gdf]['geometry'].isna()):
        topological_warnings.append('{} has missing geometries.'.format(gdf))
        
    if any(mggg_gdfs[gdf]['geometry'].is_empty):
        topological_warnings.append('{} has empty geometries.')

if len(topological_warnings) == 0:
    print("No missing or empty geometries.")
else:
    [print(msg) for msg in topological_warnings]

No missing or empty geometries.


Step 6. Cleanup
-------------------

In [29]:
# # Remove cloned repos
# dm.remove_repos('qafiles/')

In [30]:
# # Delete output dump directory
# !echo y | rm -rf ./qafiles

In [31]:
# # Uninstall installed python packages
# !echo y | pip3 uninstall numpy
# !echo y | pip3 uninstall pandas
# !echo y | pip3 uninstall geopandas
# !echo y | pip3 uninstall wikipedia

# !echo y | pip3 uninstall gdutils

In [32]:
# # Reset Jupyter Notebook IPython Kernel
# from IPython.core.display import HTML
# HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Next Steps
-------------

__Data Standardization__

- Manually evaluate column naming discrepancies to determine if changes are needed
- Manually evaluate column datatypes to determine if changes are needed

__Data Comparison__

- Manually investigate large differences found through comparing ``mggg-states`` data with external sources (e.g. Are absentee ballots counted? Are the precinct counts accurate? etc.) 
- For more accurate comparisons, compare ``mggg-states`` data with those in each States' Secretary of State website

__Topological Soundness__

- Manually examine shapefiles for gaps and overlaps. *Note:* although gaps and overlaps are not necessarily indicators of inaccurate data (because some counties have precinct islands), they do mean that the data cannot be for chain runs. 

__Data Documentation__

- Do the READMEs provide data sources?
- Do the READMEs describe what aggregation/disaggregation processes were used?
- Do the READMEs discuss discrepancies/caveats in the data?
- Do the READMEs provide scripts used and/or discuss the data wrangling/processing process?