Data QA Case: ``mggg-states``
========================

Below are the steps involved in performing automated data quality checks on ``mggg-states`` data. Specifically, this notebook compares the vote counts in ``mggg-states`` with those in MEDSL's and Wikipedia's datasets for the following elections:

- 2016 United States presidential election
- 2016 United States Senate elections
- 2016 United States House of Representatives elections
- 2017 United States Senate elections (special elections)
- 2018 United States Senate elections
- 2018 United States House of Representatives elections

*Note:* the automated checks are not completely exhaustive and further manual checks are required.

Timestamp: 02:00 pm ET, 14 August 2020

Step 0. Setup
----------------

In [1]:
# Install useful Python packages

!pip3 install numpy
!pip3 install pandas
!pip3 install geopandas
!pip3 install wikipedia

!pip3 install git+https://github.com/mggg/gdutils.git

In [2]:
import numpy as np
import pandas as pd
import geopandas as gpd
import json # for parsing a json file
import wikipedia # unofficial Wikipedia package (wrapper of MediaWiki API)
import os # for ensuring file traversal works regardless of operating system
import re # for complex string pattern matching

import gdutils.datamine as dm # data-mining module from gdutils
import gdutils.dataqa as dq   # data QA module from gdutils
import gdutils.extract as et  # table extraction module from gdutils

from typing import Any, List, Tuple, Dict, Hashable, Union, NoReturn

Step 1. Collect data
---------------------------

In [3]:
state_names = [
    'Alabama', 'Alaska','Arizona', 'Arkansas', 'California', 
    'Colorado', 'Connecticut', 'Delaware',  'Florida', 'Georgia', 
    'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 
    'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 
    'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 
    'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 
    'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 
    'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 
    'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 
    'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

state_abbreviations = [
    'AL', 'AK', 'AZ', 'AR', 'CA', 
    'CO', 'CT', 'DE', 'FL', 'GA', 
    'HI', 'ID', 'IL', 'IN', 'IA', 
    'KS', 'KY', 'LA', 'ME', 'MD', 
    'MA', 'MI', 'MN', 'MS', 'MO', 
    'MT', 'NE', 'NV', 'NH', 'NJ', 
    'NM', 'NY', 'NC', 'ND', 'OH', 
    'OK', 'OR', 'PA', 'RI', 'SC', 
    'SD', 'TN', 'TX', 'UT', 'VT', 
    'VA', 'WA', 'WV', 'WI', 'WY']

states = list(zip(state_names, state_abbreviations))

__Step 1.1.__ Gather ``mggg-states`` data.

In [4]:
# this will take some time to complete
dm.clone_gh_repos(account='mggg-states', account_type='orgs', 
                  outpath=os.path.join('src', 'mggg'))

In [5]:
mggg_gdfs = {}

# stores extracted GeoDataFrames in a dictionary where the keys are the filenames 
# from which the GeoDataFrames were found
for filepath in dm.list_files_of_type('.zip', os.path.join('src', 'mggg')):
      mggg_gdfs[os.path.basename(filepath)[:-4]] = et.read_file(filepath).extract()

__Step 1.2.__ Gather MEDSL data for comparison purposes.

In [6]:
# Print available MEDSL data to select applicable datasets

print('{:27} : {}'.format('Repo Name', 'Repo URL'))
print('------------------------------------------------------------------')
for (repo, url) in dm.list_gh_repos(account='MEDSL', account_type='orgs'):
    print("{:27} : {}".format(repo, url))

Repo Name                   : Repo URL
------------------------------------------------------------------
elections                   : https://github.com/MEDSL/elections.git
official-precinct-returns   : https://github.com/MEDSL/official-precinct-returns.git
primaries                   : https://github.com/MEDSL/primaries.git
data-management             : https://github.com/MEDSL/data-management.git
election-scrapers           : https://github.com/MEDSL/election-scrapers.git
medslcleaner                : https://github.com/MEDSL/medslcleaner.git
precinct-shapefiles         : https://github.com/MEDSL/precinct-shapefiles.git
documentation               : https://github.com/MEDSL/documentation.git
elections-performance-index : https://github.com/MEDSL/elections-performance-index.git
constituency-returns        : https://github.com/MEDSL/constituency-returns.git
state-returns               : https://github.com/MEDSL/state-returns.git
county-returns              : https://github.com/MEDSL/

In [7]:
medsl_repos = ['official-precinct-returns', # precinct-level 2016 election results
               '2018-elections-official']   # constituency-level 2018 election results

# this will take some time to complete
dm.clone_gh_repos(account='MEDSL', account_type='orgs', repos=medsl_repos, 
                  outpath=os.path.join('src', 'medsl'))

In [8]:
medsl_dfs = {}

# stores extracted GeoDataFrames in a dictionary where the keys are the filenames from
# which the GeoDataFrames were found
for filepath in dm.list_files_of_type('.zip', os.path.join('src', 'medsl')):
    medsl_dfs[os.path.basename(filepath)[:-4]] = et.read_file(filepath).extract()

__Step 1.3.__ Gather Wikipedia data for comparison purposes.

*Note:* The Wikipedia dataset was compiled from tables scraped from Wikipedia pages. You can review the scraping and wrangling notebook [here](https://github.com/mggg/mggg-states-qa/blob/main/src/wikipedia-election-data-mining.ipynb).

In [9]:
wiki_df = et.read_file(os.path.join('src', 'wiki', 'wiki_states.csv')).extract()

Step 2. Wrangle data
---------------------------

__Step 2.1.__ Wrangle MEDSL data.

In [10]:
def pivot_medsl(medsl_dfs_dict: Dict[str, Union[pd.DataFrame, gpd.GeoDataFrame]]
        ) -> Dict[str, Union[pd.DataFrame, gpd.GeoDataFrame]]:
    """
    Given a dictionary of MEDSL (Geo)DataFrames, returns a dictionary of
    pivoted DataFrames where the columns are the election and parties and 
    the values are the votes for every precinct.
    
    """
    for df in medsl_dfs_dict:
        medsl_pvt = medsl_dfs_dict[df].pivot_table(index='precinct',
                                                   columns=['office', 'party'],
                                                   values='votes')
        medsl_pvt.columns = [' '.join(col).strip() for col in medsl_pvt.columns.values]
        medsl_dfs_dict[df] = et.ExtractTable(medsl_pvt).extract()

In [11]:
# Pivot and extract relevant MEDSL election data

medsl_18_dfs = {}
medsl_pres16_dfs = {}
medsl_sen16_dfs = {}
medsl_ush16_dfs = {}

for state in states:
    st, st_abv = state
    try:
        medsl_18_dfs[st] = et.ExtractTable(medsl_dfs['precinct_2018'], 
                                           column='state', value=st).extract()
    except Exception as e:
        pass # print('Missing State in medsl_18:', e) 
             # uncomment if want to view limits of MEDSL data
    
    try:
        medsl_pres16_dfs[st] = et.ExtractTable(medsl_dfs['2016-precinct-president'], 
                                               column='state', value=st).extract()
    except Exception as e:
        pass # print('Missing State in medsl_pres16:', e)
    
    try:
        medsl_sen16_dfs[st] = et.ExtractTable(medsl_dfs['2016-precinct-senate'], 
                                              column='state', value=st).extract()
    except Exception as e:
        pass # print('Missing State in medsl_sen16:', e)
    
    try:
        medsl_ush16_dfs[st] = et.ExtractTable(medsl_dfs['2016-precinct-house'], 
                                              column='state', value=st).extract()
    except Exception as e:
        pass # print('Missing State in medsl_ush16:', e)

        
pivot_medsl(medsl_18_dfs)
pivot_medsl(medsl_pres16_dfs)
pivot_medsl(medsl_sen16_dfs)
pivot_medsl(medsl_ush16_dfs)

__Step 2.2.__ Wrangle Wikipedia data

In [12]:
wiki_df = wiki_df.drop(columns='geometry')

Step 3. Check naming convention compliance
------------------------------------------------------

__Step 3.1__ Generate naming conventions.

In [13]:
with open(os.path.join('src', 'naming_convention.json')) as json_file:
    standards_raw = json.load(json_file)
    
offices = dm.get_keys_by_category(standards_raw, 'offices')
parties = dm.get_keys_by_category(standards_raw, 'parties')
counts  = dm.get_keys_by_category(standards_raw, 'counts')
others  = dm.get_keys_by_category(standards_raw, ['geographies', 
                                                  'demographics', 
                                                  'districts', 
                                                  'other'])

elections = [office + format(year, '02') + party 
             for office in offices
             for year in range(0, 21)
             for party in parties 
             if not (office == 'PRES' and year % 4 != 0)]

counts    = [count + format(year, '02') 
             for count in counts 
             for year in range(0, 20)]

standards = elections + counts + others

__Step 3.2.__ Check ``mggg-states`` data compliance with naming conventions.

In [14]:
naming_check = {}

for gdf in mggg_gdfs:
      naming_check[gdf] = dq.compare_column_names(mggg_gdfs[gdf], standards)

In [15]:
# Print and store results of naming convention check

# dictionary with mggg GeoDataFrame names as keys (names of files from
# which dataset was gathered), and a set of columns that fit the 
# naming convention as values
matched_columns = {}

for gdf in naming_check:
    print('=========================================')
    print('Dataset: {}'.format(gdf))
    print('=========================================')
    
    (matches, diffs) = naming_check[gdf]
    matched_columns[gdf] = matches
    
    diffs = list(diffs)
    diffs.sort()
    
    print('Discrepancies from naming convention:')
    print(diffs)
    print()

Dataset: MN12
Discrepancies from naming convention:
['CA1NO12', 'CA1YES12', 'CA2NO12', 'CA2YES12', 'CONGDIST', 'COUNTYFIPS', 'COUNTYNAME', 'CTU_TYPE', 'CTYCOMDIST', 'JUDDIST', 'MCDCODE', 'MCDNAME', 'MNLEGDIST', 'MNSENDIST', 'PCTCODE', 'PCTNAME', 'VTD']

Dataset: MN16
Discrepancies from naming convention:
['CONGDIST', 'COUNTYFIPS', 'COUNTYNAME', 'CTU_TYPE', 'CTYCOMDIST', 'JUDDIST', 'MCDCODE', 'MCDNAME', 'MNLEGDIST', 'MNSENDIST', 'PCTCODE', 'PCTNAME', 'VTDID']

Dataset: MN14
Discrepancies from naming convention:
['AG14LM', 'AUD14LC', 'CONGDIST', 'COUNTYFIPS', 'COUNTYNAME', 'CTU_TYPE', 'CTYCOMDIST', 'JUDDIST', 'MCDCODE', 'MCDNAME', 'MNLEGDIST', 'MNSENDIST', 'PCTCODE', 'PCTNAME', 'VTDID']

Dataset: MN12_18
Discrepancies from naming convention:
['AG14LM', 'AG18LC', 'AUD14LC', 'AUD18LM', 'CA1NO12', 'CA1YES12', 'CA2NO12', 'CA2YES12', 'CONGDIST', 'COUNTYFIPS', 'COUNTYNAME', 'CTU_TYPE', 'CTYCOMDIST', 'GOV18LC', 'JUDDIST', 'MCDCODE', 'MCDNAME', 'MNLEGDIST', 'MNSENDIST', 'PCTCODE', 'PCTNAME', 'SE

Step 4. Compare ``mggg-states`` against external sources
----------------------------------------------------------------------

In [16]:
# Categorize mggg-states dataframes by the names of their States

# a list of tuples of state names (left) and a list of GeoDataFrames (right)
available_mggg_states = []

for state in states:
    state_name, state_abv = state
    
    # messy name matching because file naming isn't standardized
    mggg_gdf_names = [gdf_name for gdf_name in list(mggg_gdfs) 
                               if gdf_name.startswith(state_abv.lower() + '_') or
                                  gdf_name.startswith(state_abv + '_') or
                                  gdf_name.startswith(state_name.lower() + '_') or
                                  gdf_name.startswith(state_name.upper() + '_') or
                                  gdf_name.startswith(state_name + '_') or
                                  gdf_name.startswith(
                                      state_name.replace(' ', '_').lower() + '_') or
                                  gdf_name.startswith(
                                      state_name.replace(' ', '_').upper() + '_') or
                                  gdf_name.startswith(
                                      state_name.replace(' ', '_') + '_')]
                      
    available_mggg_states.append((state_name, mggg_gdf_names))

__Step 4.1.__ Compare against MEDSL.

In [17]:
# Generate Naming Convention Translations between MGGG and MEDSL
# Has to be done manually because some MEDSL data don't use the 
# same naming convention

pres16_cols = [
    ('PRES16D', 'US President democratic'),
    ('PRES16D', 'US President democrat'),
    ('PRES16G', 'US President green'), 
    ('PRES16L', 'US President libertarian'), 
    ('PRES16R', 'US President republican')
]

sen16_cols = [
    ('SEN16D', 'US Senate democratic'),
    ('SEN16D', 'US Senate democrat'),
    ('SEN16G', 'US Senate green'),
    ('SEN16L', 'US Senate libertarian'),
    ('SEN16R', 'US Senate republican')
]

ush16_cols = [
    ('USH16D', 'US House democratic'),
    ('USH16D', 'US House democrat'),
    ('USH16G', 'US House green'),
    ('USH16L', 'US House libertarian'),
    ('USH16R', 'US House republican')
]

fed18_cols = [
    ('SEN18D', 'US Senate democratic'),
    ('SEN18D', 'US Senate democrat'),
    ('SEN18G', 'US Senate green'),
    ('SEN18L', 'US Senate libertarian'),
    ('SEN18R', 'US Senate republican'),
    ('USH18D', 'US House democratic'),
    ('USH18D', 'US House democrat'),
    ('USH18G', 'US House green'),
    ('USH18L', 'US House libertarian'),
    ('USH18R', 'US House republican')
]

In [18]:
def bulk_compare(bulk_results: Dict[str, List[Tuple[Hashable, Any]]], 
                 st: str, 
                 mggg_names: List[str], 
                 medsls: List[pd.DataFrame],
                 mggg_medsl_cols: List[Tuple[str, str]]
        ) -> NoReturn: # Returns bulk_results by reference
    """
    Returns, by reference, a dict containing state names as keys and a value of a 
    dict containing mggg_gdf names as keys and the results of dm.compare_column_sums
    as values.
    
    """
    if st in medsls:
        try:
            mggg_results = bulk_results[st]
        except Exception:
            mggg_results = {}
            
        for mggg_name in mggg_names:
            # Generate comparable columns
            medsl_st_df = medsls[st].dropna(axis=1, how='all')
            
            mggg_medsl_cols = [tup for tup in mggg_medsl_cols
                                   if tup[0] in list(mggg_gdfs[mggg_name].columns) and
                                      tup[1] in list(medsl_st_df.columns)]
            
            mggg_cols  = [tup[0] for tup in mggg_medsl_cols]
            medsl_cols = [tup[1] for tup in mggg_medsl_cols]
                
            if mggg_cols:
                try:
                    # append results
                    mggg_results[mggg_name] += \
                        dq.compare_column_sums(mggg_gdfs[mggg_name], medsl_st_df, 
                                               mggg_cols, medsl_cols)
                except Exception as e:
                    try: # if results don't already exist
                        mggg_results[mggg_name] = \
                            dq.compare_column_sums(mggg_gdfs[mggg_name], medsl_st_df,
                                                   mggg_cols, medsl_cols)
                    except Exception as z:
                        print("Unable to compare {} and {} in {}.\n{}\n".format(
                              mggg_cols, medsl_cols, mggg_name, z))
            else:
                mggg_results[mggg_name] = []
        
            bulk_results[st] = mggg_results

In [19]:
# Compare mggg-states data with MEDSL data

# a dictionary with State names as keys and with a list 
# of comparison results of column for each mggg-states
# dataframe (dicts) as values
results = {}

# Note: a "ufunc 'subtract' did not contain a loop with signature matching types"
# error suggests that the columns contain non-numerical data, which should be 
# manually investigated
for state in available_mggg_states:
    st, mggg_names = state
    
    if mggg_names:
        bulk_compare(results, st, mggg_names, medsl_pres16_dfs, pres16_cols)
        bulk_compare(results, st, mggg_names, medsl_sen16_dfs, sen16_cols)
        bulk_compare(results, st, mggg_names, medsl_ush16_dfs, ush16_cols)
        bulk_compare(results, st, mggg_names, medsl_18_dfs, fed18_cols)

In [20]:
# Print comparison results

print("============================================================================")
print("Results of State-level Aggregation comparisons between mggg-states and MEDSL")
print("============================================================================")
print()
print('{:37} : {}'.format('mggg-states column [vs] MEDSL column', 'difference in sums'))
print('----------------------------------------------------------------------------')

for st in results:
    print("######## {} ########".format(st))
    
    for mggg_name in results[st]:
        print("{} ========".format(mggg_name))
        
        if results[st][mggg_name]:
            for col_v_col, diff in results[st][mggg_name]:
                print('{:37} : {}'.format(col_v_col, diff))
        else:
            print("No comparable columns found.")
            
        print()
    print()

Results of State-level Aggregation comparisons between mggg-states and MEDSL

mggg-states column [vs] MEDSL column  : difference in sums
----------------------------------------------------------------------------
######## Alaska ########
USH18D [vs] US House democratic       : -47288.5
USH18R [vs] US House republican       : -46390.5


######## Arizona ########
SEN18D [vs] US Senate democrat        : 0.0
SEN18R [vs] US Senate republican      : 0.0
USH18D [vs] US House democrat         : 0.0
USH18G [vs] US House green            : -3672.0
USH18R [vs] US House republican       : 0.0


######## Colorado ########
USH18D [vs] US House democratic       : 374836.0
USH18R [vs] US House republican       : 0.0


######## Connecticut ########
SEN18D [vs] US Senate democrat        : 61543.904761904734
SEN18R [vs] US Senate republican      : 43311.511904761894
USH18D [vs] US House democrat         : 63366.86904761905
USH18R [vs] US House republican       : 40299.34523809521


######## Delaware ###

__Step 4.2.__ Compare against Wikipedia.

In [21]:
# Compare mggg-states data with Wikipedia data

# a dictionary with State names as keys and with a list 
# of comparison results of column for each mggg-states
# dataframe (dicts) as values
results = {}

# Note: a "ufunc 'subtract' did not contain a loop with signature matching types"
# error suggests that the columns contain non-numerical data, which should be 
# manually investigated
for state in available_mggg_states:
    st, mggg_names = state
    
    upper_st = st.upper()
    wiki_st_df = wiki_df.set_index('STATE')
    wiki_st_df = wiki_st_df.loc[[upper_st]]
    wiki_st_df = wiki_st_df.dropna(axis=1)

    for mggg_name in mggg_names:
        common_cols = list(set(mggg_gdfs[mggg_name].columns).intersection(
                               set(wiki_st_df.columns)))
        common_cols.sort()
        
        if common_cols:
            try:
                results[st] += \
                    {mggg_name: dq.compare_column_sums(
                                        mggg_gdfs[mggg_name], wiki_st_df, 
                                        common_cols, common_cols)}
            except Exception as e:
                try:
                    results[st] = \
                        {mggg_name: dq.compare_column_sums(
                                        mggg_gdfs[mggg_name], wiki_st_df, 
                                        common_cols, common_cols)}
                except Exception as z:
                    print("Unable to compare {} in {}. {}\n".format(
                           common_cols, mggg_name, z))

Unable to compare ['PRES16D', 'PRES16L', 'PRES16R', 'SEN16D', 'SEN16L', 'SEN16R'] in GA_precincts. ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U8310'), dtype('<U8310')) -> dtype('<U8310')

Unable to compare ['PRES16D', 'PRES16R'] in MA_no_islands_12_16. ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U7939'), dtype('<U7939')) -> dtype('<U7939')

Unable to compare ['PRES16D', 'PRES16G', 'PRES16L', 'PRES16R', 'SEN16D', 'SEN16R'] in VT_towns. ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U802'), dtype('<U802')) -> dtype('<U802')



In [22]:
# Print comparison results

print("================================================================================")
print("Results of State-level Aggregation comparisons between mggg-states and Wikipedia")
print("================================================================================")
print()
print('{:37} : {}'.format('mggg-states column [vs] Wikipedia', 'difference in sums'))
print('----------------------------------------------------------------------------')

for st in results:
    print("######## {} ########".format(st))
    
    for mggg_name in results[st]:
        print("{} ========".format(mggg_name))
        
        if results[st][mggg_name]:
            for col_v_col, diff in results[st][mggg_name]:
                print('{:37} : {}'.format(col_v_col, diff))
        else:
            print("No comparable columns found.")
            
        print()
    print()

Results of State-level Aggregation comparisons between mggg-states and Wikipedia

mggg-states column [vs] Wikipedia     : difference in sums
----------------------------------------------------------------------------
######## Alaska ########
PRES16D [vs] PRES16D                  : -47357.0
PRES16G [vs] PRES16G                  : -1953.0
PRES16L [vs] PRES16L                  : -6721.0
PRES16R [vs] PRES16R                  : -59930.0
SEN16D [vs] SEN16D                    : -16598.0
SEN16L [vs] SEN16L                    : -30057.0
SEN16R [vs] SEN16R                    : -50295.0
USH16D [vs] USH16D                    : -43745.0
USH16L [vs] USH16L                    : -11670.0
USH16R [vs] USH16R                    : -54386.0
USH18R [vs] USH18R                    : -46394.0


######## Arizona ########
SEN18D [vs] SEN18D                    : 0.0
SEN18R [vs] SEN18R                    : 0.0


######## Connecticut ########
SEN18D [vs] SEN18D                    : -37894.0
SEN18R [vs] SEN18R     

Step 5. Check topological soundness of ``mggg-states`` data
-----------------------------------------------------------------------

__Step 5.1.__ Check for empty or missing geometries.

In [23]:
missing_warnings = ['{} has missing geometries.'.format(gdf) 
                    for gdf in mggg_gdfs 
                    if dq.has_missing_geometries(mggg_gdfs[gdf])]

empty_warnings   = ['{} has empty geometries.'.format(gdf) 
                    for gdf in mggg_gdfs 
                    if dq.has_empty_geometries(mggg_gdfs[gdf])] 

topological_warnings = missing_warnings + empty_warnings

if len(topological_warnings) == 0:
    print("No missing or empty geometries.")
else:
    [print(msg) for msg in topological_warnings]

No missing or empty geometries.


Step 6. Sum datasets for manual checks
------------------------------------------------

In [24]:
# Print MGGG summation results

print("============================================================================")
print("Results of State-level Aggregations - mggg-states")
print("============================================================================")
print()
print('{:11} : {:10}\t\t{}'.format('mggg-states column', 'sum', 'dtype'))
print('----------------------------------------------------------------------------')

for st, gdf_names in available_mggg_states:
    if gdf_names:
        print('######## {} ########'.format(st))
        
        for gdf_name in gdf_names:
            print("{} ========".format(gdf_name))

            cols_to_sum = [col for col in matched_columns[gdf_name] if col != 'geometry']
            col_sums = dq.sum_column_values(mggg_gdfs[gdf_name], cols_to_sum)

            for result in col_sums:
                col_name, col_sum = result
                
                if col_name in matched_columns[gdf_name]:
                    if isinstance(col_sum, str):
                        col_sum = "{} is a str".format(col_name)
                        
                    print('{:11} : {:10}\t\t{}'.format(col_name, col_sum, str(type(col_sum))))
                    
            print('\n\n')
    

Results of State-level Aggregations - mggg-states

mggg-states column : sum       		dtype
----------------------------------------------------------------------------
######## Alaska ########
SEN16R      :      87854		<class 'numpy.int64'>
USH18D      :      83855		<class 'numpy.int64'>
GOV18R      :     100372		<class 'numpy.int64'>
SEN16D      :      19602		<class 'numpy.int64'>
WVAP        :     368895		<class 'numpy.int64'>
AMINVAP     :      70630		<class 'numpy.int64'>
USH16D      :      67274		<class 'numpy.int64'>
USH14D      :      77004		<class 'numpy.int64'>
PRES16D     :      69097		<class 'numpy.int64'>
USH14R      :     102464		<class 'numpy.int64'>
USH16R      :     100702		<class 'numpy.int64'>
GOV18D      :      80954		<class 'numpy.int64'>
USH16L      :      20100		<class 'numpy.int64'>
TOTPOP      :     710231		<class 'numpy.int64'>
NHPIVAP     :       4599		<class 'numpy.int64'>
SEN16L      :      60768		<class 'numpy.int64'>
2MOREVAP    :      25525		<class 'numpy.

NH_ASIAN    : 212925.2124698109		<class 'numpy.float64'>
H_NHPI      : 296.0000000000018		<class 'numpy.float64'>
NH_AMIN     : 55402.59933935141		<class 'numpy.float64'>
WVAP        : 3460873.0691697192		<class 'numpy.float64'>
AMINVAP     : 37998.07340062703		<class 'numpy.float64'>
NH_2MORE    : 103139.49077833648		<class 'numpy.float64'>
TOTPOP      : 5301294.334104656		<class 'numpy.float64'>
NHPIVAP     : 1349.8952913794365		<class 'numpy.float64'>
2MOREVAP    : 45048.93551823909		<class 'numpy.float64'>
HISP        : 250169.2818913463		<class 'numpy.float64'>
H_BLACK     : 5270.945363978433		<class 'numpy.float64'>
ASIANVAP    : 145968.6550469736		<class 'numpy.float64'>
HVAP        : 148784.58259148896		<class 'numpy.float64'>
NH_WHITE    : 4402722.342295367		<class 'numpy.float64'>
H_AMIN      : 5493.8442450820385		<class 'numpy.float64'>
BVAP        : 174683.71417634306		<class 'numpy.float64'>
NH_OTHER    : 5946.748510220511		<class 'numpy.float64'>
NH_BLACK    : 269129.7635

GOV18I      :      53392		<class 'numpy.int64'>
GOV16L      :      45191		<class 'numpy.int64'>
NH_ASIAN    :     138838		<class 'numpy.int64'>
SEN16R      :     651106		<class 'numpy.int64'>
H_NHPI      :        582		<class 'numpy.int64'>
NH_AMIN     :      42093		<class 'numpy.int64'>
USH18D      :    1061412		<class 'numpy.int64'>
SOS16R      :     892667		<class 'numpy.int64'>
GOV18R      :     814988		<class 'numpy.int64'>
SEN16D      :    1105119		<class 'numpy.int64'>
GOV16R      :     844372		<class 'numpy.int64'>
SOS16D      :     814089		<class 'numpy.int64'>
WVAP        :    2432193		<class 'numpy.int64'>
AMINVAP     :      31264		<class 'numpy.int64'>
USH16D      :    1026851		<class 'numpy.int64'>
GOV16D      :     984922		<class 'numpy.int64'>
AG16D       :    1011761		<class 'numpy.int64'>
PRES16D     :    1002106		<class 'numpy.int64'>
NH_2MORE    :     109205		<class 'numpy.int64'>
USH16R      :     809048		<class 'numpy.int64'>
GOV18D      :     934498		<class 'numpy.

In [25]:
# Print MEDSL summation results

def print_medsl_sums(st: str, 
                     medsl_name: str, 
                     medsls: Dict[str, Union[pd.DataFrame, gpd.GeoDataFrame]]
        ) -> NoReturn:
    if st in medsls:
        print("######## {} : {} ########".format(st, medsl_name))
        
        cols_to_sum = [col for col in list(medsls[st]) if col != 'geometry']
        
        try:
            col_sums = dq.sum_column_values(medsls[st], cols_to_sum)
        except Exception as e:
            pass
        
        for result in col_sums:
            col_name, col_sum = result
            print('{:65} : {}\t{}'.format(col_name, col_sum, str(type(col_sum))))

        print('\n\n')

        
print("============================================================================")
print("Results of State-level Aggregations - MEDSL")
print("============================================================================")
print()
print('{:65} : {}\t{}'.format('MEDSL column', 'sum', 'dtype'))
print('----------------------------------------------------------------------------')

for st, st_abv in states:
    print_medsl_sums(st, 'MEDSL 2018', medsl_18_dfs)
    print_medsl_sums(st, 'MEDSL PRES16', medsl_pres16_dfs)
    print_medsl_sums(st, 'MEDSL SEN16', medsl_sen16_dfs)
    print_medsl_sums(st, 'MEDSL USH16', medsl_ush16_dfs)
    

Results of State-level Aggregations - MEDSL

MEDSL column                                                      : sum	dtype
----------------------------------------------------------------------------
######## Alabama : MEDSL 2018 ########
Agriculture Commissioner republican                               : 1040671.3510752688	<class 'numpy.float64'>
Associate Justice of the Supreme Court, Place 1 republican        : 1057149.3731182795	<class 'numpy.float64'>
Associate Justice of the Supreme Court, Place 2 republican        : 1047919.7946236558	<class 'numpy.float64'>
Associate Justice of the Supreme Court, Place 3 republican        : 1045524.8575268816	<class 'numpy.float64'>
Associate Justice of the Supreme Court, Place 4 democratic        : 638420.6086021506	<class 'numpy.float64'>
Associate Justice of the Supreme Court, Place 4 republican        : 975778.9483870969	<class 'numpy.float64'>
Attorney General democratic                                       : 679282.7467741936	<class 'num

Agua Fria Unified High Sschool District Board Member nonpartisan  : 13688.0	<class 'numpy.float64'>
Apache Levy nonpartisan                                           : 12080.0	<class 'numpy.float64'>
Attorney General democrat                                         : 1120726.0	<class 'numpy.float64'>
Attorney General nonpartisan                                      : 1792.0	<class 'numpy.float64'>
Attorney General republican                                       : 1201398.0	<class 'numpy.float64'>
Balsz Elementary School District Question nonpartisan             : 3209.0	<class 'numpy.float64'>
Benson Unified School District Question nonpartisan               : 1665.5	<class 'numpy.float64'>
Board Member Altar Valley Elementary School District nonpartisan  : 551.3333333333333	<class 'numpy.float64'>
Board Member Apache Junction Unified School District nonpartisan  : 5744.8	<class 'numpy.float64'>
Board Member Arizona City Fire District nonpartisan               : 901.0	<class 'numpy.fl

Glendale Elementary School District Question 1 nonpartisan        : 8943.5	<class 'numpy.float64'>
Glendale Elementary School District Question 2 nonpartisan        : 8913.0	<class 'numpy.float64'>
Glendale Union High School District Question nonpartisan          : 39711.5	<class 'numpy.float64'>
Governing Board Altar Valley Elementary School District nonpartisan : 562.6000000000001	<class 'numpy.float64'>
Governing Board Baboquivari Unified School District nonpartisan   : 542.5714285714286	<class 'numpy.float64'>
Governing Board Marana Unified School District nonpartisan        : 13046.75	<class 'numpy.float64'>
Governing Board Tucson Unified School District nonpartisan        : 41729.0	<class 'numpy.float64'>
Governor democrat                                                 : 994341.0	<class 'numpy.float64'>
Governor green                                                    : 50962.0	<class 'numpy.float64'>
Governor nonpartisan                                              : 1268.0	<cl

Attorney General democrat                                         : 7287858.183333334	<class 'numpy.float64'>
Attorney General republican                                       : 4168709.4000000004	<class 'numpy.float64'>
Board of Equalization democrat                                    : 6811314.399999999	<class 'numpy.float64'>
Board of Equalization republican                                  : 4307390.4	<class 'numpy.float64'>
Governor democrat                                                 : 7222048.116666667	<class 'numpy.float64'>
Governor republican                                               : 4428960.583333334	<class 'numpy.float64'>
Insurance Commissioner democrat                                   : 5797573.983333333	<class 'numpy.float64'>
Insurance Commissioner independent                                : 5139135.166666667	<class 'numpy.float64'>
Lieutenant Governor democrat                                      : 4891210.675	<class 'numpy.float64'>
Secretary of State demo

Alachua Soil and Water Conservation District: Group 3             : 12427.5	<class 'numpy.float64'>
Alachua Soil and Water Conservation District: Group 3 nonpartisan : 44699.25	<class 'numpy.float64'>
Amendment No. 10: State and Local Government Structure and Operation : 172580.0746031746	<class 'numpy.float64'>
Amendment No. 10: State and Local Government Structure and Operation nonpartisan : 2154098.3839285714	<class 'numpy.float64'>
Amendment No. 11: Property Rights; Removal of Obsolete Provision; Criminal Statutes : 211322.9396825397	<class 'numpy.float64'>
Amendment No. 11: Property Rights; Removal of Obsolete Provision; Criminal Statutes nonpartisan : 2115355.018849206	<class 'numpy.float64'>
Amendment No. 12: Lobbying and Abuse of Office by Public Officers : 155937.77996031745	<class 'numpy.float64'>
Amendment No. 12: Lobbying and Abuse of Office by Public Officers nonpartisan : 2170666.003571429	<class 'numpy.float64'>
Amendment No. 13: Ends Dog Racing                          

10th Circuit - Keith Vacancy republican                           : 1837.0	<class 'numpy.float64'>
10th Circuit - Retain Gilfillan nonpartisan                       : 42094.5	<class 'numpy.float64'>
10th Circuit - Retain Gorman nonpartisan                          : 42361.5	<class 'numpy.float64'>
10th Circuit - Retain Lyons nonpartisan                           : 43145.5	<class 'numpy.float64'>
11th Circuit - Fitzgerald Vacancy republican                      : 51374.0	<class 'numpy.float64'>
11th Circuit - Retain Foley nonpartisan                           : 48294.5	<class 'numpy.float64'>
11th Circuit - Retain Lawrence nonpartisan                        : 48479.5	<class 'numpy.float64'>
12th Circuit - Policandriotes Vacancy democrat                    : 126161.0	<class 'numpy.float64'>
12th Circuit - Policandriotes Vacancy republican                  : 114870.0	<class 'numpy.float64'>
12th Circuit - Rozak Vacancy democrat                             : 129097.0	<class 'numpy.float64'

Attorney General republican                                       : 453574.98905529955	<class 'numpy.float64'>
Auditor of Public Accounts democrat                               : 254041.2459421403	<class 'numpy.float64'>
Auditor of Public Accounts republican                             : 330846.59229390684	<class 'numpy.float64'>
Governor and Lt Governor democrat                                 : 259087.38687916024	<class 'numpy.float64'>
Governor and Lt Governor republican                               : 357901.8799923195	<class 'numpy.float64'>
Public Service Commission democrat                                : 129882.41666666666	<class 'numpy.float64'>
Public Service Commission republican                              : 170276.83333333334	<class 'numpy.float64'>
Secretary of State democrat                                       : 239366.0918330773	<class 'numpy.float64'>
Secretary of State republican                                     : 353597.21182795695	<class 'numpy.float64'>
Stat

Alderman Boonton democrat                                         : 1767.0	<class 'numpy.float64'>
Alderman Boonton republican                                       : 1677.0	<class 'numpy.float64'>
Alderman Dover democrat                                           : 2814.0	<class 'numpy.float64'>
B.H. town coun. 3 yr term (2) democrat                            : 3338.0	<class 'numpy.float64'>
B.H. town coun. 3 yr term (2) other                               : 5.0	<class 'numpy.float64'>
B.H. town coun. 3 yr term (2) republican                          : 3092.5	<class 'numpy.float64'>
Barnegat Light Council-1yr Unexpired republican                   : 269.0	<class 'numpy.float64'>
Barnegat Light Mayor republican                                   : 290.0	<class 'numpy.float64'>
Barnegat Light Member of Council republican                       : 263.0	<class 'numpy.float64'>
Barnegat Member of the Township Committee democrat                : 4022.0	<class 'numpy.float64'>
Barnegat Member 

Alamance County Board Of Commissioners democrat                   : 6204.875	<class 'numpy.float64'>
Alamance County Board Of Commissioners republican                 : 7060.0	<class 'numpy.float64'>
Alamance County Clerk Of Superior Court democrat                  : 6319.5	<class 'numpy.float64'>
Alamance County Clerk Of Superior Court republican                : 7713.75	<class 'numpy.float64'>
Alamance County Sheriff republican                                : 9546.0	<class 'numpy.float64'>
Alexander County Board Of Commissioners democrat                  : 957.375	<class 'numpy.float64'>
Alexander County Board Of Commissioners republican                : 2372.416666666667	<class 'numpy.float64'>
Alexander County Clerk Of Superior Court democrat                 : 2736.5	<class 'numpy.float64'>
Alexander County Register Of Deeds democrat                       : 974.75	<class 'numpy.float64'>
Alexander County Register Of Deeds republican                     : 2452.25	<class 'numpy.floa

1. Rhode Island School Buildings - $250,000,000 non-partisan      : 88444.0	<class 'numpy.float64'>
10. Casino Gaming Revenue Town of Tiverton non-partisan           : 1563.0	<class 'numpy.float64'>
10. Charter Amendment Regarding Diversity of Languages non-partisan : 582.5	<class 'numpy.float64'>
10. Debt Limit. Amends: Article VII. Department of Fin. non-partisan : 1173.25	<class 'numpy.float64'>
10. Hiring Legal Counsel For The Town non-partisan                : 3109.0	<class 'numpy.float64'>
10. Improvement And Replacement of Traffic Control non-partisan   : 3858.0	<class 'numpy.float64'>
10. Town of Middletown non-partisan                               : 1424.25	<class 'numpy.float64'>
11. Asset Management Commission. Amends: Article XV. non-partisan : 1149.5	<class 'numpy.float64'>
11. Charter Amendment Regarding Recall of City Council non-partisan : 574.0	<class 'numpy.float64'>
11. Financial Town Referendum Town of Tiverton non-partisan       : 1518.75	<class 'numpy.float64'>
1

Acc Trustee no party                                              : 1106.2857142857142	<class 'numpy.float64'>
Councilmember no party                                            : Councilmember no party    40072.343463
Councilmember no party     5240.484848
dtype: float64	<class 'pandas.core.series.Series'>
Mayor no party                                                    : 54456.158916900094	<class 'numpy.float64'>
Alderman no party                                                 : 1811.8166666666666	<class 'numpy.float64'>
Alderperson Ward 1 no party                                       : 112.5	<class 'numpy.float64'>
Alderperson Ward 2 no party                                       : 381.0	<class 'numpy.float64'>
Attorney General democrat                                         : 1921669.833744411	<class 'numpy.float64'>
Attorney General libertarian                                      : 93001.36984790914	<class 'numpy.float64'>
Attorney General republican                           

Adams County Assessor republican                                  : 3507.0	<class 'numpy.float64'>
Adams County Auditor republican                                   : 3527.0	<class 'numpy.float64'>
Adams County Clerk republican                                     : 3429.0	<class 'numpy.float64'>
Adams County Commissioner republican                              : 1914.0	<class 'numpy.float64'>
Adams County Prosecutor republican                                : 3334.0	<class 'numpy.float64'>
Adams County Sheriff republican                                   : 3424.0	<class 'numpy.float64'>
Adams County Treasurer republican                                 : 1837.5	<class 'numpy.float64'>
Benton County Assessor republican                                 : 55770.0	<class 'numpy.float64'>
Benton County Auditor republican                                  : 56574.0	<class 'numpy.float64'>
Benton County Clerk republican                                    : 55205.0	<class 'numpy.float64'>
Benton 

In [26]:
# Print Wikipedia election data

print("============================================================================")
print("Results of State-level Aggregations - Wikipedia")
print("============================================================================")
print()
print('{:11} : {:10}\t\t{}'.format('Wikipedia column', 'sum', 'dtype'))
print('----------------------------------------------------------------------------')

num_rows, _ = wiki_df.shape

for i in range(0, num_rows):
    wiki_st_df = wiki_df.iloc[[i]]
    print("######## {} ########".format(wiki_st_df['STATE'].iloc[0]))
    
    for col in wiki_st_df:
        val = wiki_st_df[col].iloc[0]
        if col != 'STATE' and val is not None and str(val) != 'nan':
            print('{:11} : {:10}\t\t{}'.format(col, val, str(type(val))))
    print() 

Results of State-level Aggregations - Wikipedia

Wikipedia column : sum       		dtype
----------------------------------------------------------------------------
######## ALABAMA ########
PRES16D     :   729547.0		<class 'numpy.float64'>
PRES16G     :     9391.0		<class 'numpy.float64'>
PRES16L     :    44467.0		<class 'numpy.float64'>
PRES16R     :  1318255.0		<class 'numpy.float64'>
SEN16D      :   748709.0		<class 'numpy.float64'>
SEN16R      :  1335104.0		<class 'numpy.float64'>
SEN17D      :   673896.0		<class 'numpy.float64'>
SEN17R      :   651972.0		<class 'numpy.float64'>

######## ALASKA ########
PRES16D     :   116454.0		<class 'numpy.float64'>
PRES16G     :     5735.0		<class 'numpy.float64'>
PRES16L     :    18725.0		<class 'numpy.float64'>
PRES16R     :   163387.0		<class 'numpy.float64'>
SEN16D      :    36200.0		<class 'numpy.float64'>
SEN16L      :    90825.0		<class 'numpy.float64'>
SEN16R      :   138149.0		<class 'numpy.float64'>
USH16D      :   111019.0		<class 'n

Step 7. Cleanup
-------------------

In [27]:
# Remove cloned repos

dm.remove_repos('src/')

In [28]:
# Uninstall installed python packages

!echo y | pip3 uninstall numpy
!echo y | pip3 uninstall pandas
!echo y | pip3 uninstall geopandas
!echo y | pip3 uninstall wikipedia

!echo y | pip3 uninstall gdutils

In [29]:
# Reset Jupyter Notebook IPython Kernel

from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Next Steps
-------------

__Data Standardization__

- Manually evaluate column naming discrepancies to determine if changes are needed
- Manually evaluate column datatypes to determine if changes are needed

__Data Comparison__

- Manually investigate large differences found through comparing ``mggg-states`` data with external sources (e.g. Are absentee ballots counted? Are the precinct counts accurate? etc.) 
- For overcounts, how are the votes counted? e.g. A `USH__D` count may include votes for all Democratic candidates where external sources may be only counting one main Democratic candidate
- For more accurate comparisons, compare ``mggg-states`` data with those in each States' Secretary of State website

__Topological Soundness__

- Manually examine shapefiles for gaps and overlaps. 
- *Note:* although gaps and overlaps are not necessarily indicators of inaccurate data (because some counties have precinct islands), they do mean that the data cannot be for chain runs

__Data Documentation__

- Do the READMEs provide data sources?
- Do the READMEs describe what aggregation/disaggregation processes were used?
- Do the READMEs discuss discrepancies/caveats in the data?
- Do the READMEs provide scripts used and/or discuss the data wrangling/processing process?