Data QA Case: ``mggg-states``
========================

Below are the steps involved in performing automated data quality checks on ``mggg-states`` data. 

*Note:* the automated checks are not completely exhaustive and further manual checks are required.

Step 0. Setup
----------------

In [None]:
# !pip3 install numpy
# !pip3 install pandas
# !pip3 install geopandas
# !pip3 install wikipedia

# !pip3 install git+https://github.com/KeiferC/gdutils.git

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import json
import wikipedia
import os

import gdutils.datamine as dm
import gdutils.dataqa as dq
import gdutils.extract as et

Step 1. Data collection
---------------------------

__Step 1.1.__ Gather ``mggg-states`` data

In [None]:
# dm.clone_gh_repos(account='mggg-states', account_type='orgs', outpath=os.path.join('qafiles', 'mggg'))
#     # this will take some time to complete

In [None]:
mggg_gdfs = {}

for filepath in dm.list_files_of_type('.zip', os.path.join('qafiles', 'mggg')):
      mggg_gdfs[os.path.basename(filepath)[:-4]] = et.read_file(filepath).extract()

__Step 1.3.__ Gather MEDSL data for comparison purposes

In [None]:
print('{:27} : {}'.format('Repo Name', 'Repo URL'))
print('------------------------------------------------------------------')
for (repo, url) in dm.list_gh_repos(account='MEDSL', account_type='orgs'):
    print("{:27} : {}".format(repo, url))

In [None]:
medsl_repos = ['official-precinct-returns', # precinct-level 2016 election results
               '2018-elections-official'] # constituency-level 2018 election results
    
# dm.clone_gh_repos(account='MEDSL', account_type='orgs', repos=medsl_repos, outpath=os.path.join('qafiles', 'medsl'))
#     # this will take some time to complete

In [None]:
medsl_dfs = {}

# Note: will likely throw warnings because MEDSL data contains multiple datatypes in some columns
for filepath in dm.list_files_of_type('.zip', os.path.join('qafiles', 'medsl')):
    medsl_dfs[os.path.basename(filepath)[:-4]] = et.read_file(filepath).extract()

In [None]:
medsl_dfs

__Step 1.2.__ Gather Wikipedia data for comparison purposes

In [None]:
# Generate wikipedia page titles
states = ['Alabama', 'Alaska','Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 
          'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 
          'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 
          'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 
          'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 
          'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 
          'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

wiki_pres_elections = ['United States presidential election']
wiki_fed_elections  = ['United States Senate election']
                      #'United States House of Representatives election']
election_years_to_check = [2016, 2018]

wiki_titles = []
for yr in election_years_to_check:
    generate_title = lambda yr, etype: str(yr) + ' ' + etype 
    
    if yr % 4 == 0:
        wiki_titles.append(generate_title(yr, wiki_pres_elections[0]))
        
    [wiki_titles.append(generate_title(yr, etype) + ' in ' + st) 
         for etype in wiki_fed_elections
         for st in states]

In [None]:
# Gather wikipedia page URLs
wiki_urls = {}
for page_title in wiki_titles:
    try:
        wiki_urls[page_title] = wikipedia.page(title=page_title).url
    except Exception as e:
        continue

In [None]:
# Print retrieved page URLs
for wiki in wiki_urls:
    print('{:65}\n\t{}'.format(wiki, wiki_urls[wiki]))

In [None]:
# Gather wikipedia tabular election results
wiki_tables = {}
for wiki in wiki_urls:
    try:
        wiki_tables[wiki] = pd.read_html(wiki_urls[wiki])
    except Exception as e:
        print("Unable to gather Wikipedia tabular data:", e)

In [None]:
# Display wikipedia tabular election data
for wiki in wiki_tables:
    print('================================================')
    print('Wiki: {} '.format(wiki))
    print('================================================')
    
    for i in range(len(wiki_tables[wiki])):
        print('TABLE {}: ############################\n{}\n\n\n'.format(i, wiki_tables[wiki][i].head()))

In [None]:
# Manually gather found tabular election results

# Generate shorthand keys for wikipedia tabular data
pres_election_short = ['PRES']
fed_elections_short = ['SEN'] # , 'USH']
st_abvs = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 
           'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 
           'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']

wiki_titles_short = []
for yr in election_years_to_check:
    generate_title = lambda st_abv, yr, etype: etype + str(yr % 100) + '_' + st_abv
    
    if yr % 4 == 0:
        [wiki_titles_short.append(generate_title(abv, yr, pres_election_short[0])) 
            for abv in st_abvs]
        
    [wiki_titles_short.append(generate_title(abv, yr, etype)) 
         for etype in fed_elections_short
         for abv in st_abvs]

In [None]:
wiki_dfs = {}

# wiki_pres16_by_state = [{state_abv : table} for _______] # TODO

# has to be done manually because every wikipedia page is different

# wiki_dfs['PRES16_AL'] = wiki_pres16_by_state['AL']
# wiki_dfs['PRES16_AK'] = wiki_pres16_by_state['AK']
# wiki_dfs['PRES16_AZ'] = wiki_pres16_by_state['AZ']
# wiki_dfs['PRES16_AR'] = wiki_pres16_by_state['AR']
# wiki_dfs['PRES16_CA'] = wiki_pres16_by_state['CA']
# wiki_dfs['PRES16_CO'] = wiki_pres16_by_state['CO']
# wiki_dfs['PRES16_CT'] = wiki_pres16_by_state['CT']
# wiki_dfs['PRES16_DE'] = wiki_pres16_by_state['DE']
# wiki_dfs['PRES16_FL'] = wiki_pres16_by_state['FL']
# wiki_dfs['PRES16_GA'] = wiki_pres16_by_state['GA']
# wiki_dfs['PRES16_HI'] = wiki_pres16_by_state['HI']
# wiki_dfs['PRES16_ID'] = wiki_pres16_by_state['ID']
# wiki_dfs['PRES16_IL'] = wiki_pres16_by_state['IL']
# wiki_dfs['PRES16_IN'] = wiki_pres16_by_state['IN']
# wiki_dfs['PRES16_IA'] = wiki_pres16_by_state['IA']
# wiki_dfs['PRES16_KS'] = wiki_pres16_by_state['KS']
# wiki_dfs['PRES16_KY'] = wiki_pres16_by_state['KY']
# wiki_dfs['PRES16_LA'] = wiki_pres16_by_state['LA']
# wiki_dfs['PRES16_ME'] = wiki_pres16_by_state['ME']
# wiki_dfs['PRES16_MD'] = wiki_pres16_by_state['MD']
# wiki_dfs['PRES16_MA'] = wiki_pres16_by_state['MA']
# wiki_dfs['PRES16_MI'] = wiki_pres16_by_state['MI']
# wiki_dfs['PRES16_MN'] = wiki_pres16_by_state['MN']
# wiki_dfs['PRES16_MS'] = wiki_pres16_by_state['MS']
# wiki_dfs['PRES16_MO'] = wiki_pres16_by_state['MO']
# wiki_dfs['PRES16_MT'] = wiki_pres16_by_state['MT']
# wiki_dfs['PRES16_NE'] = wiki_pres16_by_state['NE']
# wiki_dfs['PRES16_NV'] = wiki_pres16_by_state['NV']
# wiki_dfs['PRES16_NH'] = wiki_pres16_by_state['NH']
# wiki_dfs['PRES16_NJ'] = wiki_pres16_by_state['NJ']
# wiki_dfs['PRES16_NM'] = wiki_pres16_by_state['NM']
# wiki_dfs['PRES16_NY'] = wiki_pres16_by_state['NY']
# wiki_dfs['PRES16_NC'] = wiki_pres16_by_state['NC']
# wiki_dfs['PRES16_ND'] = wiki_pres16_by_state['ND']
# wiki_dfs['PRES16_OH'] = wiki_pres16_by_state['OH']
# wiki_dfs['PRES16_OK'] = wiki_pres16_by_state['OK']
# wiki_dfs['PRES16_OR'] = wiki_pres16_by_state['OR']
# wiki_dfs['PRES16_PA'] = wiki_pres16_by_state['PA']
# wiki_dfs['PRES16_RI'] = wiki_pres16_by_state['RI']
# wiki_dfs['PRES16_SC'] = wiki_pres16_by_state['SC']
# wiki_dfs['PRES16_SD'] = wiki_pres16_by_state['SD']
# wiki_dfs['PRES16_TN'] = wiki_pres16_by_state['TN']
# wiki_dfs['PRES16_TX'] = wiki_pres16_by_state['TX']
# wiki_dfs['PRES16_UT'] = wiki_pres16_by_state['UT']
# wiki_dfs['PRES16_VT'] = wiki_pres16_by_state['VT']
# wiki_dfs['PRES16_VA'] = wiki_pres16_by_state['VA']
# wiki_dfs['PRES16_WA'] = wiki_pres16_by_state['WA']
# wiki_dfs['PRES16_WV'] = wiki_pres16_by_state['WV']
# wiki_dfs['PRES16_WI'] = wiki_pres16_by_state['WI']
# wiki_dfs['PRES16_WY'] = wiki_pres16_by_state['WY']

wiki_dfs['SEN16_AL'] = wiki_tables['2016 United States Senate election in Alabama'][19]
wiki_dfs['SEN16_AK'] = wiki_tables['2016 United States Senate election in Alaska'][20]
wiki_dfs['SEN16_AZ'] = wiki_tables['2016 United States Senate election in Arizona'][40]
# wiki_dfs['SEN16_AR'] =
# wiki_dfs['SEN16_CA'] =
# wiki_dfs['SEN16_CO'] = 
# wiki_dfs['SEN16_CT'] = 
# wiki_dfs['SEN16_DE'] = 
# wiki_dfs['SEN16_FL'] = 
# wiki_dfs['SEN16_GA'] = 
# wiki_dfs['SEN16_HI'] = 
# wiki_dfs['SEN16_ID'] = 
# wiki_dfs['SEN16_IL'] = 
# wiki_dfs['SEN16_IN'] = 
# wiki_dfs['SEN16_IA'] = 
# wiki_dfs['SEN16_KS'] = 
# wiki_dfs['SEN16_KY'] = 
# wiki_dfs['SEN16_LA'] = 
# wiki_dfs['SEN16_ME'] = 
# wiki_dfs['SEN16_MD'] = 
# wiki_dfs['SEN16_MA'] = 
# wiki_dfs['SEN16_MI'] = 
# wiki_dfs['SEN16_MN'] = 
# wiki_dfs['SEN16_MS'] = 
# wiki_dfs['SEN16_MO'] = 
# wiki_dfs['SEN16_MT'] = 
# wiki_dfs['SEN16_NE'] = 
# wiki_dfs['SEN16_NV'] = 
# wiki_dfs['SEN16_NH'] = 
# wiki_dfs['SEN16_NJ'] = 
# wiki_dfs['SEN16_NM'] = 
# wiki_dfs['SEN16_NY'] = 
# wiki_dfs['SEN16_NC'] = 
# wiki_dfs['SEN16_ND'] = 
# wiki_dfs['SEN16_OH'] = 
# wiki_dfs['SEN16_OK'] = 
# wiki_dfs['SEN16_OR'] = 
# wiki_dfs['SEN16_PA'] = 
# wiki_dfs['SEN16_RI'] = 
# wiki_dfs['SEN16_SC'] = 
# wiki_dfs['SEN16_SD'] = 
# wiki_dfs['SEN16_TN'] = 
# wiki_dfs['SEN16_TX'] = 
# wiki_dfs['SEN16_UT'] = 
# wiki_dfs['SEN16_VT'] = 
# wiki_dfs['SEN16_VA'] = 
# wiki_dfs['SEN16_WA'] = 
# wiki_dfs['SEN16_WV'] = 
# wiki_dfs['SEN16_WI'] = 
# wiki_dfs['SEN16_WY'] = 

wiki_dfs['SEN18_AZ'] = wiki_tables['2018 United States Senate election in Arizona'][40]
# wiki_dfs['SEN18_AR'] = 
# wiki_dfs['SEN18_CA'] = 
# wiki_dfs['SEN18_CO'] = 
# wiki_dfs['SEN18_CT'] = 
# wiki_dfs['SEN18_DE'] = 
# wiki_dfs['SEN18_FL'] = 
# wiki_dfs['SEN18_GA'] = 
# wiki_dfs['SEN18_HI'] = 
# wiki_dfs['SEN18_ID'] = 
# wiki_dfs['SEN18_IL'] = 
# wiki_dfs['SEN18_IN'] = 
# wiki_dfs['SEN18_IA'] = 
# wiki_dfs['SEN18_KS'] = 
# wiki_dfs['SEN18_KY'] = 
# wiki_dfs['SEN18_LA'] = 
# wiki_dfs['SEN18_ME'] = 
# wiki_dfs['SEN18_MD'] = 
# wiki_dfs['SEN18_MA'] = 
# wiki_dfs['SEN18_MI'] = 
# wiki_dfs['SEN18_MN'] = 
# wiki_dfs['SEN18_MS'] = 
# wiki_dfs['SEN18_MO'] = 
# wiki_dfs['SEN18_MT'] = 
# wiki_dfs['SEN18_NE'] = 
# wiki_dfs['SEN18_NV'] = 
# wiki_dfs['SEN18_NH'] = 
# wiki_dfs['SEN18_NJ'] = 
# wiki_dfs['SEN18_NM'] = 
# wiki_dfs['SEN18_NY'] = 
# wiki_dfs['SEN18_NC'] = 
# wiki_dfs['SEN18_ND'] = 
# wiki_dfs['SEN18_OH'] = 
# wiki_dfs['SEN18_OK'] = 
# wiki_dfs['SEN18_OR'] = 
# wiki_dfs['SEN18_PA'] = 
# wiki_dfs['SEN18_RI'] = 
# wiki_dfs['SEN18_SC'] = 
# wiki_dfs['SEN18_SD'] = 
# wiki_dfs['SEN18_TN'] = 
# wiki_dfs['SEN18_TX'] = 
# wiki_dfs['SEN18_UT'] = 
# wiki_dfs['SEN18_VT'] = 
# wiki_dfs['SEN18_VA'] = 
# wiki_dfs['SEN18_WA'] = 
# wiki_dfs['SEN18_WV'] = 
# wiki_dfs['SEN18_WI'] = 
# wiki_dfs['SEN18_WY'] = 

wiki_dfs

[Coming Soon] __Step 1.4.__ Gather Ballotpedia data for comparison purposes

*Note:* Depending on response to API access, this step may need to be done manually

Step 2. Data wrangling
---------------------------

__Step 2.1.__ Wrangle MEDSL data

Step 3. Data standardization check of ``mggg-states``
---------------------------------------------------------------

__Step 3.1__ Generate naming conventions

In [None]:
with open('naming_convention.json') as json_file:
    standards_raw = json.load(json_file)
    
offices = dm.get_keys_by_category(standards_raw, 'offices')
parties = dm.get_keys_by_category(standards_raw, 'parties')
counts = dm.get_keys_by_category(standards_raw, 'counts')
others = dm.get_keys_by_category(standards_raw, 
            ['geographies', 'demographics', 'districts', 'other'])

elections = [office + format(year, '02') + party 
                 for office in offices
                 for year in range(0, 21)
                 for party in parties 
                 if not (office == 'PRES' and year % 4 != 0)]

counts = [count + format(year, '02') for count in counts 
                                     for year in range(0, 20)]

standards = elections + counts + others

__Step 3.2.__ Check ``mggg-states`` data compliance with naming conventions

In [None]:
naming_check = {}

for gdf in mggg_gdfs:
      naming_check[gdf] = dq.compare_column_names(mggg_gdfs[gdf], standards)

In [None]:
matched_columns = {}

# Print results
for gdf in naming_check:
    print('=========================================')
    print('Dataset: {}'.format(gdf))
    print('=========================================')
    
    (matches, diffs) = naming_check[gdf]
    matched_columns[gdf] = matches
    
    diffs = list(diffs)
    diffs.sort()
    
    print('Discrepancies from naming convention:')
    print(diffs)
    print()

Step 4. Compare ``mggg-states`` data with external sources
----------------------------------------------------------

__Step 4.1.__ Compare against MEDSL

__Step 4.2.__ Compare against Wikipedia

__Step 4.3.__ Compare against Ballotpedia

Step 5. Check topological soundness of ``mggg-states`` data
-----------------------------------------------------------------------

__Step 5.1.__ Check for empty geometries

Step 6. Cleanup
-------------------

In [None]:
# # Remove cloned repos
# dm.remove_repos('qafiles/')

In [None]:
# # Delete output dump directory
# !echo y | rm -rf ./qafiles

In [None]:
# # Uninstall installed python packages
# !echo y | pip3 uninstall numpy
# !echo y | pip3 uninstall pandas
# !echo y | pip3 uninstall geopandas
# !echo y | pip3 uninstall wikipedia

# !echo y | pip3 uninstall gdutils

In [None]:
# # Reset Jupyter Notebook IPython Kernel
# from IPython.core.display import HTML
# HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Next Steps
-------------

__Data Standardization__

- Manually evaluate column naming discrepancies to determine if changes are needed
- Manually evaluate column datatypes to determine if changes are needed

__Data Comparison__

- Manually investigate large differences found through comparing ``mggg-states`` data with external sources (e.g. Are absentee ballots counted? Are the precinct counts accurate? etc.) 
- For more accurate comparisons, compare ``mggg-states`` data with those in each States' Secretary of State website

__Topological Soundness__

- Manually examine shapefiles for gaps and overlaps. *Note:* although gaps and overlaps are not necessarily indicators of inaccurate data (because some counties have precinct islands), they do mean that the data cannot be for chain runs. 

__Data Documentation__

- Do the READMEs provide data sources?
- Do the READMEs describe what aggregation/disaggregation processes were used?
- Do the READMEs discuss discrepancies/caveats in the data?
- Do the READMEs provide scripts used and/or discuss the data wrangling/processing process?