Data QA Case: ``mggg-states``
========================

Below are the steps involved in performing automated data quality checks on ``mggg-states`` data. 

*Note:* the automated checks are not completely exhaustive and further manual checks are required.

Step 0. Setup
----------------

In [None]:
# !pip3 install numpy
# !pip3 install pandas
# !pip3 install geopandas
# !pip3 install wikipedia-api

# !pip3 install git+https://github.com/KeiferC/gdutils.git

In [4]:
import numpy as np
import pandas as pd
import geopandas as gpd
import json
import wikipediaapi

import gdutils.datamine as dm
import gdutils.dataqa as dq
import gdutils.extract as et

Step 1. Data collection
---------------------------

__Step 1.1.__ Gather ``mggg-states`` data

In [5]:
# dm.clone_gh_repos(account='mggg-states', account_type='orgs', outpath='qafiles/')
    # this will take some time to complete

__Step 1.3.__ Gather MEDSL data for comparison purposes

- 1.3.1. Find what data is available on GitHub

In [11]:
print('{:27} : {}'.format('Repo Name', 'Repo URL'))
print('------------------------------------------------------------------')
for (repo, url) in dm.list_gh_repos(account='MEDSL', account_type='orgs'):
    print("{:27} : {}".format(repo, url))

Repo Name                   : Repo URL
------------------------------------------------------------------
elections                   : https://github.com/MEDSL/elections.git
official-precinct-returns   : https://github.com/MEDSL/official-precinct-returns.git
primaries                   : https://github.com/MEDSL/primaries.git
data-management             : https://github.com/MEDSL/data-management.git
election-scrapers           : https://github.com/MEDSL/election-scrapers.git
medslcleaner                : https://github.com/MEDSL/medslcleaner.git
precinct-shapefiles         : https://github.com/MEDSL/precinct-shapefiles.git
documentation               : https://github.com/MEDSL/documentation.git
elections-performance-index : https://github.com/MEDSL/elections-performance-index.git
constituency-returns        : https://github.com/MEDSL/constituency-returns.git
state-returns               : https://github.com/MEDSL/state-returns.git
county-returns              : https://github.com/MEDSL/

- 1.3.2. Manually download data from https://electionlab.mit.edu/data

__Step 1.2.__ Gather Wikipedia data for comparison purposes

[Coming Soon] __Step 1.4.__ Gather Ballotpedia data for comparison purposes

*Note:* Depending on response to API access, this step may need to be done manually

Step 2. Data standardization check of ``mggg-states``
---------------------------------------------------------------

__Step 2.1__ Generate naming conventions

__Step 2.2.__ Check ``mggg-states`` data compliance with naming conventions

Step 3. Compare ``mggg-states`` data with external sources
----------------------------------------------------------

__Step 3.1.__ Compare against MEDSL

__Step 3.2.__ Compare against Wikipedia

__Step 3.3.__ Compare against Ballotpedia

Step 4. Check topological soundness of ``mggg-states`` data
-----------------------------------------------------------------------

__Step 4.1.__ Check for empty geometries

Step 5. Cleanup
-------------------

In [None]:
# # Remove cloned repos
# dm.remove_repos('qafiles/')

In [None]:
# # Delete output dump directory
# !echo y | rm -rf ./qafiles

In [None]:
# # Uninstall installed python packages
# !echo y | pip3 uninstall numpy
# !echo y | pip3 uninstall pandas
# !echo y | pip3 uninstall geopandas
# !echo y | pip3 uninstall wikipedia-api

# !echo y | pip3 uninstall gdutils

In [None]:
# # Reset Jupyter Notebook IPython Kernel
# from IPython.core.display import HTML
# HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Next Steps
-------------

__Data Standardization__

- Manually evaluate column naming discrepancies to determine if changes are needed
- Manually evaluate column datatypes to determine if changes are needed

__Data Comparison__

- Manually investigate large differences found through comparing ``mggg-states`` data with external sources (e.g. Are absentee ballots counted? Are the precinct counts accurate? etc.) 
- For more accurate comparisons, compare ``mggg-states`` data with those in each States' Secretary of State website

__Topological Soundness__

- Manually examine shapefiles for gaps and overlaps. *Note:* although gaps and overlaps are not necessarily indicators of inaccurate data (because some counties have precinct islands), they do mean that the data cannot be for chain runs. 

__Data Documentation__

- Do the READMEs provide data sources?
- Do the READMEs describe what aggregation/disaggregation processes were used?
- Do the READMEs discuss discrepancies/caveats in the data?
- Do the READMEs provide scripts used and/or discuss the data wrangling/processing process?