## Processing the Source Data
The Georgia Election Data Jupyter Notebook uses a sqlite3 database to store the entire corpus of Georgia Election data. In lieu of sqlite, any other SQL-based database could certainly be used; SQL lite, however, is used to reduce any costs (AWS, GCS, etc.) associated with working with a live SQL database.

### Prerequisites
1. Python >= 3.6
2. CLI version of Git
3. OSX/Linux folder structure
3. 3gb of space to clone the repository and generate the database
4. Tested with 20 gb of memory, may work with 16 gb.
5. Packages required: tqdm

### Source data location

A copy of the full source data files is maintained on this repo, for purposes of reproducing the methods used to convert the original fixed width files into a SQLite database.

Alternatively, you can download the files directly from the State of Georgia website. 

In [1]:
# Required libraries
import os
import sqlite3
from tqdm import tqdm
from zipfile import ZipFile
from glob import glob

### How to access source files from State of Georgia -- OPTIONAL

(Directions are functional as of August 2019)

#### 2013-2019
Step 1: Go to https://elections.sos.ga.gov/Elections/voterhistory.do
Step 2: Select the election year, and then download the `Full Year File`
Step 3: Download each of the ZIP files into a folder accessible to your Python instance

#### 1996-2012
Step 1: Go to https://elections.sos.ga.gov/Elections/voterhistoryprevious.do
Step 2: Download each year's zip file into a folder accessible to your Python instance

In [2]:
# Sets the location of the "original" data, 
# meaning the data from the Secretary of State's
# web site

original_data_folder = '/original_data_20190709'

In [3]:
# If this notebook was not loaded as part of a repository,
# the following settings will be used to save the cloned
# repository

working_root_folder = '/tmp/ga_elect'

In [4]:
def repo_check_clone(force=False):

    """
    This function checks if the notebook is running from a cloned folder, or
    if the notebook was downloaded individually from the repo.
    
    If the notebook is running in a repo, the folder of the repo will be returned.
    
    If not, an option will be provided to re-clone the repo.
    
    """

    if os.path.exists(os.getcwd() + '/original_data_20190709') and force is False:
        print("Notebook appears to be running within a cloned repo. Skipping repo clone.")
        working_root_folder = os.getcwd()
        print(f"Setting working root folder to {working_root_folder}")
    else:
        print("Notebook does not appear to be running within a cloned repo.")
        input_res = input("Type Y to clone the repo. NoteL This is a 750MB download.")
        if input_res == 'Y':
            print(f"Creating {working_root_folder}")
            try:
                os.mkdir(working_root_folder)
                print("Folder created")
                !git clone git@github.com:solidgose/ga_elect.git {working_root_folder + '/.'}
            except FileExistsError as fee:
                print("The folder already exists. Please empty the folder if you wish to recreate")
            except Exception as e:
                print("Other error", e)
    return working_root_folder            
    

In [5]:
# Check to see if the notebook is running inside of a repository.
# If not, clone the respository.
working_root_folder = repo_check_clone(force=False)

Notebook appears to be running within a cloned repo. Skipping repo clone.
Setting working root folder to /home/michaelhandelman/ga_election_code/ga_elect


The following are utility functions used for ETL purposes of the original source data.

In [6]:
def parse_function_2012_prior(in_line):

    in_line = in_line.decode('utf-8')
    
    county_no = in_line[0:3]
    reg_no = in_line[3:11]
    election_date = in_line[11:19]
    election_type = in_line[19:22].strip()
    party = in_line[22:23].strip()
    absentee = in_line[23:24]
    
    # dates for this time series are presented as day-month-year
    # we will standardize these dates to year-month-day to reduce
    # confusion and match the date format of the 2013 and onwards data
    #print(election_date)
    election_date = election_date[4:] + '-' + election_date[0:2] + '-' + election_date[2:4]
    
    # standardize party indicator from primary election records: single char
    # D -> Democrat
    # R -> Republican
    # N -> Non-Partisan
    
    if party == 'NP':
        party = 'N'
    elif party != 'D' and party != 'R' and party != 'N':
        party = ""


    if absentee == 'Y':
        absentee = True
    else:
        absentee = False
        
    return (0,county_no, reg_no, election_date,election_type, party, absentee,None,None) 

In [32]:
test_date = '20140203'
election_date = election_date[0:4] + '-' + election_date[4:6] + '-' + election_date[6:]


election_date = '02032014'
election_date = election_date[4:] + '-' + election_date[0:2] + '-' + election_date[2:4]
print(election_date)


2014-02-03
2014-02-03


In [7]:
def parse_function_2013_post(in_line):

    # Convert all strings to utf-8 for international standarization purposes
    in_line = in_line.decode('utf-8')
    
    county_no = in_line[0:3].strip()
    reg_no = in_line[3:11].strip()
    election_date = in_line[11:19].strip()
    election_type = in_line[19:22].strip()
    party= in_line[22:24].strip()
    absentee =in_line[24:25].strip()
    provisional = in_line[25:26].strip()
    supplemental = in_line[26:27].strip()
    
    election_date = election_date[0:4] + '-' + election_date[4:6] + '-' + election_date[6:]    
    # the election date is already in YYYY-MM-DD, so no additional action needed
    
    # standardize party indicator from primary election records: single char
    # D -> Democrat
    # R -> Republican
    # N -> Non-Partisan
    
    if party == 'NP':
        party = 'N'
    elif party != 'D' and party != 'R' and party != 'N':
        party = None

    # Convert absentee flags to True/False

    if absentee == 'Y':
        absentee = True
    else:
        absentee = False

    # Convert provisional flags to True/False

    if provisional == 'Y':
        provisional = True
    else:
        provisional = False        

    # Convert supplemental flags to True/False
    
    if supplemental == 'Y':
        supplemental = True
    else:
        supplemental = False       
    return (1,county_no, reg_no, election_date,election_type, party, absentee,supplemental,absentee)
  

In [8]:
def convert_fwf_to_sqlitedb(source_files_loc=None,dest_db_loc=None):
    
    """
    This function takes the source data files and builds a sqlite database.
    CSVs are being used in lieu of pandas dataframes for the source data because
    the memory footprint and CPU overhead of 78 million records requires an
    un-needed quantity of those resources.
    """
    
    # first, find all available files in the source location
    
    list_of_files = glob(source_files_loc + '/*')
    list_of_files.sort()
    print(f"Found {len(list_of_files)} to process.")
    
    range_known_files = set(range(1996,2020))
    range_found_files = set()
    
    
    print(f"Creating database in the {dest_db_loc} folder")
    
    for f in tqdm(list_of_files, desc="Individual file processing progress", position=0):
            range_found_files.add(int(os.path.basename(f.split('.')[0])))
   
    #print(range_known_files)
    #print(range_found_files)

    proceed = False
    
    if len(range_known_files - range_found_files) == 0:
        print('All Files Found')
        proceed = True
    
    else:
        print("The following files were expected but not found:")
        print(list(range_known_files-range_found_files))
        input_response = input("There are missing files. Type Y to proceed.")
        if input_response == 'Y':
            proceed = True
    
    
    if proceed:
            
        db = sqlite3.connect(dest_db_loc)
        c = db.cursor()
        c.execute('drop table if exists ga_elect_data;')
        db.commit()
        c = db.cursor()
        c.execute('''
                   CREATE TABLE ga_elect_data(
                   vintage INT,
                   county_no TEXT,
                   reg_no TEXT, 
                   election_date TEXT,
                   election_type TEXT, 
                   party TEXT,
                   absentee BOOLEAN,
                   supplemental BOOLEAN,
                   provisional BOOLEAN) 
                ''')
        db.commit()
        
        # now load the files
        
        pre_2012_range = set(range(1996,2013))
        post_2013_range = set(range(2013,2020))
        
        for f in tqdm(list_of_files):
            with ZipFile(f).open(ZipFile(f).namelist()[0]) as ff:
                    data = ff.readlines()
                            
                    if int(os.path.basename(f).split(".")[0]) in pre_2012_range:
                        parsed_data = list(map(parse_function_2012_prior,data))

                    elif int(os.path.basename(f).split(".")[0]) in post_2013_range:
                        parsed_data = list(map(parse_function_2013_post,data))
    
            print(f"Loading {os.path.basename(f)} into the sqlite database")
            c = db.cursor()
            c.executemany("INSERT INTO ga_elect_data VALUES (?,?, ?, ?, ?, ?, ?, ?, ?)", parsed_data)
            db.commit()
            print(f"Loaded {os.path.basename(f)} into the sqlite database")
            
    db.close()
    

### Generating the database

The following code generates the sqlite database from the source data. This process should take 5-10 minutes to complete, depending on available computing resources.

In [None]:
# Set the path for outputting the database file

original_data_full_path = working_root_folder + original_data_folder
database_out_path = working_root_folder +  '/database/'

# the sqlite database will be saved here
database_name = 'ga_elect2.db'

# Check first to see if this file exists
# If so, we want to give the option to delete it
# since it may be locked or corrupted

print(database_out_path + database_name)
is_database_exists = os.path.exists(database_out_path + database_name)

confirm = None 

if is_database_exists:
    confirm = input("The curent database name still exists. Type Y if you want to re-create the database")

if confirm == 'Y':    
    convert_fwf_to_sqlitedb(original_data_full_path, database_out_path + database_name)
else:
    print("Keeping existing database")

/home/michaelhandelman/ga_election_code/ga_elect/database/ga_elect2.db
The curent database name still exists. Type Y if you want to re-create the databaseY


Individual file processing progress: 100%|██████████| 24/24 [00:00<00:00, 54648.91it/s]

Found 24 to process.
Creating database in the /home/michaelhandelman/ga_election_code/ga_elect/database/ga_elect2.db folder
All Files Found



  0%|          | 0/24 [00:00<?, ?it/s]

Loading 1996.zip into the sqlite database


  4%|▍         | 1/24 [00:34<13:09, 34.31s/it]

Loaded 1996.zip into the sqlite database
Loading 1997.zip into the sqlite database


  8%|▊         | 2/24 [00:42<09:42, 26.50s/it]

Loaded 1997.zip into the sqlite database
Loading 1998.zip into the sqlite database


 12%|█▎        | 3/24 [01:09<09:18, 26.59s/it]

Loaded 1998.zip into the sqlite database
Loading 1999.zip into the sqlite database


 17%|█▋        | 4/24 [01:14<06:40, 20.04s/it]

Loaded 1999.zip into the sqlite database
Loading 2000.zip into the sqlite database


 21%|██        | 5/24 [01:52<08:04, 25.50s/it]

Loaded 2000.zip into the sqlite database
Loading 2001.zip into the sqlite database


 25%|██▌       | 6/24 [01:58<05:56, 19.78s/it]

Loaded 2001.zip into the sqlite database
Loading 2002.zip into the sqlite database


 29%|██▉       | 7/24 [02:27<06:22, 22.49s/it]

Loaded 2002.zip into the sqlite database
Loading 2003.zip into the sqlite database


 33%|███▎      | 8/24 [02:32<04:37, 17.33s/it]

Loaded 2003.zip into the sqlite database
Loading 2004.zip into the sqlite database


 38%|███▊      | 9/24 [03:22<06:43, 26.89s/it]

Loaded 2004.zip into the sqlite database
Loading 2005.zip into the sqlite database


 42%|████▏     | 10/24 [03:29<04:53, 20.95s/it]

Loaded 2005.zip into the sqlite database
Loading 2006.zip into the sqlite database


 46%|████▌     | 11/24 [03:57<05:00, 23.11s/it]

Loaded 2006.zip into the sqlite database
Loading 2007.zip into the sqlite database


 50%|█████     | 12/24 [04:03<03:37, 18.13s/it]

Loaded 2007.zip into the sqlite database
Loading 2008.zip into the sqlite database


 54%|█████▍    | 13/24 [05:16<06:20, 34.57s/it]

Loaded 2008.zip into the sqlite database
Loading 2009.zip into the sqlite database


 58%|█████▊    | 14/24 [05:23<04:20, 26.08s/it]

Loaded 2009.zip into the sqlite database
Loading 2010.zip into the sqlite database


 62%|██████▎   | 15/24 [06:00<04:26, 29.62s/it]

Loaded 2010.zip into the sqlite database
Loading 2011.zip into the sqlite database


 67%|██████▋   | 16/24 [06:11<03:10, 23.87s/it]

Loaded 2011.zip into the sqlite database
Loading 2012.zip into the sqlite database


 71%|███████   | 17/24 [07:06<03:52, 33.28s/it]

Loaded 2012.zip into the sqlite database
Loading 2013.zip into the sqlite database


 75%|███████▌  | 18/24 [07:13<02:32, 25.47s/it]

Loaded 2013.zip into the sqlite database
Loading 2014.zip into the sqlite database


 79%|███████▉  | 19/24 [07:50<02:24, 28.81s/it]

Loaded 2014.zip into the sqlite database
Loading 2015.zip into the sqlite database


 83%|████████▎ | 20/24 [07:56<01:27, 21.98s/it]

Loaded 2015.zip into the sqlite database
Loading 2016.zip into the sqlite database


 88%|████████▊ | 21/24 [09:03<01:46, 35.38s/it]

Loaded 2016.zip into the sqlite database
Loading 2017.zip into the sqlite database


 92%|█████████▏| 22/24 [09:16<00:57, 28.68s/it]

Loaded 2017.zip into the sqlite database
Loading 2018.zip into the sqlite database


In [10]:
# Test Function/

db = sqlite3.connect(database_out_path + database_name)
c = db.cursor()
c.execute('''SELECT count(*) FROM ga_elect_data''')
assert(c.fetchall()[0][0]) == 76453445
db.close()


## Adding helper tables

There is one helper table that will also be added to the database. This helper table maps the election_type field to indicate if the field reflects a _primary_ election, a _primary runoff_ election, a _general_ election, or another type of vote.

This mapping is needed because the election_type coding has either 1) not been consistent from election to election or 2) a new election_type code was created when a ballot combined two types of elections. For example, a general election added to a recall election.

The repo includes a file `election_type_mapping.csv` that contains a manual classification between the election types and the above manual classification.



In [11]:
print(f"Opening {database_out_path + database_name}")
db = sqlite3.connect(database_out_path + database_name)
c = db.cursor()

manual_classification_file_loc = working_root_folder + '/election_type_mapping.csv'

if os.path.exists(manual_classification_file_loc):
    print("found helper data here:" + manual_classification_file_loc)

    
# load the file into a list of tuples

input_str = open(manual_classification_file_loc, 'r').readlines()
input_str_list = [tuple(l.strip().split(',')) for l in input_str]
# remove header
input_str_list = input_str_list[1:]

c.execute('drop table if exists ga_elect_manual_classification;')
db.commit()
c = db.cursor()
c.execute('''
           CREATE TABLE ga_elect_manual_classification(
           election_type_index INT,
           election_type TEXT,
           election_type_description TEXT, 
           manual_classification TEXT)
           ''')
db.commit()
print("Table created")

c = db.cursor()
c.executemany("INSERT INTO ga_elect_manual_classification VALUES (?,?, ?, ?)", input_str_list)
db.commit()
print("Data uploaded")

db.close()

Opening /home/michaelhandelman/ga_election_code/ga_elect/database/ga_elect2.db
found helper data here:/home/michaelhandelman/ga_election_code/ga_elect/election_type_mapping.csv
Table created
Data uploaded


Confirming that this new table was created

In [12]:
# Test Function/

db = sqlite3.connect(database_out_path + database_name)
c = db.cursor()
c.execute('''SELECT count(*) FROM ga_elect_manual_classification''')
assert(c.fetchall()[0][0] == 37)

