# Loading Georgia Election Source Data

The notebooks in this repo require a processed, cleaned, and normalized version of the source election data from the State of Georgia's Office of Secretary of State. This notebook, and an associated python script, will create a sqlite3-based database that subsequent notebooks will use as a data source.

### Why sqlite3?
This library is used to permit running the Jupyter notebooks on a laptop or Google Colab-type environment, without the need for purchasing or establishing a Google Cloud Compute (GCS) or AWS Resource. Future versions of this notebook may include the option to use paid-database resources on one or both of these services.

### Prerequisites
1. Anaconda Python is used to create the environment needed to operate this notebook, with an environment defined in `environment.yml`. Anaconda Python may be [downloaded without cost](https://www.anaconda.com/distribution/#download-section) for Linux, Windows, or macOS.


2. This repo should be cloned from GitHub; the notebook may not operate as expected if individual notebooks are downloaded without supporting files.

In [2]:
# Required libraries
import os
import sqlite3
from tqdm import tqdm
from zipfile import ZipFile
from glob import glob

## Configuration variables

In [3]:
source_data_loc = 'source_data'
data_vintage_loc = '20190709'

working_directory = None
working_directory = '/home/michael/git_repos/georgia_election_data'

first_years_data = 1996
last_years_data = 2019

database_location = 'processed_data'
dest_db_name = 'election_data.db'
dest_db_table_name = 'gaelect'

batch_size = int(1e6)

### About the Source Data

Included in this repo are the source ZIP files obtained from the State of Georgia Secretary of State in August 2019. _These files are provided without assertions to accuracy_. To re-acquire or download the source files, you may follow the following instructions.

#### 2013-2019
Step 1: Go to https://elections.sos.ga.gov/Elections/voterhistory.do
Step 2: Select the election year, and then download the `Full Year File`
Step 3: Download each of the ZIP files into a folder accessible to your Python instance

#### 1996-2012
Step 1: Go to https://elections.sos.ga.gov/Elections/voterhistoryprevious.do
Step 2: Download each year's zip file into a folder accessible to your Python instance

### Confirming the existence and integrity of the source data

The following cells indicate configuration parameters that describe the location and composition of the source data.

In [4]:
print(f'Current Working Directory is {os.getcwd()}')

if working_directory is None:
    working_directory = os.getcwd()

if working_directory.split('/')[-1] != 'georgia_election_data':
    print("Change the new_working_directory folder to indicate the root folder of this notebook")
        
else:    
    list_of_known_years_election_data = [str(a) + '.zip' for a in list(range(first_years_data,last_years_data+1))]

found_files = sorted(os.listdir(working_directory + f'/{source_data_loc}' + f'/{data_vintage_loc}'))
matches = found_files == list_of_known_years_election_data
if matches:
    print("Found all expected data.")
else:
    print(f"Expected data not found. Missing files are {list(set(list_of_known_years_election_data)-set(found_files))}")

Current Working Directory is /home/michael/git_repos/georgia_election_data/etl
Found all expected data.


## ETL Loading Functions

These functions load the two available vintages of Georgia Voter Data: data for years 2012 and prior, and data for years 2013 and subsequent.

In [5]:
def parse_function_2012_prior(in_line):

    in_line = in_line.decode('utf-8')
    
    county_no = in_line[0:3]
    reg_no = in_line[3:11]
    election_date = in_line[11:19]
    election_type = in_line[19:22].strip()
    party = in_line[22:23].strip()
    absentee = in_line[23:24]
    
    # dates for this time series are presented as day-month-year
    # we will standardize these dates to year-month-day to reduce
    # confusion and match the date format of the 2013 and onwards data
    #print(election_date)
    election_date = election_date[4:] + '-' + election_date[0:2] + '-' + election_date[2:4]
    
    # standardize party indicator from primary election records: single char
    # D -> Democrat
    # R -> Republican
    # N -> Non-Partisan
    
    if party == 'NP':
        party = 'N'
    elif party != 'D' and party != 'R' and party != 'N':
        party = ""


    if absentee == 'Y':
        absentee = True
    else:
        absentee = False
        
    return (0,county_no, reg_no, election_date,election_type, party, absentee,None,None) 

In [6]:
def parse_function_2013_post(in_line):

    # Convert all strings to utf-8 for international standarization purposes
    in_line = in_line.decode('utf-8')
    
    county_no = in_line[0:3].strip()
    reg_no = in_line[3:11].strip()
    election_date = in_line[11:19].strip()
    election_type = in_line[19:22].strip()
    party= in_line[22:24].strip()
    absentee =in_line[24:25].strip()
    provisional = in_line[25:26].strip()
    supplemental = in_line[26:27].strip()
    
    election_date = election_date[0:4] + '-' + election_date[4:6] + '-' + election_date[6:]    
    # the election date is already in YYYY-MM-DD, so no additional action needed
    
    # standardize party indicator from primary election records: single char
    # D -> Democrat
    # R -> Republican
    # N -> Non-Partisan
    
    if party == 'NP':
        party = 'N'
    elif party != 'D' and party != 'R' and party != 'N':
        party = None

    # Convert absentee flags to True/False

    if absentee == 'Y':
        absentee = True
    else:
        absentee = False

    # Convert provisional flags to True/False

    if provisional == 'Y':
        provisional = True
    else:
        provisional = False        

    # Convert supplemental flags to True/False
    
    if supplemental == 'Y':
        supplemental = True
    else:
        supplemental = False       
    return (1,county_no, reg_no, election_date,election_type, party, absentee,supplemental,absentee)
  

In [7]:
def check_if_elections_db_exists(full_path_to_db):
    found = False
    # first, check to see if the database file itself exists
    if os.path.exists(full_path_to_db):
        db = sqlite3.connect(full_path_to_db)
        c = db.cursor()
        c.execute(f'SELECT name FROM sqlite_master WHERE type=\'table\' AND name={dest_db_table_name}')
    else:
        return found
    
    

In [8]:
def get_n_lines_iterator(source_file):
    with ZipFile(source_file).open(ZipFile(source_file).namelist()[0]) as file:
        for i in file:
            yield i
            
#list_of_files = glob(working_directory + f'/{source_data_loc}' + f'/{data_vintage_loc}' + '/*')
#source_file = list_of_files[1]
#lines_required = batch_size

# get number of lines in text file



In [11]:
def create_etl_batches(source_file):
    with ZipFile(source_file).open(ZipFile(source_file).namelist()[0]) as f:
        line_count = sum(1 for _ in f)
        print(f'Datafile {ZipFile(source_file).namelist()[0]} contains {line_count} records')
        in_range = list(range(line_count))
        out_batch = [in_range[i * batch_size:(i + 1) * batch_size] for i in range((len(in_range) + batch_size - 1) // batch_size )] 
        return({'source_file':source_file,
                'line_count':line_count,
                'batches':[(min(b), max(b)) for b in out_batch]
        })

In [12]:
list_of_files = glob(working_directory + f'/{source_data_loc}' + f'/{data_vintage_loc}' + '/*')
list_of_files.sort()
print(f"Found {len(list_of_files)} to process.")
ret_batches = list(map(create_etl_batches, list_of_files))

Found 24 to process.
Datafile Voter History 1996.txt contains 4644200 records
Datafile Voter History 1997.txt contains 959435 records
Datafile Voter History 1998.txt contains 3553259 records
Datafile Voter History 1999.txt contains 539933 records
Datafile Voter History 2000.txt contains 5046847 records
Datafile Voter History 2001.txt contains 708042 records
Datafile Voter History 2002.txt contains 3731165 records
Datafile Voter History 2003.txt contains 602137 records
Datafile Voter History 2004.txt contains 6399634 records
Datafile Voter History 2005.txt contains 651522 records
Datafile Voter History 2006.txt contains 3796362 records
Datafile Voter History 2007.txt contains 752749 records
Datafile Voter History 2008.txt contains 9628482 records
Datafile Voter History 2009.txt contains 560700 records
Datafile Voter History 2010.txt contains 4841793 records
Datafile Voter History 2011.txt contains 877001 records
Datafile Voter History 2012.txt contains 7036094 records
Datafile 2013.TXT 

In [19]:
number_of_records = sum([r['batches'][-1][1] for r in ret_batches])
number_of_batches = sum([len(r['batches']) for r in ret_batches])
print(f'There are {number_of_records} total voting records and {number_of_batches} batches in the data sets')

There are 76453421 total voting records and 87 batches in the data sets


In [20]:
print(ret_batches[0])

{'source_file': '/home/michael/git_repos/georgia_election_data/source_data/20190709/1996.zip', 'line_count': 4644200, 'batches': [(0, 999999), (1000000, 1999999), (2000000, 2999999), (3000000, 3999999), (4000000, 4644199)]}


In [35]:
def convert_fwf_to_sqlitedb(file_dict):
    
    """
    This function takes the source data files and builds a sqlite database.
    CSVs are being used in lieu of pandas dataframes for the source data because
    the memory footprint and CPU overhead of 78 million records requires an
    un-needed quantity of those resources.
    """
    full_path_to_db = f'{working_directory}/{database_location}/{dest_db_name}'

    
    db = sqlite3.connect(full_path_to_db)

    # now load the files

    pre_2012_range = set(range(1996,2013))
    post_2013_range = set(range(2013,2020))

    cur_file_name = file_dict['source_file']
    
    # collect the records
    gen = get_n_lines_iterator(cur_file_name)
        
    for cur_range in tqdm(file_dict['batches'], desc=f'Processing batch in {os.path.basename(cur_file_name).split(".")[0]}'):
        start_i, end_i = cur_range
        cur_record_list = [next(gen) for r in range(start_i, end_i)]
        if int(os.path.basename(cur_file_name).split(".")[0]) in pre_2012_range:
            parsed_data = list(map(parse_function_2012_prior,cur_record_list))

        elif int(os.path.basename(cur_file_name).split(".")[0]) in post_2013_range:
            parsed_data = list(map(parse_function_2013_post,cur_record_list))

        #print(f"Loading {os.path.basename(f)} into the sqlite database")
        c = db.cursor()
        c.executemany(f"INSERT INTO {dest_db_table_name} VALUES (?,?, ?, ?, ?, ?, ?, ?, ?)", parsed_data)
        db.commit()
        #print(f"Loaded {os.path.basename(f)} into the sqlite database")

    db.close()
    

In [20]:
    c = db.cursor()
    c.execute(f'''
               CREATE TABLE {dest_db_table_name}(
               vintage INT,
               county_no TEXT,
               reg_no TEXT, 
               election_date TEXT,
               election_type TEXT, 
               party TEXT,
               absentee BOOLEAN,
               supplemental BOOLEAN,
               provisional BOOLEAN) 
            ''')
    db.commit()


In [36]:
convert_fwf_to_sqlitedb(ret_batches[0])

Processing batch in 1996:   0%|          | 0/5 [00:00<?, ?it/s]

Cur record list is 999999





OperationalError: database is locked

In [36]:
   if drop_table:
        c = db.cursor()
        c.execute(f'drop table if exists {dest_db_table_name};')
        db.commit()
    

### Generating the database

The following code generates the sqlite database from the source data. This process should take 5-10 minutes to complete, depending on available computing resources.

In [None]:
# Set the path for outputting the database file

original_data_full_path = working_root_folder + original_data_folder
database_out_path = working_root_folder +  '/database/'

# the sqlite database will be saved here
database_name = 'ga_elect2.db'

# Check first to see if this file exists
# If so, we want to give the option to delete it
# since it may be locked or corrupted

print(database_out_path + database_name)
is_database_exists = os.path.exists(database_out_path + database_name)

confirm = None 

if is_database_exists:
    confirm = input("The curent database name still exists. Type Y if you want to re-create the database")

if confirm == 'Y':    
    convert_fwf_to_sqlitedb(original_data_full_path, database_out_path + database_name)
else:
    print("Keeping existing database")

/home/michaelhandelman/ga_election_code/ga_elect/database/ga_elect2.db
The curent database name still exists. Type Y if you want to re-create the databaseY


Individual file processing progress: 100%|██████████| 24/24 [00:00<00:00, 54648.91it/s]

Found 24 to process.
Creating database in the /home/michaelhandelman/ga_election_code/ga_elect/database/ga_elect2.db folder
All Files Found



  0%|          | 0/24 [00:00<?, ?it/s]

Loading 1996.zip into the sqlite database


  4%|▍         | 1/24 [00:34<13:09, 34.31s/it]

Loaded 1996.zip into the sqlite database
Loading 1997.zip into the sqlite database


  8%|▊         | 2/24 [00:42<09:42, 26.50s/it]

Loaded 1997.zip into the sqlite database
Loading 1998.zip into the sqlite database


 12%|█▎        | 3/24 [01:09<09:18, 26.59s/it]

Loaded 1998.zip into the sqlite database
Loading 1999.zip into the sqlite database


 17%|█▋        | 4/24 [01:14<06:40, 20.04s/it]

Loaded 1999.zip into the sqlite database
Loading 2000.zip into the sqlite database


 21%|██        | 5/24 [01:52<08:04, 25.50s/it]

Loaded 2000.zip into the sqlite database
Loading 2001.zip into the sqlite database


 25%|██▌       | 6/24 [01:58<05:56, 19.78s/it]

Loaded 2001.zip into the sqlite database
Loading 2002.zip into the sqlite database


 29%|██▉       | 7/24 [02:27<06:22, 22.49s/it]

Loaded 2002.zip into the sqlite database
Loading 2003.zip into the sqlite database


 33%|███▎      | 8/24 [02:32<04:37, 17.33s/it]

Loaded 2003.zip into the sqlite database
Loading 2004.zip into the sqlite database


 38%|███▊      | 9/24 [03:22<06:43, 26.89s/it]

Loaded 2004.zip into the sqlite database
Loading 2005.zip into the sqlite database


 42%|████▏     | 10/24 [03:29<04:53, 20.95s/it]

Loaded 2005.zip into the sqlite database
Loading 2006.zip into the sqlite database


 46%|████▌     | 11/24 [03:57<05:00, 23.11s/it]

Loaded 2006.zip into the sqlite database
Loading 2007.zip into the sqlite database


 50%|█████     | 12/24 [04:03<03:37, 18.13s/it]

Loaded 2007.zip into the sqlite database
Loading 2008.zip into the sqlite database


 54%|█████▍    | 13/24 [05:16<06:20, 34.57s/it]

Loaded 2008.zip into the sqlite database
Loading 2009.zip into the sqlite database


 58%|█████▊    | 14/24 [05:23<04:20, 26.08s/it]

Loaded 2009.zip into the sqlite database
Loading 2010.zip into the sqlite database


 62%|██████▎   | 15/24 [06:00<04:26, 29.62s/it]

Loaded 2010.zip into the sqlite database
Loading 2011.zip into the sqlite database


 67%|██████▋   | 16/24 [06:11<03:10, 23.87s/it]

Loaded 2011.zip into the sqlite database
Loading 2012.zip into the sqlite database


 71%|███████   | 17/24 [07:06<03:52, 33.28s/it]

Loaded 2012.zip into the sqlite database
Loading 2013.zip into the sqlite database


 75%|███████▌  | 18/24 [07:13<02:32, 25.47s/it]

Loaded 2013.zip into the sqlite database
Loading 2014.zip into the sqlite database


 79%|███████▉  | 19/24 [07:50<02:24, 28.81s/it]

Loaded 2014.zip into the sqlite database
Loading 2015.zip into the sqlite database


 83%|████████▎ | 20/24 [07:56<01:27, 21.98s/it]

Loaded 2015.zip into the sqlite database
Loading 2016.zip into the sqlite database


 88%|████████▊ | 21/24 [09:03<01:46, 35.38s/it]

Loaded 2016.zip into the sqlite database
Loading 2017.zip into the sqlite database


 92%|█████████▏| 22/24 [09:16<00:57, 28.68s/it]

Loaded 2017.zip into the sqlite database
Loading 2018.zip into the sqlite database


In [10]:
# Test Function/

db = sqlite3.connect(database_out_path + database_name)
c = db.cursor()
c.execute('''SELECT count(*) FROM ga_elect_data''')
assert(c.fetchall()[0][0]) == 76453445
db.close()


## Adding helper tables

There is one helper table that will also be added to the database. This helper table maps the election_type field to indicate if the field reflects a _primary_ election, a _primary runoff_ election, a _general_ election, or another type of vote.

This mapping is needed because the election_type coding has either 1) not been consistent from election to election or 2) a new election_type code was created when a ballot combined two types of elections. For example, a general election added to a recall election.

The repo includes a file `election_type_mapping.csv` that contains a manual classification between the election types and the above manual classification.



In [11]:
print(f"Opening {database_out_path + database_name}")
db = sqlite3.connect(database_out_path + database_name)
c = db.cursor()

manual_classification_file_loc = working_root_folder + '/election_type_mapping.csv'

if os.path.exists(manual_classification_file_loc):
    print("found helper data here:" + manual_classification_file_loc)

    
# load the file into a list of tuples

input_str = open(manual_classification_file_loc, 'r').readlines()
input_str_list = [tuple(l.strip().split(',')) for l in input_str]
# remove header
input_str_list = input_str_list[1:]

c.execute('drop table if exists ga_elect_manual_classification;')
db.commit()
c = db.cursor()
c.execute('''
           CREATE TABLE ga_elect_manual_classification(
           election_type_index INT,
           election_type TEXT,
           election_type_description TEXT, 
           manual_classification TEXT)
           ''')
db.commit()
print("Table created")

c = db.cursor()
c.executemany("INSERT INTO ga_elect_manual_classification VALUES (?,?, ?, ?)", input_str_list)
db.commit()
print("Data uploaded")

db.close()

Opening /home/michaelhandelman/ga_election_code/ga_elect/database/ga_elect2.db
found helper data here:/home/michaelhandelman/ga_election_code/ga_elect/election_type_mapping.csv
Table created
Data uploaded


Confirming that this new table was created

In [12]:
# Test Function/

db = sqlite3.connect(database_out_path + database_name)
c = db.cursor()
c.execute('''SELECT count(*) FROM ga_elect_manual_classification''')
assert(c.fetchall()[0][0] == 37)

