# Texas Legislative Council Comprehensive Election Dataset ETL

## Notes
Voting Tabulation Districts (VTDs), the census geographic equivalent of county election precincts, are created for the purpose of relating 2020 Census population data to election precinct data. VTDs can differ from actual election precincts because precincts do not always follow census geography.

On the occasion that a precinct is in two noncontiguous pieces, it is a suffixed VTD in the database. For example, if precinct 0001 had two non-contiguous areas, the corresponding VTD would be VTD 0001A and VTD 0001B. If a 2022 general election precinct does not match any census geography, it is consolidated with an adjacent precinct and given that precinct's corresponding VTD number. 

GIS users can join election datasets to the  general election VTDs using the common field name `VTDKEY` to join the data.

## Sources
[Comprehensive Election Datasets](https://data.capitol.texas.gov/dataset/comprehensive-election-datasets-compressed-format) 
- 2022 General VTDs Election Data .zip
    - https://data.capitol.texas.gov/dataset/35b16aee-0bb0-4866-b1ec-859f1f044241/resource/b9ebdbdb-3e31-4c98-b158-0e2993b05efc/download/2022-general-vtds-election-data.zip
- 2022 Primary VTDs Election Data .zip
    - https://data.capitol.texas.gov/dataset/35b16aee-0bb0-4866-b1ec-859f1f044241/resource/3d870fce-c4ea-4412-ae9f-ef90e0a25233/download/2022-primary-vtds-election-data.zip

[2022 General Election VTDs](https://data.capitol.texas.gov/dataset/vtds)
- 2022 General Election VTDs Shapefiles .zip
    - https://data.capitol.texas.gov/dataset/4d8298d0-d176-4c19-b174-42837027b73e/resource/037e1de6-a862-49de-ae31-ae609e214972/download/vtds_22g.zip

## User Input

This section can be edited by the user of this notebook to change certain settings:
- initializing run
- county of reference (in FIPS code)
- year
- election type
- debug mode

In [34]:
initializing = True             # if initial run, initializes databases, etc.
county_or_counties = [29]     # county or counties of reference
year = '22'                     # can be `20` or `22`
election_type = 'g'             # can be `g` for general or `p` for primary
debug = True                   # if debug is true, database writes are disabled

## Pre-ETL
Checks to see if required files are available locally or if they need to be extracted from source. 

In [35]:
import os
import requests
import zipfile
import geopandas as gpd
import pandas as pd
import sqlalchemy

In [36]:
# checks for Shapefiles
shapefiles_dirpath = f'data/geospatial_files/shapefiles/texas_legislative_council/20{year}/'
shapefiles_filepath = f'vtds_{year}{election_type}'
shapefiles_url = f'https://data.capitol.texas.gov/dataset/4d8298d0-d176-4c19-b174-42837027b73e/resource/037e1de6-a862-49de-ae31-ae609e214972/download/vtds_{year}{election_type}.zip'

# checks if shapefile directories contain data, downloads shapefiles if not
if not os.path.exists(shapefiles_dirpath):
    os.makedirs(shapefiles_dirpath)
    response = requests.get(shapefiles_url)
    with open(shapefiles_dirpath + shapefiles_filepath, 'wb') as f:
        f.write(response.content)

    # extracts .zip file contents  
    with zipfile.ZipFile(shapefiles_dirpath + shapefiles_filepath, mode='r') as archive:
        archive.extractall(path=shapefiles_dirpath)
    archive.close()
    
    # delete unnecessary files
    for file in os.listdir(shapefiles_dirpath):
        filename = os.fsdecode(shapefiles_dirpath + file)
        if filename.endswith('.shp') or filename.endswith('.shx') or filename.endswith('.dbf'):
            continue
        else:
            os.remove(shapefiles_dirpath + file)

In [37]:
# checks for datasets
dataset_dirpath = f'../data/datasets/texas_legislative_council/'
dataset_filepath = f'{year}{election_type}_vtds_election_data'
if year == '22' and election_type == 'g':
    election_data_url = f'https://data.capitol.texas.gov/dataset/35b16aee-0bb0-4866-b1ec-859f1f044241/resource/b9ebdbdb-3e31-4c98-b158-0e2993b05efc/download/2022-general-vtds-election-data.zip'
    
# checks whether dataset directories exist and contain data, downloads datasets if not
if not os.path.exists(dataset_dirpath):
    os.makedirs(dataset_dirpath)
    response = requests.get(election_data_url)
    with open(dataset_dirpath + dataset_filepath, 'wb') as f:
        f.write(response.content)
        
    # extracts .zip file content
    with zipfile.ZipFile(dataset_dirpath + dataset_filepath, 'r') as archive:
        archive.extractall(path=dataset_dirpath)
    archive.close()

# delete unnecessary files
    for file in os.listdir(dataset_dirpath):
        filename = os.fsdecode(os.path.join(dataset_dirpath, file))
        if filename.endswith('.csv') or filename.endswith('.txt'):
            continue
        else:
            os.remove(os.path.join(dataset_dirpath, file))

## Extract

Extracts data from local files

### Shapefile Extract

In [38]:
# creates GeoDataFrame from Shapefiles
gdf = gpd.read_file(shapefiles_dirpath + shapefiles_filepath.upper() + '.shp').set_crs(epsg=3395)
gdf.head()

Unnamed: 0,CNTY,COLOR,VTD,CNTYKEY,VTDKEY,CNTYVTD,Shape_area,Shape_len,geometry
0,1,6,1,1,1.0,10001,5666216.0,15288.088777,"POLYGON ((1413960.808 1073012.816, 1413971.571..."
1,1,1,2,1,2.0,10002,256212900.0,94434.420881,"POLYGON ((1420165.429 1066385.798, 1420251.968..."
2,1,3,3,1,3.0,10003,70722280.0,55660.372406,"POLYGON ((1416579.790 1072023.104, 1416744.635..."
3,1,7,4,1,4.0,10004,241066200.0,91319.549282,"POLYGON ((1435674.876 1074608.545, 1435714.039..."
4,1,6,5,1,5.0,10005,168985400.0,86937.648556,"POLYGON ((1436888.342 1072498.734, 1436911.364..."


### Election Returns & Voter Registration and Voter Turnout Extract

In [39]:
# loads voting results dataset into DataFrame
if election_type == 'g':    
    election_returns_df = pd.read_csv(f'{dataset_dirpath}20{year}_General_Election_Returns.csv')
    voter_registration_and_turnout_df = pd.read_csv(f'{dataset_dirpath}20{year}_General_Election_VRTO.csv')

In [40]:
# example DataFrame showing a election returns in a single VTD/voting precinct in Bexar County
election_returns_df
election_returns_df.head()

Unnamed: 0,County,FIPS,VTD,cntyvtd,vtdkeyvalue,Office,Name,Party,Incumbent,Votes
0,Anderson,1,1,10001,1,Governor,Abbott,R,Y,610
1,Anderson,1,2,10002,2,Governor,Abbott,R,Y,1165
2,Anderson,1,3,10003,3,Governor,Abbott,R,Y,573
3,Anderson,1,4,10004,4,Governor,Abbott,R,Y,808
4,Anderson,1,5,10005,5,Governor,Abbott,R,Y,163


In [41]:
# example DataFrame showing voter registraton and turnout totals in a single VTD/voting precinct in Bexar County
voter_registration_and_turnout_df
voter_registration_and_turnout_df.head()

Unnamed: 0,County,FIPS,VTD,CNTYVTD,vtdkey,TotalPop,TotalVR,SpanishSurnamePercent,TotalTO
0,Anderson,1,1,10001,1,3153,1834,10.2,828
1,Anderson,1,2,10002,2,3811,2697,3.7,1317
2,Anderson,1,3,10003,3,1925,1228,6.0,658
3,Anderson,1,4,10004,4,2306,1610,3.0,884
4,Anderson,1,5,10005,5,405,286,1.7,173


## Transform

Parse target geography from our DataFrames and GeoDataFrames, then add `geometry` column from the GeoDataFrame to both the voter registration and turnout DataFrame and to the election results DataFrame to create a single GeoDataFrame each for both datasets.  

In [42]:
# extract target geography from GeoDataFrame
for county in county_or_counties:
    parsed_gdf = gdf[gdf['CNTY'] == county]
parsed_gdf.head()

Unnamed: 0,CNTY,COLOR,VTD,CNTYKEY,VTDKEY,CNTYVTD,Shape_area,Shape_len,geometry
248,29,7,1002,15,252.0,291002,393145.3,2767.912893,"POLYGON ((1127317.181 808422.762, 1127254.531 ..."
249,29,1,1003,15,253.0,291003,2204971.0,6923.592336,"POLYGON ((1145785.821 806705.122, 1145781.822 ..."
250,29,2,1004,15,254.0,291004,1650327.0,5641.228836,"POLYGON ((1146409.144 803755.848, 1146384.195 ..."
251,29,1,1005,15,255.0,291005,847841.4,4453.396699,"POLYGON ((1144737.072 804975.860, 1144739.074 ..."
252,29,3,1006,15,256.0,291006,1550623.0,5644.273424,"POLYGON ((1144238.746 804758.981, 1144258.132 ..."


In [43]:
# parse voter registation DataFrame to include only target geographies
for county in county_or_counties:
    parsed_voter_registration_and_turnout_df = voter_registration_and_turnout_df[voter_registration_and_turnout_df['FIPS'] == county]
parsed_voter_registration_and_turnout_df.head()

Unnamed: 0,County,FIPS,VTD,CNTYVTD,vtdkey,TotalPop,TotalVR,SpanishSurnamePercent,TotalTO
250,Bexar,29,1001,291001,251,1026,934,47.8,528
251,Bexar,29,1002,291002,252,1039,679,56.4,287
252,Bexar,29,1003,291003,253,3399,2237,59.8,927
253,Bexar,29,1004,291004,254,1689,1041,77.0,359
254,Bexar,29,1005,291005,255,1516,844,84.1,254


In [44]:
# merge GeoDataFrame `geometry` columns with voter registration and turnout DataFrame
voter_registration_and_turnout_gdf = pd.merge(parsed_voter_registration_and_turnout_df, parsed_gdf[['VTDKEY', 'geometry']], left_on='vtdkey', right_on='VTDKEY')
voter_registration_and_turnout_gdf.drop(columns='VTDKEY', inplace=True)
voter_registration_and_turnout_gdf.head()

Unnamed: 0,County,FIPS,VTD,CNTYVTD,vtdkey,TotalPop,TotalVR,SpanishSurnamePercent,TotalTO,geometry
0,Bexar,29,1001,291001,251,1026,934,47.8,528,"POLYGON ((1145716.223 806977.750, 1145699.040 ..."
1,Bexar,29,1002,291002,252,1039,679,56.4,287,"POLYGON ((1127317.181 808422.762, 1127254.531 ..."
2,Bexar,29,1003,291003,253,3399,2237,59.8,927,"POLYGON ((1145785.821 806705.122, 1145781.822 ..."
3,Bexar,29,1004,291004,254,1689,1041,77.0,359,"POLYGON ((1146409.144 803755.848, 1146384.195 ..."
4,Bexar,29,1005,291005,255,1516,844,84.1,254,"POLYGON ((1144737.072 804975.860, 1144739.074 ..."


In [45]:
# parse elections results DataFrame to include only target geographies
for county in county_or_counties:
    parsed_election_returns_df = election_returns_df[election_returns_df['FIPS'] == county]
parsed_election_returns_df.head()

Unnamed: 0,County,FIPS,VTD,cntyvtd,vtdkeyvalue,Office,Name,Party,Incumbent,Votes
10450,Bexar,29,1001,291001,251,U.S. Rep 35,Casar,D,N,416
10451,Bexar,29,1002,291002,252,U.S. Rep 20,Castro,D,Y,185
10452,Bexar,29,1003,291003,253,U.S. Rep 35,Casar,D,N,718
10453,Bexar,29,1004,291004,254,U.S. Rep 20,Castro,D,Y,274
10454,Bexar,29,1005,291005,255,U.S. Rep 20,Castro,D,Y,196


In [46]:
election_returns_gdf = pd.merge(parsed_election_returns_df, parsed_gdf[['VTDKEY', 'geometry']], left_on='vtdkeyvalue', right_on='VTDKEY')
election_returns_gdf.drop(columns='VTDKEY', inplace=True)
election_returns_gdf.rename(columns={'vtdkeyvalue': 'vtdkey'}, inplace=True)
election_returns_gdf.head()

Unnamed: 0,County,FIPS,VTD,cntyvtd,vtdkey,Office,Name,Party,Incumbent,Votes,geometry
0,Bexar,29,1001,291001,251,U.S. Rep 35,Casar,D,N,416,"POLYGON ((1145716.223 806977.750, 1145699.040 ..."
1,Bexar,29,1002,291002,252,U.S. Rep 20,Castro,D,Y,185,"POLYGON ((1127317.181 808422.762, 1127254.531 ..."
2,Bexar,29,1003,291003,253,U.S. Rep 35,Casar,D,N,718,"POLYGON ((1145785.821 806705.122, 1145781.822 ..."
3,Bexar,29,1004,291004,254,U.S. Rep 20,Castro,D,Y,274,"POLYGON ((1146409.144 803755.848, 1146384.195 ..."
4,Bexar,29,1005,291005,255,U.S. Rep 20,Castro,D,Y,196,"POLYGON ((1144737.072 804975.860, 1144739.074 ..."


## Load

The following code loads the DataFrame/GeoDataFrame into the database for future analysis. 

In [47]:
# creates connection to local SQLite database
database_dirpath = os.path.join('../data/databases')
if year == '22':
    election_dataset_db_filepath = os.path.join(database_dirpath, 'texas_legislative_council_election_dataset_2022.db')
# uses user flag from first notebook cell to determine whether to replace database contents or add to each table
if initializing:
    replace_or_append = 'replace'
else:
    replace_or_append = 'append'

# creates connection to SQLite database
sql_engine = sqlalchemy.create_engine('sqlite:///' + election_dataset_db_filepath)

In [48]:
# writes election results to database
if debug is False:
    election_returns_gdf['geometry'] = election_returns_gdf['geometry'].astype(str)
    print('transformation complete: beginning database write')
    election_returns_gdf.to_sql('election_returns', sql_engine, if_exists=replace_or_append)

In [49]:
if debug is False:
    voter_registration_and_turnout_gdf['geometry'] = voter_registration_and_turnout_gdf['geometry'].astype(str)
    voter_registration_and_turnout_gdf.to_sql('voter_registration_and_turnout', sql_engine, if_exists=replace_or_append)