# Texas Legislative Council Comprehensive Election Dataset ETL

## Notes
Voting Tabulation Districts (VTDs), the census geographic equivalent of county election precincts, are created for the purpose of relating 2020 Census population data to election precinct data. VTDs can differ from actual election precincts because precincts do not always follow census geography.

On the occasion that a precinct is in two noncontiguous pieces, it is a suffixed VTD in the database. For example, if precinct 0001 had two non-contiguous areas, the corresponding VTD would be VTD 0001A and VTD 0001B. If a 2022 general election precinct does not match any census geography, it is consolidated with an adjacent precinct and given that precinct's corresponding VTD number. 

GIS users can join election datasets to the  general election VTDs using the common field name `VTDKEY` to join the data.

## Sources
[Comprehensive Election Datasets](https://data.capitol.texas.gov/dataset/comprehensive-election-datasets-compressed-format) 
- 2022 General VTDs Election Data.zip
    - https://data.capitol.texas.gov/dataset/35b16aee-0bb0-4866-b1ec-859f1f044241/resource/b9ebdbdb-3e31-4c98-b158-0e2993b05efc/download/2022-general-vtds-election-data.zip
- 2022 Primary VTDs Election Data.zip
    - https://data.capitol.texas.gov/dataset/35b16aee-0bb0-4866-b1ec-859f1f044241/resource/3d870fce-c4ea-4412-ae9f-ef90e0a25233/download/2022-primary-vtds-election-data.zip

[2022 General Election VTDs](https://data.capitol.texas.gov/dataset/vtds)
- 2022 General Election VTDs Shapefiles.zip
    - https://data.capitol.texas.gov/dataset/4d8298d0-d176-4c19-b174-42837027b73e/resource/037e1de6-a862-49de-ae31-ae609e214972/download/vtds_22g.zip

## User Input

This section can be edited by the user of this notebook to change certain settings:
- initializing run
- county of reference (in FIPS code)
- year
- election type
- debug mode

In [1]:
county_or_counties = [453]     # county or counties of reference
year = '22'                     # can be `20` or `22`
election_type = 'g'             # can be `g` for general or `p` for primary
debug = False                   # if debug is true, database writes are disabled

## Pre-ETL
Checks to see if required files are available locally or if they need to be extracted from source. 

In [2]:
import os
import requests
import zipfile
import geopandas as gpd
import pandas as pd
import sqlalchemy

In [3]:
# checks for Shapefiles
shapefiles_dirpath = f'../data/geospatial_files/shapefiles/texas_legislative_council/20{year}/'
shapefiles_filepath = f'vtds_{year}{election_type}'
shapefiles_url = f'https://data.capitol.texas.gov/dataset/4d8298d0-d176-4c19-b174-42837027b73e/resource/037e1de6-a862-49de-ae31-ae609e214972/download/vtds_{year}{election_type}.zip'

# checks if shapefile directories contain data, downloads shapefiles if not
if not os.path.exists(shapefiles_dirpath):
    os.makedirs(shapefiles_dirpath)
    response = requests.get(shapefiles_url)
    with open(shapefiles_dirpath + shapefiles_filepath, 'wb') as f:
        f.write(response.content)

    # extracts .zip file contents  
    with zipfile.ZipFile(shapefiles_dirpath + shapefiles_filepath, mode='r') as archive:
        archive.extractall(path=shapefiles_dirpath)
    archive.close()
    
    # delete unnecessary files
    for file in os.listdir(shapefiles_dirpath):
        filename = os.fsdecode(shapefiles_dirpath + file)
        if filename.endswith('.shp') or filename.endswith('.shx') or filename.endswith('.dbf'):
            continue
        else:
            os.remove(shapefiles_dirpath + file)

In [4]:
# checks for datasets
dataset_dirpath = f'../data/datasets/texas_legislative_council/'
dataset_filepath = f'{year}{election_type}_vtds_election_data'
if year == '22' and election_type == 'g':
    election_data_url = f'https://data.capitol.texas.gov/dataset/35b16aee-0bb0-4866-b1ec-859f1f044241/resource/b9ebdbdb-3e31-4c98-b158-0e2993b05efc/download/2022-general-vtds-election-data.zip'
    
# checks whether dataset directories exist and contain data, downloads datasets if not
if not os.path.exists(dataset_dirpath):
    os.makedirs(dataset_dirpath)
    response = requests.get(election_data_url)
    with open(dataset_dirpath + dataset_filepath, 'wb') as f:
        f.write(response.content)
        
    # extracts .zip file content
    with zipfile.ZipFile(dataset_dirpath + dataset_filepath, 'r') as archive:
        archive.extractall(path=dataset_dirpath)
    archive.close()

# delete unnecessary files
    for file in os.listdir(dataset_dirpath):
        filename = os.fsdecode(os.path.join(dataset_dirpath, file))
        if filename.endswith('.csv') or filename.endswith('.txt'):
            continue
        else:
            os.remove(os.path.join(dataset_dirpath, file))

## Extract

Extracts data from local files

### Shapefile Extract

In [5]:
# creates GeoDataFrame from Shapefiles
gdf = gpd.read_file(shapefiles_dirpath + shapefiles_filepath.upper() + '.shp').set_crs(epsg=3395)
gdf.head()

Unnamed: 0,CNTY,COLOR,VTD,CNTYKEY,VTDKEY,CNTYVTD,Shape_area,Shape_len,geometry
0,1,6,1,1,1.0,10001,5666216.0,15288.088777,"POLYGON ((1413960.808 1073012.816, 1413971.571..."
1,1,1,2,1,2.0,10002,256212900.0,94434.420881,"POLYGON ((1420165.429 1066385.798, 1420251.968..."
2,1,3,3,1,3.0,10003,70722280.0,55660.372406,"POLYGON ((1416579.790 1072023.104, 1416744.635..."
3,1,7,4,1,4.0,10004,241066200.0,91319.549282,"POLYGON ((1435674.876 1074608.545, 1435714.039..."
4,1,6,5,1,5.0,10005,168985400.0,86937.648556,"POLYGON ((1436888.342 1072498.734, 1436911.364..."


### Election Returns & Voter Registration and Voter Turnout Extract

In [6]:
# loads voting results dataset into DataFrame
if election_type == 'g':    
    election_returns_df = pd.read_csv(f'{dataset_dirpath}20{year}_General_Election_Returns.csv')
    voter_registration_and_turnout_df = pd.read_csv(f'{dataset_dirpath}20{year}_General_Election_VRTO.csv')

In [7]:
# example DataFrame showing a election returns in a single VTD/voting precinct in Bexar County
election_returns_df
election_returns_df.head()

Unnamed: 0,County,FIPS,VTD,cntyvtd,vtdkeyvalue,Office,Name,Party,Incumbent,Votes
0,Anderson,1,1,10001,1,Governor,Abbott,R,Y,610
1,Anderson,1,2,10002,2,Governor,Abbott,R,Y,1165
2,Anderson,1,3,10003,3,Governor,Abbott,R,Y,573
3,Anderson,1,4,10004,4,Governor,Abbott,R,Y,808
4,Anderson,1,5,10005,5,Governor,Abbott,R,Y,163


In [8]:
# example DataFrame showing voter registration and turnout totals in a single VTD/voting precinct in Bexar County
voter_registration_and_turnout_df
voter_registration_and_turnout_df.head()

Unnamed: 0,County,FIPS,VTD,CNTYVTD,vtdkey,TotalPop,TotalVR,SpanishSurnamePercent,TotalTO
0,Anderson,1,1,10001,1,3153,1834,10.2,828
1,Anderson,1,2,10002,2,3811,2697,3.7,1317
2,Anderson,1,3,10003,3,1925,1228,6.0,658
3,Anderson,1,4,10004,4,2306,1610,3.0,884
4,Anderson,1,5,10005,5,405,286,1.7,173


## Transform

Parse target geography from our DataFrames and GeoDataFrames, then add `geometry` column from the GeoDataFrame to both the voter registration and turnout DataFrame and to the election results DataFrame to create a single GeoDataFrame each for both datasets.  

In [9]:
# extract target geography from GeoDataFrame
for county in county_or_counties:
    parsed_gdf = gdf[gdf['CNTY'] == county]
parsed_gdf.head()

Unnamed: 0,CNTY,COLOR,VTD,CNTYKEY,VTDKEY,CNTYVTD,Shape_area,Shape_len,geometry
8387,453,6,100,227,8706.0,4530100,1912838.0,5485.476939,"POLYGON ((1218820.256 904027.602, 1218843.617 ..."
8388,453,7,101,227,8707.0,4530101,1911649.0,6631.245395,"POLYGON ((1219720.633 902394.500, 1219646.994 ..."
8389,453,1,102,227,8708.0,4530102,1226771.0,4999.409181,"POLYGON ((1219034.652 903490.680, 1218936.068 ..."
8390,453,4,103,227,8709.0,4530103,1375887.0,6490.054312,"POLYGON ((1219432.868 904857.233, 1219459.578 ..."
8391,453,3,104,227,8710.0,4530104,1892998.0,6469.607264,"POLYGON ((1220118.996 905118.975, 1220132.937 ..."


In [10]:
# parse voter registation DataFrame to include only target geographies
for county in county_or_counties:
    parsed_voter_registration_and_turnout_df = voter_registration_and_turnout_df[voter_registration_and_turnout_df['FIPS'] == county]
parsed_voter_registration_and_turnout_df.head()

Unnamed: 0,County,FIPS,VTD,CNTYVTD,vtdkey,TotalPop,TotalVR,SpanishSurnamePercent,TotalTO
8705,Travis,453,100,4530100,8706,3863,3377,14.9,1690
8706,Travis,453,101,4530101,8707,4278,3266,19.1,1397
8707,Travis,453,102,4530102,8708,3485,3387,11.1,1804
8708,Travis,453,103,4530103,8709,2903,2338,10.7,1207
8709,Travis,453,104,4530104,8710,3739,3522,7.9,2147


In [11]:
# parse elections results DataFrame to include only target geographies
for county in county_or_counties:
    parsed_election_returns_df = election_returns_df[election_returns_df['FIPS'] == county]
parsed_election_returns_df.head()

Unnamed: 0,County,FIPS,VTD,cntyvtd,vtdkeyvalue,Office,Name,Party,Incumbent,Votes
635664,Travis,453,100,4530100,8706,U.S. Rep 35,Casar,D,N,1415
635665,Travis,453,101,4530101,8707,U.S. Rep 35,Casar,D,N,1220
635666,Travis,453,102,4530102,8708,U.S. Rep 35,Casar,D,N,1601
635667,Travis,453,103,4530103,8709,U.S. Rep 35,Casar,D,N,1091
635668,Travis,453,104,4530104,8710,U.S. Rep 37,Doggett,D,Y,1976


## Load

The following code loads the DataFrame/GeoDataFrame into the database for future analysis. 

In [12]:
# creates connection to local SQLite database
database_dirpath = os.path.join('../data/databases/texas_legislative_council')
if year == '22':
    election_dataset_db_filepath = os.path.join(database_dirpath, 'texas_legislative_council_election_dataset_2022.db')

# creates connection to SQLite database
sql_engine = sqlalchemy.create_engine('sqlite:///' + election_dataset_db_filepath)

In [13]:
# writes election results to database
if debug is False:
    parsed_election_returns_df.to_sql('election_returns', sql_engine, if_exists='append')

In [14]:
if debug is False:
    parsed_voter_registration_and_turnout_df.to_sql('voter_registration_and_turnout', sql_engine, if_exists='append')