# Cleaning Appendix

Importing libraries:

In [19]:
import numpy as np
import pandas as pd
import duckdb

Reading in our amended CSV, switching the type of the Boolean reference column (ints for our model) and dropping unneeded columns post-phase 2 EDA.

In [20]:
demographics = pd.read_csv('data/election_demographics_amended.csv', index_col=False)

demographics['Presidential'] = demographics['Presidential'].astype(int)

demographics = demographics[['State','Year','TotalBallots','PercentVotingEligibleVotes',
                             'PercentBachelors','Income','PercentWhite','AverageAge','Presidential']]

We are going to convert the scale of the data columns below:

- PercentVotingEligibleVotes: from 0 to 1 to 0 to 100 (multiplying by 100)
- PercentWhite: from 0 to 1 to 0 to 100 (multiplying by 100)
- Income: dividing this by 1000 to indicate income in thousands of dollars

These changes are just so that our coefficients for modeling are more readable since the units are less dispersed. There's no loss of information from doing these transformations.

In [21]:
demographics['PercentVotingEligibleVotes'] = demographics['PercentVotingEligibleVotes'] * 100
demographics['PercentWhite'] = demographics['PercentWhite'] * 100
demographics['Income'] = demographics['Income'] / 1000

Creating 2 new dataframes seperated by presidential years and midterm years:


In [22]:
demographics_presidential = duckdb.sql("""SELECT State,Year,PercentVotingEligibleVotes FROM demographics WHERE Presidential=1""").df()
demographics_midterm = duckdb.sql("""SELECT State,Year,PercentVotingEligibleVotes FROM demographics WHERE Presidential=0""").df()

Exporting the 3 updated dataframes to CSVs, after dropping NAs in any columns (since all columns will eventually be used) for modeling:

In [23]:
demographics = demographics.dropna()
demographics_presidential = demographics_presidential.dropna()
demographics_midterm = demographics_midterm.dropna()

demographics.to_csv('data/election_demographics_final.csv')
demographics_presidential.to_csv('data/election_demographics_presidential.csv')
demographics_midterm.to_csv('data/election_demographics_midterm.csv')