The purpose of this analysis is to establish a functional relational database using bulk election data, build a machine learning model that is capable of accurately predicting election results, create dashboards accessible to the public, and explore the distribution of funding across parties, candidates, and the United States. We will then present the insights to a group of peers in a 12 minute presentation.
Team "Analyze This!" has discussed our communication protocols and agreed to the following:
- We will be using Slack as our primary communication tool, using our own channel.
- We will meet no less than twice a week over Zoom to discuss the progress on our project.
- We will post a message into our Slack channel to discuss any emerging Pull Requests and to seek team approvals.
- Communicate any difficult life circumstances that may prevent us from completing a task into Slack so that we may jump in as a team to assist.
US Federal Campaign Finance data 1990-2016 https://www.kaggle.com/datasets/jeegarmaru/campaign-contributions-19902016
• Overview - Machine Learning will be applied to this data set to address our main problem statement: Predict a winner in a political race based on selected features from the data. We will be testing and comparing different models to see which performs the best in terms of accuracy and best fit.
MODELS:
- Supervised:
•Random Forest Classifier
•Logistic Regression
•Neural Network (likely too complex for this situation).
- Un-Supervised:
•KMEANS Clustering
PREPROCESSING: Cleaning and encoding categorical variables. Bucketing rare values may be necessary. PCA for Clustering. Scaling/Standardizing. Joining Candidates to pacs and individual_contributions tables as to include features from multiple tables. It's also worth mentioning that we may decide to select only the most recent years, 2000-2016, due to inflation devaluing money and including most relevant data.
FEATURES: c.party • c.dist_id_run_for • c.CRPICO • c.NOPACS • c.raised_from_pacs • c.raised_from_individuals • c.raised_total • c.raised_unitemized • p.pacid • p.Amount • p.type
TARGET: c.result - 1:W, 0:L
RESULT: Hopefully, our supervised models will provide us with accurate predictions for election results. Our un-supervised approach may help us group candidates based on fiscal activity and help better understand the politics and power play.
In this project we will be utilizing a data set of Kaggle that contains data from campaign finance data starting in 1990. The data is originally sourced from the website OpenSecrets, a reputable nonpartisan, independent, and nonprofit organization that has been in operation since 1996. We have access to a handful of different data sets through Kaggle including candidates, backer information, committee information, pac information and more that we will be tying together via common keys with SQL. We will accomplish this through inner joins, and create new tables with the information needed, so that we can analyze the election results based on financial support.
We took this data and cleaned it by removing duplicate rows, nulls, converting date to datetime format and selected only relevant columns.
We used Postgres SQL to create the connections through out are data and put together a cohesive and clear picture. When connecting the tables we noticed that the candidate ID was not unique enough because many candidates ran in more than one election cycle and received donations from more than one place which caused a many to many relationship. In order to create a one to many relationship and properly attribute the donations to the correct election we created a new key. This key was a combination of both the candidate ID and the year that they ran. This allowed us to ensure that the correct contributions were being matched to the correct year and the data was being correctly portrayed.
Encode categorical variables with OneHotEncoder()
Split data to train/test groups
Standardize Data with StandardScaler()
Train/Test - analyze accuracy score
Raw Data
Data downloaded from Federal Election Commision.
Columns were selected, formatted, cleaned, and combined to calculate columns resembling the structure of training data then preprocessed and fed to RFC model.
Prediction Results - first 5 rows
Seat Requirement Differentials
- difference in expected seats and predicted winning seats
Adjusted to fulfill State seat requirements
68 candidates were adjusted - 40 from L to W (with highest raised_total), 28 from W to L (with lowest raised_total) in order to attempt satisfying seat requirements. An improvement overall but not exact. Full list of adjusted politicians in ML_modeling_v1.ipynb
Seat Requirement Differentials - after adjustment
Feature Selection - added State
Dimensionality Reduction - Principal Component Analysis (PCA)
Web Page - Link
Page 1 - Prediction Results + 2022 Candidate Dashboard
Our initial hypothesis was that funding sources would influence election outcomes. What we found was that indeed it does, however, incumbency status was perhaps the greatest indicator for a successful campaign. All in all this was an insightful analysis on Election data. With a 93% accuracy score, we can be pretty confident in the predictions. Our dashboards indicate the distribution of funding across the United States, each party's presence in the races and their resources, how the distribution of capital disseminates over time, and more. We invite you to explore our website, peer into the crystal ball, and return after the votes are counted to compare our predictions with reality. Will your candidate win? Only time will tell...
Lora Leonida
Neekoh Tablate
Ryan Knauff
Cayli Swartz
Marshall Miley