Skip to content

loraleonida/Campaign-Finance-Analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analyze This! Campaign Finance Analysis

Overview

The purpose of this analysis is to establish a functional relational database using bulk election data, build a machine learning model that is capable of accurately predicting election results, create dashboards accessible to the public, and explore the distribution of funding across parties, candidates, and the United States. We will then present the insights to a group of peers in a 12 minute presentation.

Outline

Communication Protocols

Team "Analyze This!" has discussed our communication protocols and agreed to the following:

  • We will be using Slack as our primary communication tool, using our own channel.
  • We will meet no less than twice a week over Zoom to discuss the progress on our project.
  • We will post a message into our Slack channel to discuss any emerging Pull Requests and to seek team approvals.
  • Communicate any difficult life circumstances that may prevent us from completing a task into Slack so that we may jump in as a team to assist.

Machine Learning Model

US Federal Campaign Finance data 1990-2016 https://www.kaggle.com/datasets/jeegarmaru/campaign-contributions-19902016

• Overview - Machine Learning will be applied to this data set to address our main problem statement: Predict a winner in a political race based on selected features from the data. We will be testing and comparing different models to see which performs the best in terms of accuracy and best fit.

MODELS:

- Supervised:
	•Random Forest Classifier
	•Logistic Regression
	•Neural Network (likely too complex for this situation).

- Un-Supervised:
	•KMEANS Clustering

PREPROCESSING: Cleaning and encoding categorical variables. Bucketing rare values may be necessary. PCA for Clustering. Scaling/Standardizing. Joining Candidates to pacs and individual_contributions tables as to include features from multiple tables. It's also worth mentioning that we may decide to select only the most recent years, 2000-2016, due to inflation devaluing money and including most relevant data.

FEATURES: c.party • c.dist_id_run_for • c.CRPICO • c.NOPACS • c.raised_from_pacs • c.raised_from_individuals • c.raised_total • c.raised_unitemized • p.pacid • p.Amount • p.type

TARGET: c.result - 1:W, 0:L

RESULT: Hopefully, our supervised models will provide us with accurate predictions for election results. Our un-supervised approach may help us group candidates based on fiscal activity and help better understand the politics and power play.

Database

In this project we will be utilizing a data set of Kaggle that contains data from campaign finance data starting in 1990. The data is originally sourced from the website OpenSecrets, a reputable nonpartisan, independent, and nonprofit organization that has been in operation since 1996. We have access to a handful of different data sets through Kaggle including candidates, backer information, committee information, pac information and more that we will be tying together via common keys with SQL. We will accomplish this through inner joins, and create new tables with the information needed, so that we can analyze the election results based on financial support.

We took this data and cleaned it by removing duplicate rows, nulls, converting date to datetime format and selected only relevant columns.

We used Postgres SQL to create the connections through out are data and put together a cohesive and clear picture. When connecting the tables we noticed that the candidate ID was not unique enough because many candidates ran in more than one election cycle and received donations from more than one place which caused a many to many relationship. In order to create a one to many relationship and properly attribute the donations to the correct election we created a new key. This key was a combination of both the candidate ID and the year that they ran. This allowed us to ensure that the correct contributions were being matched to the correct year and the data was being correctly portrayed.


SQL Database

Entity Relationship Diagram QuickDBD-export

PgAdmin4 - Postgres Table Screen Shot 2022-08-20 at 2 27 47 PM (1)


Machine Learning Model - Random Forest Classifier

Feature Selection Screen Shot 2022-08-24 at 4 20 22 PM

Encode categorical variables with OneHotEncoder() Screen Shot 2022-08-24 at 4 20 49 PM

Split data to train/test groups
Screen Shot 2022-08-24 at 5 21 00 PM

Standardize Data with StandardScaler() Screen Shot 2022-08-24 at 5 25 01 PM

Train/Test - analyze accuracy score Screen Shot 2022-08-24 at 5 10 26 PM


Predict 2022 Election

Raw Data
Data downloaded from Federal Election Commision.
 Screen Shot 2022-08-24 at 6 02 42 PM

Columns were selected, formatted, cleaned, and combined to calculate columns resembling the structure of training data then preprocessed and fed to RFC model.

Screen Shot 2022-08-24 at 5 35 44 PM

Prediction Results - first 5 rows Screen Shot 2022-08-24 at 5 37 31 PM

Number of Wins vs. Losses
Screen Shot 2022-08-24 at 5 39 07 PM

Seat Requirement Differentials
- difference in expected seats and predicted winning seats 2022 Prediction Results diff

Adjusted to fulfill State seat requirements
68 candidates were adjusted - 40 from L to W (with highest raised_total), 28 from W to L (with lowest raised_total) in order to attempt satisfying seat requirements. An improvement overall but not exact. Full list of adjusted politicians in ML_modeling_v1.ipynb

Seat Requirement Differentials - after adjustment
after-adjustments


KMeans Clustering

Feature Selection - added State Screen Shot 2022-08-24 at 5 11 44 PM

Dimensionality Reduction - Principal Component Analysis (PCA)
Screen Shot 2022-08-24 at 5 12 49 PM

Elbow Curve, n=5
Screen Shot 2022-08-24 at 5 14 22 PM

Clustering 3D Graph K-MeansClustering_3D_graph


Web Page - Link

Page 1 - Prediction Results + 2022 Candidate Dashboard Screen Shot 2022-08-24 at 5 49 13 PM Screen Shot 2022-08-24 at 5 49 34 PM

Page 2 - Past Data Dashboard Screen Shot 2022-08-24 at 5 50 42 PM Screen Shot 2022-08-24 at 5 50 58 PM Screen Shot 2022-08-24 at 5 51 12 PM Screen Shot 2022-08-24 at 5 51 21 PM


Conclusion

Our initial hypothesis was that funding sources would influence election outcomes. What we found was that indeed it does, however, incumbency status was perhaps the greatest indicator for a successful campaign. All in all this was an insightful analysis on Election data. With a 93% accuracy score, we can be pretty confident in the predictions. Our dashboards indicate the distribution of funding across the United States, each party's presence in the races and their resources, how the distribution of capital disseminates over time, and more. We invite you to explore our website, peer into the crystal ball, and return after the votes are counted to compare our predictions with reality. Will your candidate win? Only time will tell...


Contributors

Lora Leonida
Neekoh Tablate
Ryan Knauff
Cayli Swartz
Marshall Miley

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 90.4%
  • JavaScript 9.6%