Using your scraped data, investigate different relationships between candidates and the amount of money they raised. Here are some suggestions to get you started, but feel free to pose your own questions or do additional exploration:

a. How often does the candidate who raised more money win a race?

b. How often does the candidate who spent more money win a race?

c. Does the difference between either money raised or money spent seem to influence the likelihood of a candidate winning a race?

d. How often does the incumbent candidate win a race?

e. Can you detect any relationship between amount of money raised and the incumbent status of a candidate?

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
from scipy.stats import pearsonr
from scipy.stats import spearmanr
import statsmodels.formula.api as smf
from scipy.stats import pointbiserialr
from statsmodels.formula.api import glm
from statsmodels.stats.outliers_influence import variance_inflation_factor 
import statsmodels.api as sm
from patsy import dmatrix 
import matplotlib.pyplot as plt

In [2]:
scraped_data = pd.read_csv('../data/scraped_data.csv')
scraped_data.drop('Unnamed: 0', axis=1, inplace=True)
scraped_data

Unnamed: 0,State,District,Name,Party,Incumbent,Winner,Vote Percentage,Raised,Spent
0,AL,1,Jerry Carl,R,False,True,64.9,"$1,971,321","$1,859,349"
1,AL,1,James Averhart,D,False,False,35.0,"$80,095","$78,973"
2,AL,2,Barry Moore,R,False,True,65.3,"$650,807","$669,368"
3,AL,2,Phyllis Harvey-Hall,D,False,False,34.6,"$56,050","$55,988"
4,AL,3,Mike D Rogers,R,True,True,67.5,"$1,193,111","$1,218,564"
...,...,...,...,...,...,...,...,...,...
803,WI,7,Tricia Zunker,D,False,False,39.2,"$1,261,957","$1,232,690"
804,WI,8,Mike Gallagher,R,True,True,64.0,"$3,202,905","$2,841,801"
805,WI,8,Amanda Stuck,D,False,False,36.0,"$416,978","$399,916"
806,WY,1,Liz Cheney,R,True,True,68.6,"$3,003,883","$3,060,167"


In [3]:
scraped_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808 entries, 0 to 807
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   State            808 non-null    object 
 1   District         808 non-null    int64  
 2   Name             808 non-null    object 
 3   Party            808 non-null    object 
 4   Incumbent        808 non-null    bool   
 5   Winner           808 non-null    bool   
 6   Vote Percentage  808 non-null    float64
 7   Raised           808 non-null    object 
 8   Spent            808 non-null    object 
dtypes: bool(2), float64(1), int64(1), object(5)
memory usage: 45.9+ KB


In [4]:
columns_to_convert = ['Raised', 'Spent']
for col in columns_to_convert:
    scraped_data[col] = scraped_data[col].str.replace('[$,]', '', regex=True)
    scraped_data[col] = scraped_data[col].astype(float)

In [5]:
scraped_data.head(5)

Unnamed: 0,State,District,Name,Party,Incumbent,Winner,Vote Percentage,Raised,Spent
0,AL,1,Jerry Carl,R,False,True,64.9,1971321.0,1859349.0
1,AL,1,James Averhart,D,False,False,35.0,80095.0,78973.0
2,AL,2,Barry Moore,R,False,True,65.3,650807.0,669368.0
3,AL,2,Phyllis Harvey-Hall,D,False,False,34.6,56050.0,55988.0
4,AL,3,Mike D Rogers,R,True,True,67.5,1193111.0,1218564.0


- Group scraped data by districts and states then calculate who won the most money in each race along with the actual winner of the race. 
- Compare the candidate who won the most money or spent the most money with the actual winner for each race. How often do they match? 
- Calculate the percentage. Divide the number of races where the candidate who won the most money also won the race by the total number of races to calculate the percentage. 

In [6]:
scraped_data['Max Raised Candidate'] = scraped_data.groupby(['State','District'])['Raised'].transform(lambda x: x == x.max())
scraped_data['Max Spent Candidate'] = scraped_data.groupby(['State','District'])['Spent'].transform(lambda x: x == x.max())


Created two new columns that gave Boolean statements whether a candidate raised and spent the most money

In [7]:
scraped_data.head(5)

Unnamed: 0,State,District,Name,Party,Incumbent,Winner,Vote Percentage,Raised,Spent,Max Raised Candidate,Max Spent Candidate
0,AL,1,Jerry Carl,R,False,True,64.9,1971321.0,1859349.0,True,True
1,AL,1,James Averhart,D,False,False,35.0,80095.0,78973.0,False,False
2,AL,2,Barry Moore,R,False,True,65.3,650807.0,669368.0,True,True
3,AL,2,Phyllis Harvey-Hall,D,False,False,34.6,56050.0,55988.0,False,False
4,AL,3,Mike D Rogers,R,True,True,67.5,1193111.0,1218564.0,True,True


In [8]:
winners_data = scraped_data[scraped_data['Winner'] == True]
winners_data.head(5)

Unnamed: 0,State,District,Name,Party,Incumbent,Winner,Vote Percentage,Raised,Spent,Max Raised Candidate,Max Spent Candidate
0,AL,1,Jerry Carl,R,False,True,64.9,1971321.0,1859349.0,True,True
2,AL,2,Barry Moore,R,False,True,65.3,650807.0,669368.0,True,True
4,AL,3,Mike D Rogers,R,True,True,67.5,1193111.0,1218564.0,True,True
6,AL,4,Robert B Aderholt,R,True,True,82.5,1255076.0,1323812.0,True,True
7,AL,5,Mo Brooks,R,True,True,95.8,655365.0,210045.0,True,True


In [9]:
total_districts = scraped_data.groupby(['State', 'District']).size().count()
total_districts

430

In [10]:
# Find how often the candidate who raised the most money also won
match_win_raised = winners_data.groupby('Max Raised Candidate').size()
match_win_spent = winners_data.groupby('Max Spent Candidate').size()

# Compare this to the total number of districts in the dataset
total_districts = scraped_data.groupby(['State', 'District']).size().count()

# Calculate the percentage by dividing the number of matches by the total number of districts
percent_match_raised = (match_win_raised / total_districts) * 100
percent_match_spent = (match_win_spent / total_districts) * 100

print(percent_match_raised)
print('---------')
print(percent_match_spent)

Max Raised Candidate
False    10.930233
True     88.604651
dtype: float64
---------
Max Spent Candidate
False    11.627907
True     87.906977
dtype: float64


**a. How often does the candidate who raised more money win a race?**
88.6 %

**b. How often does the candidate who spent more money win a race?**
87.9%

In [11]:
scraped_data['Money_Difference'] = scraped_data['Raised'] - scraped_data['Spent']
scraped_data.head(5)

Unnamed: 0,State,District,Name,Party,Incumbent,Winner,Vote Percentage,Raised,Spent,Max Raised Candidate,Max Spent Candidate,Money_Difference
0,AL,1,Jerry Carl,R,False,True,64.9,1971321.0,1859349.0,True,True,111972.0
1,AL,1,James Averhart,D,False,False,35.0,80095.0,78973.0,False,False,1122.0
2,AL,2,Barry Moore,R,False,True,65.3,650807.0,669368.0,True,True,-18561.0
3,AL,2,Phyllis Harvey-Hall,D,False,False,34.6,56050.0,55988.0,False,False,62.0
4,AL,3,Mike D Rogers,R,True,True,67.5,1193111.0,1218564.0,True,True,-25453.0


In [12]:
correlation = scraped_data['Money_Difference'].corr(scraped_data['Winner'].astype(int))
correlation

0.2248804227596295

In [13]:
# Calculate the Pearson correlation coefficient
correlation_coefficient, p_value = pearsonr(scraped_data['Money_Difference'], scraped_data['Winner'])

# Print the correlation coefficient
print(f"Pearson Correlation Coefficient: {correlation_coefficient}")

Pearson Correlation Coefficient: 0.22488042275962905


In [14]:
# Calculate the Spearman rank correlation coefficient
correlation_coefficient, p_value = spearmanr(scraped_data['Money_Difference'], scraped_data['Winner'])

# Print the correlation coefficient
print(f"Spearman Rank Correlation Coefficient: {correlation_coefficient}")

Spearman Rank Correlation Coefficient: 0.44718226589616233


**c. Does the difference between either money raised or money spent seem to influence the likelihood of a candidate winning a race?**

The correlation between difference in money raised and spent and winning is pretty small. The correlation value from the Pearson Correlation Coefficient (0.2248) shows little to no linear correlation. The coefficient from Spearman Rank is stronger but not close to 1 so still pretty weak. 

**d. How often does the incumbent candidate win a race?**

In [15]:
# Find how often incumbent candidate win a race
match_win_incumbent = winners_data.groupby('Incumbent').size()

# Calculate the percentage by dividing the number of matches by the total number of districts
percent_incumbent_win = (match_win_incumbent / total_districts) * 100

print(percent_match_raised)

Max Raised Candidate
False    10.930233
True     88.604651
dtype: float64


**e. Can you detect any relationship between amount of money raised and the incumbent status of a candidate?**


In [16]:
correlation = scraped_data['Raised'].corr(scraped_data['Incumbent'].astype(int))
correlation


0.23580865297654205

In [17]:
# Calculate the Pearson correlation coefficient
correlation_coefficient, p_value = pearsonr(scraped_data['Raised'], scraped_data['Incumbent'])

# Print the correlation coefficient
print(f"Pearson Correlation Coefficient: {correlation_coefficient}")

Pearson Correlation Coefficient: 0.23580865297654183


In [18]:
# Calculate the Spearman rank correlation coefficient
correlation_coefficient, p_value = spearmanr(scraped_data['Raised'], scraped_data['Incumbent'])

# Print the correlation coefficient
print(f"Spearman Rank Correlation Coefficient: {correlation_coefficient}")

Spearman Rank Correlation Coefficient: 0.4178104520722119
