# Who Voted and Why? A Data Analysis of Socioeconomic Factors in the 2020 Election

**Datasets to be used:**
The [**Election Results Dataset**](https://www.kaggle.com/datasets/phindolin/election-results-dataset/data) contains key demographic and economic indicators for each state.
There are two datasets that will be used in this project:
1. Election Result Data
2. Voters Demographic Data by FIPS Code

The dataset also includes these important data:

1. Education Levels: Percentage of individuals with Bachelor’s Degrees or higher, Associate’s Degrees, High School Graduates, and those without a high school diploma.
2. Unemployment Rate: The percentage of unemployed individuals per state.
3. Voting Data: The number of votes for Democratic, Republican, and other candidates in each county.
These features allow us to analyze whether factors like education or job status influenced how people voted.

**Analysis Questions:**
Elections are more than just numbers; they reflect the social and economic conditions of a nation. The 2020 U.S. Presidential Election was no exception, and in this project, I dive into three key questions.

1. How does the unemployment rate correlate with voting patterns?
2. What is the relationship between education levels and voting preferences
3. Do younger voters have a preference when it comes to voting?

**Columns that will be used:**
1. Name of the State (state_name)
2. State Abbreciation (ST_ABBR)
3. Year
4. Total Votes for Democrat (DEMOCRAT)
5. Total Votes for Republican (REPUBLICAN)
6. Young Voters with Bachelor Degree or higher (EP_Age18_24_BachOrHigher)
7. Young Voters with Associate Degree (EP_Age18_24_AssDeg)
8. Young Voters with High School Degree (EP_Age18_24_HSGrad)
9. Young Voters with Degree Lower than High School (EP_Age18_24_NoHS)
10. Mature Voters with Bachelor Degree or higher (EP_Age24plus_BachOrHigher)
11. Mature Voters with Associate Degree (EP_Age24plus_AssDeg)
12. Mature Voters with Associate Degree (EP_Age24plus_HSGrad)
13. Mature Voters with Associate Degree (EP_Age24plus_NoHS, EP_NOHSDP)
14. FIPS (county_fips)
15. EP_UNEMP (Unemployment Rate)

**Columns to be used to merge/join them:**
1. Election Result Data (county_fips)
2. Voters Demographic Data by FIPS Code (FIPS)

**Hyphotesis:**
1. Counties with higher unemployment rates are more likely to support the Democratic candidate.

2. Economic hardship is often linked to political shifts, and we anticipate that areas with higher unemployment might lean Democratic due to policies focused on job creation and social safety nets. States with higher education levels will favor the Democratic candidate, while those with lower education levels will favor the Republican candidate.

3. Higher education levels are often associated with more liberal viewpoints, while lower education levels have historically been linked to Republican-leaning counties. Younger voters will favor the Democratic candidate.

With these hypotheses in mind, I will move forward with the analysis. <br>
Let's load the data first!

In [20]:
import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

In [21]:
# Load the dataset
!pip install kagglehub[pandas-datasets]
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Open Election Result Database from Kaggle API
file_path = "2024_US_County_Level_Presidential_Results.csv"

result = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "phindolin/election-results-dataset",
    "2024_US_County_Level_Presidential_Results.csv",
)

display(result)




[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Unnamed: 0,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff
0,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632
1,Alabama,1003,Baldwin County,95144,24763,120973,70381.0,0.786490,0.204699,0.581791
2,Alabama,1005,Barbour County,5578,4120,9766,1458.0,0.571165,0.421872,0.149293
3,Alabama,1007,Bibb County,7563,1617,9230,5946.0,0.819393,0.175190,0.644204
4,Alabama,1009,Blount County,25271,2569,28024,22702.0,0.901763,0.091671,0.810091
...,...,...,...,...,...,...,...,...,...,...
3155,Wyoming,56037,Sweetwater County,12541,3731,16569,8810.0,0.756895,0.225180,0.531716
3156,Wyoming,56039,Teton County,4134,8748,13077,-4614.0,0.316128,0.668961,-0.352833
3157,Wyoming,56041,Uinta County,7282,1561,8984,5721.0,0.810552,0.173753,0.636799
3158,Wyoming,56043,Washakie County,3125,656,3841,2469.0,0.813590,0.170789,0.642801


In [22]:
# Open Voters Demographic Database from Kaggle API
file_path = "output.csv"

# Load the latest version
demographic = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "phindolin/election-results-dataset",
    file_path,
)

display(demographic)

Unnamed: 0,FIPS,EP_POV150,EP_AGE17,EP_AGE65,EP_Age18_24_AssDeg,EP_Age18_24_BachOrHigher,EP_Age18_24_HSGrad,EP_Age18_24_NoHS,EP_Age24plus_AssDeg,EP_Age24plus_BachOrHigher,...,Median_Bracket,STATE,ST_ABBR,COUNTY,LOCATION,POPCHANGE,Year,DEMOCRAT,OTHER,REPUBLICAN
0,1001,10.587460,26.5350,12.3250,37.70,5.20,34.90,22.15,29.00,21.75,...,27.0,ALABAMA,AL,Autauga,"autauga county, alabama",Decreasing,2012,6363.0,190.0,17379.0
1,1003,12.210226,22.8925,17.0025,37.35,7.10,35.05,20.55,31.30,27.25,...,27.0,ALABAMA,AL,Baldwin,"baldwin county, alabama",Decreasing,2012,18424.0,898.0,66016.0
2,1005,25.003023,21.8325,14.5300,27.55,0.90,41.65,29.95,24.35,14.00,...,17.0,ALABAMA,AL,Barbour,"barbour county, alabama",Increasing,2012,5912.0,47.0,5550.0
3,1007,12.639007,22.4250,13.0600,28.90,2.90,43.05,25.15,23.90,9.55,...,17.0,ALABAMA,AL,Bibb,"bibb county, alabama",Decreasing,2012,2202.0,86.0,6132.0
4,1009,13.371317,24.4825,15.0150,38.95,0.80,37.45,22.80,26.45,12.50,...,27.0,ALABAMA,AL,Blount,"blount county, alabama",Decreasing,2012,2970.0,279.0,20757.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9464,56037,15.000000,26.4500,11.4500,46.90,4.20,38.30,10.65,37.40,21.60,...,22.0,WYOMING,WY,Sweetwater,"sweetwater county, wyoming",Increasing,2020,3823.0,646.0,12229.0
9465,56039,9.500000,18.5500,14.2500,59.10,16.55,15.65,8.70,23.95,56.25,...,22.0,WYOMING,WY,Teton,"teton county, wyoming",Decreasing,2020,9848.0,598.0,4341.0
9466,56041,15.350000,29.1500,13.1000,39.00,3.30,38.80,18.85,36.85,17.25,...,22.0,WYOMING,WY,Uinta,"uinta county, wyoming",Increasing,2020,1591.0,372.0,7496.0
9467,56043,15.800000,23.8500,21.1500,39.65,3.20,31.60,25.60,38.55,22.65,...,22.0,WYOMING,WY,Washakie,"washakie county, wyoming",Increasing,2020,651.0,136.0,3245.0


I will merge those two datasets. The first dataset, Election Result, includes county-level voting data, vote differences, and FIPS codes. The FIPS (Federal Information Processing Standards) code is a unique numerical identifier assigned to each county in the U.S., ensuring consistency across datasets.

The second dataset, Voters Demographic dataset, contains age and education data, along with FIPS codes. The FIPS code will serve as the unique identifier to merge both datasets accurately. I will rename the FIPS column name so the columns can match.

In [23]:
# Rename Column
demographic = demographic.rename(columns={"FIPS": "county_fips"})
demographic.head(5)

Unnamed: 0,county_fips,EP_POV150,EP_AGE17,EP_AGE65,EP_Age18_24_AssDeg,EP_Age18_24_BachOrHigher,EP_Age18_24_HSGrad,EP_Age18_24_NoHS,EP_Age24plus_AssDeg,EP_Age24plus_BachOrHigher,...,Median_Bracket,STATE,ST_ABBR,COUNTY,LOCATION,POPCHANGE,Year,DEMOCRAT,OTHER,REPUBLICAN
0,1001,10.58746,26.535,12.325,37.7,5.2,34.9,22.15,29.0,21.75,...,27.0,ALABAMA,AL,Autauga,"autauga county, alabama",Decreasing,2012,6363.0,190.0,17379.0
1,1003,12.210226,22.8925,17.0025,37.35,7.1,35.05,20.55,31.3,27.25,...,27.0,ALABAMA,AL,Baldwin,"baldwin county, alabama",Decreasing,2012,18424.0,898.0,66016.0
2,1005,25.003023,21.8325,14.53,27.55,0.9,41.65,29.95,24.35,14.0,...,17.0,ALABAMA,AL,Barbour,"barbour county, alabama",Increasing,2012,5912.0,47.0,5550.0
3,1007,12.639007,22.425,13.06,28.9,2.9,43.05,25.15,23.9,9.55,...,17.0,ALABAMA,AL,Bibb,"bibb county, alabama",Decreasing,2012,2202.0,86.0,6132.0
4,1009,13.371317,24.4825,15.015,38.95,0.8,37.45,22.8,26.45,12.5,...,27.0,ALABAMA,AL,Blount,"blount county, alabama",Decreasing,2012,2970.0,279.0,20757.0


In [24]:
# Merge the Data based on unique ID (county fips)
result_demo = pd.merge(result, demographic, on="county_fips")
result_demo.head()

Unnamed: 0,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff,...,Median_Bracket,STATE,ST_ABBR,COUNTY,LOCATION,POPCHANGE,Year,DEMOCRAT,OTHER,REPUBLICAN
0,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632,...,27.0,ALABAMA,AL,Autauga,"autauga county, alabama",Decreasing,2012,6363.0,190.0,17379.0
1,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632,...,27.0,ALABAMA,AL,Autauga,"autauga county, alabama",Decreasing,2016,5936.0,865.0,18172.0
2,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632,...,27.0,ALABAMA,AL,Autauga,"autauga county, alabama",Increasing,2020,7503.0,429.0,19838.0
3,Alabama,1003,Baldwin County,95144,24763,120973,70381.0,0.78649,0.204699,0.581791,...,27.0,ALABAMA,AL,Baldwin,"baldwin county, alabama",Decreasing,2012,18424.0,898.0,66016.0
4,Alabama,1003,Baldwin County,95144,24763,120973,70381.0,0.78649,0.204699,0.581791,...,27.0,ALABAMA,AL,Baldwin,"baldwin county, alabama",Increasing,2016,18458.0,3874.0,72883.0


In [25]:
result_demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9349 entries, 0 to 9348
Data columns (total 47 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   state_name                  9349 non-null   object 
 1   county_fips                 9349 non-null   int64  
 2   county_name                 9349 non-null   object 
 3   votes_gop                   9349 non-null   int64  
 4   votes_dem                   9349 non-null   int64  
 5   total_votes                 9349 non-null   int64  
 6   diff                        9349 non-null   float64
 7   per_gop                     9349 non-null   float64
 8   per_dem                     9349 non-null   float64
 9   per_point_diff              9349 non-null   float64
 10  EP_POV150                   9349 non-null   float64
 11  EP_AGE17                    9349 non-null   float64
 12  EP_AGE65                    9349 non-null   float64
 13  EP_Age18_24_AssDeg          9341 

Now, we are ready to test the hypothesis!

---

Before we begin the hypothesis test, let's group the data based on the state name, state abbreviation, and year for democrat and republican votes. This step is needed to simplify the data we are going to use for analysis.

In [26]:
# Group data by state, year, and votes
votes_state = result_demo.groupby(["state_name", "ST_ABBR", "Year"])[
    ["DEMOCRAT", "REPUBLICAN"]
].sum()
votes_state.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,DEMOCRAT,REPUBLICAN
state_name,ST_ABBR,Year,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,AL,2012,795696.0,1255925.0
Alabama,AL,2016,729547.0,1318250.0
Alabama,AL,2020,849624.0,1441170.0
Alaska,AK,2012,0.0,0.0
Alaska,AK,2016,0.0,0.0


In order to know how the data was spread accross the state, I am going to choose one of the election period. In this case, I will use the 2020 election data to test the hyphotesis. First, we are going to look at the vote share for each state in the USA.

In [27]:
# Filter data for the year 2020
votes_state = votes_state.reset_index()
votes_2020 = votes_state[votes_state["Year"] == 2020]
votes_2020.head()

Unnamed: 0,state_name,ST_ABBR,Year,DEMOCRAT,REPUBLICAN
2,Alabama,AL,2020,849624.0,1441170.0
5,Alaska,AK,2020,0.0,0.0
8,Arizona,AZ,2020,1672143.0,1661686.0
11,Arkansas,AR,2020,423932.0,760647.0
14,California,CA,2020,11110250.0,6006429.0


In [28]:
# Calculate total votes per party
total_democrat = votes_2020["DEMOCRAT"].sum()
total_republican = votes_2020["REPUBLICAN"].sum()

# Create the aggregate bar chart
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
    go.Bar(
        y=["Democrat"],  # Y-axis represents the party
        x=[total_democrat],
        name="Democrat",
        orientation="h",
    )
)

fig.add_trace(
    go.Bar(y=["Republican"], x=[total_republican], name="Republican", orientation="h")
)

# layout
fig.update_layout(
    title="Total Democrat and Republican Votes (2020)",
    xaxis_title="Votes",
    yaxis_title="Party",
    barmode="group",  # Display bars side-by-side
    xaxis_tickformat=",.0f",  # Format x-axis ticks to show whole numbers
    width=800,
    height=400,
)

fig.show()

In [29]:
# Create the bar plot
fig = px.bar(
    votes_2020,
    x="ST_ABBR",
    y=["DEMOCRAT", "REPUBLICAN"],
    title="Aggregate Democrat and Republican Votes by State (2020)",
    labels={"ST_ABBR": "State", "value": "Number of Votes", "variable": "Party"},
    barmode="group",
)
fig.show()


The total votes graph shows that the Democratic candidate secured a higher number of total votes compared to the Republican candidate. This aligns with the official election results, where the Democratic candidate won the popular vote. However, because the U.S. uses the Electoral College system, winning the popular vote does not necessarily guarantee victory in the election. The margin between the two parties indicates a competitive race, with a substantial number of votes for both candidates.

The second graph provides a state-by-state breakdown of Democratic and Republican votes. While most states have recorded votes for both parties, certain states, such as DC (District of Columbia), VT (Vermont), and RI (Rhode Island), appear to show absolute dominance for the Democratic candidate, which is consistent with historical voting patterns. Some states, like WY (Wyoming) and ND (North Dakota), lean heavily Republican. Interestingly, there are states with no recorded vote values in the dataset (such as AK (Alaska)), which could be due to data availability issues or missing records in the dataset. Another possibility is that some states were excluded due to incomplete county-level reporting in the dataset used for this visualization.T

However, those two bar chart does not clearly reveal voting patterns across states. A choropleth map would provide a better visualization of which party is dominant in each state. Additionally, analyzing the top 10 list can offer further insights into voting trends. So, let's dig deeper into this.

In [30]:
# Create a new column indicating the dominant party in each state
votes_2020["Dominant_Party"] = "Democrat"
votes_2020.loc[votes_2020["REPUBLICAN"] > votes_2020["DEMOCRAT"], "Dominant_Party"] = (
    "Republican"
)
votes_2020.loc[votes_2020["REPUBLICAN"] == votes_2020["DEMOCRAT"], "Dominant_Party"] = (
    "None"
)


# Create the choropleth map
fig = px.choropleth(
    votes_2020,
    locations="ST_ABBR",
    locationmode="USA-states",
    color="Dominant_Party",
    scope="usa",
    color_discrete_map={"Democrat": "blue", "Republican": "red", "None": "black"},
    hover_name="state_name",
    hover_data=["DEMOCRAT", "REPUBLICAN"],
    title="Dominant Party by State in 2020 US Presidential Election",
)

fig.show()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [31]:
# Create top10 list
top10 = votes_2020.sort_values(by="DEMOCRAT", ascending=False).head(10)

# Create a new column representing the difference in votes
top10["Difference"] = abs(top10["DEMOCRAT"] - top10["REPUBLICAN"])

# Sort the DataFrame by the 'Difference' column and display the top 10 rows
top10.sort_values(by="Difference", ascending=False).head(10)

Unnamed: 0,state_name,ST_ABBR,Year,DEMOCRAT,REPUBLICAN,Dominant_Party,Difference
14,California,CA,2020,11110250.0,6006429.0,Democrat,5103821.0
95,New York,NY,2020,5230985.0,3244798.0,Democrat,1986187.0
38,Illinois,IL,2020,3471915.0,2446891.0,Democrat,1025024.0
89,New Jersey,NJ,2020,2608335.0,1883274.0,Democrat,725061.0
128,Texas,TX,2020,5259126.0,5890347.0,Republican,631221.0
104,Ohio,OH,2020,2679165.0,3154834.0,Republican,475669.0
26,Florida,FL,2020,5297045.0,5668731.0,Republican,371686.0
65,Michigan,MI,2020,2804040.0,2649852.0,Democrat,154188.0
113,Pennsylvania,PA,2020,3458229.0,3377674.0,Democrat,80555.0
98,North Carolina,NC,2020,2684292.0,2758773.0,Republican,74481.0


From the analysis, we can see that the top states with the highest difference in votes favoring the Democratic party are California, New York, Illinois, and New Jersey.
California has the largest margin, with over 5.1 million more votes for Democrats than Republicans, followed by New York with nearly 2 million. States like Michigan and Pennsylvania, with very close margins, highlight the importance of battleground states in determining election outcomes.
A small shift in voter turnout or preferences in these states could change the overall election result.

*Hyphotesis Testing Num #1*

**Counties with higher unemployment rates are more likely to support the Democratic candidate**

Next, I will use scatterplot to know the relation between unemployment rate and voting patterns.
But first let's remove any data anomaly here, that is the data that contains unemployment rate < 0.

In [32]:
# Filter for 2020 data and remove anomalies (Unemployment rate < 0)
unemp = result_demo[
    (result_demo["Year"] == 2020) & (result_demo["EP_UNEMP"] >= 0)
].copy()
unemp.head()

Unnamed: 0,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff,...,Median_Bracket,STATE,ST_ABBR,COUNTY,LOCATION,POPCHANGE,Year,DEMOCRAT,OTHER,REPUBLICAN
2,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632,...,27.0,ALABAMA,AL,Autauga,"autauga county, alabama",Increasing,2020,7503.0,429.0,19838.0
5,Alabama,1003,Baldwin County,95144,24763,120973,70381.0,0.78649,0.204699,0.581791,...,27.0,ALABAMA,AL,Baldwin,"baldwin county, alabama",Increasing,2020,24578.0,1557.0,83544.0
8,Alabama,1005,Barbour County,5578,4120,9766,1458.0,0.571165,0.421872,0.149293,...,17.0,ALABAMA,AL,Barbour,"barbour county, alabama",Decreasing,2020,4816.0,80.0,5622.0
11,Alabama,1007,Bibb County,7563,1617,9230,5946.0,0.819393,0.17519,0.644204,...,27.0,ALABAMA,AL,Bibb,"bibb county, alabama",Decreasing,2020,1986.0,84.0,7525.0
14,Alabama,1009,Blount County,25271,2569,28024,22702.0,0.901763,0.091671,0.810091,...,27.0,ALABAMA,AL,Blount,"blount county, alabama",Increasing,2020,2640.0,237.0,24711.0


In [33]:
# Calculate the total votes
unemp["Total_Votes"] = unemp["DEMOCRAT"] + unemp["REPUBLICAN"] + unemp["OTHER"]

# Calculate the Democrat vote share
unemp["Democrat_Vote_Share"] = (unemp["DEMOCRAT"] / unemp["Total_Votes"]) * 100

# Create the scatter plot
fig = px.scatter(
    unemp,
    x="EP_UNEMP",
    y="Democrat_Vote_Share",
    color="Democrat_Vote_Share",
    hover_data=["state_name", "county_name"],
    title="Unemployment Rate vs. Democrat Vote Share (2020)",
    labels={
        "EP_UNEMP": "Unemployment Rate",
        "Democrat_Vote_Share": "Democrat Vote Share (%)",
    },
)

fig.show()

**Result:**

The plot does not reveal a clear correlation between the unemployment rate and Democratic vote share. The Democratic vote share is primarily concentrated in counties with 4-7% unemployment, which aligns with the national average unemployment rate in the U.S.

Additionally, voting patterns for the Democratic nominee do not consistently correspond with higher unemployment rates across states. Therefore, we reject the hypothesis that states with higher unemployment rates are more likely to vote for the Democratic nominee.

---

*Hyphotesis Testing Num #2*

**States with higher education levels will favor the Democratic candidate, while those with lower education levels will favor the Republican candidate.**

Education levels have long been linked to political preferences, and in this test, I examine whether higher education correlates with Democratic support while lower education favors Republican votes in the 2020 election.

Focusing on young adults (18-24) and the mature population (25+), I categorize education levels into higher education (Bachelor’s degree or higher) and lower education (Associate’s degree, high school diploma, or less).

In [34]:
# Filter data for the year 2020
educ = result_demo[result_demo["Year"] == 2020].copy()
educ.head()

Unnamed: 0,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff,...,Median_Bracket,STATE,ST_ABBR,COUNTY,LOCATION,POPCHANGE,Year,DEMOCRAT,OTHER,REPUBLICAN
2,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632,...,27.0,ALABAMA,AL,Autauga,"autauga county, alabama",Increasing,2020,7503.0,429.0,19838.0
5,Alabama,1003,Baldwin County,95144,24763,120973,70381.0,0.78649,0.204699,0.581791,...,27.0,ALABAMA,AL,Baldwin,"baldwin county, alabama",Increasing,2020,24578.0,1557.0,83544.0
8,Alabama,1005,Barbour County,5578,4120,9766,1458.0,0.571165,0.421872,0.149293,...,17.0,ALABAMA,AL,Barbour,"barbour county, alabama",Decreasing,2020,4816.0,80.0,5622.0
11,Alabama,1007,Bibb County,7563,1617,9230,5946.0,0.819393,0.17519,0.644204,...,27.0,ALABAMA,AL,Bibb,"bibb county, alabama",Decreasing,2020,1986.0,84.0,7525.0
14,Alabama,1009,Blount County,25271,2569,28024,22702.0,0.901763,0.091671,0.810091,...,27.0,ALABAMA,AL,Blount,"blount county, alabama",Increasing,2020,2640.0,237.0,24711.0


I applied a new data filter because the previous filter for unemployment data removed entries with an unemployment rate greater than zero. Since this analysis focuses on education levels, I want to reintegrate the previously excluded data.
Now, let's define higher and lower education by aggregating the relevant columns. Since the data is already expressed as proportions within each age group, we only need to categorize it into young and mature populations based on their levels of higher and lower education.

In [35]:
# Define higher and lower education columns
young_higher_education_cols = ["EP_Age18_24_BachOrHigher"]
young_lower_education_cols = [
    "EP_Age18_24_AssDeg",
    "EP_Age18_24_HSGrad",
    "EP_Age18_24_NoHS",
]
mature_higher_education_cols = ["EP_Age24plus_BachOrHigher"]
mature_lower_education_cols = [
    "EP_Age24plus_AssDeg",
    "EP_Age24plus_HSGrad",
    "EP_Age24plus_NoHS",
    "EP_NOHSDP",
]

# Calculate the sum of higher education percentages for each state
educ["Young_Higher_Education"] = educ[young_higher_education_cols].sum(axis=1)
educ["Young_Lower_Education"] = educ[young_lower_education_cols].sum(axis=1)
educ["Mature_Higher_Education"] = educ[mature_higher_education_cols].sum(axis=1)
educ["Mature_Lower_Education"] = educ[mature_lower_education_cols].sum(axis=1)

# Group by state and calculate the mean of higher and lower education percentages
state_education = (
    educ.groupby("state_name")
    .agg(
        {
            "Young_Higher_Education": "mean",
            "Young_Lower_Education": "mean",
            "Mature_Higher_Education": "mean",
            "Mature_Lower_Education": "mean",
        }
    )
    .sort_values(("Young_Higher_Education"), ascending=False)
)

state_education.head(10)


Unnamed: 0_level_0,Young_Higher_Education,Young_Lower_Education,Mature_Higher_Education,Mature_Lower_Education
state_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
District of Columbia,23.65,76.35,58.7,50.1
Massachusetts,16.357143,83.664286,41.671429,66.603571
New Jersey,16.1,83.902381,37.47619,72.397619
Rhode Island,15.13,84.85,40.81,67.86
North Dakota,13.150943,86.851887,22.84434,86.388679
New Hampshire,11.965,88.03,33.925,73.48
New York,11.365323,88.633065,28.479839,82.273387
Maryland,11.320833,88.685417,32.65625,77.3375
Nebraska,10.107527,89.889785,22.753763,85.463441
Vermont,9.796429,90.196429,34.442857,73.282143


In [36]:
# Calculate the Bachelor Degree and remove anomalies
educ["BachOrHigher"] = educ["Young_Higher_Education"] + educ["Mature_Higher_Education"]
educ = educ[educ["BachOrHigher"] <= 100]

# Calculate the total votes
educ["Total_Votes"] = educ["DEMOCRAT"] + educ["REPUBLICAN"] + educ["OTHER"]

# Calculate the Bachelor Degree Share
educ["Democrat_Vote_Share"] = (educ["DEMOCRAT"] / educ["Total_Votes"]) * 100

# Create the scatter plot
fig = px.scatter(
    educ,
    x="BachOrHigher",
    y="Democrat_Vote_Share",
    color="Democrat_Vote_Share",
    hover_data=["state_name", "county_name"],
    title="Bachelor or Higher Degree and Democrat Vote Share (2020)",
    labels={
        "BachOrHigher": "People with Bachelor Degree or Higher",
        "Democrat_Vote_Share": "Democrat Vote Share (%)",
    },
)

fig.show()

**Result**

There appears to be a positive correlation between education level and Democratic vote share.
Counties with higher percentages of bachelor’s degrees (right side of the chart) generally have higher Democratic vote shares (upper side of the chart).

The leftmost area of the chart (counties with lower percentages of bachelor's degrees, below ~20%) has many purple dots, indicating low Democratic vote shares (below 40%).

On the right side of the chart (above ~40% bachelor's degree share), there are more yellow and orange dots, showing that counties with a higher proportion of college-educated residents tend to vote more Democratic.

---


*Hyphotesis Testing Num #3*

**Higher education levels are often associated with more liberal viewpoints, while lower education levels have historically been linked to Republican-leaning counties. Younger voters will favor the Democratic candidate.**

*Since we have already defined the data for young and mature voters, we will now use that definition to proceed with the next steps. However, instead of separating them based on education level, we will merge both age groups into a combined dataset for further analysis.*

In [37]:
# Calculate the Bachelor Degree and Total Voters
educ["Total_Voters"] = (
    educ["Young_Higher_Education"]
    + educ["Young_Lower_Education"]
    + educ["Mature_Higher_Education"]
    + educ["Mature_Lower_Education"]
)
educ["Young_Voters"] = (
    educ["Young_Higher_Education"] + educ["Young_Lower_Education"]
) / educ["Total_Voters"]
educ["Mature_Voters"] = (
    educ["Mature_Higher_Education"] + educ["Mature_Lower_Education"]
) / educ["Total_Voters"]

educ.head()

Unnamed: 0,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff,...,Young_Higher_Education,Young_Lower_Education,Mature_Higher_Education,Mature_Lower_Education,BachOrHigher,Total_Votes,Democrat_Vote_Share,Total_Voters,Young_Voters,Mature_Voters
2,Alabama,1001,Autauga County,20447,7429,28139,13018.0,0.726643,0.264011,0.462632,...,11.2,88.8,28.0,83.3,39.2,27770.0,27.018365,211.3,0.473261,0.526739
5,Alabama,1003,Baldwin County,95144,24763,120973,70381.0,0.78649,0.204699,0.581791,...,8.7,91.25,31.6,78.05,40.3,109679.0,22.40903,209.6,0.476861,0.523139
8,Alabama,1005,Barbour County,5578,4120,9766,1458.0,0.571165,0.421872,0.149293,...,3.65,96.35,11.9,114.25,15.55,10518.0,45.788173,226.15,0.442184,0.557816
11,Alabama,1007,Bibb County,7563,1617,9230,5946.0,0.819393,0.17519,0.644204,...,1.15,98.85,11.4,106.55,12.55,9595.0,20.69828,217.95,0.458821,0.541179
14,Alabama,1009,Blount County,25271,2569,28024,22702.0,0.901763,0.091671,0.810091,...,3.05,97.0,12.95,105.55,16.0,27588.0,9.569378,218.55,0.45779,0.54221


In [38]:
# Create the scatter plot
fig = px.scatter(
    educ,
    x="Young_Voters",
    y="Democrat_Vote_Share",
    color="Democrat_Vote_Share",
    hover_data=["state_name", "county_name"],
    title="Young Voters and Democrat Vote Share (2020)",
    labels={
        "Young_Voters": "People who are 18-24 years old",
        "Democrat_Vote_Share": "Democrat Vote Share (%)",
    },
)

fig.show()

**Result**
1. The data points are widely scattered, showing no clear linear relationship between the proportion of young voters and the Democratic vote share.
Counties with a higher proportion of young voters (0.46 to 0.50 range) do not always show higher Democratic vote shares. Some still have low Democratic support (purple/blue dots).
This suggests that just having a higher percentage of young voters does not necessarily translate into stronger Democratic support.
2. The majority of counties have a young voter share between 42% and 48%.
Within this range, Democratic vote shares vary widely, indicating that other factors (such as state-level political culture, urban-rural divides, and turnout rates) influence voting outcomes more than just age distribution.
3. Counties with high Democratic vote shares (above 60%) appear across different levels of young voter share.
This suggests that while younger voters tend to lean Democratic, their share in a county is not the sole determinant of voting outcomes.
4. There are counties with lower young voter proportions that still show low Democratic vote shares, reinforcing the idea that age alone does not drive election results.

The plot suggests that while young voters are often associated with Democratic support, their proportion alone does not strongly predict voting outcomes. Other factors—such as regional political leanings, voter turnout, and demographic diversity—likely play a much greater role in shaping election results. To further refine the analysis, it would be useful to have a better definition for this young age group. The United States Census Bureau, for instance, defines young adults as those between the ages of 18 and 32.

**Conclusion:**

Our analysis helps shed light on how key demographic and economic factors influenced the 2020 election. While political preferences are shaped by many factors, we observe correlations between education, unemployment, and age with voting trends.

Higher education levels positively correlate with Democratic support.
Unemployment rates show a mixed relationship with voting patterns, depending on regional factors.
Younger voters lean Democratic, but turnout and engagement remain crucial.
By using Python for data analysis, we gain a deeper understanding of electoral dynamics and the societal forces that drive them. As data science continues to evolve, its role in political analysis will only become more significant, helping us interpret past elections and anticipate future trends.

---