# 21 — US County: Mass Shooting Rate vs Population

Scatter plot with trend line exploring the correlation between population size
and mass shooting rates across ~100 of the largest US counties.

**Caveat:** Mass shootings are rare events. Rates are computed over a 5-year window
(2019-2023) but remain subject to small-count volatility.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats
from pathlib import Path

DATA_DIR = Path('../data/processed')
df = pd.read_csv(DATA_DIR / 'merged_us_mass_shooting_data.csv', dtype={'fips': str})

plot_df = df.dropna(subset=['population', 'mass_shooting_rate']).copy()
print(f"Counties with both population and mass shooting data: {len(plot_df)}")

Counties with both population and mass shooting data: 101


## Scatter Plot — Mass Shooting Rate vs Population (log-log)

In [2]:
plot_df['log_ms_rate'] = np.log10(plot_df['mass_shooting_rate'].clip(lower=0.001))
plot_df['log_population'] = np.log10(plot_df['population'])

slope, intercept, r_value, p_value, std_err = stats.linregress(
    plot_df['log_population'], plot_df['log_ms_rate']
)
r_squared = r_value ** 2

print(f"Linear regression (log10 population vs log10 mass shooting rate):")
print(f"  R\u00b2 = {r_squared:.4f}")
print(f"  p-value = {p_value:.2e}")
print(f"  slope = {slope:.4f}")

Linear regression (log10 population vs log10 mass shooting rate):
  R² = 0.0795
  p-value = 4.29e-03
  slope = -0.3187


In [3]:
fig = px.scatter(
    plot_df,
    x='population',
    y='mass_shooting_rate',
    color='region',
    hover_name='county_name',
    hover_data={'population': ':,.0f', 'mass_shooting_rate': ':.2f', 'state': True, 'region': True},
    log_x=True,
    log_y=True,
    title=f'Mass Shooting Rate vs Population — US Counties (R\u00b2={r_squared:.3f}, p={p_value:.2e})',
    labels={
        'population': 'Population (log scale)',
        'mass_shooting_rate': 'Mass Shooting Rate per 100K/yr (log scale)',
        'region': 'Region',
    },
)

x_range = np.linspace(plot_df['log_population'].min(), plot_df['log_population'].max(), 100)
y_trend = 10 ** (slope * x_range + intercept)
fig.add_trace(go.Scatter(
    x=10 ** x_range, y=y_trend,
    mode='lines',
    name=f'Trend (R\u00b2={r_squared:.3f})',
    line=dict(color='red', dash='dash', width=2),
))

fig.update_layout(template='plotly_white', height=600)
fig.show()