# America’s Political Mood, in Five Key Insights  
### Full Reproducible Analysis – Jupyter Workflow

This notebook documents the full analytical workflow behind the article **“America’s Political Mood, in Five Key Insights”**, based on polling data from the Echo Insights **Verified Voter Omnibus**.

We will:

1. Load and inspect the cleaned polling file (`polls_long_tidy (final).csv`).
2. Prepare and reshape the data to link:
   - **QRightDirection** – *“Would you say things in the United States are headed in the right direction, or is the country off on the wrong track?”*
   - **QIssues - Combined** – *“If you had to choose just one, which would you say is the biggest issue facing the country today?”*
3. Build the main visualizations used in the story.
4. Run supporting **statistical analysis**, including:
   - Descriptive statistics for key variables
   - Correlations between mood and issue priorities
   - A simple regression of pessimism on key issues
   - Clustering into “Four Americas” personas (KMeans)
5. Explain each step with detailed comments and transparent reasoning.

> With the exception of explicit total columns, the values in this dataset are **percentages** (0–100 scale).

## 1. Imports and basic setup

In [None]:
# Core data science stack
import pandas as pd
import numpy as np

# Visualization
import plotly.express as px

# Clustering and preprocessing
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Statistical analysis
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Utilities
from pathlib import Path

# Configure pandas display options for easier debugging / inspection
pd.set_option("display.max_columns", 60)
pd.set_option("display.width", 120)

## 2. Load the polling data

We assume the cleaned file is stored as:

`polls_long_tidy (final).csv`

This file is in *long* (tidy) format, with one row per combination of:

- poll wave (`period`)
- question (`question_id`)
- demographic category and subcategory (`demo_category`, `demo_subcategory`)
- response option (`response_option_norm`)
- numeric value (`value_num`), usually a percentage share

In [None]:
# Path to the CSV. Adjust if the file lives in a different folder.
data_path = Path("polls_long_tidy (final).csv")

# Load the CSV. The original file uses ';' as a separator.
df = pd.read_csv(data_path, sep=';')

# Quick look at the structure
df.head()

## 3. Basic cleaning and filtering

We apply a few key steps:

1. **Filter valid periods** – keep rows where `period` looks like `'YYYY-MM'` (e.g. `2025-01`).  
2. **Sort and forward-fill `demo_category`** – some rows may omit the category if it repeats; we restore it by forward-filling within each `(period, sheet, question_id)` group.  
3. **Normalize response option labels** – strip whitespace and lowercase the `response_option_norm` column for reliable matching and pivoting.

In [None]:
# Keep only rows where `period` is of the form YYYY-MM
valid_df = df[df['period'].astype(str).str.match(r'^\d{4}-\d{2}$', na=False)].copy()

# Sort to make forward-filling deterministic
valid_df = valid_df.sort_values(
    ['period', 'sheet', 'question_id', 'demo_category', 'demo_subcategory']
)

# Forward-fill demo_category within each (period, sheet, question_id) block
valid_df['demo_category'] = (
    valid_df
    .groupby(['period', 'sheet', 'question_id'])['demo_category']
    .ffill()
)

# Normalize response option labels
valid_df['response_option_norm'] = (
    valid_df['response_option_norm']
    .astype(str)
    .str.strip()
    .str.lower()
)

valid_df.head()

## 4. Focus on the two key questions

We subset the data to the two questions of interest:

- `QRightDirection` – mood about the direction of the country  
- `QIssues - Combined` – main issue facing the country

Then we reshape them from *long* to *wide* so each row corresponds to a demographic slice in a given period, with columns for each response option (e.g. `right direction`, `wrong track`, `cost of living`, `the state of democracy`, etc.).

In [None]:
# Keep only the two focal questions
q_subset = valid_df[valid_df['question_id'].isin(['QRightDirection', 'QIssues - Combined'])].copy()

# Split into two separate frames for clarity
rd = q_subset[q_subset['question_id'] == 'QRightDirection'].copy()
issues = q_subset[q_subset['question_id'] == 'QIssues - Combined'].copy()

# Sanity check the unique response options for each
print("QRightDirection options:", sorted(rd['response_option_norm'].unique()))
print("\nQIssues - Combined options (first 15):", sorted(issues['response_option_norm'].unique())[:15])

## 5. Pivot responses to wide format

We pivot each question separately so that:

- Index = (`period`, `demo_category`, `demo_subcategory`)  
- Columns = `response_option_norm` (e.g. `right direction`, `wrong track`, `cost of living`)  
- Values = `value_num` (percentage or total)

Then we merge the two wide tables to obtain a single dataset where each row is a demographic slice per wave with both:

- Sentiment columns from `QRightDirection`
- Issue-priority columns from `QIssues - Combined`

In [None]:
# Pivot QRightDirection responses
rd_pivot = (
    rd.pivot_table(
        index=['period', 'demo_category', 'demo_subcategory'],
        columns='response_option_norm',
        values='value_num',
        aggfunc='first'  # each combination should be unique
    )
    .reset_index()
)
rd_pivot.columns.name = None

# Pivot QIssues - Combined responses
issues_pivot = (
    issues.pivot_table(
        index=['period', 'demo_category', 'demo_subcategory'],
        columns='response_option_norm',
        values='value_num',
        aggfunc='first'
    )
    .reset_index()
)
issues_pivot.columns.name = None

# Merge into a single wide dataset
merged = pd.merge(
    rd_pivot,
    issues_pivot,
    on=['period', 'demo_category', 'demo_subcategory'],
    how='inner',
    suffixes=('_rd', '_issue')  # if any overlap in names
)

merged.head()

> ### Note on scales
> With the exception of explicit `total` columns (if present), each numeric field in `merged` represents a **percentage** (share of respondents) for that demographic group and period.  
> We treat them as proportions on a 0–100 scale; no rescaling is done here unless specified.

## 6. Descriptive statistics for key variables

Before visualizing or modeling, we compute basic descriptive statistics for the main variables used in the story:

- **Mood**: `right direction`, `wrong track`  
- **Issues**: `cost of living`, `the state of democracy`

This helps sanity-check ranges and detect anomalies (e.g. values out of 0–100, excessive missingness).

In [None]:
key_cols = ['right direction', 'wrong track', 'cost of living', 'the state of democracy']

# Filter to rows where at least one key column is not missing
desc_df = merged[key_cols].copy()

desc_stats = desc_df.describe().T  # transpose for a more readable layout
desc_stats

## 7. Correlation analysis – mood vs key issues

To quantify relationships that are only hinted at in the visuals, we compute Pearson correlation coefficients between:

- `wrong track` and `the state of democracy`
- `wrong track` and `cost of living`
- `right direction` and the same two issues

This provides a compact numerical summary of how strongly each issue is associated with pessimism or optimism at the group level.

In [None]:
# Compute a simple correlation matrix over the key columns
corr_matrix = desc_df.corr(method='pearson')
corr_matrix

In [None]:
# For more explicit reporting, we can extract specific correlations and p-values

pairs = [
    ('wrong track', 'the state of democracy'),
    ('wrong track', 'cost of living'),
    ('right direction', 'the state of democracy'),
    ('right direction', 'cost of living'),
]

for x, y in pairs:
    # Drop rows with missing values in either variable
    sub = desc_df[[x, y]].dropna()
    r, p = stats.pearsonr(sub[x], sub[y])
    print(f"Correlation between '{x}' and '{y}': r = {r:.3f}, p = {p:.3e}, n = {len(sub)}")

## 8. Visualization 1 – Democratic anxiety vs. “wrong track” sentiment

This figure explores whether concern about democracy is associated with believing the country is on the wrong track.

- **x-axis**: share selecting *“the state of democracy”* as the top issue  
- **y-axis**: share saying the country is on the *“wrong track”*  
- **Color**: poll wave (`period`)  
- **Points**: demographic slices (`demo_category` + `demo_subcategory` for each wave)

If the relationship is strong and positive, most points will lie along an upward slope: more democratic anxiety → more pessimism about direction.

In [None]:
# Build the scatter plot: democratic concern vs 'wrong track'
fig1 = px.scatter(
    merged,
    x='the state of democracy',
    y='wrong track',
    color='period',
    hover_data=['demo_category', 'demo_subcategory'],
    title="Democratic Anxiety and the 'Wrong Track' Mood",
    labels={
        'the state of democracy': "Share selecting 'State of Democracy' (%)",
        'wrong track': "Share saying 'Wrong Track' (%)",
        'period': "Poll wave"
    }
)
fig1

## 9. Visualization 2 – Cost of living vs optimism

This chart investigates whether concern about the cost of living is a clean predictor of optimism or pessimism.

- **x-axis**: share selecting *“cost of living”* as the top issue  
- **y-axis**: share saying the country is headed in the *“right direction”*  
- **Color**: poll wave (`period`)  
- **Points**: demographic slices per wave

If inflation were the dominant driver of mood, we would expect a strong negative slope: more cost-of-living concern → lower optimism. In practice, the pattern is more scattered, suggesting inflation is more of a shared backdrop than a clean separator between camps.

In [None]:
# Build the scatter plot: cost of living vs 'right direction'
fig2 = px.scatter(
    merged,
    x='cost of living',
    y='right direction',
    color='period',
    hover_data=['demo_category', 'demo_subcategory'],
    title="Cost of Living Worries vs Optimism",
    labels={
        'cost of living': "Share selecting 'Cost of Living' (%)",
        'right direction': "Share saying 'Right Direction' (%)",
        'period': "Poll wave"
    }
)
fig2

## 10. Visualization 3 – National mood over time

To summarize the evolution of sentiment, we compute the **average “wrong track” share by period** across all demographic slices.

- **x-axis**: `period` (poll wave, e.g. `2025-01`)  
- **y-axis**: mean share saying the country is on the *“wrong track”*

We are primarily looking for the directionality of the trend (flat, improving, or worsening), not for precise point estimates.

In [None]:
# Aggregate average 'wrong track' by period
trend = (
    merged
    .groupby('period', as_index=False)['wrong track']
    .mean()
    .sort_values('period')
)

fig3 = px.line(
    trend,
    x='period',
    y='wrong track',
    markers=True,
    title="A Slow Drift Into National Pessimism",
    labels={
        'period': "Poll wave",
        'wrong track': "Average 'Wrong Track' share (%)"
    }
)
fig3

## 11. Regression – How much does democracy vs cost of living explain pessimism?

To move beyond pairwise correlations, we run a simple **linear regression** where:

- **Outcome**: `wrong track` (share saying the country is on the wrong track)  
- **Predictors**:
  - `the state of democracy` (share citing democracy as the top issue)
  - `cost of living` (share citing cost of living as the top issue)

This is not a causal model; its goal is descriptive:

> How strongly, and in what direction, are these issues associated with pessimism after controlling for each other?

In [None]:
# Build a modeling DataFrame with only the needed columns and drop missing values
reg_df = merged[['wrong track', 'the state of democracy', 'cost of living']].dropna()

# Statsmodels OLS: wrong_track ~ democracy + cost_of_living
model = smf.ols(
    formula='Q("wrong track") ~ Q("the state of democracy") + Q("cost of living")',
    data=reg_df
).fit()

model.summary()

## 12. Clustering – Identifying “Four Americas” personas

To group similar demographic slices together, we apply KMeans clustering to a set of features capturing:

- Issue priorities (from `QIssues - Combined`)  
- Mood (`right direction`, `wrong track`)

### 12.1 Feature selection and scaling

We:

1. Select issue columns from `issues_pivot` (excluding identifiers like `period`, `demo_category`, etc.).  
2. Combine them with `wrong track` and `right direction`.  
3. Replace missing values with 0 (interpreted as “not selected / not prioritized” in that slice).  
4. Standardize all features using `StandardScaler` so each has mean 0 and variance 1 before clustering.

In [None]:
# Identify issue columns from the issues pivot (exclude identifiers and non-features)
issue_cols = [c for c in issues_pivot.columns if c not in ['period', 'demo_category', 'demo_subcategory', 'total', 'unsure']]

# Combine issue shares with sentiment metrics
feature_cols = issue_cols + ['wrong track', 'right direction']

features = merged[feature_cols].fillna(0)

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

# Fit KMeans with 4 clusters (the "Four Americas")
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
merged['cluster'] = kmeans.fit_predict(X_scaled)

merged[['period', 'demo_category', 'demo_subcategory', 'cluster']].head()

### 12.2 Cluster profile visualization

To interpret the clusters, we summarize each one with a small set of intuitive indicators:

- Average **“wrong track”** share  
- Average **“right direction”** share  
- Average share selecting **“the state of democracy”** as the top issue  
- Average share selecting **“cost of living”** as the top issue

We then plot them as grouped bars per cluster to visualize the profile of each persona.

In [None]:
metrics = ['wrong track', 'right direction', 'the state of democracy', 'cost of living']

# Ensure metrics exist
missing_metrics = [m for m in metrics if m not in merged.columns]
print("Missing metrics:", missing_metrics)

cluster_summary = (
    merged
    .groupby('cluster')[metrics]
    .mean()
    .reset_index()
)

cluster_long = cluster_summary.melt(
    id_vars='cluster',
    value_vars=metrics,
    var_name='metric',
    value_name='value'
)

fig4 = px.bar(
    cluster_long,
    x='cluster',
    y='value',
    color='metric',
    barmode='group',
    title="The Four Americas: Mood and Issues by Cluster",
    labels={
        'cluster': "Cluster ID",
        'value': "Average share (%)",
        'metric': "Indicator"
    }
)
fig4

### 12.3 Interpreting the clusters (conceptual)

While the exact numeric cutoffs depend on the data, the pattern typically looks like:

- **Cluster with lowest “wrong track” and relatively high “right direction”**  
  → *System-OK Optimists* – concerned about issues but still confident the system works.

- **Cluster with mid-level “wrong track” and mixed issue priorities**  
  → *Strained Middle* – feels the squeeze, not fully in either camp.

- **Cluster with high “wrong track” and elevated democracy concern**  
  → *System Skeptics* – worry less about prices as such and more about institutions.

- **Cluster with the highest “wrong track” and intense focus on democracy**  
  → *Crisis Viewers* – believe something is fundamentally broken in the system.

These labels are interpretive overlays; the clustering itself is purely driven by the patterns in the polling data.

## 13. Summary and next steps

This notebook provides the full, code-first backbone for the article:

- It connects raw crosstabbed polling data to:
  - Cleaned, merged analytic tables  
  - Visualizations used in the narrative  
  - Quantitative checks via correlation and regression  
  - Cluster-based personas used to describe emerging “Four Americas”

Possible extensions:

- Add additional predictors (e.g. party ID, age group) to the regression models.  
- Explore alternative clustering methods (e.g. Gaussian Mixture Models, hierarchical clustering).  
- Quantify uncertainty more explicitly with bootstrapped confidence intervals for key statistics.