# Hypothesis Testing – Economic Stress and Digital Behavior

This notebook implements the hypothesis testing part of the project:

> **Economic Stress and Digital Behavior: How Economic Changes Shape Social Media Usage in Europe**

We use two processed panel datasets:

- `panel_annual.csv` / `panel_annual_final.csv` – country × year panel with:
  - `sm_participation` – Eurostat social-media participation (% of individuals)
  - `inflation`, `unemployment` – annual macro indicators
  - `cci` – consumer confidence index (annual average)
  - `stress` – combined economic stress index (z-score of inflation and unemployment minus z-score of confidence)
  - `d_sm` – year-to-year change in social-media participation (within country)
- `panel_monthly_h3.csv` – country × year × month panel for H3 with:
  - unemployment rate, HICP index
  - platform shares (Facebook, Instagram, YouTube, LinkedIn)
  - `ent_share`, `prof_share` – entertainment vs professional categories
  - monthly changes `d_unemp`, `d_infl`, `d_ent`, `d_prof`

We test the following hypotheses:

- **H1:** Higher economic stress is associated with higher social-media participation.
- **H2:** When consumer confidence is low, the effect of economic stress on social-media participation is stronger.
- **H3:** During periods of economic stress, platform use shifts:
    - rising unemployment is associated with increases in professional platforms (e.g., LinkedIn),
    - rising inflation is associated with increases in entertainment platforms (e.g., Facebook, Instagram, YouTube).

In [None]:
# ------------------------------------------------------------
# Imports and paths
# ------------------------------------------------------------
import pandas as pd
import numpy as np
from pathlib import Path
import scipy.stats as st
import statsmodels.formula.api as smf

# Project paths
PROJECT_ROOT = Path.home() / "Desktop" / "DSA210"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"

print("Using processed data from:", DATA_PROCESSED)

# ------------------------------------------------------------
# Load annual panel (panel_a)
# ------------------------------------------------------------
annual_candidates = ["panel_annual_final.csv", "panel_annual.csv"]
panel_a = None
for fname in annual_candidates:
    path = DATA_PROCESSED / fname
    if path.exists():
        print("Loaded annual panel from:", fname)
        panel_a = pd.read_csv(path)
        break

if panel_a is None:
    raise FileNotFoundError("No annual panel file found. Expected one of: " + ", ".join(annual_candidates))

# ------------------------------------------------------------
# Load monthly panel for H3 (panel_m)
# ------------------------------------------------------------
monthly_path = DATA_PROCESSED / "panel_monthly_h3.csv"
if not monthly_path.exists():
    raise FileNotFoundError("Monthly panel 'panel_monthly_h3.csv' not found in data/processed/.")

panel_m = pd.read_csv(monthly_path)
print("Monthly panel shape:", panel_m.shape)

display(panel_a.head())
display(panel_m.head())

## H1 – Economic Stress and Social-Media Participation

**Hypothesis**

- **H1:** Years with higher economic stress have higher social-media participation.
- **H0:** There is no systematic difference in participation between low-stress and high-stress years.

We use the combined **economic stress index** (`stress`) constructed from inflation, unemployment, and (negatively) consumer confidence.

Two complementary approaches are used:

1. **Group comparison (Welch two-sample t-test)**  
   Compare mean `sm_participation` between:
   - low-stress years (bottom 50% of the stress distribution), and  
   - high-stress years (top 50%).

2. **Linear association (correlation with participation change)**  
   Check whether the stress index is correlated with year-to-year change in participation `d_sm` within countries.

In [None]:
# ------------------------------------------------------------
# H1 – Welch t-test: low vs high stress years
# ------------------------------------------------------------

# Keep observations with non-missing stress and participation
df_h1 = panel_a.dropna(subset=["stress", "sm_participation"]).copy()

# Median split (or q=2) into low vs high stress
df_h1["stress_group"] = pd.qcut(df_h1["stress"], 2, labels=["Low stress", "High stress"])

low  = df_h1[df_h1["stress_group"] == "Low stress"]["sm_participation"]
high = df_h1[df_h1["stress_group"] == "High stress"]["sm_participation"]

tstat, pval = st.ttest_ind(high, low, equal_var=False)

print("H1 – Welch t-test (High vs Low stress)")
print(f"t-statistic = {tstat:.3f}")
print(f"p-value     = {pval:.4f}")
print(f"Mean SM participation – Low stress  : {low.mean():.2f}")
print(f"Mean SM participation – High stress : {high.mean():.2f}")
print(f"N_low = {len(low)}, N_high = {len(high)}")

# ------------------------------------------------------------
# H1 – correlation between stress and ΔSM (within countries)
# ------------------------------------------------------------

# Make sure year-to-year change is available (or recompute)
if "d_sm" not in panel_a.columns:
    panel_a = panel_a.sort_values(["geo", "year"])
    panel_a["d_sm"] = panel_a.groupby("geo")["sm_participation"].diff()

df_corr = panel_a.dropna(subset=["stress", "d_sm"]).copy()
r_h1, p_h1 = st.pearsonr(df_corr["stress"], df_corr["d_sm"])

print("\nH1 – Pearson correlation between stress and ΔSM (yearly change)")
print(f"r = {r_h1:.3f}, p = {p_h1:.4f}, N = {len(df_corr)}")

## H2 – Moderating Role of Consumer Confidence

**Hypothesis**

- **H2:** When consumer confidence is low, the effect of economic stress on social-media participation is stronger.
- **H0:** Consumer confidence does not moderate the relationship between stress and participation.

We estimate the following linear regression model:

> sm_participation_{c,t} = β0  
> &nbsp;&nbsp;+ β1 · stress_{c,t}  
> &nbsp;&nbsp;+ β2 · cci_{c,t}  
> &nbsp;&nbsp;+ β3 · (stress_{c,t} × cci_{c,t})  
> &nbsp;&nbsp;+ ε_{c,t}

Where:

- β1 measures the direct effect of economic stress  
- β2 measures the effect of consumer confidence  
- β3 measures whether the effect of stress depends on consumer confidence

In [None]:
# ------------------------------------------------------------
# H2 – OLS regression with interaction: sm_participation ~ stress * cci
# ------------------------------------------------------------

df_h2 = panel_a.dropna(subset=["stress", "cci", "sm_participation"]).copy()

model_h2 = smf.ols("sm_participation ~ stress * cci", data=df_h2).fit()

print(model_h2.summary())

# Extract key coefficients
coef = model_h2.params
pvals = model_h2.pvalues

print("\nKey coefficients:")
print(f"stress      coef = {coef['stress']:.3f}, p = {pvals['stress']:.4f}")
print(f"cci         coef = {coef['cci']:.3f}, p = {pvals['cci']:.4f}")
print(f"stress:cci  coef = {coef['stress:cci']:.3f}, p = {pvals['stress:cci']:.4f}")

## H3 – Platform Shifts Under Economic Stress

**Hypothesis**

- **H3-1:** Increases in unemployment are associated with increases in professional-platform share (e.g., LinkedIn).
- **H3-2:** Increases in inflation are associated with increases in entertainment-platform share (e.g., Facebook, Instagram, YouTube).
- **H0:** Short-run changes in unemployment and inflation are not systematically related to changes in platform categories.

We use the monthly panel `panel_monthly_h3.csv` and focus on month-to-month changes within each country:

- `d_unemp` – monthly change in unemployment rate  
- `d_infl` – monthly change in HICP index  
- `d_prof` – monthly change in professional platform share  
- `d_ent` – monthly change in entertainment platform share  

For each pair we compute a Pearson correlation after dropping rows with missing values.

In [None]:
# ------------------------------------------------------------
# H3 – Pearson correlations for monthly changes
# ------------------------------------------------------------

def safe_corr(df, x, y, label):
    # Compute Pearson r between x and y, dropping rows with missing values.
    sub = df[[x, y]].dropna()
    n = len(sub)
    if n < 2:
        print(f"{label}: not enough overlapping non-missing observations (n={n}).")
        return np.nan, np.nan, n
    r, p = st.pearsonr(sub[x], sub[y])
    print(f"{label}: r = {r:.3f}, p = {p:.3g}, n = {n}")
    return r, p, n

print("Monthly panel loaded for H3.")
print("Columns:", panel_m.columns.tolist())
print("Shape  :", panel_m.shape)

print("\n==============================")
print("H3-1: ΔUnemployment → ΔProfessional share")
print("==============================")
r1, p1, n1 = safe_corr(panel_m, "d_unemp", "d_prof",
                       "H3-1 ΔUnemployment vs ΔProfessional share")

print("\n==============================")
print("H3-2: ΔInflation → ΔEntertainment share")
print("==============================")
r2, p2, n2 = safe_corr(panel_m, "d_infl", "d_ent",
                       "H3-2 ΔInflation vs ΔEntertainment share")

print("\nSummary:")
print(f"H3-1: r = {r1}, p = {p1}, n = {n1}")
print(f"H3-2: r = {r2}, p = {p2}, n = {n2}")