# Task 3: Event Impact Modeling

This notebook loads the enriched processed dataset, builds the event–indicator association matrix, applies temporal impact modeling (lag + linear accumulation), and validates against the Telebirr case study.

In [None]:
import sys
sys.path.append("..")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_loading import load_processed_enriched
from src.impact_model import (
    merge_event_impacts,
    build_event_indicator_matrix,
    compute_numeric_impact,
    apply_event_impacts_over_time,
)

# Load from processed (not raw). If missing, run: python scripts/build_processed_enriched.py
try:
    data, events, impact_links = load_processed_enriched("../data/processed/ethiopia_fi_enriched.xlsx")
except FileNotFoundError:
    import subprocess
    subprocess.run([sys.executable, "scripts/build_processed_enriched.py"], cwd="..", check=True)
    data, events, impact_links = load_processed_enriched("../data/processed/ethiopia_fi_enriched.xlsx")
print("Data shape:", data.shape, "Events:", len(events), "Impact links:", len(impact_links))
data.head(2)

In [None]:
# Merge impact_links with events using parent_id
merged = impact_links.merge(
    events[["record_id", "category", "period_start", "source_name"]],
    left_on="parent_id",
    right_on="record_id",
    how="left",
    suffixes=("_impact", "_event"),
)
merged = compute_numeric_impact(merged)
merged[["parent_id", "indicator_code", "impact_direction", "impact_magnitude", "lag_months", "period_start"]].head(10)

In [None]:
# Build matrix: rows = events, columns = indicators, values = signed impact
matrix = build_event_indicator_matrix(merged)
print("Event–Indicator Matrix (table):")
display(matrix)

### Modeling choices

- **Direction and magnitude:** Impact direction (positive/negative or increase/decrease) is mapped to sign (+1 or -1). Magnitude is either numeric (e.g. percentage points) or categorical (low/medium/high) mapped to 0.5, 1.5, 3.0. The **signed impact** is direction × magnitude.
- **Additivity:** Impacts are **additive** across events and across multiple links for the same (event, indicator). When several events affect one indicator, we sum their signed impacts in the matrix and in the temporal model.
- **Multiple events per indicator:** The pivot uses `aggfunc="sum"`, so multiple event–indicator links are summed. This assumes effects combine additively rather than multiplicatively, which is a simplifying assumption given data constraints.

In [None]:
# Heatmap centered at 0
fig, ax = plt.subplots(figsize=(10, 6))
if matrix.size > 0:
    vmax = matrix.abs().max().max() or 1
    sns.heatmap(matrix, annot=True, fmt=".2f", center=0, cmap="RdBu_r", vmin=-vmax, vmax=vmax, ax=ax)
    ax.set_title("Event x Indicator Impact Matrix (signed impact)")
else:
    ax.text(0.5, 0.5, "No event-indicator links", ha="center", va="center")
plt.tight_layout()
plt.show()

In [None]:
# Get observations for one indicator (e.g. ACC_MM_ACCOUNT or first available)
obs = data[data["record_type"] == "observation"].copy()
if "ACC_MM_ACCOUNT" in obs.get("indicator_code", pd.Series()).values:
    indicator_code_sel = "ACC_MM_ACCOUNT"
else:
    indicator_code_sel = obs["indicator_code"].dropna().iloc[0] if "indicator_code" in obs.columns else None
if indicator_code_sel:
    ind_df = obs[obs["indicator_code"] == indicator_code_sel][["observation_date", "value_numeric", "indicator_code"]].dropna(subset=["value_numeric"])
else:
    ind_df = obs[["observation_date", "value_numeric"]].dropna(subset=["value_numeric"]).head(20)
    ind_df["indicator_code"] = "indicator"
ind_df["observation_date"] = pd.to_datetime(ind_df["observation_date"])
ind_df = ind_df.sort_values("observation_date")
impacted = apply_event_impacts_over_time(ind_df, merged)
impacted[["observation_date", "value_numeric", "impact_addition", "value_impacted"]].head(10)

In [None]:
# Visualize: baseline vs impacted trajectory
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(impacted["observation_date"], impacted["value_numeric"], marker="o", label="Baseline (observed)", color="C0")
ax.plot(impacted["observation_date"], impacted["value_impacted"], marker="s", label="With event impacts", color="C1", linestyle="--")
ax.set_title(f"Baseline vs impacted trajectory ({indicator_code_sel or 'indicator'})")
ax.set_ylabel("Value")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Telebirr validation: predicted vs observed change
obs_acc = data[(data["record_type"] == "observation") & (data["indicator_code"] == "ACC_MM_ACCOUNT")].copy()
if obs_acc.empty:
    obs_acc = data[data["record_type"] == "observation"].copy()
    obs_acc["indicator_code"] = obs_acc.get("indicator_code", "ACC_MM_ACCOUNT")
obs_acc["observation_date"] = pd.to_datetime(obs_acc["observation_date"])
obs_acc = obs_acc.sort_values("observation_date")
observed_2021 = 4.7
observed_2024 = 9.45
observed_change = observed_2024 - observed_2021

# Modeled: apply event impacts and compare 2021 vs 2024
acc_ts = obs_acc[obs_acc["indicator_code"] == "ACC_MM_ACCOUNT"][["observation_date", "value_numeric", "indicator_code"]].dropna()
if acc_ts.empty:
    acc_ts = pd.DataFrame({"observation_date": [pd.Timestamp("2021-12-31"), pd.Timestamp("2024-12-31")], "value_numeric": [observed_2021, observed_2024], "indicator_code": "ACC_MM_ACCOUNT"})
acc_impacted = apply_event_impacts_over_time(acc_ts, merged)
# Predicted change: difference between last and first value_impacted
predicted_change = acc_impacted["value_impacted"].iloc[-1] - acc_impacted["value_impacted"].iloc[0] if len(acc_impacted) >= 2 else (acc_impacted["impact_addition"].max() - acc_impacted["impact_addition"].min())

print("Telebirr (ACC_MM_ACCOUNT) validation:")
print(f"  Observed 2021: {observed_2021}%")
print(f"  Observed 2024: {observed_2024}%")
print(f"  Observed change: {observed_change:.2f} pp")
print(f"  Model predicted change (impact addition): {predicted_change:.2f} pp")
print(f"  Gap: {observed_change - predicted_change:.2f} pp")

### Explaining the gap

- **Confounding policies:** NDPS and other initiatives (e.g. interoperability) also pushed adoption; the model attributes part of the change to Telebirr only.
- **Data frequency:** We have few observation points; 2021 vs 2024 is a long window, so timing of effect (lag) is coarse.
- **Adoption inertia:** Real adoption may ramp non-linearly (slow then fast); our linear accumulation is a simplification.

## Methodology

- **Approach:** We build an Event × Indicator signed impact matrix from analyst-defined impact links (direction and magnitude). Temporal impact is applied with a **lag** (effect starts after `lag_months`) and **linear accumulation** over a fixed duration (e.g. 36 months), with **no decay**.
- **Functional form:** Additive effect on the indicator level; cumulative impact = sum over events of (months_since_effect_start / duration) × signed_impact, capped at total impact per event.
- **Appropriateness:** Given sparse time series and few events, we avoid complex dynamics (e.g. decay, saturation) and keep the model interpretable and auditable.

## Assumptions

- **Linearity:** Impact accumulates linearly over the chosen duration (no saturation within that window).
- **Additivity:** Multiple events affecting the same indicator add (no interaction terms).
- **No decay:** Effect does not fade over time within the horizon (explicit assumption).
- **Proxy use of comparable countries:** Magnitude and lag for Telebirr are informed by Kenya/Tanzania mobile money evidence; Ethiopia’s context may differ.

## Confidence & Uncertainty

- **High confidence:** Telebirr → ACC_MM_ACCOUNT (positive, well-documented launch; comparable Kenya evidence).
- **Medium confidence:** M-Pesa/Safaricom entry effects (more recent, less post-data).
- **Low confidence:** Policy events with unclear timing or magnitude; indicators with a single source.