
# Cholera Early‑Warning → SMS Alert (Demo Notebook)

This notebook shows **how a predictive model can trigger SMS alerts** for cholera risk in Nigerian regions using:
1) **Model creation** on historical-like (mocked) data
2) **Loading + scoring** with near real-time weather forecasts (mocked)

> Why this matters:
- **Sparse diagnoses**: confirmed cholera diagnoses are sporadic and lag reality, so a *probabilistic early-warning* helps detect risk *before* official case confirmations.
- **Supports contact tracing**: timely risk flags can route outreach and hygiene messaging to likely hotspots, accelerating case finding and **breaking transmission chains**.



> **Known limitations (to be transparent in the repo):**
- **Reliant on user reporting / SMS access:** outreach assumes people own phones and opt-in; network coverage and literacy vary.
- **Changing weather patterns:** climate regime shifts and anomalous seasons can drift relationships; models need **ongoing retraining**.
- **Sparse local sub-regional data:** many covariates are coarse; sub‑LGA variability (informal settlements, borehole failures) is under-captured → **uncertainty** remains.


In [None]:

# Core imports
import numpy as np
import pandas as pd
from pathlib import Path

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score, precision_recall_curve, RocCurveDisplay
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import joblib

# Plotting (follow org conventions: matplotlib only, single chart per cell, no explicit colors)
import matplotlib.pyplot as plt

# Reproducibility
rng = np.random.default_rng(42)
DATA_DIR = Path('/mnt/data')
DATA_DIR.mkdir(parents=True, exist_ok=True)



## 1) Mock Historical Data (Training Set)

We create a toy weekly panel at the **state** level with these features (all per‑capita where noted):
- `rainfall_mm` (weekly forecast/estimate; we also create lag terms)
- `flood_risk` (0–1)
- `water_access_pc` (per 1k people with clean water)
- `health_facilities_pc` (per 1k people)
- **Label:** `outbreak_next_2w` (1 if an outbreak emerges within the next two weeks)

> In production, replace this block by joining official case line‑lists (NCDC), ERA5 rainfall/flood indices, WASH coverage, and facility counts, all normalized by population and aligned to **region × week**.


In [None]:

# Create mock regions and weeks
states = [f'State_{i:02d}' for i in range(1, 20)]  # 19 states for demo
weeks = pd.date_range('2023-01-01', periods=60, freq='W')  # ~60 weeks

rows = []
for s in states:
    base_water = rng.uniform(20, 80)  # per 1k
    base_fac = rng.uniform(0.1, 0.6)  # per 1k
    flood_profile = rng.uniform(0.2, 0.8)
    for w in weeks:
        rainfall = max(0, rng.normal(60, 30))  # mm/week
        flood = np.clip(flood_profile + 0.003 * (rainfall - 60) + rng.normal(0, 0.05), 0, 1)
        water_access = np.clip(base_water + rng.normal(0, 5), 0, 100)
        health_fac_pc = np.clip(base_fac + rng.normal(0, 0.05), 0, 2)

        # Latent risk: more rainfall & flood -> higher risk; better water & facilities -> lower risk
        logit = -2.2 + 0.012*rainfall + 1.3*flood - 0.015*water_access - 0.8*health_fac_pc
        p = 1/(1 + np.exp(-logit))
        outbreak = rng.binomial(1, p)

        rows.append(dict(
            state=s, week=w, rainfall_mm=rainfall, flood_risk=flood,
            water_access_pc=water_access, health_facilities_pc=health_fac_pc,
            outbreak_next_2w=outbreak
        ))

df = pd.DataFrame(rows).sort_values(['state','week']).reset_index(drop=True)

# Simple feature lags
df['rainfall_mm_lag1'] = df.groupby('state')['rainfall_mm'].shift(1).fillna(df['rainfall_mm'].median())
df['flood_risk_lag1'] = df.groupby('state')['flood_risk'].shift(1).fillna(df['flood_risk'].median())

df.head()



## 2) Train a Predictive Model

We use a small **Gradient Boosting** classifier wrapped in a pipeline with standardization for stability, then **calibrate** the probabilities (isotonic) for thresholding.


In [None]:

features = ['rainfall_mm','flood_risk','water_access_pc','health_facilities_pc',
            'rainfall_mm_lag1','flood_risk_lag1']
target = 'outbreak_next_2w'

X = df[features].values
y = df[target].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7, stratify=y)

# Pipeline: scale -> model
base_model = Pipeline([
    ('scaler', StandardScaler()),
    ('gb', GradientBoostingClassifier(random_state=7))
])

# Calibrated probabilities
cal_model = CalibratedClassifierCV(base_model, method='isotonic', cv=3)
cal_model.fit(X_train, y_train)

pred_proba = cal_model.predict_proba(X_test)[:,1]
roc = roc_auc_score(y_test, pred_proba)
ap = average_precision_score(y_test, pred_proba)

print(f'ROC AUC: {roc:.3f} | Average Precision: {ap:.3f}')



### Choose Action Thresholds

We’ll adopt the hackathon defaults:
- **High risk ≥ 0.70** → trigger **public SMS** + notify officials
- **Medium 0.40–0.69** → notify facilities; no mass SMS
- **Low < 0.40** → routine surveillance

Below we visualize the ROC just to sanity‑check separability (for production, add cost‑sensitive analysis).


In [None]:

RocCurveDisplay.from_predictions(y_test, pred_proba)
plt.title("ROC (holdout)")
plt.show()



### Persist the Model

We save the calibrated model and the feature list for reuse in the scoring step.


In [None]:

MODEL_PATH = DATA_DIR / 'cholera_model.joblib'
FEATURES_PATH = DATA_DIR / 'cholera_features.json'

joblib.dump(cal_model, MODEL_PATH)
FEATURES_PATH.write_text(json.dumps(features, indent=2))

print(f"Saved model → {MODEL_PATH}")
print(f"Saved features → {FEATURES_PATH}")



## 3) Real‑Time Scoring with Forecasts

Assume a weekly job pulls **weather forecasts** and merges with static covariates (flood risk baseline, WASH access, facilities per‑capita). Here we mock a single scoring week for a subset of states.

> In production, replace the CSV creation below with your pipeline that ingests ERA5/Forecast API + registry tables.


In [None]:

# Create a mock forecast CSV for the coming week
scoring_states = states[:8]
score_rows = []
future_week = pd.to_datetime('2025-09-28')  # example next week

for s in scoring_states:
    # Pull last week's values from df to create lags
    hist = df[df['state']==s].iloc[-1]
    rainfall_forecast = max(0, rng.normal(70, 25))  # wetter week

    flood_forecast = float(np.clip(hist['flood_risk'] + 0.004*(rainfall_forecast - hist['rainfall_mm']) + rng.normal(0,0.04), 0, 1))

    score_rows.append(dict(
        state=s,
        week=future_week,
        rainfall_mm=rainfall_forecast,
        flood_risk=flood_forecast,
        water_access_pc=float(np.clip(hist['water_access_pc'] + rng.normal(0,2), 0, 100)),
        health_facilities_pc=float(hist['health_facilities_pc']),
        rainfall_mm_lag1=float(hist['rainfall_mm']),
        flood_risk_lag1=float(hist['flood_risk'])
    ))

score_df = pd.DataFrame(score_rows)
forecast_csv = DATA_DIR / 'weekly_forecast_input.csv'
score_df.to_csv(forecast_csv, index=False)
forecast_csv, score_df.head()



### Load Model & Score

We produce **risk probabilities** and map them into **action tiers** that the backend can consume.


In [None]:

# Load artifacts
model = joblib.load(MODEL_PATH)
feat_list = json.loads(FEATURES_PATH.read_text())

incoming = pd.read_csv(forecast_csv, parse_dates=['week'])

# Predict probabilities
probs = model.predict_proba(incoming[feat_list])[:,1]
incoming['risk_prob'] = probs

def to_action(p):
    if p >= 0.70:
        return 'HIGH: trigger_public_sms & notify_officials'
    if p >= 0.40:
        return 'MEDIUM: notify_facilities & monitor'
    return 'LOW: routine_surveillance'

incoming['action'] = incoming['risk_prob'].apply(to_action)
incoming[['state','week','risk_prob','action']]



### Backend Hook (Pseudo‑Trigger)

In production, the dataframe below would be iterated and passed to your **SMS/alert service**. We stub a `send_alert()` function to show the interface signature you can wire to Twilio/Vonage/etc.


In [None]:

def send_alert(region: str, week: str, prob: float, action: str):
    """Placeholder for your real SMS/notification publisher.
    In prod, enforce idempotency + rate limiting, and log delivery + response.
    """
    print(f"[ALERT] {region} | {week} | risk={prob:.2f} → {action}")
    # TODO: integrate with your messaging bus / SMS provider

# Fire only HIGH and MEDIUM actions
to_fire = incoming[incoming['action'].str.startswith(('HIGH','MEDIUM'))]
for _, row in to_fire.iterrows():
    send_alert(row['state'], row['week'].date().isoformat(), row['risk_prob'], row['action'])



### Export Scored Output

Downstream services can pick up a simple CSV with `state, week, risk_prob, action`.


In [None]:

scored_csv = DATA_DIR / 'weekly_scored_output.csv'
incoming[['state','week','risk_prob','action']].to_csv(scored_csv, index=False)
print(f"Wrote: {scored_csv}")
incoming[['state','week','risk_prob','action']]



## Next Steps (productionizing)
- Replace mocked joins with your **ETL** that aligns *region × week* across sources; validate population normalizations.
- Add **drift monitoring** and **scheduled retraining** (e.g., monthly) as new case data arrives.
- Implement **cost-aware thresholds** co-designed with health authorities (false alarm vs missed detection trade-offs).
- Log **alert outcomes** (deliveries, clinic presentations) to improve the model via closed-loop learning.
