Automated pipeline for collecting structured protest event data from African news sources. Discovers articles via GDELT, BBC Monitoring, and World News API, extracts structured events using an Azure-hosted LLM, and writes research-ready JSONL/CSV to Azure Data Lake Storage Gen2 on a daily cron schedule.
| Codebook | v2.3 (Halterman & Keith 2025, Type III stipulative definitions) |
| Target geography | Nigeria (NG), South Africa (ZA), Uganda (UG), Algeria (DZ) |
| LLM backend | Azure AI Foundry — --model sets the deployment name (default: gpt-4.1) |
| Production schedule | Daily at 06:00 UTC via Azure Container Apps Job (pea-daily) |
Articles flow through six stages in the acquire step. Stages 1–3 run once and are shared across all active domains; stages 2.5–5 run independently per domain so an article can qualify as both a protest event and a drone event.
┌─────────────────────────────────────────────────────────────────────┐
│ SHARED (all domains) │
│ │
│ 1a. GDELT DOC 2.0 1b. BBC Monitoring 1c. World News │
│ One query per country Optional: --source bbc API (optional) │
│ FIPS sourcecountry or --source both/all --source worldnews│
│ filter + keyword tags Civil_unrest topic or --source all │
│ ISO3 country codes │
│ │
│ 1d. File / ADLS input 2. Scraper │
│ --source file newspaper3k + requests fallback │
│ Pre-scraped CSV/JSONL user-agent rotation │
│ from local or ADLS path (skipped if text pre-populated) │
└───────────────────────────────────────┬─────────────────────────────┘
│ article text
┌─────────────────────┼──────────────────────┐
▼ ▼ ▼
protest domain drone domain (future domains)
┌──────────────────────────────────────────────────────────────────────┐
│ PER DOMAIN │
│ │
│ 2.5 Relevance filter 3. Translation │
│ DeBERTa NLI, threshold 0.30 langdetect + Google Translate │
│ (keyword fallback if offline) Native langs skip translation │
│ batched inference, size 32 │
│ │ │
│ ▼ │
│ 4. LLM Extraction 4.5 Geocoding │
│ Azure AI Foundry Nominatim OSM (free, no key) │
│ Codebook v2.3 in SYSTEM_PROMPT venue → city → region → country │
│ Few-shot examples in USER_PROMPT disk-cached per location string │
│ Prompt caching: ~36% cost saving up to 4 parallel workers │
│ │ │
│ ▼ │
│ 5. Storage │
│ data/raw/<domain>/ JSONL + CSV + run summary + dead-letter file │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 2 — PROCESS --stage process │
│ │
│ data/raw/<domain>/all_events.jsonl │
│ ├─► Geography filter (remove off-target countries) │
│ ├─► Deduplication (country + city fuzzy ±3 days + event type │
│ │ + TF-IDF claims similarity ≥0.20; null-city safe) │
│ ├─► LLM re-verification of medium/low confidence events │
│ └─► Quality control → data/processed/ │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 3 — PREDICT --stage predict │
│ │
│ data/processed/events_consolidated.jsonl │
│ ├─► Prevalence estimates per country + event type │
│ ├─► Prediction-Powered Inference (Angelopoulos et al. 2023) │
│ │ corrects for LLM misclassification in confidence intervals │
│ └─► data/predictions/ │
└──────────────────────────────────────────────────────────────────────┘
Code changes flow from a git push to a running scheduled job with no manual steps after initial setup.
git push origin main
│
▼
GitHub Actions (.github/workflows/docker.yml)
├── Build pea-pipeline image (Dockerfile)
├── Build pea-dashboard image (Dockerfile.web)
├── Push both images to ACR with :latest and :<sha> tags
└── Update pea-daily and pea-backfill jobs to the new image
│
▼
Azure Container Registry (ACR)
pea-pipeline:latest pea-pipeline:<sha>
│
├──────────────────────────────────────┐
▼ ▼
pea-daily pea-backfill
Trigger: cron 0 6 * * * (06:00 UTC) Trigger: manual
2 CPU / 4 GB RAM 4 CPU / 8 GB RAM
Replica timeout: 4 h Replica timeout: 24 h
Retry limit: 2 Retry limit: 1
(pass --args at trigger time)
Command:
--stage all
--countries NG,ZA,UG,DZ
--days 2
--resume
--upload-to abfss://pea-data/runs
│
▼
Azure Data Lake Storage Gen2 (abfss://pea-data/runs)
events_{run_id}.jsonl summary_{run_id}.json failures_{run_id}.jsonl
│
▼
Azure Monitor — scheduled query alert
Fires when Log Analytics sees "Pipeline failed"
Notifies: ALERT_EMAIL via action group
Secrets (Foundry API key, optional BBC credentials) are stored in Azure Key Vault and injected at runtime via a user-assigned managed identity — no credentials in environment variables on the Container Apps Job.
cp .env.example .env # then fill in your Azure credentialsAZURE_FOUNDRY_API_KEY= # required
AZURE_OPENAI_ENDPOINT= # required e.g. https://<resource>.openai.azure.com/openai/v1
AZURE_STORAGE_CONNECTION_STRING= # optional needed for --upload-to abfss://...
AZURE_STORAGE_ACCOUNT_URL= # optional DFS endpoint (managed identity)
BBC_MONITORING_USER_NAME= # optional needed for --source bbc or both
BBC_MONITORING_USER_PASSWORD= # optional
WORLDNEWS_API_KEY= # optional needed for --source worldnews or all
python -m venv venv && source venv/bin/activate
pip install -r requirements-core.txt# Minimal — South Africa, last 7 days
python -m src.acquisition.pipeline --countries ZA --days 7
# Multi-country, 30-day window
python -m src.acquisition.pipeline \
--model gpt-4.1 --countries NG,ZA,UG,DZ --days 30 --max-articles 200
# Multi-domain: scrape once, extract protest AND drone events
python -m src.acquisition.pipeline \
--domains protest,drone --countries NG,ZA,UG,DZ --days 7
# Full pipeline — acquire → process → predict
python -m src.acquisition.pipeline --stage all --countries ZA --days 30
# Use World News API as the source (requires WORLDNEWS_API_KEY)
python -m src.acquisition.pipeline --source worldnews --countries ZA --days 7
# Ingest a pre-scraped CSV file (skips scraper stage for these articles)
python -m src.acquisition.pipeline --source file --file-path /data/articles.csv --countries ZA
# Ingest pre-scraped articles from ADLS Gen2
python -m src.acquisition.pipeline --source file \
--file-path abfss://my-filesystem/input/articles.csv --countries ZA
# Resume after a crash (skips URLs already in checkpoint.txt)
python -m src.acquisition.pipeline --resume
# Upload outputs to ADLS Gen2 after run
python -m src.acquisition.pipeline --upload-to abfss://my-filesystem/pea/runsOutput is written to data/raw/<domain>/ (domain defaults to protest).
| File | Stage | Contents |
|---|---|---|
data/raw/<domain>/events_{run_id}.jsonl |
acquire | Extracted events — primary output |
data/raw/<domain>/events_{run_id}.csv |
acquire | Same events, spreadsheet-friendly |
data/raw/<domain>/summary_{run_id}.json |
acquire | Run metadata: counts by country, type, turmoil level |
data/raw/<domain>/failures_{run_id}.jsonl |
acquire | Articles that failed all retries (dead-letter) |
data/raw/<domain>/all_events.jsonl |
acquire | Cumulative append across all runs |
data/raw/<domain>/checkpoint.txt |
acquire | Processed URLs — used by --resume |
data/processed/events_consolidated.jsonl |
process | Deduplicated, quality-controlled events |
data/processed/quality_report.json |
process | Schema validity + confidence distribution |
data/processed/duplicates_log.jsonl |
process | Audit trail of removed duplicates |
data/predictions/prevalence_estimates.json |
predict | PPI prevalence by event type and country |
data/predictions/confidence_breakdown.json |
predict | High/medium/low confidence distribution |
Run the two provisioning scripts once. After that, every push to main updates the running jobs automatically via GitHub Actions.
az login
az account set --subscription <your-subscription-id>
chmod +x infra/setup.sh && ./infra/setup.shsetup.sh creates a resource group (pea-rg), an Azure Container Registry, and an ADLS Gen2 storage account (hierarchical namespace enabled) with a pea-outputs filesystem. It prints all values needed for the next steps.
In your repo: Settings → Secrets and variables → Actions
| Secret | Where to get it |
|---|---|
ACR_LOGIN_SERVER |
Printed by setup.sh |
ACR_USERNAME |
Printed by setup.sh |
ACR_PASSWORD |
Printed by setup.sh |
AZURE_CREDENTIALS |
az ad sp create-for-rbac --sdk-auth output |
AZURE_RESOURCE_GROUP |
pea-rg (default) |
Then push to main. GitHub Actions builds both Docker images and pushes them to ACR.
export ACR_NAME=<from-setup-output>
export STORAGE_ACCOUNT=<from-setup-output>
export ADLS_FILESYSTEM=pea-outputs # matches ADLS_FILESYSTEM in setup.sh
export AZURE_FOUNDRY_API_KEY=<your-foundry-key>
export ALERT_EMAIL=<your-email>
chmod +x infra/deploy.sh && ./infra/deploy.shdeploy.sh creates:
| Resource | Purpose |
|---|---|
| Log Analytics workspace + Container Apps Environment | Runtime environment |
| Key Vault | Stores Foundry API key (+ optional BBC credentials) |
| User-assigned managed identity | Keyless auth to ACR, Key Vault, Blob Storage |
pea-daily Container Apps Job |
Cron 0 6 * * * (06:00 UTC), 2 CPU / 4 GB, runs --stage all --countries NG,ZA,UG,DZ --days 2 --resume --upload-to abfss://<filesystem>/runs |
pea-backfill Container Apps Job |
Manual trigger, 4 CPU / 8 GB — pass --args at trigger time |
| Azure Monitor alert | Emails ALERT_EMAIL when Pipeline failed appears in logs |
# Trigger pea-daily immediately (outside its schedule)
az containerapp job start --name pea-daily --resource-group pea-rg
# Run a historical backfill
az containerapp job start --name pea-backfill --resource-group pea-rg \
--args "--stage" "all" "--countries" "NG,ZA,UG,DZ" \
"--backfill-from" "2024-01-01" "--backfill-to" "2024-12-31" \
"--backfill-window-days" "30" "--workers" "8" \
"--upload-to" "abfss://pea-data/backfill"
# Watch execution logs
az containerapp job execution list \
--name pea-daily --resource-group pea-rg --output table
# Deploy a code change — just push to main; GitHub Actions handles the rest
git push origin main| Item | Required for |
|---|---|
| Azure Container Registry + GitHub Secrets | Docker CI workflow |
| Azure Storage Account (ADLS Gen2, HNS enabled) | --upload-to abfss://... |
| ACLED API token | Recall validation (acled_validator.py not yet built) |
| GLOCON data access | glocon_validator.py (applied 2026-04-05) |
src/acquisition/gdelt_discovery.py queries the GDELT DOC 2.0 API.
- Per-country queries with FIPS codes. The GDELT
sourcecountryfilter requires FIPS 10-4 codes. The pipeline runs one query per country (ISO2 → FIPS viaISO2_TO_FIPS) and merges results by URL. - Keyword-based relevance tagging. After GDELT returns results, articles are tagged for relevance by checking titles and URLs against protest signals loaded from
configs/keywords.yaml. All articles are passed to the next stage; the tag (title_match,url_match,gdelt_theme) is used for diagnostic logging downstream. - Per-country fallback. If GDELT returns nothing for a given country + FIPS code, the pipeline retries without
sourcecountryand injects the country name as a keyword.
Keywords are managed in configs/keywords.yaml:
protest_themes— GDELT GKG themes used in the API queryprotest_signals— 39 multilingual title keywords (English, Arabic, French, Spanish, Indonesian)url_signals— URL substring fallback
src/acquisition/bbc_discovery.py queries BBC Monitoring using the Civil_unrest topic filter and ISO3 country codes. Requires BBC_MONITORING_USER_NAME and BBC_MONITORING_USER_PASSWORD.
Enable with --source bbc or --source both. When --source both, results are deduplicated by URL before scraping.
src/acquisition/worldnews_discovery.py queries the worldnewsapi.com search-news endpoint. One paginated query per country. Requires WORLDNEWS_API_KEY.
Enable with --source worldnews (standalone) or --source all (GDELT + BBC + World News). Results are deduplicated by URL with the other sources.
Rate limits: 60 req/min, 1 concurrent, 50 points/day on the free tier — suitable for targeted daily runs.
src/acquisition/file_discovery.py ingests pre-scraped articles from a local or ADLS Gen2 file, bypassing the scraper stage.
Use when: articles are already collected and stored as CSV/JSON (e.g., from a media monitoring service or internal corpus).
Enable with --source file --file-path <path>. Supported path formats:
- Local:
/path/to/articles.csv,./data/articles.jsonl - ADLS Gen2:
abfss://filesystem/path/to/articles.csv
Required columns: url, title, text, date (YYYY-MM-DD), country (ISO2).
Since text is pre-populated, the scraper skips the network fetch for these articles. They proceed through the relevance filter, translation, and extraction stages normally.
src/acquisition/scraper.py fetches full article text.
- Primary:
newspaper3k - Fallback:
requests+BeautifulSoup - User-agent rotation to reduce bot detection
- Known paywall domains (NYT, FT, Reuters, etc.) are skipped gracefully
- Parallel scraping — up to 16 concurrent workers (configurable via
--scrape-workers), with per-host politeness delays so no single domain is hit more than once per second
src/acquisition/relevance_filter.py rejects off-topic articles before they reach the LLM, reducing cost and noise.
- Model:
cross-encoder/nli-deberta-v3-small(184 MB, CPU-only) - Default threshold:
0.30— conservative, prioritises recall over precision - Domain-aware: the hypothesis text changes per domain (
protestvsdrone) - Fallback: keyword matching if
transformersis unavailable - In multi-domain mode, each domain runs its own filter independently against the shared scraped corpus
- Batched scoring — all articles scored in a single HuggingFace pipeline call (batch size 32 by default,
--relevance-batch-size). Substantially faster than per-article inference on a CPU. - Rejected articles are logged with their
_relevance_scorefor threshold calibration - Raise to
0.50after GLOCON/ACLED validation confirms filter accuracy
src/acquisition/translator.py detects language with langdetect and translates to English via Google Translate (free tier).
Languages handled natively (en, es, fr, pt, ar, sw, hi, ur, id, tl, bn, ha, yo, ig, am) skip translation to preserve fidelity and save time.
src/acquisition/extractor.py is the core of the pipeline.
LLM backend: Azure AI Foundry only. The --model flag sets the deployment name in your Azure AI Foundry project (default: gpt-4.1).
Two-step disqualifier gate:
- The
SYSTEM_PROMPTopens with an explicit list of non-protest article types to reject immediately (AU/SADC summits, election results, sports, parliamentary sessions, etc.). The model returns[]without extracting. - If the article passes the gate, minimum criteria are applied: at least one collective actor + one claim/grievance + one action.
Codebook-in-prompt architecture (v2.3):
The full codebook is injected into the system prompt at import time via _build_codebook_context(). This loads the appropriate domain codebook YAML and formats all event type definitions — including positive examples, boundary negatives, and decision rules. ~22k tokens. This approach yields +8–15 F1 points over brief label descriptions (Halterman & Keith 2025; arXiv:2502.16377).
| Config file | Domain |
|---|---|
configs/protest_codebook.yaml |
protest — 8 protest event types |
configs/drone_events_codebook.yaml |
drone — drone/UAV incident types |
Few-shot examples:
Gold-standard examples from the domain examples YAML are prepended to every user prompt:
configs/extraction_examples.yaml(protest domain)configs/drone_extraction_examples.yaml(drone domain)
Examples are selected from a two-tier pool:
| Tier | YAML tag | Behaviour |
|---|---|---|
| Pinned | pinned: true |
Always injected — the original handwritten curriculum examples, never evicted |
| Rotatable | (no tag) | Annotator-promoted corrections that rotate in by random sample each run |
The number of examples per run is controlled by --examples-sample-n (default 5). With only the 5 original pinned examples and no promoted ones, this is identical to the previous behaviour. As the promoted pool grows through the annotation workflow, raising --examples-sample-n above 5 causes promoted examples to start rotating in.
The random seed is resolved once per run from time.time_ns() and held fixed across all articles in that run. This keeps the per-run prompt prefix identical across every article, which is required for Azure prompt caching to hit reliably.
Prompt caching: The system prompt prefix (~29k tokens) is identical across every article in a run. Azure caches it automatically for gpt-4.1 (>1024 token prefix). Cached tokens are billed at 50% of the input rate — approximately 36% overall cost reduction.
Concurrent workers: Use --workers N for parallel extraction during backfill runs. All workers share one system prompt so prompt caching is maximised. Use --rpm-limit to stay under your Azure quota (default: 450 RPM, ~10% headroom under a 500 RPM deployment limit).
Checkpoint / resume: Each successfully extracted article URL is appended to checkpoint.txt. Pass --resume to skip already-processed URLs after a crash.
Dead-letter file: Articles that fail all extraction retries are written to failures_{run_id}.jsonl.
src/acquisition/geocoder.py converts location fields to latitude/longitude using the Nominatim OSM API (free, no API key required).
Geocoding is attempted from most to least specific:
venue + city + country→geo_accuracy: venuecity + country→geo_accuracy: cityregion + country→geo_accuracy: regioncountry→geo_accuracy: country
Nominatim rate limit (1 req/sec) is enforced across all workers. Results are cached to disk (data/cache/geocode_cache.json) so repeated location strings (e.g. "Johannesburg, South Africa") are resolved without a network round-trip. Skip geocoding entirely with --no-geocode.
Use --geocode-workers to parallelise dispatch; the default is 4 concurrent workers with shared rate-limit enforcement.
src/acquisition/storage.py writes JSONL, CSV, and a run summary JSON. It also derives a turmoil_level field (high/medium/low) from event type and state response severity.
Optional cloud upload via --upload-to:
abfss://filesystem/prefix— Azure Data Lake Storage Gen2 (AZURE_STORAGE_CONNECTION_STRINGor managed identity viaAZURE_STORAGE_ACCOUNT_URLset to the DFS endpointhttps://<account>.dfs.core.windows.net)s3://bucket/prefix— AWS S3 (boto3)
src/acquisition/processing.py reads data/raw/<domain>/all_events.jsonl and produces a clean dataset.
- Geography filter — removes events outside the target country list
- Deduplication — same country + city fuzzy match ≥0.70 + date ±3 days + event type; city matching only enforced when both cities are non-null (null-city safe); TF-IDF cosine similarity on claims ≥0.20 prevents same-city/same-day events with different demands from merging;
claims_similarityrecorded induplicates_log.jsonl - LLM re-verification — borderline medium/low confidence events re-examined with chain-of-thought prompting
- Quality control — schema validity checks and confidence distribution report
src/acquisition/predictions.py applies Prediction-Powered Inference (Angelopoulos et al. 2023) to generate statistically valid prevalence estimates that account for LLM misclassification rates. Raw averaging of LLM predictions propagates classification error into the point estimate and is methodologically incorrect.
The --domains flag enables processing multiple event codebooks in a single invocation. Articles are scraped and translated once, then routed through each domain's relevance filter and extractor independently.
Shared: Discovery → Scraping → Translation
│
┌─────────┴──────────┐
▼ ▼
protest domain drone domain
2.5 Filter 2.5 Filter
4. Extraction 4. Extraction
4.5 Geocoding 4.5 Geocoding
5. Storage 5. Storage
data/raw/protest/ data/raw/drone/
An article can qualify for multiple domains — for example, a protest dispersed with a surveillance drone passes both relevance filters.
Supported domains:
| Domain | Codebook | Default query | --domains support |
|---|---|---|---|
protest |
configs/protest_codebook.yaml |
protest demonstration strike rally march |
Yes |
drone |
configs/drone_events_codebook.yaml |
drone UAV airstrike unmanned aircraft |
Yes |
violent_extremism |
configs/violent_extremism_codebook.yaml |
— | Pending (use --codebook / --examples for now) |
Use --codebook and --examples to supply a custom codebook YAML when running a single domain. These flags are not supported in multi-domain mode.
Use --backfill-from / --backfill-to to run the pipeline over a historical date range, processing one window at a time. This bypasses GDELT's 1-year timespan ceiling by issuing one date-bounded query per window.
python -m src.acquisition.pipeline \
--domains protest \
--countries ZA,NG,UG,DZ \
--backfill-from 2024-01-01 \
--backfill-to 2024-12-31 \
--backfill-window-days 30 \
--workers 8 \
--rpm-limit 450 \
--upload-to abfss://my-filesystem/pea/backfill--backfill-window-days(default 30) controls how many days each GDELT query spans--workersparallelises extraction within each window; prompt caching is preserved because all workers share the same system prompt--resumeskips URLs already incheckpoint.txt, making restarts safe
Each extracted event is a JSON object with the following fields:
| Field | Type | Description |
|---|---|---|
event_date |
string | null | ISO date of event (not article) |
country |
string | Country name |
city |
string | null | City or settlement |
region |
string | null | Province, state, or region |
venue |
string | null | Specific location (stadium, university, etc.) |
location_notes |
string | null | Additional location detail |
latitude |
float | null | From geocoder |
longitude |
float | null | From geocoder |
geo_accuracy |
string | null | venue, city, region, or country |
event_type |
string | One of the codebook types (see below) |
organizer |
string | null | Organising entity, or unknown if WhatsApp-mobilised |
participant_groups |
array | Groups involved (workers, students, residents, etc.) |
claims |
array | Demands or grievances |
crowd_size |
string | null | Reported figure or range |
duration |
string | null | Event duration |
state_response |
string | null | Most severe state action (see vocabulary below) |
state_actors |
array | Police, military, etc. |
arrests |
string | null | Number arrested |
fatalities |
string | null | Number killed |
injuries |
string | null | Number injured |
outcome |
string | null | ongoing, dispersed, partial_concession, full_concession, no_concession |
outcome_notes |
string | null | What was conceded or what happened |
article_title |
string | null | Source article title |
article_url |
string | null | Source article URL |
article_date |
string | null | Article publication date |
source_country |
string | null | Country of the news source |
source_language |
string | BCP-47 language code |
confidence |
string | high, medium, or low |
| Code | Description |
|---|---|
demonstration_march |
Organised march or outdoor gathering (permit or not) |
strike_boycott |
Work stoppage, general strike, consumer boycott |
occupation_seizure |
Sit-in, building occupation, road blockade |
confrontation |
Physical confrontation with police or opposing group; burning tyres/vehicles |
petition_signature |
Formal petition, open letter, or signature campaign |
vigil |
Candlelight vigil or symbolic silent gathering |
hunger_strike |
Individual or collective food refusal |
riot |
Widespread looting or violence against persons |
Standard: none, monitoring, dispersal, teargas, water_cannon, rubber_bullets, live_ammunition, arrests, ban, curfew
Extended (from criminalisation of protest literature): legal_criminalisation, anti_terrorism_designation, organisational_dissolution, non_association_bail
| File | Purpose |
|---|---|
| configs/protest_codebook.yaml | Codebook v2.3 — 8 protest event type definitions with positive/negative examples, decision rules, non-event disqualifiers, African context notes, edge cases, state response vocabulary, confidence guidance |
| configs/drone_events_codebook.yaml | Drone/UAV event codebook — parallel structure to protest codebook |
| configs/violent_extremism_codebook.yaml | Violent extremism codebook v1.0 — GTD-aligned attack type taxonomy (assassination, bombing, armed assault, etc.); mutually exclusive with the protest codebook. Use with --codebook / --examples until the domain is wired into DOMAIN_CONFIGS. |
| configs/extraction_examples.yaml | Gold-standard few-shot examples for protest extraction (5 pinned; promoted entries appended by annotation workflow) |
| configs/drone_extraction_examples.yaml | Few-shot examples for drone extraction |
| configs/violent_extremism_extraction_examples.yaml | Few-shot examples for the violent extremism domain (4 pinned entries covering bombing, armed assault, kidnapping, ambiguous criminal/VE cases) |
| configs/keywords.yaml | GDELT GKG themes, protest signal keywords (39, multilingual), URL signals |
| configs/countries.yaml | Single source of truth for 48 countries (34 Africa target + 14 others) — ISO2, ISO3, FIPS, name, aliases. Loaded by src/constants.py; powers GDELT queries, BBC Monitoring, and the web dashboard. |
| File | When to use |
|---|---|
| configs/extraction_examples_NEW_template.yaml | Add new few-shot examples for codebook v2.4. Contains worked examples and stubs with TODO markers. Quality bar: real articles, full schema, one new type or country per entry, boundary_case field required. Paste finished entries into configs/extraction_examples.yaml with pinned: true. |
| configs/protest_codebook_v24_additions_template.yaml | Six targeted decision-rule changes to apply to protest_codebook.yaml to produce v2.4: operational threshold for riot vs confrontation, vigil distinction, civic-space modifiers, and others. Apply each change and bump metadata.version to "2.4". |
docker build -t pea .
docker run --env-file .env pea --countries ZA --days 7The multi-stage Dockerfile uses requirements-core.txt (pipeline dependencies only). ML packages (torch, transformers) are in requirements.txt for local development.
Note:
torchinrequirements-core.txtcurrently pulls the CUDA build. For production, pin to a CPU wheel URL in the Dockerfile to avoid a 2 GB image layer.
Label Studio is used to review LLM-extracted events, correct them, and feed the corrections back into the extraction prompt as new few-shot examples. Each annotation session directly improves the quality of the next pipeline run — the loop is now fully closed.
Goal: 200+ reviewed gold pairs to unlock QLoRA fine-tuning. The annotation workflow builds this dataset incrementally, one batch at a time.
Label Studio provides a structured review interface where every extracted event is presented alongside its source article. The annotator answers four targeted questions per task, and all corrections are exported as structured JSON that the import script can consume automatically.
The interface is pre-configured for this pipeline via src/annotation/labeling_config.xml — you do not need to design the interface yourself.
# Start Label Studio
docker compose -f docker-compose.annotation.yml up -d
# Opens at http://localhost:8080- Go to
http://localhost:8080and create an account - Create a new project — name it "PEA Protest Events"
- Settings → Labeling Interface → Code tab
- Paste the full contents of src/annotation/labeling_config.xml
- Save
The labeling interface shows the source article on the left and the LLM's extracted event JSON on the right.
# Step 1 — Export the highest-priority events from the latest pipeline run
python -m src.annotation.export_for_annotation \
--events data/raw/protest/all_events.jsonl \
--output data/annotation/tasks_$(date +%Y%m%d).json \
--max-tasks 50 \
--tiers 1,2# Step 2 — In Label Studio:
# Import → upload the tasks JSON file
# Annotate each task (target: ~2 min per task)
# Export → JSON → save to data/annotation/label_studio_export.json
# Step 3 — Import corrections back and promote top corrections into the
# few-shot examples pool so the next run benefits immediately
python -m src.annotation.import_annotations \
--annotations data/annotation/label_studio_export.json \
--output-dir data/annotation/ \
--promote-to-examples 3The --promote-to-examples 3 flag is the key step that closes the loop. See Promotion below.
Each Label Studio task presents the source article text alongside the LLM-extracted JSON. You answer four questions:
1. Is this a genuine protest event?
The most important check. If the article is not about a protest event at all (election result, parliamentary debate, crime report, sports), mark it as a false positive. These events are filtered out of reviewed_events.jsonl and excluded from training data entirely — they are the most damaging failures because they silently inflate event counts.
2. Is the event type correct?
Check the event_type field against the eight codebook types. The most common errors are:
confrontationvsriot(riot requires looting or violence against persons; burning tyres is confrontation)demonstration_marchvsoccupation_seizure(sit-ins and road blockades are occupations)petition_signatureextracted from an article that describes a march that also presented a petition (classify by the primary action)
If the type is wrong, select the correct one from the dropdown. Type corrections are ranked highest for promotion.
3. Is the confidence level appropriate?
high = article explicitly describes all three minimum criteria (collective actor + claim/grievance + action).
medium = one criterion is implied or uncertain.
low = substantial ambiguity — the article might be describing a planned or rumoured event.
Correct this if the LLM under- or over-stated certainty.
4. Are there specific extraction errors?
Use the free-text field to flag field-level problems: wrong city, missing organiser, incorrect crowd size, misidentified state response, etc. These are recorded in annotation_stats.json and inform threshold calibration, but the field-level correction itself is not currently auto-applied — it informs manual codebook improvement.
# Tier 1 + 2 (recommended for most batches)
python -m src.annotation.export_for_annotation \
--events data/raw/protest/all_events.jsonl \
--output data/annotation/tasks.json \
--max-tasks 50 \
--tiers 1,2
# Tier 3 only (spot-check precision on high-confidence events)
python -m src.annotation.export_for_annotation \
--events data/raw/protest/all_events.jsonl \
--output data/annotation/spot_check.json \
--max-tasks 20 \
--tiers 3| Tier | Condition | Why annotate |
|---|---|---|
| 1 | Low confidence + high relevance score | Uncertain but the relevance filter says it's probably real — highest misclassification risk, most training value per hour |
| 2 | Medium confidence | Borderline — most F1 improvement per annotation hour |
| 3 (10% spot-check) | High confidence | Precision monitoring — catch systematic over-extraction without annotating every event |
After annotation, run import_annotations with --promote-to-examples N. This appends up to N annotator-corrected events to configs/extraction_examples.yaml as new few-shot entries, ranked by likely teaching value:
- Type-corrected events first — the LLM had the wrong event type; the correction directly teaches the boundary the model is getting wrong
- Extraction-error events second — the annotator flagged field-level problems; useful for schema fidelity
- Longest article text third — all else equal, richer context makes a better example
Each promoted entry is tagged with full provenance:
provenance:
source: label_studio
task_id: https://www.example.com/article-url
annotator_id: user@example.com
date_promoted: 2026-04-22T18:52:57
type_corrected: true
had_extraction_errors: falseDe-duplication is by task_id (article URL), so re-running promotion after a second annotation batch never adds duplicates.
Recommended cadence: promote 2–5 examples per batch. More than that and the pool grows faster than it rotates through the prompt, so individual corrections take longer to reach the model.
The examples file contains two tiers of entries:
| YAML tag | Behaviour |
|---|---|
pinned: true |
Always injected into every run — the five original handwritten examples |
| (no tag) | Promoted corrections; rotated in by random sample across runs |
The --examples-sample-n flag (default 5) sets the total number of examples per run. With only pinned examples and no promoted ones, this preserves the previous behaviour exactly.
Once the promoted pool grows, raise --examples-sample-n above 5 to start rotating promoted examples into the prompt:
# Inject 5 pinned + up to 3 promoted examples per run
python -m src.acquisition.pipeline \
--countries ZA --days 7 \
--examples-sample-n 8The rotation seed is resolved once per run and held fixed for all articles. This is important: every article in a given run sees the same example set, which keeps the system-prompt prefix byte-for-byte identical across the run and preserves the Azure prompt cache hit rate.
All written to data/annotation/:
| File | Contents |
|---|---|
reviewed_events.jsonl |
All reviewed events with human corrections applied |
training_data.jsonl |
Gold (article text → corrected JSON) pairs for fine-tuning |
annotation_stats.json |
False positive rate, type correction rate, running pair count toward 200 |
annotation_stats.json is the main health indicator for the annotation effort. Monitor false_positive_rate (should be < 5% at high confidence), type_correction_rate (high rates indicate codebook boundary ambiguity), and training_pairs (target: 200 before QLoRA fine-tuning).
Pipeline run
│
▼
data/raw/protest/all_events.jsonl
│
├──► export_for_annotation (tier 1+2 priority)
│ │
│ ▼
│ Label Studio
│ (annotate: FP? type? confidence? errors?)
│ │
│ ▼
│ label_studio_export.json
│ │
│ ▼
│ import_annotations --promote-to-examples 3
│ │
│ ├──► data/annotation/reviewed_events.jsonl
│ ├──► data/annotation/training_data.jsonl → future QLoRA fine-tuning
│ ├──► data/annotation/annotation_stats.json
│ │
│ └──► configs/extraction_examples.yaml ◄──────────────────┐
│ (promoted corrections appended) │
│ │
└──► Next pipeline run │
│ │
└──► extractor.py: _build_few_shot_examples() │
pinned (always) + rotatable sample ──────────────┘
(--examples-sample-n controls pool size)
Each annotation session improves the few-shot pool that the next extraction run draws from. The improvement compounds: better examples → fewer type errors → fewer tier-1/2 events exported for annotation → faster future batches.
tests/fixtures/test_set_v1.json is a 20-article ground-truth evaluation set for measuring pipeline quality before and after codebook or prompt changes.
Structure: 5 articles per target country (NG, ZA, UG, DZ). Each country block contains:
| Slot | Category | What to expect from the pipeline |
|---|---|---|
_01, _02 |
clear_positive |
High-confidence event with named organiser, explicit claims, no ambiguity — pipeline should return ≥1 event |
_03, _04 |
borderline |
Ambiguous framing, implied criteria, or mixed signals — tests confidence calibration |
_05 |
non_event |
Article about an election result, parliamentary session, court ruling, or economic report — pipeline should return [] |
Status: The fixture ships as a stub — article.text, article.url, and ground_truth.events fields are "TODO". Fill each slot with a real article following the _meta.instructions in the file, hand-coding ground_truth against protest_codebook.yaml before running the pipeline.
Evaluation dimensions tracked:
| Metric | Definition |
|---|---|
extraction_rate |
% of clear_positive articles where the pipeline returns ≥1 event |
rejection_rate |
% of non_event articles where the pipeline correctly returns [] |
type_accuracy |
% of extracted events where event_type matches the ground truth |
false_negative_rate |
% of ground-truth events missing entirely from pipeline output |
confidence_calibration |
Mean pipeline confidence for each category — clear_positive should be highest |
Codebook version: v2.4. If you complete the test set before applying the v2.4 template changes, code against v2.3 and record the discrepancy in coding_notes.
Benchmarks pipeline outputs against human-coded gold-standard datasets. All validators are automated — no manual annotation required.
| Dataset | What it validates | Status |
|---|---|---|
| CEHA | Relevance filter F1 on African conflict text | Available — CEHA/data/CEHA_dataset.csv |
| CASE 2021 Task 2 | Relevance filter + event type classification | Available — CASE2021/task2/test_dataset/ |
| GLOCON GSC | Full end-to-end recall (event matching) | Pending data access (applied 2026-04-05) |
| ACLED | Full end-to-end recall (multi-country) | Pending API token |
Conflict Events in the Horn of Africa (Bai et al. 2025, COLING). 500 expert-annotated event descriptions from ACLED and GDELT covering Sudan, Ethiopia, Somalia, South Sudan, Uganda, and Kenya. Binary relevance labels (Yes/No) with a held-out test split of 250 items.
Use this to calibrate the relevance filter threshold. CEHA event types (tribal conflict, religious conflict, SGBV, climate security) do not map to PEA's protest typology — do not use it for event type validation.
# Clone the dataset (once)
git clone https://github.com/dataminr-ai/CEHA CEHA
# Evaluate on the held-out test split (250 items)
python -m src.validation.ceha_validator \
--ceha-csv CEHA/data/CEHA_dataset.csv \
--split test \
--output data/validation/ceha_report.json
# Sweep thresholds to find the best F1 operating point
python -m src.validation.ceha_validator \
--ceha-csv CEHA/data/CEHA_dataset.csv \
--sweep-thresholds \
--output data/validation/ceha_sweep.json
# Keyword-fallback only (no model download required)
python -m src.validation.ceha_validator \
--ceha-csv CEHA/data/CEHA_dataset.csv \
--no-model \
--output data/validation/ceha_report_keyword.jsonInterpreting results: Bai et al. report ~0.70 F1 for zero-shot LLMs. The DeBERTa NLI filter is not fine-tuned on African conflict text, so the gap to this baseline indicates how much fine-tuning would help. Recall matters more than precision here — the pipeline is designed for high recall at the default threshold of 0.30.
The report breaks down F1 by country, by source (ACLED vs GDELT), and by CEHA event type (ethnic/communal, religious, SGBV, climate security). The threshold_sweep output maps each threshold to its F1 — use this to choose the operating point before raising --relevance-threshold on production runs.
CASE 2021 Shared Task 2 (Hürriyetoğlu et al. 2021, ACL-IJCNLP). 1,019 event snippets with 30 fine-grained event type labels. The test set is fully public — no access request required.
Five SubTypes map directly to PEA protest types: PEACE_PROTEST, VIOL_DEMONSTR, FORCE_AGAINST_PROTEST, PROTEST_WITH_INTER, and MOB_VIOL (172 protest-relevant items out of 1,019 total). Non-protest types (air strikes, natural disasters, armed clashes, etc.) serve as negative examples for the relevance classifier.
# Clone the dataset (once)
git clone https://github.com/emerging-welfare/case-2021-shared-task CASE2021
# Relevance mode — offline, no API key needed
# Tests whether the relevance filter correctly identifies protest snippets
python -m src.validation.case2021_validator \
--case-tsv CASE2021/task2/test_dataset/test_set_final_release_with_labels.tsv \
--mode relevance \
--output data/validation/case2021_relevance_report.json
# Extraction mode — tests event_type classification on the 172 protest snippets
# Requires LLM API key
python -m src.validation.case2021_validator \
--case-tsv CASE2021/task2/test_dataset/test_set_final_release_with_labels.tsv \
--mode extraction \
--provider azure \
--llm-model gpt-4o-mini \
--output data/validation/case2021_extraction_report.jsonCASE SubType → PEA event_type crosswalk:
| CASE SubType | PEA event_type |
|---|---|
PEACE_PROTEST |
demonstration_march |
VIOL_DEMONSTR |
riot |
PROTEST_WITH_INTER |
confrontation |
FORCE_AGAINST_PROTEST |
demonstration_march |
MOB_VIOL |
riot |
Interpreting results — relevance mode: F1 measures binary protest/not-protest discrimination across all 1,019 snippets. The by_subtype field shows recall separately for each protest SubType and false-positive rates for non-protest types. A high false-positive rate on ARMED_CLASH or ATTACK (adjacent violence types) is expected and acceptable; a high rate on AGREEMENT or NATURAL_DISASTER indicates the filter is too permissive.
Interpreting results — extraction mode: Accuracy is measured on the 172 protest-relevant snippets only. Snippets are short (one or two sentences) which is not the pipeline's typical input — expect lower accuracy here than on full articles.
Status: Data access application submitted 2026-04-05. Access pending.
The gold standard for this pipeline. The South Africa English subset provides token-level protest event annotation with IAA Krippendorff α = 0.35–0.60 by argument type, directly setting the human ceiling for confidence threshold calibration.
# Download dataset (requires approval from the emerging-welfare team)
git clone https://github.com/emerging-welfare/glocon-dataset ~/datasets/glocon
# Basic run
python -m src.validation.glocon_validator \
--glocon-dir ~/datasets/glocon/data/south_africa/english \
--pea-events data/processed/events_consolidated.jsonl \
--start-date 2024-01-01 --end-date 2024-12-31 \
--countries ZA \
--output data/validation/recall_report_glocon.json
# Diagnose misses if recall < 60%
python -m src.validation.glocon_validator \
--glocon-dir ~/datasets/glocon/data/south_africa/english \
--pea-events data/processed/events_consolidated.jsonl \
--start-date 2024-01-01 --end-date 2024-12-31 \
--countries ZA --show-misses \
--output data/validation/recall_report_glocon_misses.json
# High-confidence events only
python -m src.validation.glocon_validator \
--glocon-dir ~/datasets/glocon/data/south_africa/english \
--pea-events data/processed/events_consolidated.jsonl \
--min-confidence high \
--output data/validation/recall_report_glocon_highconf.jsonA PEA event is counted as matching a GLOCON event when all four criteria are met: same country, date within ±3 days, location fuzzy similarity ≥ 0.60, and event type in the same broad category (protest / strike / riot). The --show-misses flag prints each unmatched GLOCON event alongside the nearest PEA candidate and the specific reason it failed to match (date_too_far, location_mismatch, type_mismatch).
Recall targets:
| Recall | Interpretation |
|---|---|
| ≥ 60% | Acceptable for a GDELT-sourced pipeline |
| 40–60% | Investigate systematic misses by type and country |
| < 40% | Diagnose stage by stage: GDELT discovery → scraper → relevance filter → LLM |
Status:
acled_validator.pynot yet built — blocked on obtaining an ACLED API token. Register at acleddata.com for a free researcher token.
Pipeline run → data/raw/protest/all_events.jsonl
│
┌─────────────┴──────────────┐
│ │
export_for_annotation Stage 2 processing
(tier 1+2 priority) │
│ events_consolidated.jsonl
Label Studio │
(annotate) ┌──────────┴─────────────────────────┐
│ │ │
import_annotations glocon_validator ceha_validator
--promote-to- (pending access) case2021_validator
examples recall vs relevance filter F1
│ GLOCON gold event type accuracy
│
├─► training_data.jsonl (→ fine-tuning)
└─► configs/extraction_examples.yaml (→ next run's few-shot pool)
python -m src.acquisition.pipeline [OPTIONS]
Discovery & scope:
--query TEXT Keywords for GDELT/World News API (space-separated)
--countries TEXT ISO2 codes, comma-separated [default: NG,ZA,UG,DZ]
--days INT Lookback window in days [default: 7]
--max-articles INT Article cap [default: 50]
--source TEXT gdelt | bbc | worldnews | file | both | all [default: gdelt]
both = gdelt + bbc; all = gdelt + bbc + worldnews
--file-path TEXT Path to pre-scraped articles file (required for --source file)
Local: /path/to/articles.csv
ADLS: abfss://filesystem/path/to/articles.csv
Required columns: url, title, text, date, country
Domain & codebook:
--domains TEXT Comma-separated codebook domains [default: protest]
Use 'protest,drone' for multi-domain mode
--codebook PATH Custom codebook YAML (single-domain only)
--examples PATH Custom examples YAML (single-domain only)
LLM backend (Azure AI Foundry only):
--provider TEXT azure [only supported value]
--model TEXT Deployment name [default: gpt-4.1]
--api-key TEXT Override AZURE_FOUNDRY_API_KEY env var
Pipeline control:
--stage TEXT acquire | process | predict | all [default: acquire]
--no-translate Skip translation step
--no-geocode Skip Nominatim geocoding
--resume Skip already-processed URLs (reads checkpoint.txt)
--relevance-threshold FLOAT Minimum NLI score to pass to LLM [default: 0.30]
Few-shot examples:
--examples-sample-n INT Total examples injected per run [default: 5]
Pinned examples always included; remaining slots
filled by run-stable random sample from the
promoted pool. Raise above 5 once annotation
promotion has grown the pool.
Concurrency & rate limiting:
--workers INT Concurrent extraction workers [default: 4]
--rpm-limit INT Azure OpenAI RPM ceiling [default: 450]
--scrape-workers INT Parallel scrape workers [default: 16]
--geocode-workers INT Parallel geocode workers [default: 4]
--relevance-batch-size INT NLI inference batch size [default: 32]
Historical backfill:
--backfill-from DATE Start date: YYYY-MM-DD
--backfill-to DATE End date: YYYY-MM-DD [default: today]
--backfill-window-days INT Days per GDELT query window [default: 30]
Output & storage:
--output-dir PATH Output directory [default: data/raw/]
--upload-to TEXT abfss://filesystem/prefix (ADLS Gen2) or s3://bucket/prefix
This pipeline implements the Halterman & Keith (2025) framework for LLM-based protest event coding:
- Type III (Stipulative) definitions — codebook definitions are precise enough that reasonable annotators would reach the same decision. Vague definitions cannot be compensated for by larger models.
- Codebook-in-prompt — full annotation guidelines injected into the system prompt. Validated to yield +8–15 F1 points over brief label descriptions.
- Prediction-Powered Inference (Angelopoulos et al. 2023) — prevalence estimates account for LLM misclassification rates with valid confidence intervals. Raw averaging of LLM predictions is methodologically incorrect.