# 🧾 Committee Metadata Normalisation & Integration Pipeline

This notebook processes scraped meeting metadata and aligns it with a canonical list of council committees.

It handles noisy, inconsistent, or legacy committee names using:
- Custom slugification (`slugify`) to generate consistent committee IDs
- Forced mapping of known problematic names (`force_map_committee_name`)
- Alias resolution (`committee_alias_map.csv`) for renamed or merged committees

---

## 🔁 Workflow Summary

### 1. **Load Meeting Metadata**
- Read `meetings_metadata.jsonl`
- Clean and normalize `committee_name` using forced mapping
- Generate `committee_id` using `slugify()`
- Redirect via alias map if needed

### 2. **Aggregate Committee Stats**
- Group meetings by `committee_id`
- Count total meetings
- Record first and last meeting dates

### 3. **Load Canonical Metadata**
- Read `committees.jsonl` (from HTML parsing)
- Merge in meeting stats where committee already exists
- Preserve canonical metadata fields: `aliases`, `status`, etc.

### 4. **Identify Extras**
- Detect committees that appear only in meeting data but not in metadata
- Store them in `extras_df` for manual inspection

### 5. **Promote Valid Extras**
- If a missing committee has valid meeting history:
  - Assign `"active"` if last meeting ≤ 12 months ago
  - Otherwise mark as `"inactive"` and store `date_inactivated`
- Construct `new_entries_df` for potential inclusion in metadata

---

## 📦 Outputs
- Final merged metadata: `committees.jsonl`
- Unmatched entries for QA: `extras_df`
- Metadata-ready promoted entries: `new_entries_df`

In [64]:
import json
import os
from pathlib import Path
from collections import defaultdict
import re

# === CONFIGURATION ===
INPUT_PATH = Path("../data/meetings/meetings_metadata.jsonl")
OUTPUT_PATH = Path("../data/committees/committees.jsonl")
LEGACY_PATH = Path("../data/metadata/committees.jsonl")
ALIAS_PATH = Path("../data/references/committee_alias_map.csv")

OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)

# === SLUGIFY ===
def slugify(name):
    name = name.lower()
    name = name.replace("’", "'")  # smart quotes
    name = re.sub(r"'s\b", "s", name)  # handle possessives
    name = re.sub(r"\bcommittee\b", "", name)  # remove 'committee' globally
    name = re.sub(r"[^\w\s]", "", name)
    name = re.sub(r"\s+", "-", name.strip())
    name = re.sub(r"-+", "-", name).strip("-")
    return name

def force_map_committee_name(name):
    patterns = {
        "county council": "County Council",
        "kent and medway police and crime panel": "Kent and Medway Police and Crime Panel",
        "transport cabinet committee": "Environment & Transport Cabinet Committee",
        "sacre": "Standing Advisory Council on Religious Education (SACRE)",
        'scrutiny committee': "Scrutiny Committee",
        'forum': "Forum",
        "appeal panel": "Regulation Committee Appeal Panel (Transport)",
        "governor appointments panel": "Governor Appointments Panel"
        # Add more patterns below if needed
    }

    name_lower = name.lower()
    for pattern, canonical in patterns.items():
        if pattern in name_lower:
            return canonical
    return name  # fallback: keep original

# === Load alias map ===
alias_map = {}
if ALIAS_PATH.exists():
    import csv
    with open(ALIAS_PATH, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            alias_id = row["alias_committee_id"]
            canonical_id = row["canonical_committee_id"]
            alias_map[alias_id] = canonical_id

# === Load meetings and group by committee ===
committees = defaultdict(list)

with open(INPUT_PATH, "r", encoding="utf-8") as f:
    for line in f:
        try:
            obj = json.loads(line)
            if obj.get("committee_name") and obj.get("meeting_date"):
                # Step 1: remap noisy names based on known patterns
                forced_name = force_map_committee_name(obj["committee_name"])

                # Step 2: generate slug from cleaned name
                raw_slug = slugify(forced_name)

                # Step 3: apply alias redirection if exists
                canonical_id = alias_map.get(raw_slug, raw_slug)

                # Step 4: track committee occurrence
                committees[canonical_id].append({
                    "name": forced_name,
                    "date": obj["meeting_date"]
                })
        except Exception:
            continue

# === Aggregate committee summaries ===
records = []
for cid, meetings in committees.items():
    dates = [m["date"] for m in meetings if m["date"]]
    try:
        first = min(dates)
        last = max(dates)
    except ValueError:
        continue  # skip empty

    records.append({
        "committee_id": cid,
        "committee_name": meetings[0]["name"],  # First encountered name
        "first_meeting_date": first,
        "last_meeting_date": last,
        "meeting_count": len(dates)
    })

# === Load legacy committee metadata ===
legacy_data = {}
if LEGACY_PATH.exists():
    with open(LEGACY_PATH, "r", encoding="utf-8") as f:
        for line in f:
            try:
                entry = json.loads(line)
                if "committee_id" in entry:
                    legacy_data[entry["committee_id"]] = entry
            except Exception:
                continue

# === Merge meeting stats into legacy records ===
for r in records:
    cid = r["committee_id"]
    if cid in legacy_data:
        legacy_data[cid].update(r)
    else:
        legacy_data[cid] = r

# === Write output ===
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    for r in sorted(legacy_data.values(), key=lambda x: x["committee_id"]):
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

print(f"✅ Saved {len(legacy_data)} committees to {OUTPUT_PATH}")

✅ Saved 153 committees to ../data/committees/committees.jsonl


In [65]:
import pandas as pd

# Convert merged legacy_data dict to DataFrame
merged_committees_df = pd.DataFrame(legacy_data.values())
merged_committees_df = merged_committees_df.sort_values(by="committee_id").reset_index(drop=True)
merged_committees_df.head()

Unnamed: 0,committee_id,canonical_name,status,date_inactivated,aliases,council_code,committee_name,first_meeting_date,last_meeting_date,meeting_count
0,access-joint,,,,,,ACCESS Joint Committee,2017-07-31,2025-03-24,37.0
1,adult-social-care-cabinet,Adult Social Care Cabinet Committee,active,,"[Adult Social Care and Health Cabinet Committee, Adult Social Care and Public Health Policy Overview and Scrutiny Committee, Children's Social Care and Health Cabinet Committee, Social Care & Community Health POC, Social Care and Public Health Cabinet Committee, Adult Social Care and Health Cabinet Committee, Adult Social Care and Public Health Policy Overview and Scrutiny Committee, Adult Social Services Policy Overview and Scrutiny Committee]",kent_cc,Adult Social Care and Health Cabinet Committee,2015-01-15,2025-07-08,68.0
2,appeals,Appeals Committee,inactive,31/12/2005,[],kent_cc,,,,
3,ashford-central-forum,Ashford Central Forum,inactive,31/10/2010,[],kent_cc,,,,
4,ashford-local-board,Ashford Local Board,inactive,30/06/2016,[],kent_cc,,,,


In [66]:
import pandas as pd
import json

# Load your final output
output_path = "../data/committees/committees.jsonl"
output_df = pd.read_json(output_path, lines=True)

# Load canonical metadata
canonical_path = "../data/metadata/committees.jsonl"
canonical_df = pd.read_json(canonical_path, lines=True)

# Extract known IDs
known_ids = set(canonical_df["committee_id"])
actual_ids = set(output_df["committee_id"])

# Find mismatches
unexpected_ids = actual_ids - known_ids

# Filter and inspect
extras_df = output_df[output_df["committee_id"].isin(unexpected_ids)].sort_values("meeting_count")
extras_df

Unnamed: 0,committee_id,committee_name,first_meeting_date,last_meeting_date,meeting_count,canonical_name,status,date_inactivated,aliases,council_code
62,kent-medway-economic-partnership-kmep,Kent & Medway Economic Partnership (KMEP),2014-09-08,2015-03-23,5.0,,,,,
35,forum,Forum,2014-06-23,2017-03-16,36.0,,,,,
0,access-joint,ACCESS Joint Committee,2017-07-31,2025-03-24,37.0,,,,,
137,standing-advisory-council-on-religious-education-sacre,Standing Advisory Council on Religious Education (SACRE),2015-03-10,2025-06-10,37.0,,,,,


In [68]:
from datetime import datetime, timedelta

today = datetime.today()
cutoff = today - timedelta(days=365)

# Assume extras_df is already loaded
final_extras = []

for _, row in extras_df.iterrows():
    last_date = pd.to_datetime(row["last_meeting_date"])
    
    if pd.notnull(last_date) and last_date > cutoff:
        is_active = True
    else:
        is_active = False

    final_extras.append({
        "committee_id": row["committee_id"],
        "canonical_name": row["committee_name"],
        "status": "active" if is_active else "inactive",
        "date_inactivated": None if is_active else row["last_meeting_date"],
        "aliases": [],
        "council_code": "kent_cc",
        "first_meeting_date": row["first_meeting_date"],
        "last_meeting_date": row["last_meeting_date"],
        "meeting_count": row["meeting_count"]
    })

# Convert to DataFrame or merge with your existing legacy_data
new_entries_df = pd.DataFrame(final_extras)

In [69]:
extras_df.to_csv("../data/committees/extras_for_manual_review.csv", index=False)
new_entries_df.to_json("../data/committees/promoted_committees.jsonl", orient="records", lines=True)

In [70]:
# Load the canonical metadata
canonical_path = "../data/metadata/committees.jsonl"
canonical_df = pd.read_json(canonical_path, lines=True)

# Append promoted extras
updated_df = pd.concat([canonical_df, new_entries_df], ignore_index=True)

# Save it back to the metadata JSONL
updated_df.to_json(canonical_path, orient="records", lines=True, force_ascii=False)

print(f"✅ Appended {len(new_entries_df)} promoted committees to {canonical_path}")

✅ Appended 4 promoted committees to ../data/metadata/committees.jsonl
