# Setting up a committee structure for a new council

This notebook is designed to process and clean up the committee list. This is an essential step before we start scraping meetings and matching them to the correct committee names.

---

### 🧭 Workflow Overview

1. **Download the raw HTML file**  
   - Go to the council’s meetings page (e.g. *Browse Meetings*).
   - Save the full HTML of the committee list (often includes both active and historic entries).
   - https://democracy.kent.gov.uk:9071/ieDocHome.aspx?XXR=0&Year=-1&Page=1&Categories=-14759&EB=F& - example

2. **Parse the HTML and extract committee entries**  
   - Use BeautifulSoup to extract all `<li>` elements inside the main `<div class="mgList">`.
   - Extract each committee name and check if the text includes a phrase like `decommissioned DD/MM/YYYY`.

3. **Build a raw committee dataframe**  
   - For each committee, extract:
     - `canonical_name` (the visible name)
     - `committee_id` (slugified name)
     - `status`: `active` or `inactive` (based on presence of `decommissioned`)
     - `date_inactivated`: if applicable
     - `council_code`: e.g. `"kent_cc"`
     - `aliases`: initially empty list

4. **Search and alias-mapping**  
   - Define keywords (e.g. `"social care"`, `"children"`) to explore historical committee variants.
   - Use `search_committees_by_keyword()` to inspect all matching entries.
   - Use `auto_map_aliases_by_keyword()` to map **inactive** committee names to a **single active** canonical one.

5. **Manually discard irrelevant committees**  
   - Define terms (e.g. `"forum"`, `"local"`, `"Maidstone"`) for committees that should be removed entirely.
   - Use `committees_df` and `inactive_committees` to prune out these entries.

6. **Maintain an alias map**  
   - Store alias relationships in a central `alias_map_df`.
   - Append to `../data/references/committee_alias_map.csv` for persistence.
   - Ensure removed committees are no longer in the final list.

7. **Inject aliases into canonical records**  
   - Populate each active committee’s `aliases` list based on the reverse lookup from `alias_map_df`.

8. **Save the cleaned, structured output**  
   - Export the final `committees_df` as a JSONL file for downstream use:
     ```
     ../data/metadata/committees.jsonl
     ../data/references/committee_alias_map.csv
     ```

## Use of GPT for quick results

Alternatively, to extract the data quickly we can use Chat GPT. Feed it the save HTML page, with the following prompt.
Here is a strong, reusable prompt you can adapt:

⸻

### 🧾 Prompt: Extract Structured Committee Data from HTML

You are an expert in web scraping and data extraction. You are given a full HTML file from a UK local government Modern.Gov-based website that lists council committees.

Your task is to extract a structured list of all committees and output a clean .jsonl file, where each line is a JSON object matching the following schema:

{
  "committee_id": "slug-version-of-canonical-name",
  "canonical_name": "Full human-readable committee name",
  "status": "active" or "inactive",
  "date_inactivated": "DD/MM/YYYY" or null,
  "aliases": [],
  "council_code": "short-code-of-council"
}

Extraction Rules:
	•	Parse all committee entries under sections such as:
	•	Full Council
	•	Executive (Cabinet)
	•	Committees
	•	Advisory Boards
	•	Working Groups or Panels
	•	For status, set to "inactive" only if the name includes a phrase like "(Ceased DATE)". Otherwise, use "active".
	•	Extract the inactivation date if present, and convert it to "DD/MM/YYYY" format.
	•	Use this as your slugify function:
		def simple_slugify(text):
			text = text.lower()
			text = text.replace("’", "'")  # Normalize smart apostrophes
			text = re.sub(r"'s\b", "s", text)  # Turn possessives into plain 's' (e.g. "Children's" → "childrens")
			text = re.sub(r"\bcommittee\b", "", text)  # Remove the word 'committee' anywhere
			text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
			text = re.sub(r"\s+", "-", text.strip())  # Replace spaces with hyphens
			text = re.sub(r"-+", "-", text)  # Collapse multiple hyphens
			return text.strip("-")
	•	The aliases field should be an empty list [] for now.
	•	Set "council_code" to a fixed short code, e.g. "tunbridge_wells_bc" or "kent_cc".

Output Format:
	•	Return the full data as .jsonl (JSON Lines), with one JSON object per line.
	•	Make the file downloadable as committees_tunbridge_wells.jsonl.


In [173]:
import re
import pandas as pd
from bs4 import BeautifulSoup
import os

# Configuration
INPUT_HTML_FILE = "../data/references/Browse Meetings, 0.html"
OUTPUT_JSONL_FILE = "../data/metadata/committees.jsonl"
ALIAS_CSV_PATH = "../data/references/committee_alias_map.csv"
COUNCIL_CODE = "kent_cc"

# Simple slugify fallback
def simple_slugify(text):
    text = text.lower()
    text = text.replace("’", "'")  # Normalize smart apostrophes
    text = re.sub(r"'s\b", "s", text)  # Turn possessives into plain 's' (e.g. "Children's" → "childrens")
    text = re.sub(r"\bcommittee\b", "", text)  # Remove the word 'committee' anywhere
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    text = re.sub(r"\s+", "-", text.strip())  # Replace spaces with hyphens
    text = re.sub(r"-+", "-", text)  # Collapse multiple hyphens
    return text.strip("-")

# Load HTML
with open(INPUT_HTML_FILE, "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "html.parser")

committee_div = soup.find("div", class_="mgList")
committee_items = committee_div.find_all("li")

committee_data = []

# Extract committee data
for li in committee_items:
    link = li.find("a")
    if not link:
        continue

    name = link.get_text(strip=True)
    full_text = li.get_text(separator=' ', strip=True)
    match = re.search(r'decommissioned\s+(\d{2}/\d{2}/\d{4})', full_text)

    if match:
        status = "inactive"
        date_inactivated = match.group(1)
    else:
        status = "active"
        date_inactivated = None

    committee_data.append({
        "committee_id": simple_slugify(name),
        "canonical_name": name,
        "status": status,
        "date_inactivated": date_inactivated,
        "aliases": [],
        "council_code": COUNCIL_CODE
    })

# Define a function to search for committees by keyword (case-insensitive)

def search_committees_by_keyword(df, keyword):
    keyword_lower = keyword.lower()
    matches = df[df["canonical_name"].str.lower().str.contains(keyword_lower)].copy()
    return matches[["committee_id", "canonical_name", "status"]].reset_index(drop=True)

def auto_map_aliases_by_keyword(df, keyword):
    global alias_map_df, committees_df

    matches = df[df["canonical_name"].str.lower().str.contains(keyword.lower())].copy()
    active_rows = matches[matches["status"] == "active"]
    inactive_rows = matches[matches["status"] == "inactive"]

    if len(active_rows) != 1:
        print(f"⚠️ Found {len(active_rows)} active committees for keyword '{keyword}'. Please refine your keyword.")
        return

    canonical_id = active_rows.iloc[0]["committee_id"]

    new_aliases = inactive_rows.apply(
        lambda row: {
            "alias_committee_id": row["committee_id"],
            "alias_name": row["canonical_name"],
            "canonical_committee_id": canonical_id
        }, axis=1
    ).tolist()

    # Append to alias map
    alias_map_df = pd.concat([alias_map_df, pd.DataFrame(new_aliases)], ignore_index=True)

    print(f"✅ Added {len(new_aliases)} aliases pointing to '{canonical_id}'.")


# Create dataframe
committees_df = pd.DataFrame(committee_data)
committees_df.head(20)

Unnamed: 0,committee_id,canonical_name,status,date_inactivated,aliases,council_code
0,adult-social-care-and-health-cabinet,Adult Social Care and Health Cabinet Committee,inactive,16/05/2017,[],kent_cc
1,adult-social-care-and-public-health-policy-ove...,Adult Social Care and Public Health Policy Ove...,inactive,31/03/2012,[],kent_cc
2,adult-social-care-cabinet,Adult Social Care Cabinet Committee,active,,[],kent_cc
3,adult-social-services-policy-overview-and-scru...,Adult Social Services Policy Overview and Scru...,inactive,05/04/2011,[],kent_cc
4,appeals,Appeals Committee,inactive,31/12/2005,[],kent_cc
5,ashford-central-forum,Ashford Central Forum,inactive,31/10/2010,[],kent_cc
6,ashford-local-board,Ashford Local Board,inactive,30/06/2016,[],kent_cc
7,ashford-rural-south-forum,Ashford Rural South Forum,inactive,28/02/2011,[],kent_cc
8,ashford-rural-west-forum,Ashford Rural West Forum,inactive,21/06/2011,[],kent_cc
9,audit,Audit Committee,inactive,31/12/2004,[],kent_cc


In [174]:
committees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 0 to 178
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   committee_id      179 non-null    object
 1   canonical_name    179 non-null    object
 2   status            179 non-null    object
 3   date_inactivated  147 non-null    object
 4   aliases           179 non-null    object
 5   council_code      179 non-null    object
dtypes: object(6)
memory usage: 8.5+ KB


In [175]:
# Count active and inactive
num_active = committees_df[committees_df["status"] == "active"].shape[0]
num_inactive = committees_df[committees_df["status"] == "inactive"].shape[0]

print(f"✅ Active committees: {num_active}")
print(f"❌ Inactive committees: {num_inactive}")

# List all active committee names
active_committees = committees_df[committees_df["status"] == "active"]["canonical_name"].tolist()

print("\nList of active committees:")
for name in active_committees:
    print(f" - {name}")


✅ Active committees: 32
❌ Inactive committees: 147

List of active committees:
 - Adult Social Care Cabinet Committee
 - Cabinet
 - Children's, Young People and Education Cabinet Committee
 - County Council
 - Electoral and Boundary Review Committee
 - Environment & Transport Cabinet Committee
 - Governance and Audit Committee
 - Growth, Economic Development and Communities Cabinet Committee
 - Health Overview and Scrutiny Committee
 - Health Reform and Public Health Cabinet Committee
 - Kent and Medway NHS Joint Overview and Scrutiny Committee
 - Kent and Medway Police and Crime Panel
 - Kent and Medway Police and Crime Panel - Complaints Sub-Committee
 - Kent Community Safety Partnership
 - Kent Flood Risk and Water Management Committee
 - Kent Health and Wellbeing Board
 - Kent Utilities Engagement Sub-Committee
 - Member Development Sub-Committee
 - Pension Fund Committee
 - Personnel Committee
 - Personnel Committee - Member Appointment Panel
 - Planning Applications Committee
 - 

In [176]:
# List all active committee names
inactive_committees = committees_df[committees_df["status"] == "inactive"]["canonical_name"].tolist()

print("\nList of inactive committees:")
for name in inactive_committees:
    print(f" - {name}")



List of inactive committees:
 - Adult Social Care and Health Cabinet Committee
 - Adult Social Care and Public Health Policy Overview and Scrutiny Committee
 - Adult Social Services Policy Overview and Scrutiny Committee
 - Appeals Committee
 - Ashford Central Forum
 - Ashford Local Board
 - Ashford Rural South Forum
 - Ashford Rural West Forum
 - Audit Committee
 - Bexley and Kent Urgent Care Review Joint Health Overview and Scrutiny Committee
 - Budget IMG
 - Business Consultation
 - Cabinet Scrutiny Committee
 - Canterbury Area Member Panel
 - Canterbury Local Board
 - Canterbury Rural Area Member Panel - North and South
 - Central Ashford Neighbourhood Forum
 - Children, Families & Education - Learning and Development Policy Overview and Scrutiny Committee
 - Children, Families & Education - Resources and Infrastructure Policy Overview and Scrutiny Committee
 - Children, Families & Education - Vulnerable Children and Partnerships Policy Overview and Scrutiny Committee
 - Children,

## Steps to identify aliases - use this only when doing it for new councils

### Look up terms

In [None]:
### Search terms
search_term = "social care"
search_term = "children"
search_term = "environment"
search_term = "econom"
search_term = "health overview"
search_term = "audit"
search_term = "adult social"
search_term = "appointment"
search_term = "communities"

search_committees_by_keyword(committees_df, search_term )

#### Remove without creating aliases

In [152]:
# ✅ COMPLETE REMOVAL WITHOUT ALIAS MAPPING

# Define all search terms to remove
removal_terms = [
    "forum",
    "local",
    "Canterbury",
    "Select Committee - ",
    "Whitstable",
    "Tonbridge",
    "Maidstone",
    "East Kent",
    "Malling",
    "Herne Bay"
]

# Rebuild inactive_committees fresh from current state
inactive_committees = committees_df[committees_df["status"] == "inactive"].copy()

# Collect all matching committee_ids across all search terms
ids_to_drop = set()
for term in removal_terms:
    matches = search_committees_by_keyword(committees_df, term)
    ids_to_drop.update(matches["committee_id"].tolist())

# Apply removal from both dataframes
committees_df = committees_df[~committees_df["committee_id"].isin(ids_to_drop)].reset_index(drop=True)
inactive_committees = inactive_committees[~inactive_committees["committee_id"].isin(ids_to_drop)].reset_index(drop=True)


#### Remove with aliases

In [None]:
# We'll store alias mappings in this dataframe structure for now
alias_map_df = pd.DataFrame(columns=["alias_committee_id", "alias_name", "canonical_committee_id"])
auto_map_aliases_by_keyword(committees_df, search_term)

In [None]:
# Remove any committees from committees_df that were added as aliases
alias_ids_to_remove = set(alias_map_df["alias_committee_id"])
committees_df = committees_df[~committees_df["committee_id"].isin(alias_ids_to_remove)].reset_index(drop=True)

# Optionally, recreate an inactive_committees dataframe if you're tracking those separately
inactive_committees = committees_df[committees_df["status"] == "inactive"].copy()

committees_df

In [None]:
alias_map_df

### Assign aliases

In [177]:
# ✅ Bulk alias mapping and cleanup cycle

# Define confirmed aliasable search terms
search_terms = [
    "social care",
    "children",
    "environment",
    "econom",
    "health overview",
    "audit",
    "adult social",
    "appointment",
    "communities"
]

# Reset alias map if needed
alias_map_df = pd.DataFrame(columns=["alias_committee_id", "alias_name", "canonical_committee_id"])

# Loop through each term, map aliases, and clean them out of committee list
for term in search_terms:
    matches = committees_df[committees_df["canonical_name"].str.lower().str.contains(term.lower())]
    active_rows = matches[matches["status"] == "active"]
    inactive_rows = matches[matches["status"] == "inactive"]

    if len(active_rows) != 1:
        print(f"⚠️ '{term}': found {len(active_rows)} active matches — skipping.")
        continue

    canonical_id = active_rows.iloc[0]["committee_id"]

    new_aliases = inactive_rows.apply(
        lambda row: {
            "alias_committee_id": row["committee_id"],
            "alias_name": row["canonical_name"],
            "canonical_committee_id": canonical_id
        }, axis=1
    ).tolist()

    alias_map_df = pd.concat([alias_map_df, pd.DataFrame(new_aliases)], ignore_index=True)
    print(f"✅ '{term}': mapped {len(new_aliases)} aliases → {canonical_id}")

# Remove aliases from main committee lists
alias_ids_to_remove = set(alias_map_df["alias_committee_id"])
committees_df = committees_df[~committees_df["committee_id"].isin(alias_ids_to_remove)].reset_index(drop=True)
inactive_committees = committees_df[committees_df["status"] == "inactive"].copy()


✅ 'social care': mapped 5 aliases → adult-social-care-cabinet
✅ 'children': mapped 9 aliases → childrens-young-people-and-education-cabinet
✅ 'environment': mapped 3 aliases → environment-transport-cabinet
✅ 'econom': mapped 2 aliases → growth-economic-development-and-communities-cabinet
✅ 'health overview': mapped 2 aliases → health-overview-and-scrutiny
✅ 'audit': mapped 2 aliases → governance-and-audit
✅ 'adult social': mapped 3 aliases → adult-social-care-cabinet
✅ 'appointment': mapped 2 aliases → personnel-member-appointment-panel
✅ 'communities': mapped 5 aliases → growth-economic-development-and-communities-cabinet


In [178]:
alias_map_df

Unnamed: 0,alias_committee_id,alias_name,canonical_committee_id
0,adult-social-care-and-health-cabinet,Adult Social Care and Health Cabinet Committee,adult-social-care-cabinet
1,adult-social-care-and-public-health-policy-ove...,Adult Social Care and Public Health Policy Ove...,adult-social-care-cabinet
2,childrens-social-care-and-health-cabinet,Children's Social Care and Health Cabinet Comm...,adult-social-care-cabinet
3,social-care-community-health-poc,Social Care & Community Health POC,adult-social-care-cabinet
4,social-care-and-public-health-cabinet,Social Care and Public Health Cabinet Committee,adult-social-care-cabinet
5,children-families-education-learning-and-devel...,"Children, Families & Education - Learning and ...",childrens-young-people-and-education-cabinet
6,children-families-education-resources-and-infr...,"Children, Families & Education - Resources and...",childrens-young-people-and-education-cabinet
7,children-families-education-vulnerable-childre...,"Children, Families & Education - Vulnerable Ch...",childrens-young-people-and-education-cabinet
8,children-families-and-education-policy-overview,"Children, Families and Education Policy Overvi...",childrens-young-people-and-education-cabinet
9,children-families-and-educational-achievement-...,"Children, Families and Educational Achievement...",childrens-young-people-and-education-cabinet


### Save the alias map permanently

In [None]:
#alias_map_df.to_csv(ALIAS_CSV_PATH, index=False)

## Enrich committees_df with backward references

In [180]:
from collections import defaultdict

# Build a mapping from canonical_id → list of aliases
aliases_by_canonical = defaultdict(list)
for _, row in alias_map_df.iterrows():
    aliases_by_canonical[row["canonical_committee_id"]].append(row["alias_name"])

# Inject aliases into the main committees_df
committees_df["aliases"] = committees_df.apply(
    lambda row: aliases_by_canonical.get(row["committee_id"], []),
    axis=1
)

In [181]:
import re

# Strip the word "committee" from the committee_id field
committees_df["committee_id"] = committees_df["committee_id"].apply(
    lambda cid: re.sub(r'-committee\b', '', cid)
)
committees_df.head(30)

Unnamed: 0,committee_id,canonical_name,status,date_inactivated,aliases,council_code
0,adult-social-care-cabinet,Adult Social Care Cabinet Committee,active,,[Adult Social Care and Health Cabinet Committe...,kent_cc
1,appeals,Appeals Committee,inactive,31/12/2005,[],kent_cc
2,ashford-central-forum,Ashford Central Forum,inactive,31/10/2010,[],kent_cc
3,ashford-local-board,Ashford Local Board,inactive,30/06/2016,[],kent_cc
4,ashford-rural-south-forum,Ashford Rural South Forum,inactive,28/02/2011,[],kent_cc
5,ashford-rural-west-forum,Ashford Rural West Forum,inactive,21/06/2011,[],kent_cc
6,budget-img,Budget IMG,inactive,01/01/2008,[],kent_cc
7,business-consultation,Business Consultation,inactive,31/12/2005,[],kent_cc
8,cabinet,Cabinet,active,,[],kent_cc
9,cabinet-scrutiny,Cabinet Scrutiny Committee,inactive,30/05/2009,[],kent_cc


In [None]:
# Save committees_df to JSONL
#committees_df.to_json(OUTPUT_JSONL_FILE, orient="records", lines=True, force_ascii=False)
OUTPUT_JSONL_FILE

'../data/metadata/committees.jsonl'