# Deputyship data extraction and stop flow forecast notebook

This notebook pulls monthly snapshots of deputyship cases from the Sirius database, turns those snapshots into a set of flow measures, then produces a forecast of future active caseload by age group. The forecast method used here is a stop flow recurrence, which updates the number of active cases using entries and terminations observed one year earlier.

The notebook is written so a new analyst can run it from top to bottom. The first code cell installs a small set of dependencies into an isolated location so the rest of the notebook can import them without changing the shared environment. The later run cells write outputs into the output folder, including raw monthly extracts, summary tables, a combined historical and forecast table, charts, and an insights markdown file.

Before you run the extraction steps you need access to the opg_sirius_prod database through pydbtools. If you do not have that access, the notebook will fail at the first query and you will need to request the relevant permissions.

A practical note about outputs: the run sections call a helper that deletes all files inside the output folder before writing new results. If you want to keep earlier runs, copy the folder elsewhere first.


## Environment and dependency set up

This cell installs the extra Python libraries that the rest of the notebook relies on. It does it in a way that avoids breaking the shared Python environment.

The approach is to create a fresh folder on disk and tell pip to install packages into that folder. The folder is then added to Python import search path so imports work immediately.

This is useful on managed notebook platforms because a normal install can try to upgrade or remove packages that other projects depend on. When that happens you can get errors about locked files or dependency conflicts. By isolating installs into a new folder, we avoid touching the base environment.

The helper functions in this cell do three things.

First they choose a base directory for installs. You can override it with the NB_PIP_TARGET environment variable. If you do not set that variable the code creates a unique folder under tmp using the current time.

Second they activate a chosen install folder by inserting it at the front of sys.path. Python imports from the first matching entry on sys.path, so putting the new folder first means the newly installed packages are the ones that get imported.

Third they run pip in a mode that installs only the named packages into the target folder and avoids pulling in and upgrading large shared dependencies. This reduces the chance of conflicts with core libraries such as numpy and botocore.

At the end of the cell, the notebook imports each installed package and prints a version check so you can see that the environment is ready for the rest of the notebook.


In [None]:
# --- Fresh, isolated installs to avoid "device or resource busy" + avoid dependency conflicts ---
import sys
import os
import time
import subprocess
import importlib

def _new_target_base():
    # prefer env override; else unique temp dir
    base = os.environ.get("NB_PIP_TARGET")
    if not base:
        base = f"/tmp/pythonlibs_{int(time.time())}"
    os.makedirs(base, exist_ok=True)
    return base

_ISO_BASE = _new_target_base()      # e.g., /tmp/pythonlibs_1737500000
_SYS_PATH_SEEN = set(sys.path)

def activate_site(path: str):
    """Put a site-packages dir at the very front of sys.path (once)."""
    if path in _SYS_PATH_SEEN:
        return path
    sys.path.insert(0, path)
    _SYS_PATH_SEEN.add(path)
    return path

def pip_install_isolated(*pkgs, no_deps: bool = False, target: str | None = None):
    """
    Install pkgs into a brand-new site dir so nothing gets removed/overwritten.
    Use no_deps=True to avoid pulling conflicting core deps (botocore/wrapt/numpy).
    Returns the target path.
    """
    target = target or os.path.join(_ISO_BASE, f"site_{len(_SYS_PATH_SEEN)}")
    os.makedirs(target, exist_ok=True)
    activate_site(target)

    cmd = [sys.executable, "-m", "pip", "install",
           "--upgrade", "--no-cache-dir", "--disable-pip-version-check",
           "--target", target]
    if no_deps:
        cmd.append("--no-deps")
    cmd.extend(pkgs)
    print(">", " ".join(cmd))
    subprocess.check_call(cmd)
    importlib.invalidate_caches()
    print("Installed into:", target)
    return target

print("Isolated install base:", _ISO_BASE)

# Example: install pydbtools without dependencies (use base env for boto/numpy/etc.)

pip_install_isolated("arrow_pd_parser", no_deps=True)
pip_install_isolated("mojap_metadata", no_deps=True)
pip_install_isolated("sql_metadata", no_deps=True)
pip_install_isolated("sqlparse", no_deps=True)
pip_install_isolated("pydbtools", no_deps=True)
pip_install_isolated("awswrangler", no_deps=True)
pip_install_isolated("dataengineeringutils3", no_deps=True)

# Sanity check
import sql_metadata
print("sql_metadata:", getattr(sql_metadata, "__version__", "unknown"))

import sqlparse
print("sqlparse:", getattr(sqlparse, "__version__", "unknown"))

import pydbtools
print("pydbtools:", getattr(pydbtools, "__version__", "unknown"))

import arrow_pd_parser
print("arrow_pd_parser:", getattr(arrow_pd_parser, "__version__", "unknown"))

import mojap_metadata
print("mojap_metadata:", getattr(mojap_metadata, "__version__", "unknown"))

import awswrangler
print("awswrangler:", getattr(awswrangler, "__version__", "unknown"))




## Core imports and small utilities used throughout the notebook

This cell loads the Python libraries used for data access, data manipulation, forecasting, and plotting. Most of the work later in the notebook uses pandas for tabular data, numpy for numeric work, matplotlib for charts, and pydbtools for running SQL and returning results as a DataFrame.

It also sets notebook level behaviour.

Warnings from statsmodels about divide by zero in information criteria are suppressed because they can appear when a time series is flat and they distract from the main output. Logging is disabled so that the notebook prints only the outputs we care about when running end to end.

After imports, the cell defines a few helper functions that keep the rest of the notebook readable.

parse_month converts a month string into a datetime object representing the first day of that month. It is used so that functions receive consistent date inputs.

clear_directory removes everything inside a given folder. It checks that the path is a directory first. This is used to ensure the output folder contains only the results from the current run.

generate_month_list creates a list of monthly datetime points from a start month to an end month. It uses relativedelta to move forward one month at a time. This gives a simple and explicit loop over reporting months.

last_day_of_month converts a datetime to the last calendar day of its month, formatted as a string. The data extraction queries work on month end snapshots, so we use the last day as the snapshot date.

There is also a small placeholder definition of fetch_cases_for_date. It illustrates the intended pattern of running a SQL query through pydbtools. A full version of that function is defined later and will replace this placeholder.


In [None]:
import pydbtools
import calendar
import shutil
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import logging
from dateutil.relativedelta import relativedelta
# Configure logging
import warnings
# Suppress statsmodels AIC/BIC divide-by-zero runtime warnings
warnings.filterwarnings("ignore", message=".*divide by zero encountered in log.*")
#logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Disable logging output
logging.disable(logging.CRITICAL)

def parse_month(month_str: str) -> datetime:
    """Parse 'YYYY-MM' to datetime."""
    result = datetime.strptime(month_str.strip().strip("'\""), "%Y-%m")
    logging.debug(f"parse_month: parsed '{month_str}' to {result}")
    return result


# def clear_directory(path):
#     logging.info(f"clear_directory: clearing path {path}")
#     for filename in os.listdir(path):
#         file_path = os.path.join(path, filename)
#         try:
#             if os.path.isfile(file_path) or os.path.islink(file_path):
#                 os.unlink(file_path)
#                 logging.debug(f"Deleted file {file_path}")
#             elif os.path.isdir(file_path):
#                 shutil.rmtree(file_path)
#                 logging.debug(f"Deleted directory {file_path}")
#         except Exception as e:
#             logging.error(f"Failed to delete {file_path}. Reason: {e}")
def clear_directory(path: str):
    """Safely delete all files in `path` if it’s a directory."""
    if not os.path.isdir(path):
        logging.debug(f"clear_directory: '{path}' is not a directory, skipping.")
        return
    for filename in os.listdir(path):
        file_path = os.path.join(path, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            logging.warning(f"Failed to delete {file_path}: {e}")

def fetch_cases_for_date(run_date: str) -> pd.DataFrame:
    logging.info(f"fetch_cases_for_date: fetching data for {run_date}")
    query = "..."  # trimmed for brevity
    df = pydbtools.read_sql_query(query)
    logging.info(f"fetch_cases_for_date: returned {len(df)} rows for {run_date}")
    return df


def generate_month_list(start_month: str, end_month: str):
    logging.info(f"generate_month_list: from {start_month} to {end_month}")
    start_dt = parse_month(start_month)
    end_dt = parse_month(end_month)
    if start_dt > end_dt:
        raise ValueError(f"Start month ({start_month}) is after end month ({end_month})")
    months = []
    current = start_dt
    while current <= end_dt:
        months.append(current)
        logging.debug(f"Added month {current}")
        current += relativedelta(months=1)
    logging.info(f"Generated {len(months)} months")
    return months


def last_day_of_month(dt: datetime) -> str:
    day = calendar.monthrange(dt.year, dt.month)[1]
    result = dt.replace(day=day).strftime("%Y-%m-%d")
    logging.debug(f"last_day_of_month: for {dt} result {result}")
    return result

## Data extraction from Sirius and export of monthly snapshots

This cell defines the functions that actually pull deputyship case data from the database and write it to disk as month end snapshots.

The central function is fetch_cases_for_date. It takes a snapshot date as a string in year month day format and runs a SQL query through pydbtools. The query is designed to be a point in time extract, meaning every table reference is filtered to the same glueexporteddate. That is important because Sirius data in this environment is stored as daily exports. Using the same export date everywhere makes the extract internally consistent.

Inside the query, the common table expression active_fee_reductions finds fee reduction records that are active on the snapshot date. It does two filters that matter.

It keeps only fee reductions whose start date is on or before the snapshot date and whose end date is on or after the snapshot date. That gives the reductions that apply on that day.

It also chooses the latest record per finance client by taking the maximum id within the active window. That avoids duplicate fee reduction rows if there have been updates.

The main SELECT then joins persons to cases for the snapshot date. It pulls fields that are used later for grouping and for age based modelling, including case number, supervision level, risk score, order subtype, order status, and date of birth.

Age is calculated as an approximate integer number of years. The logic is:

age in years equals the number of days between date of birth and created date, divided by 365.25, then rounded to the nearest integer. The 365.25 factor accounts for leap years on average. If the computed value is negative, it is set to zero.

The cell also defines export_monthly_reports. This function takes a first month and a last month, builds the list of months in between, then for each month:

1. It converts the month to its last calendar day and uses that as the snapshot date.
2. It calls fetch_cases_for_date to pull the data for that snapshot.
3. It adds a month tag column so we can later combine months into one table.
4. It writes a csv file into a month specific folder under the output directory.
5. It writes the same data into an Excel workbook as one sheet per month.

After looping over all months, it concatenates the monthly DataFrames into one combined table and writes a combined csv for the full date range. It also produces a small summary table that counts total rows as a proxy for orders, and unique case numbers as a proxy for unique clients, grouped by year plus an overall row for the whole period.

These exports give you an auditable trail: you can inspect individual month extracts, and you can also work with the combined table for modelling steps.


In [None]:

def fetch_cases_for_date(run_date: str) -> pd.DataFrame:
    """
    Fetch all cases & their fee reductions for the given run_date (YYYY-MM-DD)
    using pydbtools.read_sql_query, which returns a pandas DataFrame.
    """
    query = f"""
    WITH active_fee_reductions AS (
      SELECT
        fc.client_id,
        SUBSTRING(fr.type,1,1) || LOWER(SUBSTRING(fr.type,2)) AS type,
        DATE(fr.startdate) AS startdate,
        DATE(fr.enddate)   AS enddate,
        fc.payment_method
      FROM opg_sirius_prod.fee_reduction fr
      JOIN opg_sirius_prod.finance_client fc
        ON fc.id = fr.finance_client_id
       AND fc.glueexporteddate = DATE('{run_date}')
      JOIN (
        SELECT
          MAX(id)           AS id,
          finance_client_id
        FROM opg_sirius_prod.fee_reduction
        WHERE enddate           >= DATE('{run_date}')
          AND startdate         <= DATE('{run_date}')
          AND deleted            = FALSE
          AND glueexporteddate   = DATE('{run_date}')
        GROUP BY finance_client_id
      ) latest ON latest.id = fr.id
      WHERE fr.glueexporteddate = DATE('{run_date}')
    )
    SELECT
      c.glueexporteddate,
      c.caserecnumber            AS casenumber,
      c.uid                      AS siriusid,
      (
        SELECT supervisionlevel
        FROM opg_sirius_prod.supervision_level_log sll
        WHERE sll.order_id         = c.id
          AND sll.glueexporteddate = DATE('{run_date}')
        ORDER BY sll.appliesfrom DESC
        LIMIT 1
      ) AS casesupervisionlevel,
      p.risk_score               AS CREC,
      c.casesubtype              AS orderType,
      c.orderdate                AS ordermadedate,
      c.orderstatus              AS orderStatus,
      afr.type                   AS feereductiontype,
      p.dob,
      CASE
        WHEN FLOOR(DATE_DIFF('day', p.dob, p.createddate) / 365.25) < 0 THEN 0
        ELSE ROUND(DATE_DIFF('day', p.dob, p.createddate) / 365.25)
      END AS age_in_years
    FROM opg_sirius_prod.persons p
    JOIN opg_sirius_prod.cases c
      ON p.id                   = c.client_id
     AND c.glueexporteddate     = DATE('{run_date}')
    LEFT JOIN active_fee_reductions afr
      ON afr.client_id          = p.id
    WHERE c.orderstatus IN ('OPEN','ACTIVE','DUPLICATE')
      AND p.glueexporteddate     = DATE('{run_date}')
    ORDER BY c.orderdate;
    """
    return pydbtools.read_sql_query(query)


def generate_month_list(start_month: str, end_month: str):
    """
    Return a list of datetime objects for each month-start
    from start_month to end_month inclusive.
    """
    start_dt = parse_month(start_month)
    end_dt = parse_month(end_month)
    if start_dt > end_dt:
        raise ValueError(f"Start month ({start_month}) is after end month ({end_month})")

    months = []
    current = start_dt
    while current <= end_dt:
        months.append(current)
        current += relativedelta(months=1)
    return months

def last_day_of_month(dt: datetime) -> str:
    """Return the last day of dt's month as 'YYYY-MM-DD'."""
    day = calendar.monthrange(dt.year, dt.month)[1]
    return dt.replace(day=day).strftime("%Y-%m-%d")
        
def export_monthly_reports(first_month: str, last_month: str, output_base="output") -> tuple[pd.DataFrame, pd.DataFrame]:
    # Clean inputs
    clean_first = first_month.strip().strip("'\"")
    clean_last = last_month.strip().strip("'\"")

    # Generate all months
    months = generate_month_list(clean_first, clean_last)
    if not months:
        print("No months in range; nothing to do.")
        return pd.DataFrame(), pd.DataFrame()

    # Prepare output directory
    os.makedirs(output_base, exist_ok=True)
    # **Clear the output directory, not the Excel filepath**
    clear_directory(output_base)

    excel_filename = f"cases_{clean_first}_to_{clean_last}.xlsx"
    excel_path = os.path.join(output_base, excel_filename)

    # List to accumulate each month's DataFrame
    all_months = []

    # Create Excel workbook and write each month's sheet
    with pd.ExcelWriter(excel_path, engine="openpyxl") as writer:
        for dt in months:
            month_tag = dt.strftime("%Y-%m")
            run_date = last_day_of_month(dt)

            # Fetch data for this month-end
            df = fetch_cases_for_date(run_date)

            # Tag the DataFrame with its month, then collect it
            df["month"] = month_tag
            all_months.append(df)

            # Save CSV for this month
            month_folder = os.path.join(output_base, month_tag)
            os.makedirs(month_folder, exist_ok=True)
            csv_path = os.path.join(month_folder, f"cases_{month_tag}.csv")
            df.to_csv(csv_path, index=False)

            # Add to Excel workbook
            df.to_excel(writer, sheet_name=month_tag, index=False)

            print(f"→ Saved CSV for {month_tag}: {csv_path}")
        pass
        print(f"→ Combined Excel workbook saved at: {excel_path}")

    # After all sheets are written, concatenate & export one big CSV
    if all_months:
        combined_df = pd.concat(all_months, ignore_index=True)
        combined_csv_path = os.path.join(
            output_base,
            f"all_cases_{clean_first}_to_{clean_last}.csv"
        )
        combined_df.to_csv(combined_csv_path, index=False)
        print(f"→ Combined CSV for all months saved at: {combined_csv_path}")
    else:
        combined_df = pd.DataFrame()

    # ---- NEW SUMMARY SECTION ----
    if not combined_df.empty:
        # extract year from the month tag
        combined_df["year"] = pd.to_datetime(combined_df["month"], format="%Y-%m").dt.year

        # annual summary
        annual = (
            combined_df
            .groupby("year")
            .agg(
                total_orders=("casenumber", "size"),
                total_people=("casenumber", "nunique")
            )
            .reset_index()
        )

        # overall summary across entire period
        overall = pd.DataFrame([{
            "year": "all",
            "total_orders": combined_df.shape[0],
            "total_people": combined_df["casenumber"].nunique()
        }])

        # combine for easy comparison
        summary_df = pd.concat([annual, overall], ignore_index=True)

        # print to console
        print("\n=== Orders & Unique-People Summary ===")
        print(summary_df.to_string(index=False))
    else:
        summary_df = pd.DataFrame()
        print("No data to summarise.")

    return combined_df, summary_df

## Monthly active cases by order type

This cell defines calculate_monthly_active_cases. The intention is to create a month by month view of how many active cases exist, broken down by order type.

The function takes four inputs.

df is the source DataFrame containing case level rows.
first_month and last_month define the month range to report.
output_base is the folder where the function writes its csv output.

The function builds a list of months, then loops through that list. For each month it creates a month tag and prepares a DataFrame called df_active. The aggregation step groups by month and order type and counts unique case numbers. Using nunique here means a person with multiple rows is counted once within the group.

After all months are processed, the per month summaries are concatenated into a single result table and written to a csv file. The function then creates a yearly summary from the collected active data. It reports two measures per year.

order_count is the number of rows, which is a proxy for number of orders.
unique_cases is the number of unique case numbers, which is a proxy for number of distinct clients.

Important note for anyone maintaining this function. The current implementation sets df_active equal to df without filtering df down to the relevant month. The commented code shows an earlier idea of fetching or filtering per month, but that logic is not active. For the monthly outputs to be meaningful, df should already be filtered to the current month inside the loop, or the commented fetch should be re enabled in a future refactor.


In [None]:
def calculate_monthly_active_cases(
    df: pd.DataFrame,
    first_month: str,
    last_month: str,
    output_base="output"
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    For each month between first_month and last_month (inclusive), fetch the data,
    filter to ACTIVE cases, then aggregate unique casenumber counts by orderType.
    Writes a CSV 'monthly_active_cases_<first>_to_<last>.csv' under output_base,
    prints yearly order & people counts, and returns:
      - result_df: monthly/orderType aggregates
      - summary_df: yearly order & unique-case counts
    """
    # Clean inputs
    clean_first = first_month.strip().strip("'\"")
    clean_last = last_month.strip().strip("'\"")

    # Generate all months
    months = generate_month_list(clean_first, clean_last)
    if not months:
        print("No months in range; nothing to do.")
        return pd.DataFrame(), pd.DataFrame()

    # Prepare output directory
    os.makedirs(output_base, exist_ok=True)

    # Lists to collect each month's summary and raw active data
    summaries = []
    active_data = []

    for dt in months:
        month_tag = dt.strftime("%Y-%m")
        run_date = last_day_of_month(dt)

        
        # Fetch and filter to ACTIVE
        #df = fetch_cases_for_date(run_date)
        #df = combined_df
        
        #df_active = df[df["orderstatus"] == "ACTIVE"].copy()
        df_active = df
        df_active["month"] = month_tag

        # Accumulate raw active rows for yearly summary
        active_data.append(df_active)

        # Aggregate unique casenumbers per orderType
        if df_active.empty:
            summaries.append(
                pd.DataFrame([{
                    "month": month_tag,
                    "orderType": None,
                    "active_case_count": 0
                }])
            )
        else:
            summary = (
                df_active
                .groupby(["month", "ordertype"], observed=False)["casenumber"]
                .nunique()
                .reset_index(name="active_case_count")
            )
            summaries.append(summary)

        print(f"→ Aggregated ACTIVE cases for {month_tag}")

    # Combine monthly summaries
    result_df = pd.concat(summaries, ignore_index=True)

    # Write out CSV
    out_csv = os.path.join(
        output_base,
        f"monthly_active_cases_{clean_first}_to_{clean_last}.csv"
    )
    result_df.to_csv(out_csv, index=False)
    print(f"→ Monthly ACTIVE cases CSV saved at: {out_csv}")

    # ---- NEW YEARLY SUMMARY ----
    if active_data:
        combined_active = pd.concat(active_data, ignore_index=True)
        combined_active["year"] = pd.to_datetime(
            combined_active["month"], format="%Y-%m"
        ).dt.year

        # Total orders per year
        orders_year = (
            combined_active
            .groupby("year")
            .size()
            .reset_index(name="order_count")
        )

        # Unique cases (people) per year
        people_year = (
            combined_active
            .groupby("year")["casenumber"]
            .nunique()
            .reset_index(name="unique_cases")
        )

        summary_df = orders_year.merge(people_year, on="year")

        print("\n=== Yearly Active Orders & Unique Cases ===")
        print(summary_df.to_string(index=False))
    else:
        summary_df = pd.DataFrame()
        print("No ACTIVE data to summarise.")

    return active_data, result_df, summary_df


## Monthly flow of active cases

This cell defines calculate_monthly_flow. It creates a simple movement table that answers, for each month, how many people are active, how many appear for the first time compared with the previous month, and how many disappear compared with the previous month.

The logic relies on treating the set of active case numbers in each month as a mathematical set.

For a given month m, let A(m) be the set of active case numbers in that month.

The number active is |A(m)|, meaning the size of the set.

The people who entered in month m are A(m) minus A(m−1). In set notation that is A(m) ∖ A(m−1).

The people who exited in month m are A(m−1) minus A(m). In set notation that is A(m−1) ∖ A(m).

The function builds A(m) for each month by querying the database at month end, then performs the set arithmetic above. This approach is fast and avoids duplicate counting.

The first month in the range has no prior month in the snapshots, so its entered value would equal the full active set and its exited value would be zero. The code removes that first record before saving outputs, so the csv focuses on genuine month to month movement.

After building the monthly table, the function creates a yearly summary. It unions the monthly entered sets and exited sets within a year to count unique people who entered and exited at any point in that year, and it unions the monthly active sets to estimate unique active clients within the year.

The outputs are written to a csv file in the output folder and also printed for quick inspection.


In [None]:
def calculate_monthly_flow(
    #df: pd.DataFrame,
    first_month: str,
    last_month: str,
    output_base="output"
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    For each month from first_month to last_month (inclusive),
    snapshot the set of active casenumbers, then compare to the prior month
    to count how many entered, exited, and how many were active.
    Writes 'monthly_flow_<first>_to_<last>.csv' under output_base,
    prints yearly summaries, and returns (flow_df, summary_df).
    """
    # Clean inputs
    clean_first = first_month.strip().strip("'\"")
    clean_last  = last_month.strip().strip("'\"")

    # Generate month list
    months = generate_month_list(clean_first, clean_last)
    if not months:
        print("No months in range; nothing to do.")
        return pd.DataFrame(), pd.DataFrame()

    # Snapshot active casenumbers per month
    snapshots = {}
    for dt in months:
        tag = dt.strftime("%Y-%m")

        # load data frame
        df = fetch_cases_for_date(last_day_of_month(dt))
        
        snapshots[tag] = set(df["casenumber"].unique())
        print(f"→ Snapshot for {tag}: {len(snapshots[tag])} active cases")

    # Build flow records
    flow_records = []
    entered_sets = {}
    exited_sets  = {}
    prev_tag = None

    for tag in sorted(snapshots):
        current = snapshots[tag]
        active_cnt = len(current)

        if prev_tag is None:
            entered = current
            exited  = set()
        else:
            prev = snapshots[prev_tag]
            entered = current - prev
            exited  = prev - current

        entered_sets[tag] = entered
        exited_sets[tag]  = exited

        flow_records.append({
            "month":        tag,
            "active_count": active_cnt,
            "entered":      len(entered),
            "exited":       len(exited)
        })
        prev_tag = tag

    # Create DataFrame & save CSV
    flow_df = pd.DataFrame(flow_records)

    # Remove the forst record as it only shows the whole count of active cases for entered and shows exited = 0
    flow_df = flow_df[1:]
    os.makedirs(output_base, exist_ok=True)
    out_csv = os.path.join(output_base, f"monthly_flow_{clean_first}_to_{clean_last}.csv")
    flow_df.to_csv(out_csv, index=False)
    print(f"→ Monthly flow CSV saved at: {out_csv}")

    # Yearly summary
    flow_df["year"] = pd.to_datetime(flow_df["month"], format="%Y-%m").dt.year
    summary_records = []

    for year, group in flow_df.groupby("year"):
        months_in_year = group["month"].tolist()
        total_entered = group["entered"].sum()
        total_exited  = group["exited"].sum()
        total_active  = group["active_count"].sum()

        unique_entered = len(set().union(*(entered_sets[m] for m in months_in_year)))
        unique_exited  = len(set().union(*(exited_sets[m]  for m in months_in_year)))
        unique_active  = len(set().union(*(snapshots[m]     for m in months_in_year)))

        summary_records.append({
            "year":            year,
            #"entered_orders":  total_entered,
            "entered_people":  unique_entered,
            #"exited_orders":   total_exited,
            "exited_people":   unique_exited,
            #"active_orders":   total_active,
            "active_clients":  unique_active
        })

    summary_df = pd.DataFrame(summary_records)
    print("\n=== Yearly Flow & Active Summary ===")
    print(summary_df.to_string(index=False))

    return flow_df, summary_df


## Combining historical and forecast tables

This small helper function exists to stitch two tables with the same column structure into one combined table.

In this project we produce a historical table of age specific metrics from the database, and a forecast table of the same metrics from the stop flow model. Concatenating them with pandas concat creates one continuous time series that can be saved, plotted, and analysed in one go.

ignore_index is set to True so the combined table gets a fresh sequential index rather than keeping the original indexes from the two inputs.


In [None]:
# Append current and forecasted tables
def get_combined_age_deputyship_table(tbl1, tbl2):
    combined = pd.concat(
        [tbl1, tbl2],
        ignore_index=True
    )
    return combined

## Year on year flows and age specific entry and termination rates

This cell defines calculate_yearonyear_flows_and_age_rates. It is the main feature engineering step that turns raw case extracts into the quantities used by the forecasting model.

There are two connected ideas in this function.

The first is a year on year view of flow. Instead of comparing month m with month m−1, the code compares month m with the same month one year earlier, m−12. That gives a seasonal comparison, which is useful when you expect strong calendar effects.

For each month tag, the function builds two snapshots.

One snapshot is the set of active case numbers at the end of the current month.
The other snapshot is the set of active case numbers at the end of the same month one year earlier.

From these it calculates:

active_count_current equals the size of the current set.
active_count_previous equals the size of the previous year set.
entered equals the number of case numbers that are in the current set but not in the previous year set.
exited equals the number of case numbers that are in the previous year set but not in the current set.

Mathematically, with A(m) as the current set and A(m−12) as the previous year set:

entered(m) = |A(m) ∖ A(m−12)|
exited(m)  = |A(m−12) ∖ A(m)|

The second idea is age specific rates. The code takes the people who entered, the people who exited, and the base population from the previous year snapshot, and assigns them to age groups. Age groups are built as single year bins from 0 up to 106. Any missing or out of range ages are labelled as Unknown.

For each month and each age group g it then computes:

active(g) is the number of unique case numbers in the base population for that age group.
entered(g) is the number of unique case numbers that entered for that age group.
terminations(g) is the number of unique case numbers that exited for that age group.

Termination rate is defined as terminations(g) divided by active(g) when active(g) is greater than zero. Retention rate is defined as 1 minus termination rate.

These rates are the core inputs to the stop flow forecast later, because the stop flow recurrence needs a view of how many people typically enter and leave by age.

Unknown age handling is optional. If redistribute_unknown_age is True, the function redistributes Unknown rows across the concrete age groups. The redistribution uses a proportional integer allocation method.

It first counts known cases by age group and converts those counts into proportions.
It multiplies those proportions by the total number of Unknown rows to get target fractional allocations.
It takes the integer floor of each target, then distributes the remaining units to the groups with the largest fractional remainders. This is often called Hamilton apportionment.
Finally it assigns the Unknown rows deterministically by sorted row index so that repeated runs give the same result.

The function saves two detailed csv outputs into the output folder: one for the month level flow table and one for the month by age table of rates. It also returns four DataFrames so downstream cells can work in memory.

flows_df contains the month level year on year flow measures.
ages_df contains the month by age metrics including termination and retention rates.
summary_df contains year level unique people counts based on unions of the monthly sets.
monthly_summary_df contains month level people counts derived from the cached sets.


In [None]:
def calculate_yearonyear_flows_and_age_rates(
    first_month: str,
    last_month: str,
    output_base: str = "output",
    redistribute_unknown_age: bool = False,
    age_bins: tuple = None,
    age_labels: tuple = None
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Calculate year-on-year flows (entries/exits) and age-specific rates, with optional
    imputation of missing ages ("Unknown") via proportional redistribution into
    integer age groups. When redistribution is enabled, each Unknown row is reassigned
    to a concrete integer age group using integer allocations (Hamilton apportionment),
    and the 'age_group' value for those rows is updated accordingly.
    """

    # --- Local imports for numerical helpers (keeps dependency scope clear)
    import numpy as np

    # --- Logging the operation and key parameters for traceability
    logging.info(
        f"Calculating year-on-year flows and age rates from {first_month} to {last_month}, "
        f"redistribute_unknown_age={redistribute_unknown_age}"
    )

    # --- Ensure the output directory exists to avoid file write errors later
    os.makedirs(output_base, exist_ok=True)

    # --- Establish age bins and labels (default is 0–106 inclusive as integer groups)
    # If caller didn't supply custom bins/labels, build the defaults.
    if age_bins is None or age_labels is None:
        age_bins = list(range(0, 107))                     # Edges for [0,1), [1,2), ..., [106,107)
        age_labels = [str(a) for a in age_bins[:-1]]       # String labels '0'...'106' (no "Unknown")

    # --- Make sure we work with labels that do NOT include 'Unknown' for final outputs
    # If a user passed labels that contain 'Unknown', remove it from the cut-labels.
    labels_for_cut = [lbl for lbl in age_labels if lbl != "Unknown"]  # Used by pd.cut
    final_age_labels = labels_for_cut[:]                               # Final index order for outputs

    # --- Helper: cut ages into groups and set 'Unknown' for out-of-range/missing ages
    def _assign_age_groups_inplace(df: pd.DataFrame) -> None:
        """Add/overwrite df['age_group'] using bins/labels; out-of-range/missing -> 'Unknown'."""
        # Compute categorical age groups for known ages using left-closed bins
        df["age_group"] = pd.cut(
            df["age_in_years"],               # Source age column (assumed present)
            bins=age_bins,                    # Integer bin edges (e.g., [0,1), [1,2), ...)
            labels=labels_for_cut,            # Only concrete integer labels
            right=False,                      # Include left edge, exclude right edge
            include_lowest=True               # Include the lowest bound
        )
        # Convert NaN categories (missing/out-of-range) into the literal string "Unknown"
        df["age_group"] = df["age_group"].astype(object).where(df["age_group"].notna(), "Unknown")

    # --- Helper: proportional integer allocation (Hamilton apportionment)
    def _proportional_integer_allocation(known_counts: pd.Series, unknown_total: int) -> pd.Series:
        """
        Allocate 'unknown_total' integer units across index labels of 'known_counts'
        proportionally to known_counts (or uniformly if all zeros).
        Returns a Series of integer allocations indexed like known_counts.
        """
        # Ensure index order matches final labels and fill missing with zero
        known_counts = known_counts.reindex(final_age_labels, fill_value=0)

        # Sum the known counts to derive proportions
        total_known = known_counts.sum()

        # If no known info exists, split uniformly; else use proportional shares
        if total_known == 0:
            # Uniform shares across all age groups
            raw = pd.Series(np.full(len(final_age_labels), unknown_total / max(len(final_age_labels), 1.0)),
                            index=final_age_labels, dtype=float)
        else:
            # Proportional shares: each group's fraction times unknown_total
            raw = (known_counts / total_known) * unknown_total

        # Base integer allocations via floor
        base = np.floor(raw).astype(int)

        # Remaining units to distribute due to flooring
        remainder = int(unknown_total - base.sum())

        # Fractional remainders for Hamilton method
        frac = raw - base

        # Deterministic tie-break: sort by fractional part desc, then by label asc
        order = sorted(final_age_labels, key=lambda x: (-frac.loc[x], x))

        # Distribute one-by-one to the top 'remainder' labels
        for i in range(remainder):
            base.loc[order[i]] += 1

        # Return allocations as a Series aligned to final_age_labels
        return base.reindex(final_age_labels).astype(int)

    # --- Helper: impute unknown ages row-wise, deterministically, in-place
    def _impute_unknowns_inplace(df: pd.DataFrame, id_col: str = "casenumber") -> None:
        """
        For a given df that already has 'age_group' with some 'Unknown',
        reassign 'Unknown' rows to concrete integer age groups.

        Allocation weights are derived from the df's own composition using
        unique {id_col} counts by known age group. The assignment to rows is
        deterministic (sorted index order) to ensure reproducibility.
        """
        # Identify which rows are currently Unknown
        unknown_idx = df.index[df["age_group"] == "Unknown"]
        # If nothing to impute, bail early
        if len(unknown_idx) == 0:
            return

        # Build weights using unique case counts by known age group
        # (drop Unknown to avoid circularity)
        known_unique = (
            df.loc[df["age_group"] != "Unknown", ["age_group", id_col]]
              .drop_duplicates()
              .groupby("age_group", observed=False)[id_col]
              .nunique()
              .reindex(final_age_labels, fill_value=0)
        )

        # Compute integer allocations across age groups for the Unknown total
        allocations = _proportional_integer_allocation(known_unique, len(unknown_idx))

        # Deterministic row assignment: sort unknown indices so results are stable
        unknown_idx_sorted = sorted(unknown_idx.tolist())

        # Pointer into the unknown index list as we assign chunks
        cursor = 0

        # Assign each block of Unknown rows to its allocated age group
        for lbl in final_age_labels:
            k = int(allocations.get(lbl, 0))     # How many Unknown rows to assign to this label
            if k > 0:
                take = unknown_idx_sorted[cursor: cursor + k]  # Slice next k rows
                df.loc[take, "age_group"] = lbl                # Set their age_group to the label
                cursor += k                                    # Advance the pointer

        # Safety: if any Unknowns remain due to edge cases, place them in the smallest label
        if (df["age_group"] == "Unknown").any():
            leftovers = df.index[df["age_group"] == "Unknown"]
            fallback = final_age_labels[0] if final_age_labels else "0"
            df.loc[leftovers, "age_group"] = fallback

    # --- Containers to accumulate per-month analytics (flows and age-rate details)
    flow_records = []         # List of dicts: overall monthly counts (active, entered, exited)
    age_rate_records = []     # List of dicts: per-month x age-group counts and rates
    snapshots_cur = {}        # Dict: month tag -> set of active casenumbers (current month)
    snapshots_prev = {}       # Dict: month tag -> set of active casenumbers (prev-year same month)
    entered_sets = {}         # Dict: month tag -> set of casenumbers entered this month
    exited_sets = {}          # Dict: month tag -> set of casenumbers exited this month

    # --- Iterate each month in the requested window
    for dt in generate_month_list(first_month, last_month):
        # Compute the month exactly one year earlier for YoY comparisons
        prev_dt = dt - relativedelta(years=1)

        # Skip early months that don't have a prior-year comparison within window
        if prev_dt < parse_month(first_month):
            continue

        # Create a YYYY-MM tag for logging and indexing
        tag = dt.strftime("%Y-%m")
        logging.info(f"Processing month {tag}")

        # Fetch snapshots of active cases at month-end for current and prior-year month
        df_cur = fetch_cases_for_date(last_day_of_month(dt))
        df_prev = fetch_cases_for_date(last_day_of_month(prev_dt))

        # Convert to sets of IDs for fast set arithmetic
        set_cur = set(df_cur["casenumber"])
        set_prev = set(df_prev["casenumber"])

        # Persist these snapshots for later summaries
        snapshots_cur[tag] = set_cur
        snapshots_prev[tag] = set_prev

        # Entrants are in current but not in previous; exits are the opposite
        entered = set_cur - set_prev
        exited = set_prev - set_cur

        # Cache the entrant/exit sets for people-level yearly/monthly summaries
        entered_sets[tag] = entered
        exited_sets[tag] = exited

        # Record high-level flow counts for this month
        flow_records.append({
            "month":                 tag,
            "active_count_current":  len(set_cur),
            "active_count_previous": len(set_prev),
            "entered":               len(entered),
            "exited":                len(exited)
        })

        # Build three DataFrames for age analysis:
        #  - df_term: those who exited (from last year's snapshot)
        #  - df_in:   those who entered (into this year's snapshot)
        #  - df_base: the base population (last year's snapshot)
        df_term = df_prev[df_prev["casenumber"].isin(exited)].copy()
        df_in   = df_cur[df_cur["casenumber"].isin(entered)].copy()
        df_base = df_prev.copy()

        # Assign initial age groups with "Unknown" for missing/out-of-range
        _assign_age_groups_inplace(df_term)
        _assign_age_groups_inplace(df_in)
        _assign_age_groups_inplace(df_base)

        # Optionally impute Unknown ages by redistributing them into integer groups
        if redistribute_unknown_age:
            _impute_unknowns_inplace(df_term)   # Replace 'Unknown' with concrete age_group
            _impute_unknowns_inplace(df_in)     # Replace 'Unknown' with concrete age_group
            _impute_unknowns_inplace(df_base)   # Replace 'Unknown' with concrete age_group

        # --- Diagnostics (can be converted to logging.debug if preferred)
        # print("All records in base:", len(df_base))
        # print("Records with age_group assigned (incl. imputed):", df_base['age_group'].notna().sum())

        # --- Compute counts by integer age group (final_age_labels), filling missing with zeros
        # People entered per age group (unique casenumbers)
        in_counts = (
            df_in.groupby("age_group", observed=False)["casenumber"]
                 .nunique()
                 .reindex(final_age_labels, fill_value=0)
        )
        # People exited per age group (unique casenumbers)
        term_counts = (
            df_term.groupby("age_group", observed=False)["casenumber"]
                  .nunique()
                  .reindex(final_age_labels, fill_value=0)
        )
        # Active people in base per age group (unique casenumbers)
        base_counts = (
            df_base.groupby("age_group", observed=False)["casenumber"]
                  .nunique()
                  .reindex(final_age_labels, fill_value=0)
        )

        # Orders (row counts) by age group — useful if multiple rows per person exist
        in_order_counts = (
            df_in.groupby("age_group", observed=False)["casenumber"]
                 .count()
                 .reindex(final_age_labels, fill_value=0)
        )
        term_order_counts = (
            df_term.groupby("age_group", observed=False)["casenumber"]
                  .count()
                  .reindex(final_age_labels, fill_value=0)
        )
        order_counts = (
            df_base.groupby("age_group", observed=False)["casenumber"]
                  .count()
                  .reindex(final_age_labels, fill_value=0)
        )

        # --- Build age-rate rows for this month across all integer age groups
        for grp in final_age_labels:
            active      = int(base_counts[grp])                 # Active unique people in base
            orders_age  = int(order_counts[grp])                # Active orders (rows) in base
            clients_age = active                                # Alias kept for continuity
            term        = int(term_counts[grp])                 # Exits (unique people)
            ent         = int(in_counts[grp])                   # Entries (unique people)
            rate        = round(term / active, 4) if active else 0.0  # Termination rate
            retention   = 1 - rate if rate >= 0 else 1.0        # Retention (1 - termination)

            # Append a fully specified record for this (month, age_group)
            age_rate_records.append({
                "month":              tag,
                "age_group":          grp,
                "active_count":       active,
                "active_orders_age":  orders_age,
                "active_clients_age": clients_age,
                "entered":            ent,
                "terminations":       term,
                "termination_rate":   rate,
                "retention_rate":     retention
            })

    # --- Convert accumulated lists to DataFrames for downstream use
    flows_df = pd.DataFrame(flow_records)       # Month-level flows
    ages_df  = pd.DataFrame(age_rate_records)   # Month x age-group metrics

    # --- Persist the outputs for reproducibility/auditing
    flows_df.to_csv(
        os.path.join(output_base, f"yearonyear_flows_{first_month}_to_{last_month}.csv"),
        index=False
    )
    ages_df.to_csv(
        os.path.join(output_base, f"termination_and_entry_rates_by_age_{first_month}_to_{last_month}.csv"),
        index=False
    )

    # --- Yearly people-level summary (unique people across months per year)
    flows_df["year"] = pd.to_datetime(flows_df["month"], format="%Y-%m").dt.year  # Extract calendar year
    summary_records = []                                                           # Collector for yearly rows

    # Iterate each year and union people across the year's months (entered/exited/active)
    for year, grp in flows_df.groupby("year"):
        months_in_year = grp["month"].tolist()                                     # Months in this year
        entered_people = len(set().union(*(entered_sets[m] for m in months_in_year)))  # Unique entrants
        exited_people  = len(set().union(*(exited_sets[m]  for m in months_in_year)))  # Unique exits
        active_clients = len(set().union(*(snapshots_cur[m] for m in months_in_year))) # Unique active

        # Append the yearly summary row
        summary_records.append({
            "year":           year,
            "entered_people": entered_people,
            "exited_people":  exited_people,
            "active_clients": active_clients
        })

    # Materialize yearly summary table
    summary_df = pd.DataFrame(summary_records)

    # --- Console print for a quick glance (can swap to logging.info if preferred)
    print("\n=== Yearly Summary: Orders & Clients ===")
    print(summary_df.to_string(index=False))

    # --- Monthly people-level summary using the cached sets
    monthly_records = []  # Collector for monthly rows

    for _, row in flows_df.iterrows():
        m = row["month"]                                # Month tag
        monthly_records.append({
            "month":           m,
            "entered_people":  len(entered_sets[m]),    # Unique entrants that month
            "exited_people":   len(exited_sets[m]),     # Unique exits that month
            "active_clients":  len(snapshots_cur[m])    # Unique active that month
        })

    # Materialise monthly summary table
    monthly_summary_df = pd.DataFrame(monthly_records)

    # Console print for a quick glance
    print("\n=== Monthly Summary: Orders & Clients ===")
    print(monthly_summary_df.to_string(index=False))

    # --- Final log to indicate successful completion
    logging.info("Completed calculation of year-on-year flows and age rates")

    # --- Return the four primary outputs: flows, age metrics, yearly and monthly people summaries
    return flows_df, ages_df, summary_df, monthly_summary_df


## Run block for extraction and feature engineering

This cell is a runnable orchestration block. It sets the date window for the extraction, creates or clears the output folder, then calls the main functions defined earlier to produce the datasets used for forecasting.

The start_month and end_month values define the month end snapshots that will be pulled from Sirius. The output_base variable controls where all files are written.

The cell runs three main steps.

First it calls export_monthly_reports to pull raw data for each month end, save a per month csv, and build a combined table across the full period.

Second it calls calculate_monthly_active_cases to create an aggregated view of active cases by order type. This provides a quick check on volumes and composition.

Third it calls calculate_yearonyear_flows_and_age_rates with redistribute_unknown_age set to True. This produces final_df, which holds the year on year flow measures, and ages_df, which holds the age specific active counts, entry counts, termination counts, and rates.

These two DataFrames are the main hand off into the forecasting step. Later cells rely on final_df and ages_df being present in memory.

Because this is a notebook, __name__ is usually "__main__" when the cell runs. That means this block will execute when you run the cell, and it will also execute if you run all cells in order. Keep in mind that it clears the output directory, so do not point output_base at a folder that contains anything you need to keep.


In [None]:
# Running the Deputyship forecasting model

if __name__ == "__main__":
    start_year = 2022
    end_year = 2025
    start_month = "2024-12"
    end_month = "2025-12"
    output_base="output"

    # Prepare output directory
    os.makedirs(output_base, exist_ok=True)
    # **Clear the output directory, not the Excel filepath**
    clear_directory(output_base)
    
    combined_df, summary_df = export_monthly_reports(start_month, end_month)
    summary_df
    print(combined_df)
    
    active_df, monthly_df, yearly_summary = calculate_monthly_active_cases(combined_df, start_month, end_month, output_base="output")
    yearly_summary
    print(monthly_df)
    print(active_df)
    

    # Calculate historical flows and age rates
    final_df, ages_df, summary_df, monthly_summary_df = calculate_yearonyear_flows_and_age_rates(
         start_month, end_month,
         redistribute_unknown_age=True)
    
    print(summary_df)
    print(monthly_summary_df)
    print(ages_df)
    print(final_df)

## Quick inspection of the flow output

This cell simply displays final_df in the notebook output.

final_df is the month level table produced by calculate_yearonyear_flows_and_age_rates. Looking at it here is a simple sanity check that the extraction ran successfully and that month tags and counts look reasonable before moving on to forecasting.


In [None]:
final_df

## Quick inspection of the age specific rate table

This cell displays ages_df.

ages_df is the detailed month by age table that contains active counts, entry counts, termination counts, and derived rates such as termination_rate and retention_rate. The stop flow forecast uses these values, so it is worth checking that the month column parsed correctly and that age groups look as expected.


In [None]:
ages_df

## Stop flow forecast function

This cell defines stop_flow_forecast, the forecasting method used in this notebook.

The stop flow idea is to update the number of active cases using a simple accounting identity:

active(t) = active(t−1) + entered(t−12) − terminations(t−12)

The identity says that the active caseload at time t equals last month active caseload plus new people entering, minus people leaving. The seasonal assumption in this implementation is that entries and terminations for month t behave like the entries and terminations observed in the same calendar month one year earlier. That is why the function looks up flows at a 12 month lag.

How the function uses the historical data.

It takes ages_df as input. ages_df contains active_count for each age group and month, plus entered and terminations for each age group and month.

It identifies the last historical month in the input, then uses the active counts in that month as the starting state for the forecast.

It creates a list of forecast months, then loops over them. For each forecast month and each age group it:

1. Reads the previous month forecast active count for that age group.
2. Looks up the entered and terminations values from the lag month one year earlier in the historical table.
3. Applies the recurrence active(t) = active(t−1) + entered − terminations, with a floor at zero so the forecast never goes negative.
4. Stores the result and updates the previous month state ready for the next month.

The function returns four tables.

per_age_df contains the forecast by month and age group.
monthly is a month level total across age groups.
yearly is a year level total across the forecast horizon.
df is a copy of the input ages_df for reference.

This function is intentionally simple. Its value is in producing a transparent baseline forecast that preserves seasonality, rather than in fitting a complex statistical model.


In [None]:
def stop_flow_forecast(
    ages_df: pd.DataFrame,
    periods: int = 12
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    “Stop‐flow” forecast:
      active_t    = active_{t–1} + entered_{t–12} – terminated_{t–12}

    Returns:
      - per_age_df          : month, age_group, active_forecast,
                              active_orders_age_fc, active_clients_age_fc
      - monthly_summary_df  : month, total_active_orders, total_active_clients
      - yearly_summary_df   : year, yearly_active_orders, yearly_active_clients
      - base_df             : the input ages_df (for reference)
    """
    # Make a copy of the input data so we don't change the original
    df = ages_df.copy()

    # Ensure 'month' is a column; if it's an index, reset it, otherwise error
    if 'month' not in df.columns:
        if 'month' in df.index.names:
            df = df.reset_index()
        else:
            raise ValueError("Input ages_df must have a 'month' column or index level")

    # Convert 'month' column to date format, just to be sure
    df['month'] = pd.to_datetime(df['month'])

    # Find the last (most recent) month in the historical data
    last_hist = df['month'].max()

    # Make a list of months to forecast (e.g. next 12 months)
    fc_months = [last_hist + relativedelta(months=i) for i in range(1, periods+1)]

    # Get the data for the last historical month
    base = df[df['month']==last_hist]
    
    
    # Store the active counts for each age group from the last month as a starting point
    prev_active      = base.set_index('age_group')['active_count'].to_dict()
    prev_orders      = base.set_index('age_group')['active_orders_age'].to_dict()  # (Commented out)
    prev_clients     = base.set_index('age_group')['active_clients_age'].to_dict()

    records = []  # This will store forecast results for each month and age group

    # For each month in the forecast period
    for m in fc_months:
        print(f"month: {m}")
        # Load all active cases at the end of the current and previous year’s month
        #df_cur  = fetch_cases_for_date(last_day_of_month(pd.to_datetime(m)))
        #print(df_cur)
        
        # Find the matching month from 12 months ago (for stop-flow calculation)
        lag = m - relativedelta(years=1)
        # For each age group
        for age in df['age_group'].unique():
            #print(f"age: {age}")
            # Get previous forecasted counts, or 0 if not found
            a_prev     = prev_active.get(age, 0)
            #print(f"1.a_prev: {a_prev}")
            o_prev     = prev_orders .get(age, 0)   # (Commented out)
            c_prev     = prev_clients.get(age, 0)
            #print(f"c_prev: {c_prev}")
            
            # Find the data for this age group from 12 months ago (if it exists)
            row        = df[(df['month']==lag)&(df['age_group']==age)]
            # Get 'entered' and 'terminations' values; use 0 if missing
            entered    = int(row['entered'     ].iloc[0]) if not row.empty else 0
            #print(f"entered: {entered}")
            
            term       = int(row['terminations'].iloc[0]) if not row.empty else 0
            #print(f"term: {term}")
            
            # Calculate new forecast: previous + entered - terminated (but not below zero)
            a_fc = max(0, a_prev + entered - term)
            #print(f"a_fc: {a_fc}")
            o_fc = max(0, o_prev + entered - term)   # (Commented out)
            c_fc = max(0, c_prev + entered - term)
            #print(f"c_fc: {c_fc}")
            
            # Store the result for this month and age group
            records.append({
                'month':                 m,
                'age_group':             age,
                'active_forecast':       a_fc, 
                'active_orders_age_fc':  o_fc,    # (Commented out)
                'active_clients_age_fc': c_fc
            }) #active_clients_age_fc

            # Update the previous values for the next month in the loop
            prev_active[age]  = a_fc
            prev_orders[age]  = o_fc   # (Commented out)
            prev_clients[age] = c_fc

    # Convert all forecast records into a DataFrame (table)
    per_age_df = pd.DataFrame(records)

    # Make a summary table for each forecast month (total across ages)
    monthly = (
        per_age_df
        .groupby('month')
        .agg(
            total_active_orders=('active_orders_age_fc' , 'sum'),    # (Commented out)
            total_active_clients=('active_clients_age_fc', 'sum')
            #total_active=('active_forecast', 'sum')
        )
        .reset_index()
    )
    # print("\n=== Monthly Stop‐Flow Summary ===")
    # print(monthly.to_string(index=False))

    # Add a 'year' column for yearly summary
    monthly['year'] = monthly['month'].dt.year

    # Make a summary table for each year (totals across months)
    yearly = (
        monthly
        .groupby('year')
        .agg(
            #yearly_active_orders = ('total_active_orders', 'sum'),   # (Commented out)
            yearly_active_clients= ('total_active_clients','sum')
        )
        .reset_index()
    )
    # print("\n=== Yearly Stop‐Flow Summary ===")
    # print(yearly.to_string(index=False))

    # Return all results and the original input data
    return per_age_df, monthly, yearly, df


## Visualisation and insight generation

The next code cell defines a helper function that turns the combined historical and forecast table into charts and a short insights note.

The input to the function is a DataFrame that contains at least a month column, an age column, and columns that represent active caseload counts for clients and orders. The function is defensive about naming. If the exact column names are not present it looks for common alternatives and renames them into a standard form so the plotting code can stay simple.

The function also standardises dates. Month values are converted to pandas datetime so the x axis in charts behaves correctly. If you pass hist_last_month, the function draws a vertical marker at that month to show where the historical data ends and the forecast begins.

Several visual outputs are produced in the output folder.

A totals line chart shows total active clients and total active orders over time. It also adds a rough uncertainty band using a Poisson style approximation for count data. For a monthly count x the function computes a 95 percent interval as:

lower = max(0, x − z √(phi x))
upper = x + z √(phi x)

where z is 1.96 and phi is a dispersion factor. When phi is 1 this matches the basic Poisson square root variance assumption. Larger values of phi widen the band to reflect over dispersion.

A stacked area chart groups ages into five year bands and shows how the composition of the caseload changes over time.

A heatmap shows active clients by five year age band and month, which helps to spot cohort patterns and seasonal effects.

A top movers chart compares the first and last month in the horizon and highlights which ages contribute most to the change in caseload.

A ratio chart shows orders per 100 clients over time, which can flag shifts in how many orders exist per person.

In addition to charts, the function writes an insights markdown file summarising horizon start and end, total change in clients and orders, peak months, and the biggest age group increases and decreases.

The function returns a small dictionary of headline metrics so the caller can log or store the key results programmatically.


In [None]:
# -----------------------------------------------
# Visualisation & Insight Analysis (append below)
# -----------------------------------------------
def visualize_and_analyze_deputyship_forecasts(
    combined_df: pd.DataFrame,
    output_dir: str = "output",
    hist_last_month: "pd.Timestamp|str|None" = None,
    top_k: int = 8,
    axis_start_month: "pd.Timestamp|str|None" = None,   # <-- add this
) -> dict:
    """
    Visualize historical + forecasted active caseloads (clients & orders) and produce key insights.

    Inputs
    ------
    combined_df : DataFrame with at least:
        - 'month' (datetime or string)
        - 'age' (string or int age group)
        - 'active_caseloads_clients' (or fallbacks: 'active_clients_age', 'active_clients_age_fc', 'active_forecast')
        - 'active_caseloads_orders'  (or fallbacks: 'active_orders_age',  'active_orders_age_fc',  'active_forecast')
    output_dir : where to save PNGs and insights markdown.
    hist_last_month : last historical month (for a vertical cutoff line). If None, no cutoff line is drawn.
    top_k : how many age groups to highlight in stacked area and top-movers charts.

    Returns
    -------
    A small dict of computed summary metrics for programmatic use.
    """

    os.makedirs(output_dir, exist_ok=True)

    df = combined_df.copy()

    # --- Ensure 'month' is datetime (handles '2025-07' and 'Jul-25' style strings)
    if not np.issubdtype(df['month'].dtype, np.datetime64):
        # try robust parsing; attempt two common formats
        try:
            df['month'] = pd.to_datetime(df['month'])
        except Exception:
            df['month'] = pd.to_datetime(df['month'], format="%b-%y")

    # --- Standardise column names via safe coalescing (guards against earlier rename differences)
    def coalesce_col(frame, candidates, new_name):
        for c in candidates:
            if c in frame.columns:
                frame.rename(columns={c: new_name}, inplace=True)
                return new_name
        # if none exist, create an empty numeric column
        frame[new_name] = 0
        return new_name

    clients_col = coalesce_col(
        df,
        ["active_caseloads_clients", "active_clients_age", "active_clients_age_fc"],
        "active_caseloads_clients"
    )
    orders_col = coalesce_col(
        df,
        ["active_caseloads_orders", "active_orders_age", "active_orders_age_fc"],
        "active_caseloads_orders"
    )

    # --- NEW: parse / infer axis_start_month safely ---
    if axis_start_month is not None and not isinstance(axis_start_month, pd.Timestamp):
        axis_start_month = pd.to_datetime(axis_start_month)

    if axis_start_month is None:
        # Auto-start at first non-zero month to avoid leading zeros
        totals_all = (
            df.groupby('month')[[clients_col, orders_col]]
              .sum(min_count=1)
              .fillna(0)
        )
        nz = totals_all.sum(axis=1) > 0
        axis_start_month = nz.index[nz.argmax()] if nz.any() else df['month'].min()

    
    # --- Tidy 'age' to string and keep a stable order
    if 'age' not in df.columns:
        # Fallback if upstream still calls the column 'age_group'
        if 'age_group' in df.columns:
            df.rename(columns={'age_group': 'age'}, inplace=True)
        else:
            raise ValueError("Expected an 'age' (or 'age_group') column in combined_df.")
    df['age'] = df['age'].astype(str)

    # --- Sort for plotting
    df = df.sort_values(['month', 'age']).reset_index(drop=True)

    # --- Identify last historical month (line on charts) if provided as str
    if hist_last_month is not None and not isinstance(hist_last_month, pd.Timestamp):
        hist_last_month = pd.to_datetime(hist_last_month)

    # # ========== 1) Total caseloads over time (line) ==========
    # totals = (
    #     df.groupby('month', as_index=False)
    #       .agg(total_clients=(clients_col, 'sum'),
    #            total_orders=(orders_col, 'sum'))
    # )

    # # Plot
    # plt.figure(figsize=(11, 6))
    # plt.plot(totals['month'], totals['total_clients'], label='Total active clients')
    # if df[orders_col].sum() > 0:
    #     plt.plot(totals['month'], totals['total_orders'], label='Total active orders')
    # if hist_last_month is not None:
    #     plt.axvline(hist_last_month, linestyle='--', linewidth=1, label='Last historical month')
    # plt.title('Active caseloads over time: clients vs orders')
    # plt.xlabel('Month'); plt.ylabel('Count'); plt.legend(); plt.tight_layout()
    # path_total = os.path.join(output_dir, "01_totals_clients_orders.png")
    # plt.savefig(path_total, dpi=180); plt.close()

    # ========== 7) Total caseloads over time (line + 95% CIs) ==========
    
    totals = (
        df.groupby('month', as_index=False)
          .agg(total_clients=(clients_col, 'sum'),
               total_orders=(orders_col, 'sum'))
    )
    
    # Optional: start x-axis at axis_start_month if your function has that param
    if 'axis_start_month' in locals() and axis_start_month is not None:
        if not isinstance(axis_start_month, pd.Timestamp):
            axis_start_month = pd.to_datetime(axis_start_month)
        totals = totals[totals['month'] >= axis_start_month]
    
    # def _poisson_ci(series: pd.Series, z: float = 1.96):
    #     """95% CI for counts via Poisson approx: x ± z*sqrt(x), floored at 0."""
    #     x = series.to_numpy(dtype=float)
    #     sd = np.sqrt(np.clip(x, 0, None))
    #     lower = np.maximum(0, x - z * sd)
    #     upper = x + z * sd
    #     return lower, upper

    def _poisson_ci(series: pd.Series, z: float = 1.96, phi: float = 1.0):
        """95% CI for counts via Poisson approx: x ± z*sqrt(x), floored at 0."""
        x = series.to_numpy(dtype=float)
        sd = np.sqrt(phi * np.clip(x, 0, None))  # phi=1 → Poisson; phi>1 → over-dispersed
        lower = np.maximum(0, x - z * sd)
        upper = x + z * sd
        return lower, upper

    # Clients CI
    c_lo, c_hi = _poisson_ci(totals['total_clients'])
    
    # Orders CI (only if any orders exist)
    has_orders = (df[orders_col].sum() > 0)
    if has_orders:
        o_lo, o_hi = _poisson_ci(totals['total_orders'])
    
    plt.figure(figsize=(11, 6))
    
    # Plot CIs first so lines sit on top
    plt.fill_between(totals['month'], c_lo, c_hi, alpha=0.2, label='95% CI (clients)', zorder=1)
    if has_orders:
        plt.fill_between(totals['month'], o_lo, o_hi, alpha=0.15, label='95% CI (orders)', zorder=1)
    
    # Now the lines
    plt.plot(totals['month'], totals['total_clients'], label='Total active clients', zorder=2)
    if has_orders:
        plt.plot(totals['month'], totals['total_orders'], label='Total active orders', zorder=2)
    
    # Historical cutoff marker
    if hist_last_month is not None:
        plt.axvline(hist_last_month, linestyle='--', linewidth=1, label='Last historical month')
    
    plt.title('Active caseloads over time: clients vs orders (with 95% CIs)')
    plt.xlabel('Month'); plt.ylabel('Count'); plt.legend(); plt.tight_layout()
    
    path_total = os.path.join(output_dir, "01_totals_clients_orders.png")
    plt.savefig(path_total, dpi=180); plt.close()

    # # ========== 2) Stacked area by age (clients, top_k) ==========
    # # pick top_k age groups by average presence across period
    # top_ages = (
    #     df.groupby('age', as_index=False)[clients_col].mean()
    #       .sort_values(clients_col, ascending=False)['age']
    #       .head(top_k)
    #       .tolist()
    # )
    # df_area = df.copy()
    # df_area['age_area'] = np.where(df_area['age'].isin(top_ages), df_area['age'], 'Other')

    # area_wide = (
    #     df_area.groupby(['month', 'age_area'], as_index=False)[clients_col].sum()
    #            .pivot(index='month', columns='age_area', values=clients_col)
    #            .fillna(0)
    # )

    # plt.figure(figsize=(11, 6))
    # # stack in deterministic order: top ages (descending by latest), then Other if present
    # ordered_cols = [c for c in top_ages if c in area_wide.columns]
    # if 'Other' in area_wide.columns:
    #     ordered_cols = ordered_cols + ['Other']
    # plt.stackplot(area_wide.index, area_wide[ordered_cols].T, labels=ordered_cols)
    # if hist_last_month is not None:
    #     plt.axvline(hist_last_month, linestyle='--', linewidth=1, label='Last historical month')
    # plt.title(f'Active clients by age (stacked), top {top_k} groups')
    # plt.xlabel('Month'); plt.ylabel('Active clients'); plt.legend(loc='upper left'); plt.tight_layout()
    # path_area = os.path.join(output_dir, "02_clients_stacked_area_by_age.png")
    # plt.savefig(path_area, dpi=180); plt.close()


    # ===== 2) Stacked area by age (clients) — 5-year bands from 0, axis starts at start_month, CI band =====
    # axis_start_month can be a string (e.g., "2022-07") or Timestamp; if None, keep all months
    if axis_start_month is not None and not isinstance(axis_start_month, pd.Timestamp):
        axis_start_month = pd.to_datetime(axis_start_month)
    
    # Filter from axis_start_month forward to avoid plotting pre-start zeros
    df_band = df.copy()
    if axis_start_month is not None:
        df_band = df_band[df_band['month'] >= axis_start_month]
    
    # Parse ages to integers (robust to labels like "70-74" by taking the first number)
    def _to_int_age(x):
        s = str(x)
        num = ''.join(ch for ch in s.split('-')[0] if ch.isdigit())
        try:
            return int(num)
        except Exception:
            return np.nan
    
    df_band['age_int'] = df_band['age'].apply(_to_int_age)
    df_band = df_band[df_band['age_int'].notna()].copy()
    df_band['age_int'] = df_band['age_int'].astype(int)
    
    # Map to 5-year bands (START AT 0), cap high ages so the final band is 105–109 (covers 105+)
    def _band_label(a, width=5, cap=109):
        a = max(0, min(int(a), cap))
        lo = (a // width) * width
        hi = lo + width - 1
        return f"{lo:02d}-{hi:02d}"
    
    df_band['age_band'] = df_band['age_int'].apply(_band_label)
    
    # Build a COMPLETE ordered list of bands from 0 up to the cap (ensures we start at 00-04)
    _width = 5
    _cap   = 109
    full_bands = [f"{lo:02d}-{lo+_width-1:02d}" for lo in range(0, _cap + 1, _width)]  # 00-04, 05-09, ..., 105-109
    
    # Wide table: month x 5-year band (sum clients per band and month), then reindex to include all bands from 0
    area_wide = (
        df_band.groupby(['month', 'age_band'], as_index=False)[clients_col].sum()
               .pivot(index='month', columns='age_band', values=clients_col)
               .reindex(columns=full_bands, fill_value=0)   # <-- force presence from 00-04 upward
               .fillna(0)
    )
    
    plt.figure(figsize=(11, 6))
    # Plot in strict ascending band order from 00-04 upwards
    plt.stackplot(area_wide.index, area_wide[full_bands].T, labels=full_bands)
    
    # Vertical line for last historical month if provided
    if hist_last_month is not None:
        plt.axvline(hist_last_month, linestyle='--', linewidth=1, label='Last historical month')
    
    # ---- Uncertainty band (95% Poisson CI) around the total clients series ----
    totals_band = area_wide.sum(axis=1)
    lower = np.maximum(0, totals_band - 1.96 * np.sqrt(np.clip(totals_band, a_min=0, a_max=None)))
    upper = totals_band + 1.96 * np.sqrt(np.clip(totals_band, a_min=0, a_max=None))
    plt.fill_between(area_wide.index, lower, upper, alpha=0.2, label='95% CI (total)')
    
    plt.title('Active clients by age (5-year bands from 0)')
    plt.xlabel('Month'); plt.ylabel('Active clients')
    plt.legend(loc='upper left', ncol=3)  # more columns so the legend fits many bands
    plt.tight_layout()
    
    path_area = os.path.join(output_dir, "02_clients_stacked_area_5yr_bands_from0.png")
    plt.savefig(path_area, dpi=180); plt.close()


    # ========== 3) Heatmap (clients) — 5-year age bands x month ==========

    # 1) Parse 'age' to an integer (handles "70" or "70-74" by taking the left number)
    def _to_int_age(x):
        s = str(x)
        left = s.split('-')[0]
        num = ''.join(ch for ch in left if ch.isdigit())
        try:
            return int(num)
        except Exception:
            return np.nan
    
    df_heat = df.copy()
    df_heat['age_int'] = df_heat['age'].apply(_to_int_age)
    df_heat = df_heat[df_heat['age_int'].notna()].copy()
    df_heat['age_int'] = df_heat['age_int'].astype(int)
    
    # 2) Map to 5-year bands (cap at 109 so final band is 105–109 catching 105+)
    def _band_label(a, width=5, cap=109):
        a = max(0, min(int(a), cap))
        lo = (a // width) * width
        hi = lo + width - 1
        return f"{lo:02d}-{hi:02d}"
    
    df_heat['age_band'] = df_heat['age_int'].apply(_band_label)
    
    # 3) Build band x month matrix (sum of clients per band/month)
    heat = (
        df_heat.pivot_table(index='age_band', columns='month', values=clients_col, aggfunc='sum')
               .fillna(0)
    )
    
    # 4) Sort age bands numerically by their lower bound
    heat = heat.reindex(sorted(heat.index, key=lambda s: int(s.split('-')[0])))
    
    # 5) Plot
    plt.figure(figsize=(12, 7))
    plt.imshow(heat.values, aspect='auto', interpolation='nearest')
    plt.colorbar(label='Active clients')
    
    # y-axis: band labels
    plt.yticks(ticks=np.arange(len(heat.index)), labels=heat.index)
    
    # x-axis: month labels (downsample to ~12 ticks for readability)
    x_idx = np.arange(len(heat.columns))
    step = max(1, len(heat.columns)//12)
    plt.xticks(
        ticks=x_idx[::step],
        labels=[m.strftime('%Y-%m') for m in heat.columns][::step],
        rotation=45, ha='right'
    )
    
    plt.title('Heatmap: Active clients by 5-year age band and month')
    plt.tight_layout()
    path_heat = os.path.join(output_dir, "03_heatmap_clients_ageband_month.png")
    plt.savefig(path_heat, dpi=180); plt.close()


    # ========== 4) Top movers across forecast horizon (delta by age) ==========
    # Determine anchor months for delta
    # If we have a historical cutoff, compare last hist vs last overall; else earliest vs latest.
    if hist_last_month is not None and (df['month'] <= hist_last_month).any():
        m0 = df.loc[df['month'] <= hist_last_month, 'month'].max()
    else:
        m0 = df['month'].min()
    m1 = df['month'].max()

    # snap0 = df[df['month'] == m0].groupby('age', as_index=False)[clients_col, orders_col].sum()
    # snap1 = df[df['month'] == m1].groupby('age', as_index=False)[clients_col, orders_col].sum()

    cols_for_delta = list(dict.fromkeys([clients_col, orders_col]))  # de-dup, keep order
    snap0 = df[df['month'] == m0].groupby('age', as_index=False)[cols_for_delta].sum()
    snap1 = df[df['month'] == m1].groupby('age', as_index=False)[cols_for_delta].sum()

    delta = snap1.merge(snap0, on='age', suffixes=('_end', '_start'), how='outer').fillna(0)
    delta['delta_clients'] = delta[f'{clients_col}_end'] - delta[f'{clients_col}_start']
    delta['delta_orders']  = delta[f'{orders_col}_end']  - delta[f'{orders_col}_start']

    # Top increases & decreases for clients
    inc_clients = delta.sort_values('delta_clients', ascending=False).head(top_k)
    dec_clients = delta.sort_values('delta_clients', ascending=True).head(top_k)

    # Plot clients movers (bar)
    plt.figure(figsize=(11, 6))
    plt.bar(inc_clients['age'], inc_clients['delta_clients'], label='Increases')
    plt.bar(dec_clients['age'], dec_clients['delta_clients'], label='Decreases')
    plt.title(f'Top movers by age (clients): {m0:%Y-%m} → {m1:%Y-%m}')
    plt.xlabel('Age'); plt.ylabel('Δ Active clients'); plt.legend(); plt.tight_layout()
    path_movers_clients = os.path.join(output_dir, "04_top_movers_clients.png")
    plt.savefig(path_movers_clients, dpi=180); plt.close()

    # If orders present, do the same
    if df[orders_col].sum() > 0:
        inc_orders = delta.sort_values('delta_orders', ascending=False).head(top_k)
        dec_orders = delta.sort_values('delta_orders', ascending=True).head(top_k)

        plt.figure(figsize=(11, 6))
        plt.bar(inc_orders['age'], inc_orders['delta_orders'], label='Increases')
        plt.bar(dec_orders['age'], dec_orders['delta_orders'], label='Decreases')
        plt.title(f'Top movers by age (orders): {m0:%Y-%m} → {m1:%Y-%m}')
        plt.xlabel('Age'); plt.ylabel('Δ Active orders'); plt.legend(); plt.tight_layout()
        path_movers_orders = os.path.join(output_dir, "05_top_movers_orders.png")
        plt.savefig(path_movers_orders, dpi=180); plt.close()
    else:
        path_movers_orders = None

    # ========== 5) Ratios & peaks ==========
    # Ratio: orders per 100 clients (where clients > 0)
    totals['orders_per_100_clients'] = np.where(
        totals['total_clients'] > 0,
        totals['total_orders'] * 100.0 / totals['total_clients'],
        np.nan
    )

    plt.figure(figsize=(11, 5))
    plt.plot(totals['month'], totals['orders_per_100_clients'])
    if hist_last_month is not None:
        plt.axvline(hist_last_month, linestyle='--', linewidth=1, label='Last historical month')
    plt.title('Orders per 100 clients (level & trend)')
    plt.xlabel('Month'); plt.ylabel('Orders per 100 clients') 
    if hist_last_month is not None: plt.legend()
    plt.tight_layout()
    path_ratio = os.path.join(output_dir, "06_orders_per_100_clients.png")
    plt.savefig(path_ratio, dpi=180); plt.close()

    # Peak months
    peak_clients_idx = totals['total_clients'].idxmax()
    peak_orders_idx  = totals['total_orders'].idxmax() if df[orders_col].sum() > 0 else None
    peak_clients_month = totals.loc[peak_clients_idx, 'month']
    peak_orders_month  = totals.loc[peak_orders_idx,  'month'] if peak_orders_idx is not None else None

    # ========== 6) Insights (markdown) ==========
    hist_total_clients = totals.loc[totals['month'] == m0, 'total_clients'].sum() if (totals['month'] == m0).any() else np.nan
    end_total_clients  = totals.loc[totals['month'] == m1, 'total_clients'].sum()
    hist_total_orders  = totals.loc[totals['month'] == m0, 'total_orders' ].sum() if (totals['month'] == m0).any() else np.nan
    end_total_orders   = totals.loc[totals['month'] == m1, 'total_orders' ].sum()

    abs_change_clients = end_total_clients - hist_total_clients if pd.notna(hist_total_clients) else np.nan
    pct_change_clients = (abs_change_clients / hist_total_clients * 100.0) if pd.notna(hist_total_clients) and hist_total_clients else np.nan

    abs_change_orders  = end_total_orders - hist_total_orders if pd.notna(hist_total_orders) else np.nan
    pct_change_orders  = (abs_change_orders / hist_total_orders * 100.0) if pd.notna(hist_total_orders) and hist_total_orders else np.nan

    # Contribution of top movers (clients)
    movers_clients = delta[['age', 'delta_clients']].sort_values('delta_clients', ascending=False)
    pos_sum = movers_clients[movers_clients['delta_clients'] > 0]['delta_clients'].sum()
    top_contrib = movers_clients.head(top_k)['delta_clients'].sum()
    share_topk = (top_contrib / pos_sum * 100.0) if pos_sum else np.nan

    insights_lines = [
        "# Deputyship caseload forecast — key insights",
        f"- **Horizon compared:** {m0:%Y-%m} → {m1:%Y-%m}",
        f"- **Total clients:** {int(end_total_clients):,} at end; change = {int(abs_change_clients):,} ({pct_change_clients:0.1f}%)" if pd.notna(pct_change_clients) else f"- **Total clients (end):** {int(end_total_clients):,}",
        f"- **Total orders:** {int(end_total_orders):,} at end; change = {int(abs_change_orders):,} ({pct_change_orders:0.1f}%)" if pd.notna(pct_change_orders) else f"- **Total orders (end):** {int(end_total_orders):,}",
        f"- **Peak clients month:** {peak_clients_month:%Y-%m} (value: {int(totals.loc[peak_clients_idx,'total_clients']):,})",
    ]
    if peak_orders_month is not None:
        insights_lines.append(f"- **Peak orders month:** {peak_orders_month:%Y-%m} (value: {int(totals.loc[peak_orders_idx,'total_orders']):,})")
    if pd.notna(share_topk):
        insights_lines.append(f"- **Top {top_k} age groups account for ~{share_topk:0.1f}% of the positive change in clients.**")

    # List the top 5 client growers & decliners
    top_incr = movers_clients.head(5)
    top_decl = movers_clients.tail(5).sort_values('delta_clients')
    insights_lines.append("\n**Top 5 age increases (clients):** " + ", ".join(f"{a} (+{int(d):,})" for a, d in zip(top_incr['age'], top_incr['delta_clients'])))
    insights_lines.append("**Top 5 age decreases (clients):** " + ", ".join(f"{a} ({int(d):,})" for a, d in zip(top_decl['age'], top_decl['delta_clients'])))

    # Write markdown
    insights_path = os.path.join(output_dir, "00_forecast_insights.md")
    with open(insights_path, "w", encoding="utf-8") as f:
        f.write("\n".join(insights_lines))

    # Console echo for quick read
    print("\n".join(insights_lines))
    print(f"\nSaved charts:\n- {path_total}\n- {path_area}\n- {path_heat}\n- {path_movers_clients}\n"
          f"{'- ' + path_movers_orders if path_movers_orders else ''}\n- {path_ratio}\nInsights → {insights_path}")

    # Return a small metrics dict if you want to log/store programmatically
    return {
        "horizon_start": m0,
        "horizon_end": m1,
        "end_total_clients": int(end_total_clients),
        "end_total_orders":  int(end_total_orders),
        "abs_change_clients": int(abs_change_clients) if pd.notna(abs_change_clients) else None,
        "pct_change_clients": float(pct_change_clients) if pd.notna(pct_change_clients) else None,
        "abs_change_orders":  int(abs_change_orders) if pd.notna(abs_change_orders) else None,
        "pct_change_orders":  float(pct_change_orders) if pd.notna(pct_change_orders) else None,
        "peak_clients_month": peak_clients_month,
        "peak_orders_month":  peak_orders_month
    }




## Run block for forecasting, final outputs, and charts

The next code cell runs the forecasting and reporting steps that sit on top of the extracted data.

It assumes that ages_df has already been created by the earlier extraction and feature engineering run. If you have not run that section yet, the commented extraction lines show what needs to be executed first.

The first action is to run stop_flow_forecast. That produces a forecast by month and age group for the chosen number of periods.

The cell then prepares two tables with a consistent schema.

One table is the historical age specific table built from ages_df. Columns are renamed into a reporting friendly naming convention such as active_caseloads_clients and new_deputyships.

The other table is the forecast table built from per_age_df. Its columns are renamed into the same reporting convention.

These two tables are concatenated into one combined table, then a subset of the most important columns is selected. Month is formatted into a short label like Jan 25 for readability in charts and outputs.

The combined historical and forecast table is saved as a csv file in the output folder. This file is the main handover artifact for downstream users.

Finally, the cell calls the visualisation and insight function. It saves charts and an insights markdown file into the output folder. The hist_last_month argument controls where the function draws the boundary between historical and forecast data, and axis_start_month is used to avoid leading empty months on the x axis.


In [None]:
# Running the Deputyship forecasting model

if __name__ == "__main__":
    start_year = 2022
    end_year = 2025
    start_month = "2024-12"
    end_month = "2025-12"
    output_base="output"

    # We have already ran the section below earlier for data extraction step,
    # if neccessary, un-comment the lines below:
    
    ## ------------------------------ Data Extraction and Engineeing----------------------------------##
    # # Prepare output directory
    # os.makedirs(output_base, exist_ok=True)
    # # **Clear the output directory, not the Excel filepath**
    # clear_directory(output_base)
    
    # combined_df, summary_df = export_monthly_reports(start_month, end_month)
    # summary_df
    # print(combined_df)
    
    # active_df, monthly_df, yearly_summary = calculate_monthly_active_cases(combined_df, start_month, end_month, output_base="output")
    # yearly_summary
    # print(monthly_df)
    # print(active_df)
    

    # # Calculate historical flows and age rates
    # final_df, ages_df, summary_df, monthly_summary_df = calculate_yearonyear_flows_and_age_rates(
    #      start_month, end_month,
    #      redistribute_unknown_age=True)
    
    # print(summary_df)
    # print(monthly_summary_df)
    # print(ages_df)
    # print(final_df)
    ## --------------------------End of Data Extraction and Engineeing-------------------------------------## 

    
    # Compute 2-year stop-flow forecast
    per_age_df, monthly, yearly, df = stop_flow_forecast(ages_df, periods=12)    
    
    print(per_age_df)
    print(monthly)
    print(yearly)
    print(df)

    # Combine the historical data and forecasts
    current_age_specific_deputyship_agg = ages_df.copy()
    current_age_specific_deputyship_agg = current_age_specific_deputyship_agg.rename(
        columns={
            'age_group': 'age',
            #'active_count': 'active_caseloads',
            'entered': 'new_deputyships',
            'terminations': 'terminated',
            'active_clients_age': 'active_caseloads_clients',
            'active_orders_age':  'active_caseloads_orders'

        }
    )
    
    
    forecasted_age_specific_deputyship_agg = per_age_df.copy()
    forecasted_age_specific_deputyship_agg = forecasted_age_specific_deputyship_agg.rename(
        columns={
            'age_group': 'age',
            #'active_forecast': 'active_caseloads',
            'active_clients_age_fc': 'active_caseloads_clients',
            'active_orders_age_fc': 'active_caseloads_orders'
        }
    )
    #forecasted_age_specific_deputyship_agg['month'] = pd.to_datetime(forecasted_age_specific_deputyship_agg['month'], format='%Y-%m')
    #current_age_specific_deputyship_agg['month'] = pd.to_datetime(forecasted_age_specific_deputyship_agg['month'], format='%Y-%m')
    # Final tforcast and actuals
    combined_table = get_combined_age_deputyship_table(current_age_specific_deputyship_agg, forecasted_age_specific_deputyship_agg)
    # Ensure 'month' is datetime
    #combined_table['month'] = pd.to_datetime(combined_table['month'], format='%Y-%m')
    final_deputyship_historical_forecasts = combined_table[['month', 'age', 'active_caseloads_clients', 'active_caseloads_orders', 'new_deputyships', 'terminated']]
    final_deputyship_historical_forecasts['month'] = pd.to_datetime(final_deputyship_historical_forecasts['month']).dt.strftime("%b-%y")
    
    # Save in CSV
    final_deputyship_historical_forecasts.to_csv(f"output/final_deputyship_historical_forecasts_{start_year}_{end_year}.csv")
    print(final_deputyship_historical_forecasts)


    # Use end_month as the last historical month to draw a vertical line on charts
    _ = visualize_and_analyze_deputyship_forecasts(
        final_deputyship_historical_forecasts,
        output_dir=output_base,
        hist_last_month=pd.to_datetime(end_month),
        axis_start_month=start_month,   # <-- important
        top_k=10
    )

## Optional plotting snippet kept as a reference

This cell contains an older plotting approach that is currently commented out. It shows how to pivot the per age forecast table into a wide format with one column per age group and then plot each age group as its own line.

It can be useful as a quick exploratory plot when you want to focus on a small number of ages, but it is not part of the main reporting pipeline because the dedicated visualisation function produces a fuller set of charts and a consistent set of saved files.


In [None]:
# Virtualisation: Plotting age-specific active caseloads, termination rate, and new deputyships over time

# # Active Caseloads by Age Group
# active_pivot = combined_table.pivot(index='month', columns='age', values='active_caseloads')
    
# # Pivot for plotting: month on x‐axis, each age a line
# pivot = per_age_df.pivot(index='month', columns='age_group', values='active_forecast')
# pivot.index = pd.to_datetime(pivot.index)
    
# # Plot
# fig, ax = plt.subplots(figsize=(10, 6))
# for age, series in pivot.items():
#     ax.plot(
#         series.index, series.values,
#         label=f"{int(age)} yrs",
#         marker='o',
#         linewidth=2,
#         alpha=0.8
#     )

## Placeholder cell

This final cell is empty. It can be used for ad hoc checks, quick plots, or scratch calculations when you are validating a run. It is safe to leave it empty for normal notebook execution.
