# Investigations Backlog – Annotated Build Notebook
**Date:** 2025-10-27

This version of the notebook is automatically annotated with:
- Line-by-line comments in code cells to explain what each statement is doing.
- Brief summaries before each code cell describing the main purpose.
- Pointers to comprehensive documentation: see `README_Investigations_Backlog_Documentation.md` in the same folder for the full end-to-end description (data engineering, predictive modelling, and Bayesian analysis approach).

> Original source notebook: `Build_Investigator_Daily_from_Raw_JL.ipynb`.


> **Original note:**
>
> # Build Investigator Daily Panel (from OPG raw extract)
> This notebook mirrors the script flow.

### general processing


In [None]:

#!python -m venv .venv && . .venv/bin/activate


# Data

## Jake note regarding linking investigation data to LPA and staff data
I’ve re-added the investigator names as requested. I’ve also added the LPA/Deputyship ID too. (so just the donors name and DOB is removed). The password remains as “backlog”. 

On the analytical platform, the LPA number is stored as ‘UID’ in the cases table in the opg_sirius_prod database. The investigations database has hyphens for these id’s, but if you remove the hyphens you can then join the database with the data on the AP. Effectively the donor names can be re-accessed there, and other key variables such as the LPA registration dates can be retrieved (as these are not stored on the database but these are significant for inbounds).

I have also added the FTE of the EO/AO investigators to this sheet, what you’ll notice is that there are some members of staff who were previously EO’s (and are now HEO’s) so they are not on the staff list. The staff list is in a constant state of flux with the incoming cohorts/natural attrition, so I’d heavily recommend if any projections relating to resource levels are made I send a definitive list on a specific date so there’s a clear point of reference. In the temporary backlog model, I am manually reviewing the list each month with placeholders for the incoming cohorts, but for the more sophisticated model you may come up with a better solution. It’s something to discuss in next week’s meeting im sure.

## Peter interpretation of cases left the allocation
One question that I do have is, whilst maintaining anonymity can the records of closed cases be linked to individual investigators ? As you know the key problem that we are trying to investigate is how will changes in staff volumes impact OPG’s ability to reduce the backlog, so we really need to understand the variation in workloads assigned to individuals.

the cases closed or sent to court for legal review from the anlaytical point of view can be the same.


### imports and environment setup


In [None]:
# Import libraries/modules for use below
from pathlib import Path
import pandas as pd, numpy as np
import re, hashlib

# Configure paths
# Path to the raw investigation data
RAW_PATH = Path('data/raw/raw.csv')
# Path to the output/processed investigation data
OUT_DIR = Path('data/out'); OUT_DIR.mkdir(parents=True, exist_ok=True)
# Print if the path exists
print(RAW_PATH.exists(), OUT_DIR)


### imports and environment setup, date parsing


In [None]:
# -----------------------------
# 🧹 DATA PRE-PROCESSING SECTION
# -----------------------------

import re
import hashlib
import pandas as pd

# Define a set of string patterns that represent missing or null values.
# These strings will be treated as equivalent to NaN during cleaning.
NULL_STRINGS = {
    '', 'na', 'n/a', 'none', 'null', '-', '--', 'unknown',
    'not completed', 'not complete', 'tbc', 'n\\a'
}


def normalise_col(c: str) -> str:
    """
    Normalize a column name for consistency.

    This function cleans up and standardizes column names by:
    - Converting to lowercase
    - Removing leading/trailing whitespace
    - Replacing multiple spaces with a single space

    Parameters
    ----------
    c : str
        The original column name.

    Returns
    -------
    str
        A cleaned and standardized version of the column name.
    """
    # Convert to string, remove extra spaces, and make lowercase.
    return re.sub(r'\s+', ' ', str(c).strip().lower())


def parse_date_series(s: pd.Series) -> pd.Series:
    """
    Parse and clean a pandas Series of date strings.

    This function:
    - Handles various date formats
    - Converts known null strings to NaT
    - Removes ordinal suffixes (e.g., '1st', '2nd', '3rd')
    - Fixes known typos
    - Uses robust pandas date parsing with fallback strategies

    Parameters
    ----------
    s : pd.Series
        A pandas Series containing raw date values.

    Returns
    -------
    pd.Series
        A pandas Series of datetime64[ns] values with cleaned and parsed dates.
    """

    def _p(x):
        """Internal helper to parse a single date entry."""
        import pandas as pd

        # Return NaT if missing
        if pd.isna(x):
            return pd.NaT

        # Convert to lowercase string
        xs = str(x).strip().lower()

        # Return NaT if in known null string set
        if xs in NULL_STRINGS:
            return pd.NaT

        # Clean up common errors and ordinal suffixes
        xs = re.sub(r'(\d{1,2})(st|nd|rd|th)', r'\1', xs).replace('legel', 'legal')

        # Try strict parsing, then flexible fallback
        try:
            return pd.to_datetime(xs, dayfirst=True, errors='raise')
        except Exception:
            return pd.to_datetime(xs, infer_datetime_format=True, dayfirst=True, errors='coerce')

    # Apply the parser to each element of the Series
    return s.apply(_p)


def hash_id(t: str) -> str:
    """
    Generate a short, anonymized hash-based identifier.

    Creates a pseudonymized ID for text entries using SHA1 hashing.
    Empty or missing values return an empty string.

    Parameters
    ----------
    t : str
        The input text value (e.g., name, case number).

    Returns
    -------
    str
        An anonymized hash string prefixed with 'S', e.g., 'S1a2b3c4d'.
    """
    # Return empty string for null or blank input
    if pd.isna(t) or str(t).strip() == '':
        return ''

    # Create SHA1 hash and take first 8 characters for compact ID
    return 'S' + hashlib.sha1(str(t).encode('utf-8')).hexdigest()[:8]


def month_to_season(m: int) -> str:
    """
    Convert a numeric month into a season name.

    Parameters
    ----------
    m : int
        Month number (1–12).

    Returns
    -------
    str
        The season corresponding to the month ('winter', 'spring', 'summer', or 'autumn').

    Examples
    --------
    >>> month_to_season(4)
    'spring'
    >>> month_to_season(10)
    'autumn'
    """
    # Map month numbers to their respective seasons
    return {
        12: 'winter', 1: 'winter', 2: 'winter',
        3: 'spring', 4: 'spring', 5: 'spring',
        6: 'summer', 7: 'summer', 8: 'summer',
        9: 'autumn', 10: 'autumn', 11: 'autumn'
    }[int(m)]


def is_term_month(m: int) -> int:
    """
    Identify whether a month is a 'termination month'.

    In the current logic, August (month 8) is excluded and returns 0.
    All other months return 1, representing active/valid months.

    Parameters
    ----------
    m : int
        Month number (1–12).

    Returns
    -------
    int
        0 if the month is August, else 1.
    """
    # Return binary flag based on month value
    return 0 if int(m) == 8 else 1


### imports and environment setup, data loading, joining/merging datasets, aggregation/grouping, pivot/reshape, data cleaning, sorting, feature engineering, exporting outputs


In [None]:
# -------------------------------------
# 🧩 DATA LOADING AND FEATURE ENGINEERING
# -------------------------------------

from pathlib import Path
import pandas as pd
import numpy as np
import re
import hashlib

# -------------------------------------------------------------
# Function: load_raw()
# -------------------------------------------------------------
def load_raw(p: Path, force_encoding: str | None = None):
    """
    Load a CSV or Excel file into a pandas DataFrame with robust encoding handling.

    This function attempts to open and read raw data files safely, even when
    character encodings vary or are unknown. It tries multiple encodings in order
    until one succeeds.

    Parameters
    ----------
    p : Path
        Path to the input file.
    force_encoding : str, optional
        If provided, forces the use of a specific encoding.

    Returns
    -------
    tuple
        (df, colmap)
        df : pd.DataFrame
            Cleaned dataframe containing the raw data.
        colmap : dict
            Mapping of normalized column names (lowercased, trimmed) to original column headers.

    Raises
    ------
    FileNotFoundError
        If the file path does not exist.
    RuntimeError
        If all encoding attempts fail.
    """
    import io

    # Check file existence
    if not p.exists():
        raise FileNotFoundError(p)

    # Excel files typically do not have encoding issues
    if p.suffix.lower() in (".xlsx", ".xls"):
        df = pd.read_excel(p, dtype=str)
    else:
        tried = []
        # Build list of encodings to try
        encodings_to_try = (
            [force_encoding] if force_encoding else
            ["utf-8-sig", "cp1252", "latin1", "iso-8859-1", "utf-16", "utf-16le", "utf-16be"]
        )

        df = None
        last_err = None

        # Try to read using multiple encodings
        for enc in encodings_to_try:
            try:
                df = pd.read_csv(
                    p, dtype=str, sep=None, engine="python",
                    encoding=enc, encoding_errors="strict"
                )
                break
            except UnicodeDecodeError as e:
                tried.append(enc)
                last_err = e
            except Exception as e:
                # Continue trying other encodings
                tried.append(enc)
                last_err = e

        # Fallback: attempt to decode with cp1252 and replace bad bytes
        if df is None:
            try:
                df = pd.read_csv(
                    p, dtype=str, sep=None, engine="python",
                    encoding="cp1252", encoding_errors="replace"
                )
                print(f"[load_raw] WARNING: used cp1252 with replacement after failed encodings: {tried}")
            except Exception as e:
                raise RuntimeError(
                    f"Failed to read CSV. Tried encodings {tried}. Last error: {last_err}"
                ) from e

    # Strip whitespace from all string values
    df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

    # Create mapping of normalized column names → original names
    colmap = {re.sub(r"\s+", " ", str(c).strip().lower()): c for c in df.columns}

    return df, colmap


# -------------------------------------------------------------
# Function: col()
# -------------------------------------------------------------
def col(df: pd.DataFrame, colmap: dict, name: str) -> pd.Series:
    """
    Retrieve a column from a DataFrame by fuzzy name matching.

    This function normalises the requested column name and searches the column map
    for an exact or partial match. Returns a Series of NaNs if not found.

    Parameters
    ----------
    df : pd.DataFrame
        The source DataFrame.
    colmap : dict
        Mapping of normalised column names to original names.
    name : str
        Column name to look up.

    Returns
    -------
    pd.Series
        The column data if found, otherwise a Series of NaN values.
    """
    k = normalise_col(name)

    # Exact match first
    if k in colmap:
        return df[colmap[k]]

    # Partial match fallback
    for kk, v in colmap.items():
        if k in kk or kk in k:
            return df[v]

    # Default: return empty column of NaNs
    return pd.Series([np.nan] * len(df))


# -------------------------------------------------------------
# Function: engineer()
# -------------------------------------------------------------
def engineer(df: pd.DataFrame, colmap: dict) -> pd.DataFrame:
    """
    Engineer standardised and typed columns from raw investigation data.

    This function extracts and converts the key variables such as case IDs, investigators,
    FTEs, and multiple date columns from the raw file using reusable helper functions.

    Parameters
    ----------
    df : pd.DataFrame
        Raw dataframe from load_raw().
    colmap : dict
        Column name mapping from load_raw().

    Returns
    -------
    pd.DataFrame
        Cleaned and feature-engineered dataframe ready for downstream modeling.
    """
    out = pd.DataFrame({
        'case_id': col(df, colmap, 'ID'),
        'investigator': col(df, colmap, 'Investigator'),
        'team': col(df, colmap, 'Team'),
        'fte': pd.to_numeric(col(df, colmap, 'Investigator FTE'), errors='coerce')
    })

    # Parse and standardize all relevant date columns
    out['dt_received_inv'] = parse_date_series(col(df, colmap, 'Date Received in Investigations'))
    out['dt_alloc_invest'] = parse_date_series(col(df, colmap, 'Date allocated to current investigator'))
    out['dt_alloc_team'] = parse_date_series(col(df, colmap, 'Date allocated to team'))
    out['dt_pg_signoff'] = parse_date_series(col(df, colmap, 'PG Sign off date'))
    out['dt_close'] = parse_date_series(col(df, colmap, 'Closure Date'))
    out['dt_legal_req_1'] = parse_date_series(col(df, colmap, 'Date of Legal Review Request 1'))
    out['dt_legal_rej_1'] = parse_date_series(col(df, colmap, 'Date Legal Rejects 1'))
    out['dt_legal_req_2'] = parse_date_series(col(df, colmap, 'Date of Legal Review Request 2'))
    out['dt_legal_rej_2'] = parse_date_series(col(df, colmap, 'Date Legal Rejects 2'))
    out['dt_legal_req_3'] = parse_date_series(col(df, colmap, 'Date of Legel Review Request 3'))
    out['dt_legal_approval'] = parse_date_series(col(df, colmap, 'Legal Approval Date'))
    out['dt_date_of_order'] = parse_date_series(col(df, colmap, 'Date Of Order'))
    out['dt_flagged'] = parse_date_series(col(df, colmap, 'Flagged Date'))

    # Fill missing FTEs with 1.0, hash investigator names for anonymization, and add placeholders
    out['fte'] = out['fte'].fillna(1.0)
    out['staff_id'] = out['investigator'].apply(hash_id)
    out['role'] = ''

    return out


# -------------------------------------------------------------
# Function: date_horizon()
# -------------------------------------------------------------
def date_horizon(typed: pd.DataFrame, pad_days: int = 14):
    """
    Determine the overall start and end date horizon of the dataset.

    Combines several date columns to find the earliest and latest dates,
    applying a configurable padding period at the end.

    Parameters
    ----------
    typed : pd.DataFrame
        Feature-engineered dataset with standardized date columns.
    pad_days : int, default=14
        Number of days to extend the end horizon.

    Returns
    -------
    tuple of pd.Timestamp
        (start, end) normalized date range.

    Examples
    --------
    >>> import pandas as pd
    >>> from datetime import datetime
    >>> df = pd.DataFrame({
    ...     'dt_received_inv': [pd.Timestamp('2025-01-05'), pd.NaT],
    ...     'dt_alloc_invest': [pd.NaT, pd.Timestamp('2025-01-10')],
    ...     'dt_alloc_team': [pd.NaT, pd.NaT],
    ...     'dt_close': [pd.NaT, pd.Timestamp('2025-02-01')],
    ...     'dt_pg_signoff': [pd.NaT, pd.NaT],
    ...     'dt_date_of_order': [pd.NaT, pd.NaT],
    ... })
    >>> s, e = date_horizon(df, pad_days=7)
    >>> isinstance(s, pd.Timestamp) and isinstance(e, pd.Timestamp)
    True
    >>> (e - s).days >= (pd.Timestamp('2025-02-01') - pd.Timestamp('2025-01-05')).days
    True
    """
    start = pd.concat([typed['dt_received_inv'], typed['dt_alloc_invest'], typed['dt_alloc_team']]).min()
    end = pd.concat([typed['dt_close'], typed['dt_pg_signoff'], typed['dt_date_of_order']]).max()

    if pd.isna(start):
        start = pd.Timestamp.today().normalize() - pd.Timedelta(days=30)
    if pd.isna(end):
        end = pd.Timestamp.today().normalize()

    end = end + pd.Timedelta(days=pad_days)
    return start.normalize(), end.normalize()


def build_event_log(typed: pd.DataFrame) -> pd.DataFrame:
    """
    Construct an event log from feature-engineered investigation data.

    For each case, this function creates dated event records (e.g., new case pickup,
    legal requests/approvals, court orders) at the staff-day level.

    Parameters
    ----------
    typed : pd.DataFrame
        Must include:
        ['staff_id','team','fte','case_id',
         'dt_alloc_invest','dt_legal_req_1','dt_legal_req_2','dt_legal_req_3',
         'dt_legal_approval','dt_date_of_order'].

    Returns
    -------
    pd.DataFrame
        ['date','staff_id','team','fte','case_id','event','meta'].

    Examples
    --------
    >>> import pandas as pd
    >>> typed = pd.DataFrame({
    ...     'staff_id':['S1'], 'team':['A'], 'fte':[1.0], 'case_id':['C1'],
    ...     'dt_alloc_invest':[pd.Timestamp('2025-01-10')],
    ...     'dt_legal_req_1':[pd.NaT], 'dt_legal_req_2':[pd.NaT], 'dt_legal_req_3':[pd.NaT],
    ...     'dt_legal_approval':[pd.Timestamp('2025-01-20')],
    ...     'dt_date_of_order':[pd.NaT],
    ... })
    >>> ev = build_event_log(typed)
    >>> sorted(ev['event'].unique().tolist())
    ['legal_approval', 'newcase']
    >>> set(ev.columns) >= {'date','staff_id','team','fte','case_id','event','meta'}
    True
    """
    rec = []
    for _, r in typed.iterrows():
        sid, team, fte, cid = r['staff_id'], r['team'], r['fte'], r['case_id']

        def add(dt, etype):
            if pd.isna(dt):
                return
            rec.append({
                'date': dt.normalize(),
                'staff_id': sid,
                'team': team,
                'fte': fte,
                'case_id': cid,
                'event': etype,
                'meta': ''
            })

        add(r['dt_alloc_invest'], 'newcase')
        add(r['dt_legal_req_1'], 'legal_request')
        add(r['dt_legal_req_2'], 'legal_request')
        add(r['dt_legal_req_3'], 'legal_request')
        add(r['dt_legal_approval'], 'legal_approval')
        add(r['dt_date_of_order'], 'court_order')

    ev = pd.DataFrame.from_records(rec)
    return ev if not ev.empty else pd.DataFrame(
        columns=['date', 'staff_id', 'team', 'fte', 'case_id', 'event', 'meta']
    )


def build_wip_series(typed: pd.DataFrame, start: pd.Timestamp, end: pd.Timestamp) -> pd.DataFrame:
    """
    Build a Work-In-Progress (WIP) daily series per staff member.

    A case is in WIP from allocation to earliest of (closure, PG sign-off, end).

    Parameters
    ----------
    typed : pd.DataFrame
        ['staff_id','team','dt_alloc_invest','dt_close','dt_pg_signoff'].
    start : pd.Timestamp
    end : pd.Timestamp

    Returns
    -------
    pd.DataFrame
        ['date','staff_id','team','wip'].

    Examples
    --------
    >>> import pandas as pd
    >>> typed = pd.DataFrame({
    ...     'staff_id':['S1','S1'], 'team':['A','A'],
    ...     'dt_alloc_invest':[pd.Timestamp('2025-01-02'), pd.Timestamp('2025-01-05')],
    ...     'dt_close':[pd.Timestamp('2025-01-03'), pd.NaT],
    ...     'dt_pg_signoff':[pd.NaT, pd.Timestamp('2025-01-07')],
    ... })
    >>> wip = build_wip_series(typed, pd.Timestamp('2025-01-01'), pd.Timestamp('2025-01-10'))
    >>> set(wip.columns) == {'date','staff_id','team','wip'}
    True
    >>> wip['wip'].ge(0).all()
    True
    """
    end_dt = typed['dt_close'].fillna(typed['dt_pg_signoff']).fillna(end)
    intervals = pd.DataFrame({
        'staff_id': typed['staff_id'],
        'team': typed['team'],
        'start': typed['dt_alloc_invest'],
        'end': end_dt
    }).dropna()

    deltas = []
    for _, r in intervals.iterrows():
        s = r['start'].normalize()
        e = r['end'].normalize()
        if s > end or e < start:
            continue
        s = max(s, start)
        e = min(e, end)
        deltas.append((r['staff_id'], r['team'], s, 1))
        deltas.append((r['staff_id'], r['team'], e + pd.Timedelta(days=1), -1))

    if not deltas:
        return pd.DataFrame(columns=['date', 'staff_id', 'team', 'wip'])

    deltas = pd.DataFrame(deltas, columns=['staff_id', 'team', 'date', 'delta'])
    all_dates = pd.DataFrame({'date': pd.date_range(start, end, freq='D')})

    rows = []
    for (sid, team), g in deltas.groupby(['staff_id', 'team']):
        gg = g.groupby('date', as_index=False)['delta'].sum()
        grid = all_dates.merge(gg, on='date', how='left').fillna({'delta': 0})
        grid['wip'] = grid['delta'].cumsum()
        grid['staff_id'] = sid
        grid['team'] = team
        rows.append(grid[['date', 'staff_id', 'team', 'wip']])

    return pd.concat(rows, ignore_index=True) if rows else pd.DataFrame(
        columns=['date', 'staff_id', 'team', 'wip']
    )


def build_backlog_series(typed: pd.DataFrame, start: pd.Timestamp, end: pd.Timestamp) -> pd.DataFrame:
    """
    Build a daily backlog series (accepted minus allocated cumulative totals).

    Parameters
    ----------
    typed : pd.DataFrame
        ['dt_received_inv','dt_alloc_invest'].
    start : pd.Timestamp
    end : pd.Timestamp

    Returns
    -------
    pd.DataFrame
        ['date','backlog_available'].

    Examples
    --------
    >>> import pandas as pd
    >>> typed = pd.DataFrame({
    ...     'dt_received_inv':[pd.Timestamp('2025-01-01'), pd.Timestamp('2025-01-03')],
    ...     'dt_alloc_invest':[pd.Timestamp('2025-01-02'), pd.NaT],
    ... })
    >>> start, end = pd.Timestamp('2025-01-01'), pd.Timestamp('2025-01-05')
    >>> backlog = build_backlog_series(typed, start, end)
    >>> list(backlog.columns)
    ['date', 'backlog_available']
    >>> backlog.iloc[-1]['backlog_available']  # 2 received, 1 allocated -> 1
    1
    """
    accepted = (
        typed[['dt_received_inv']]
        .dropna()
        .assign(date=lambda d: d['dt_received_inv'].dt.normalize())['date']
        .value_counts()
        .sort_index()
    )
    allocated = (
        typed[['dt_alloc_invest']]
        .dropna()
        .assign(date=lambda d: d['dt_alloc_invest'].dt.normalize())['date']
        .value_counts()
        .sort_index()
    )

    idx = pd.date_range(start, end, freq='D')
    acc = accepted.reindex(idx, fill_value=0).cumsum()
    allo = allocated.reindex(idx, fill_value=0).cumsum()
    backlog = (acc - allo).rename('backlog_available').to_frame()
    backlog.index.name = 'date'
    return backlog.reset_index()





In [None]:
import pandas as pd

# --- fabricate a tiny typed dataset ---
typed = pd.DataFrame({
    'case_id': ['C1','C2'],
    'investigator': ['Alice','Bob'],
    'team': ['T1','T1'],
    'role': ['',''],
    'fte': [1.0, 0.8],
    'staff_id': ['S1','S2'],

    # key dates
    'dt_received_inv': [pd.Timestamp('2025-01-01'), pd.Timestamp('2025-01-02')],
    'dt_alloc_invest': [pd.Timestamp('2025-01-02'), pd.Timestamp('2025-01-03')],
    'dt_alloc_team': [pd.NaT, pd.NaT],
    'dt_pg_signoff': [pd.NaT, pd.Timestamp('2025-01-08')],
    'dt_close': [pd.Timestamp('2025-01-06'), pd.NaT],

    # events
    'dt_legal_req_1': [pd.NaT, pd.Timestamp('2025-01-04')],
    'dt_legal_req_2': [pd.NaT, pd.NaT],
    'dt_legal_req_3': [pd.NaT, pd.NaT],
    'dt_legal_approval': [pd.NaT, pd.NaT],
    'dt_date_of_order': [pd.NaT, pd.NaT],
    'dt_flagged': [pd.NaT, pd.NaT],
})

# --- run horizon, events, wip, and panel ---
start, end = date_horizon(typed, pad_days=3)
daily, backlog, events = build_daily_panel(typed, start, end)

print("Start/End:", start.date(), end.date())
print("Daily shape:", daily.shape)
print("Backlog shape:", backlog.shape)
print("Events shape:", events.shape)

print("\nDaily head:\n", daily.head())
print("\nBacklog tail:\n", backlog.tail())
print("\nEvents:\n", events.sort_values(['date','staff_id','event']))


In [None]:

# Define a reusable function to load raw data
def load_raw(p: Path, force_encoding: str | None = None):
    """
    Load CSV/XLSX with robust encoding handling.
    - If force_encoding is given, use it.
    - Otherwise try common encodings in order and fall back to a safe decode.
    Returns: (df, colmap)
    """
# Import libraries/modules for use below
    import pandas as pd, numpy as np, io, re

# Conditional branch
    if not p.exists():

        raise FileNotFoundError(p)

# Excel files are not affected by CSV encoding issues
# Conditional branch
    if p.suffix.lower() in (".xlsx", ".xls"):
# Load an Excel sheet into a DataFrame
        df = pd.read_excel(p, dtype=str)
# Fallback branch
    else:
        tried = []
        encodings_to_try = (
            [force_encoding] if force_encoding else
            ["utf-8-sig", "cp1252", "latin1", "iso-8859-1", "utf-16", "utf-16le", "utf-16be"]
        )

        df = None
        last_err = None
# Loop over a sequence
        for enc in encodings_to_try:
# Try a block of code that may raise errors
            try:
# Load a CSV file into a DataFrame
                df = pd.read_csv(p, dtype=str, sep=None, engine="python",
                                 encoding=enc, encoding_errors="strict")
                break
# Handle errors from the try block
            except UnicodeDecodeError as e:
                tried.append(enc); last_err = e
            except Exception as e:
# Other parse errors (separator/quotes) – keep trying other encodings
                tried.append(enc); last_err = e

# Last-resort: decode with cp1252 but *replace* bad bytes
# Conditional branch
        if df is None:
# Try a block of code that may raise errors
            try:
# Load a CSV file into a DataFrame
                df = pd.read_csv(p, dtype=str, sep=None, engine="python",
                                 encoding="cp1252", encoding_errors="replace")
# Print a message or value
                print(f"[load_raw] WARNING: used cp1252 with replacement after failed encodings: {tried}")
# Handle errors from the try block
            except Exception as e:
                raise RuntimeError(
                    f"Failed to read CSV. Tried encodings {tried}. Last error: {last_err}"
                ) from e
                
# Trim whitespace across all string columns
    df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
    colmap = {re.sub(r"\s+", " ", str(c).strip().lower()): c for c in df.columns}
# Return a value from a function
    return df, colmap


# Define a reusable function
def col(df, colmap, name):
# Import libraries/modules for use below
    import numpy as np
    k=normalise_col(name)
# Return a value from a function
    if k in colmap: return df[colmap[k]]
# Loop over a sequence
    for kk,v in colmap.items():
# Return a value from a function
        if k in kk or kk in k: return df[v]
# Use NumPy for numeric operations
    return pd.Series([np.nan]*len(df))
# Define a reusable function
def engineer(df, colmap):
# Import libraries/modules for use below
    import pandas as pd
# Use pandas functionality to rename the most important column to the corresponding variables
    out=pd.DataFrame({'case_id':col(df,colmap,'ID'),
                      'investigator':col(df,colmap,'Investigator'),'team':col(df,colmap,'Team'),
                      'fte':pd.to_numeric(col(df,colmap,'Investigator FTE'), errors='coerce')})
    out['dt_received_inv']=parse_date_series(col(df,colmap,'Date Received in Investigations'))
    out['dt_alloc_invest']=parse_date_series(col(df,colmap,'Date allocated to current investigator'))
    out['dt_alloc_team']=parse_date_series(col(df,colmap,'Date allocated to team'))
    out['dt_pg_signoff']=parse_date_series(col(df,colmap,'PG Sign off date'))
    out['dt_close']=parse_date_series(col(df,colmap,'Closure Date'))
    out['dt_legal_req_1']=parse_date_series(col(df,colmap,'Date of Legal Review Request 1'))
    out['dt_legal_rej_1']=parse_date_series(col(df,colmap,'Date Legal Rejects 1'))
    out['dt_legal_req_2']=parse_date_series(col(df,colmap,'Date of Legal Review Request 2'))
    out['dt_legal_rej_2']=parse_date_series(col(df,colmap,'Date Legal Rejects 2'))
    out['dt_legal_req_3']=parse_date_series(col(df,colmap,'Date of Legel Review Request 3'))
    out['dt_legal_approval']=parse_date_series(col(df,colmap,'Legal Approval Date'))
    out['dt_date_of_order']=parse_date_series(col(df,colmap,'Date Of Order'))
    out['dt_flagged']=parse_date_series(col(df,colmap,'Flagged Date'))
    out['fte']=out['fte'].fillna(1.0); out['staff_id']=out['investigator'].apply(hash_id); out['role']=''; return out

    
# Define a reusable function to define the horizon
def date_horizon(typed, pad_days:int=14):
# Import libraries/modules for use below
    import pandas as pd
# Use pandas functionality to contatinate three columns and find min as the satrt date
    start=pd.concat([typed['dt_received_inv'],typed['dt_alloc_invest'],typed['dt_alloc_team']]).min()
# Use pandas functionality to contatinate three columns and find max as the end date
    end=pd.concat([typed['dt_close'],typed['dt_pg_signoff'],typed['dt_date_of_order']]).max()
    if pd.isna(start): start=pd.Timestamp.today().normalize()-pd.Timedelta(days=30)
    if pd.isna(end): end=pd.Timestamp.today().normalize()
    end=end+pd.Timedelta(days=pad_days); return start.normalize(), end.normalize()

    
# Define a reusable function to build event log
def build_event_log(typed):
# Import libraries/modules for use below
    import pandas as pd
    rec=[]
# Loop over a sequence
    for _,r in typed.iterrows():
        sid,team,fte,cid=r['staff_id'],r['team'],r['fte'],r['case_id']
# Define a reusable function
        def add(dt,etype):
            if pd.isna(dt): return
            rec.append({'date':dt.normalize(),'staff_id':sid,'team':team,'fte':fte,'case_id':cid,'event':etype,'meta':''})
        add(r['dt_alloc_invest'],'newcase'); add(r['dt_legal_req_1'],'legal_request'); add(r['dt_legal_req_2'],'legal_request'); add(r['dt_legal_req_3'],'legal_request'); add(r['dt_legal_approval'],'legal_approval'); add(r['dt_date_of_order'],'court_order')
# Use pandas functionality
    ev=pd.DataFrame.from_records(rec)
# Use pandas functionality
    return ev if not ev.empty else pd.DataFrame(columns=['date','staff_id','team','fte','case_id','event','meta'])
# Define a reusable function to build wip
def build_wip_series(typed,start,end):
# Import libraries/modules for use below
    import pandas as pd
# Fill missing values with a default
    end_dt=typed['dt_close'].fillna(typed['dt_pg_signoff']).fillna(end)
# Drop rows with missing values
    intervals=pd.DataFrame({'staff_id':typed['staff_id'],'team':typed['team'],'start':typed['dt_alloc_invest'],'end':end_dt}).dropna()

    deltas=[]
# Loop over a sequence
    for _,r in intervals.iterrows():
        s=r['start'].normalize(); e=r['end'].normalize()
        if s>end or e<start: continue
        s=max(s,start); e=min(e,end)
        deltas.append((r['staff_id'],r['team'],s,1)); deltas.append((r['staff_id'],r['team'],e+pd.Timedelta(days=1),-1))
    if not deltas: return pd.DataFrame(columns=['date','staff_id','team','wip'])
    deltas=pd.DataFrame(deltas, columns=['staff_id','team','date','delta'])
    all_dates=pd.DataFrame({'date':pd.date_range(start,end,freq='D')})

    rows=[]
# Group rows and compute aggregations
    for (sid,team),g in deltas.groupby(['staff_id','team']):
# Group rows and compute aggregations
        gg=g.groupby('date', as_index=False)['delta'].sum()
# Fill missing values with a default
        grid=all_dates.merge(gg,on='date', how='left').fillna({'delta':0})
        grid['wip']=grid['delta'].cumsum(); grid['staff_id']=sid; grid['team']=team; rows.append(grid[['date','staff_id','team','wip']])
    return pd.concat(rows, ignore_index=True) if rows else pd.DataFrame(columns=['date','staff_id','team','wip'])

    
# Define a reusable function to build backlog series
def build_backlog_series(typed,start,end):
# Import libraries/modules for use below
    import pandas as pd
# Drop rows with missing values
    accepted=typed[['dt_received_inv']].dropna().assign(date=lambda d:d['dt_received_inv'].dt.normalize())['date'].value_counts().sort_index()
# Drop rows with missing values
    allocated=typed[['dt_alloc_invest']].dropna().assign(date=lambda d:d['dt_alloc_invest'].dt.normalize())['date'].value_counts().sort_index()
    idx=pd.date_range(start,end,freq='D'); acc=accepted.reindex(idx, fill_value=0).cumsum(); allo=allocated.reindex(idx, fill_value=0).cumsum()
# Rename columns for clarity/consistency
    backlog=(acc-allo).rename('backlog_available').to_frame(); backlog.index.name='date'; return backlog.reset_index()


# Define a reusable function to build daily panel
def build_daily_panel(typed: pd.DataFrame, start: pd.Timestamp, end: pd.Timestamp):
    """Combine WIP, event log, and calendar features. Returns: (daily, backlog, events)."""
    ev = build_event_log(typed)
    wip = build_wip_series(typed, start, end)
    backlog = build_backlog_series(typed, start, end)

# Base grid: all staff x all dates
    staff = typed[["staff_id","team","role","fte"]].drop_duplicates()
    dates = pd.DataFrame({"date": pd.date_range(start, end, freq="D")})
# Combine tables by key columns
    grid = dates.assign(key=1).merge(staff.assign(key=1), on="key").drop(columns="key")

# Merge WIP
# Fill missing values with a default
    grid = grid.merge(wip, on=["date","staff_id","team"], how="left").fillna({"wip":0})

# Event flags
# Conditional branch
    if not ev.empty:
        ev_flags = (
# Create or transform columns
            ev.assign(flag=1)
              .pivot_table(index=["date","staff_id"], columns="event", values="flag", aggfunc="max")
# Reset index to turn group keys into columns
              .reset_index().rename_axis(None, axis=1)
        )
# Combine tables by key columns
        grid = grid.merge(ev_flags, on=["date","staff_id"], how="left")
# Loop over a sequence
    for c in ["newcase","legal_request","legal_approval","court_order"]:
# Conditional branch
        if c not in grid:
            grid[c] = 0
# Fallback branch
        else:
# Cast column(s) to a specific dtype
            grid[c] = grid[c].fillna(0).astype(int)

# --- SAFE time_since_last_pickup (no index mismatch) ---
# Sort rows by specified columns
    grid = grid.sort_values(["staff_id","date"])
# Group rows and compute aggregations
    grp = grid.groupby("staff_id", sort=False)
    runs = grp["newcase"].transform(lambda s: (s == 1).cumsum())
# Group rows and compute aggregations
    grid["time_since_last_pickup"] = grid.groupby([grid["staff_id"], runs]).cumcount()
    mask_no_pickups = grp["newcase"].transform("sum") == 0
# Select/assign rows/columns by label/position
    grid.loc[mask_no_pickups, "time_since_last_pickup"] = 99

# Calendar
    grid["dow"] = grid["date"].dt.day_name().str[:3]
    grid["season"] = grid["date"].dt.month.map(month_to_season)
# Cast column(s) to a specific dtype
    grid["term_flag"] = grid["date"].dt.month.map(is_term_month).astype(int)
    grid["bank_holiday"] = 0

# New starters
    first_alloc = (
# Drop rows with missing values
        typed.dropna(subset=["dt_alloc_invest"])
# Group rows and compute aggregations
             .groupby("staff_id")["dt_alloc_invest"].min()
# Rename columns for clarity/consistency
             .rename("first_alloc")
    )
    
# Combine tables by key columns
    grid = grid.merge(first_alloc, on="staff_id", how="left")
    grid["weeks_since_start"] = (
        (grid["date"] - grid["first_alloc"]).dt.days // 7
# Cast column(s) to a specific dtype
    ).fillna(0).clip(lower=0).astype(int)
# Cast column(s) to a specific dtype
    grid["is_new_starter"] = (grid["weeks_since_start"] < 4).astype(int)

# Default flags
    grid["mentoring_flag"] = 0
    grid["trainee_flag"] = 0

# Backlog (same for all staff/day)
# Fill missing values with a default
    grid = grid.merge(backlog, on="date", how="left").fillna({"backlog_available":0})

# Final columns
# Cast column(s) to a specific dtype
    grid["event_newcase"] = grid["newcase"].astype(int)
# Cast column(s) to a specific dtype
    grid["event_legal"]   = ((grid["legal_request"] + grid["legal_approval"]) > 0).astype(int)
# Cast column(s) to a specific dtype
    grid["event_court"]   = grid["court_order"].astype(int)
    grid = grid.drop(columns=["newcase","legal_request","legal_approval","court_order","first_alloc"])

    cols = ["date","staff_id","team","role","fte",

            "is_new_starter","weeks_since_start",

            "wip","time_since_last_pickup",

            "mentoring_flag","trainee_flag",

            "backlog_available","term_flag","season","dow","bank_holiday",

            "event_newcase","event_legal","event_court"]
# Reset index to turn group keys into columns
    daily = grid[cols].sort_values(["staff_id","date"]).reset_index(drop=True)

# <-- IMPORTANT: return the frames
# Return a value from a function
    return daily, backlog, ev


### data loading, exporting outputs


In [None]:
# Load a CSV file into a DataFrame
df_test = pd.read_csv(RAW_PATH, dtype=str, sep=None, engine="python", encoding="cp1252")

df_test.head()

df_raw, colmap = load_raw(RAW_PATH)
# print(f"df_raw: ", df_raw)
# print(f"colmap: ", df_raw)

typed = engineer(df_raw, colmap)
# Print a message or value
print(f"typed: ", typed)

start, end = date_horizon(typed, 14)
# Print a message or value
print(f"start: ", start)
# Print a message or value
print(f"end: ", end)

daily, backlog, events = build_daily_panel(typed, start, end)

# (optional) save to disk
# Save a DataFrame to CSV
daily.to_csv(OUT_DIR / "investigator_daily.csv", index=False)
# Save a DataFrame to CSV
backlog.to_csv(OUT_DIR / "backlog_series.csv", index=False)
# Save a DataFrame to CSV
events.to_csv(OUT_DIR / "event_log.csv", index=False)

# # Print a message or value
# print(f"{len(daily):,} daily rows")
# # Print a message or value
# print("Date range:", daily["date"].min().date(), "→", daily["date"].max().date())
# # Print a message or value
# print("Investigators:", daily["staff_id"].nunique())
# # Print a message or value
# print("Total new case events:", int(daily["event_newcase"].sum()))


## > # === Stage 2 extension: historical "investigated so far" + 90-day daily predictions
> # Assumes Stage-2 output exists at data/out/investigator_daily.csv
> # "Investigated" here = daily pickups (event_newcase). Swap to a different event if needed.
> 
> - This model builds the historical time series of how many cases were investigated (interpreted as new case pickups = event_newcase) per investigator, role, and team, including the cumulative (“so far”) curves;
> 
> - fits simple Gamma–Poisson posteriors and produces 90-day daily predictions (mean and 5–95% credible interval) for each investigator/role/team.
> 
> - It saves six CSVs into data/out/ so we can join/plot later.
> 
> **For “investigated” = completed rather than picked up, just change the column used from event_newcase to the right completion flag (e.g., if you track completions per day, swap it in where noted).**

### imports and environment setup, data loading, joining/merging datasets, aggregation/grouping, data cleaning, sorting, feature engineering, exporting outputs, prediction/forecasting


In [None]:

# === Stage 2 extension: historical "investigated so far" + 90-day daily predictions

# Assumes Stage-2 output exists at data/out/investigator_daily.csv

# "Investigated" here = daily pickups (event_newcase). Swap to a different event if needed.


from scipy.stats import nbinom  # Negative Binomial for Gamma–Poisson posterior predictive


OUT = Path("data/out")

OUT.mkdir(parents=True, exist_ok=True)

daily_path = OUT / "investigator_daily.csv"

# ---------- Load ----------
# Load a CSV file into a DataFrame
daily = pd.read_csv(daily_path, parse_dates=["date"])

# If you want "investigated" to mean something else, swap this column:

target_col = "event_newcase"   # <--- change if needed (e.g., 'event_court' or a completion flag)

# Cast column(s) to a specific dtype
daily[target_col] = pd.to_numeric(daily[target_col], errors="coerce").fillna(0).astype(int)
# Fill missing values with a default
daily["team"] = daily.get("team", pd.Series(index=daily.index)).fillna("Unknown")
# Fill missing values with a default
daily["role"] = daily.get("role", pd.Series(index=daily.index)).fillna("Unknown")

last_date = daily["date"].max()

# =====================================================================

# 1) HISTORICAL: daily counts + cumulative ("so far") per entity

# =====================================================================


# Investigator
# Group rows and compute aggregations
hist_inv = (daily.groupby(["date","staff_id","team","role"], as_index=False)[target_col]

                 .sum()
# Rename columns for clarity/consistency
                 .rename(columns={target_col:"daily_pickups"}))
# Sort rows by specified columns
hist_inv = hist_inv.sort_values(["staff_id","date"])
# Group rows and compute aggregations
hist_inv["cum_pickups"] = hist_inv.groupby("staff_id")["daily_pickups"].cumsum()
# Save a DataFrame to CSV
hist_inv.to_csv(OUT / "hist_pickups_investigator.csv", index=False)


# Role
# Group rows and compute aggregations
hist_role = (daily.groupby(["date","role"], as_index=False)[target_col]

                  .sum()
# Rename columns for clarity/consistency
                  .rename(columns={target_col:"daily_pickups"}))
# Sort rows by specified columns
hist_role = hist_role.sort_values(["role","date"])
# Group rows and compute aggregations
hist_role["cum_pickups"] = hist_role.groupby("role")["daily_pickups"].cumsum()
# Save a DataFrame to CSV
hist_role.to_csv(OUT / "hist_pickups_role.csv", index=False)


# Team
# Group rows and compute aggregations
hist_team = (daily.groupby(["date","team"], as_index=False)[target_col]

                  .sum()
# Rename columns for clarity/consistency
                  .rename(columns={target_col:"daily_pickups"}))
# Sort rows by specified columns
hist_team = hist_team.sort_values(["team","date"])
# Group rows and compute aggregations
hist_team["cum_pickups"] = hist_team.groupby("team")["daily_pickups"].cumsum()
# Save a DataFrame to CSV
hist_team.to_csv(OUT / "hist_pickups_team.csv", index=False)

# =====================================================================

# 2) PREDICTIONS: 90-day daily counts per entity (Gamma–Poisson)
#    Posterior (rate-param Gamma prior α0=1, β0=1):
#      For a single day ahead, y ~ NegBinom(r=α_post, p=β_post/(β_post+1)),
#      E[y] = α_post / β_post, with 5–95% credible interval from NB quantiles.
# =====================================================================

# Define a reusable function
def posterior_by_key(daily_df: pd.DataFrame, key_cols: list[str]) -> pd.DataFrame:
    # Aggregate to per-day counts for the entity
# Group rows and compute aggregations
    g_daily = (daily_df.groupby(key_cols + ["date"], as_index=False)[target_col]

                      .sum()
# Rename columns for clarity/consistency
                      .rename(columns={target_col:"y"}))
    # Total counts and exposure days (T = # unique dates observed for that entity)
# Group rows and compute aggregations
    g_total = (g_daily.groupby(key_cols, as_index=False)
# Apply aggregation(s) to grouped data
                      .agg(y_total=("y","sum"),

                           T=("date","nunique")))

    # Weak prior
    alpha0, beta0 = 1.0, 1.0

    g_total["alpha_post"] = alpha0 + g_total["y_total"]

    g_total["beta_post"]  = beta0 + g_total["T"]

    # Negative Binomial params for 1-day-ahead predictive:
    # In scipy: nbinom(n=r, p) has mean = r*(1-p)/p. Choose p = β/(β+1) → mean = α/β

    g_total["p_nb"] = g_total["beta_post"] / (g_total["beta_post"] + 1.0)

    g_total["r_nb"] = g_total["alpha_post"]

    # Daily expected value and 90% credible interval

    g_total["mean"] = g_total["r_nb"] * (1 - g_total["p_nb"]) / g_total["p_nb"]

    g_total["p05"]  = nbinom.ppf(0.05, n=g_total["r_nb"], p=g_total["p_nb"])

    g_total["p95"]  = nbinom.ppf(0.95, n=g_total["r_nb"], p=g_total["p_nb"])
# Return a value from a function
    return g_total[key_cols + ["mean","p05","p95"]]


H = 90
# Use pandas functionality
future_dates = pd.date_range(last_date + pd.Timedelta(days=1), periods=H, freq="D")

# Investigator predictions
post_inv = posterior_by_key(daily, ["staff_id"])

# Join convenient labels (first observed team/role for each staff)

first_map = daily[["staff_id","team","role"]].drop_duplicates("staff_id")
# Combine tables by key columns
post_inv = post_inv.merge(first_map, on="staff_id", how="left")


# Create or transform columns
f_inv = (post_inv.assign(key=1)
# Combine tables by key columns
                .merge(pd.DataFrame({"date":future_dates, "key":1}), on="key")

                .drop(columns="key"))[["date","staff_id","team","role","mean","p05","p95"]]
# Save a DataFrame to CSV
f_inv.to_csv(OUT / "forecast_pickups_investigator.csv", index=False)


# Role predictions
post_role = posterior_by_key(daily, ["role"])
# Create or transform columns
f_role = (post_role.assign(key=1)
# Combine tables by key columns
          .merge(pd.DataFrame({"date":future_dates, "key":1}), on="key")

          .drop(columns="key"))[["date","role","mean","p05","p95"]]
# Save a DataFrame to CSV
f_role.to_csv(OUT / "forecast_pickups_role.csv", index=False)


# Team predictions
post_team = posterior_by_key(daily, ["team"])
# Create or transform columns
f_team = (post_team.assign(key=1)
# Combine tables by key columns
          .merge(pd.DataFrame({"date":future_dates, "key":1}), on="key")

          .drop(columns="key"))[["date","team","mean","p05","p95"]]
# Save a DataFrame to CSV
f_team.to_csv(OUT / "forecast_pickups_team.csv", index=False)

# Print
print("Saved:\n -", OUT / "hist_pickups_investigator.csv",

      "\n -", OUT / "hist_pickups_role.csv",

      "\n -", OUT / "hist_pickups_team.csv",

      "\n -", OUT / "forecast_pickups_investigator.csv",

      "\n -", OUT / "forecast_pickups_role.csv",

      "\n -", OUT / "forecast_pickups_team.csv")



> # Bayesian Stage-3 implementation 
> works directly off the outputs already generated (data/out/investigator_daily.csv and data/out/backlog_series.csv). 
> It does two things:
> 1. Backlog forecasting (next 90 days) via a conjugate Bayesian linear model on daily backlog deltas with an AR(1) feature + weekday + annual seasonality — returns full predictive uncertainty.
> 
> 2. Per-investigator pickup rates via Gamma–Poisson posteriors (hierarchical-by-investigator baseline), including 7-day and 28-day expected pickups.

> **Original note:**
>
> # === Stage 3: Bayesian predictive models (Backlog + Investigator pickups) ===
> 
> The backlog model This model is a conjugate Bayesian regression on daily change, with:
>     1. AR(1) feature via yesterday’s delta,
>     2. Weekday effects,
>     3. Annual seasonality (sin/cos).
> 
> It produces proper predictive uncertainty and naturally handles short histories (weak priors).
> 
> If you want team-level or role-level pickup posteriors, copy the per-investigator block and replace groupby("staff_id") with groupby("team") or any other cohort, with the same Gamma–Poisson math.
> 
> If you have bank holiday flags in your daily panel, you can add them as another regressor column in X (just remember to include it consistently when building make_x_row for the forecast).
> 
> ## This model generate the following outputs
> 1. Backlog forecaster — predicts how many cases will be waiting on each of the next 90 days.
> 2. Pickup-rate estimator — estimates how often each investigator typically picks up a new case, and how many they’re likely to pick up over the next 1–4 weeks.
> 
> ## Inputs
> 1. The daily backlog totals from Stage-2 (backlog_series.csv).
> 2. The per-investigator daily table from Stage-2 (investigator_daily.csv) which includes the daily “new case picked up” flag.
> 
> 
> ## 1) Backlog forecaster (daily totals)
> Instead of predicting the raw backlog level directly, the code predicts the daily change in backlog (today’s backlog minus yesterday’s).
> That change tends to be:
> - a bit like yesterday’s change (momentum),
> - slightly different on different weekdays (e.g., Mondays vs Fridays),
> - and nudged by time-of-year patterns (seasonality).
> 
> ### Bayesian
> - We start with very weak, generic expectations (“priors”), look at the data, and update our beliefs to a “posterior.” 
> - Then we simulate many possible futures consistent with what we learned. This gives not just a single forecast, but a spread (best guess + uncertainty bands).
> 
> ### What the code actually does
> - Builds a simple recipe for daily change: **today’s change ≈ intercept + (yesterday’s change) + weekday effect + seasonal wiggle + random noise**
> 
> - Fits that recipe with a Bayes method that’s efficient (a conjugate prior). This gives us a clean way to learn from the data and quantify uncertainty.
> 
> - Simulates thousands of future paths day-by-day: each new day’s change depends on the previous simulated day’s change (so it keeps momentum).
> 
> - Converts those simulated changes back into backlog levels, and clips at zero (no negative backlog).
> 
> - Summarises the simulations for each future date as:
>     - mean/median (central forecasts),
>     - p05/p95 (a 90% “credible interval”),
>     - p20/p80 (a tighter middle band).
> Example: **If p05 = 120 and p95 = 180 on a date, the model is saying “given history and patterns, there’s about a 90% chance backlog will be between 120 and 180 that day.”**
> 
> 
> ## 2) Investigator pickup rates (how often investigator take new cases)
> Each investigator’s daily pickup count is treated as a “counting process” (like number of arrivals per day). 
> Some people pick up more, some less, and some have sparse histories. We want fair estimates that stabilise when data is thin.
> 
> ### Gamma–Poisson
> - We assume daily pickups follow a Poisson process (a common, simple model for counts).
> - We put a Gamma prior on each person’s true underlying daily rate (how often they pick up).
> - Combining those gives a neat closed-form update (no heavy computation): you get a posterior for each person’s rate that blends their data with a little stabilising prior.
> 
> ### What the code outputs
> For each investigator:
> 1. A posterior mean daily pickup rate (our best estimate),
> 2. A credible range (p05–p95) to show uncertainty,
> 3. Expected pickups over the next 7 and 28 days (rate × days).
> 
> Example of useability: 
> **Rank investigators by posterior mean (or lower-bound like p05 for conservative planning) to understand expected intake capacity in the short term.**
> 
> Outputs:
> 1. backlog_forecast_bayes.csv: one row per future day with mean, median, p05, p20, p80, p95.
> 2. investigator_pickup_posterior.csv: one row per investigator with posterior rate and 7-/28-day expectations.
> 
> ## Assumptions (and what to tweak)
> 1. Momentum matters: tomorrow’s change tends to resemble yesterday’s.
> 2. Weekdays differ: e.g., fewer allocations on weekends.
> 3. Seasonality: simple annual pattern (sine/cosine); can add school terms or fiscal periods.
> 4. Counts are Poisson: good first pass; if pickups bunch up or are capped, consider a richer model later.
> 5. Data quality: the forecast inherits any biases or gaps; adding bank holidays (already optional in your notebook) helps.
> 
> ## Future developement
> 1. Add bank holiday and term time flags as extra predictors in the backlog model.
> 2. Estimate pickup rates by team/role (swap the group-by key).
> 3. Move to a hierarchical pickup model (shares strength across investigators/teams) if data per person is very sparse.
> 
> ### Why this is useful
> 1. We can get actionable ranges, not just a single number—great for planning under uncertainty.
> 2. The pickup posteriors turn noisy daily events into a stable, comparable measure of capacity.
> 3. It’s all fast and transparent, so you can iterate quickly as new data arrives.

### imports and environment setup, data loading, aggregation/grouping, data cleaning, sorting, exporting outputs, prediction/forecasting


In [None]:

# Import libraries/modules for use below
from pathlib import Path
# Import libraries/modules for use below
from scipy.stats import invgamma, gamma as gamma_dist

# ---- Locations ----
BASE = Path("data")

OUT  = BASE / "out"

OUT.mkdir(parents=True, exist_ok=True)

daily_path   = OUT / "investigator_daily.csv"

backlog_path = OUT / "backlog_series.csv"

# ---- Load outputs from Stage-2 ----
# Load a CSV file into a DataFrame
daily   = pd.read_csv(daily_path, parse_dates=["date"])
# Load a CSV file into a DataFrame
backlog = pd.read_csv(backlog_path, parse_dates=["date"]).sort_values("date").reset_index(drop=True)

# ---- Build daily delta series for backlog ----

backlog["delta"] = backlog["backlog_available"].diff()
# Drop rows with missing values
backlog = backlog.dropna(subset=["delta"]).reset_index(drop=True)

# Design matrix for a conjugate Bayesian linear model:

# y_t = delta_t ~ N(X_t beta, sigma^2), with X_t = [1, lag_delta, sin, cos, DOW dummies]

df = backlog.copy()

df["lag_delta"] = df["delta"].shift(1)
# Drop rows with missing values
df = df.dropna(subset=["lag_delta"]).reset_index(drop=True)

# Weekday effects (Mon=0..Sun=6), drop_first to avoid dummy trap

df["dow"] = df["date"].dt.dayofweek
# Use pandas functionality
dow_dummies = pd.get_dummies(df["dow"], prefix="dow", drop_first=True)

# Annual seasonality with sin/cos (period ~ 365.25)
# Cast column(s) to a specific dtype
day_of_year = df["date"].dt.dayofyear.astype(float)
# Use NumPy for numeric operations
df["sin_annual"] = np.sin(2 * np.pi * day_of_year / 365.25)
# Use NumPy for numeric operations
df["cos_annual"] = np.cos(2 * np.pi * day_of_year / 365.25)

# Use pandas functionality
X = pd.concat([
    pd.Series(1.0, index=df.index, name="intercept"),

    df[["lag_delta", "sin_annual", "cos_annual"]],

    dow_dummies

], axis=1)

y = df["delta"].to_numpy(float)

X_mat = X.to_numpy(float)

# ---- Conjugate Normal–Inverse-Gamma posterior ----

# Prior: beta|sigma^2 ~ N(m0, sigma^2 V0),  sigma^2 ~ InvGamma(a0, b0)
n, p = X_mat.shape
# Use NumPy for numeric operations
m0   = np.zeros(p)
V0   = np.eye(p) * 1e6         # weakly-informative

a0   = 2.0
# Use NumPy for numeric operations
yvar = float(np.var(y)) if np.isfinite(np.var(y)) and np.var(y) > 0 else 1.0
b0   = yvar * (a0 - 1)


XtX    = X_mat.T @ X_mat
V0inv  = np.linalg.inv(V0)
Vn     = np.linalg.inv(XtX + V0inv)
mn     = Vn @ (V0inv @ m0 + X_mat.T @ y)
an     = a0 + n/2.0
bn     = b0 + 0.5*(y @ y + m0 @ V0inv @ m0 - mn @ np.linalg.inv(Vn) @ mn)

# ---- Posterior predictive: forward simulate next H days with AR(1) lag ----

H = 90         # forecast horizon (days)
S = 4000       # posterior draws

# Select/assign rows/columns by label/position
last_delta   = float(df.iloc[-1]["delta"])
# Select/assign rows/columns by label/position
last_backlog = float(backlog.iloc[-1]["backlog_available"])
# Select/assign rows/columns by label/position
last_date    = df.iloc[-1]["date"]

# Use pandas functionality
future_dates = pd.date_range(last_date + pd.Timedelta(days=1), periods=H, freq="D")
future_dow   = future_dates.dayofweek
# Use NumPy for numeric operations
future_sin   = np.sin(2 * np.pi * future_dates.dayofyear / 365.25)
# Use NumPy for numeric operations
future_cos   = np.cos(2 * np.pi * future_dates.dayofyear / 365.25)
dow_cols     = [c for c in X.columns if c.startswith("dow_")]

# Define a reusable function
def make_x_row(lag_delta_val, idx):

    # Build X* in the same column order as training X
    dow = int(future_dow[idx])
# Use NumPy for numeric operations
    dd = np.zeros(len(dow_cols))
# Loop over a sequence
    for j, c in enumerate(dow_cols):
# Try a block of code that may raise errors
        try:
            target = int(c.split("_")[1])  # 'dow_3' -> 3
# Handle errors from the try block
        except Exception:
            target = None

        dd[j] = 1.0 if (target is not None and dow == target) else 0.0
# Use NumPy for numeric operations
    return np.concatenate(([1.0, lag_delta_val, future_sin[idx], future_cos[idx]], dd))

# Use NumPy for numeric operations
rng = np.random.default_rng(2025)

# Robust Cholesky (add tiny jitter if near-singular)
# Use NumPy for numeric operations
evals = np.linalg.eigvals(Vn)
# Conditional branch
if np.min(np.real(evals)) < 1e-12:
# Use NumPy for numeric operations
    Vn = Vn + np.eye(p) * 1e-10
# Use NumPy for numeric operations
L = np.linalg.cholesky(Vn)

# Sample (sigma^2, beta) from posterior

sigma2 = invgamma.rvs(a=an, scale=bn, size=S, random_state=rng)

z      = rng.standard_normal((S, p))
# Use NumPy for numeric operations
beta   = mn + np.sqrt(sigma2)[:, None] * (z @ L.T)

# Simulate daily deltas forward with AR lag in X
# Use NumPy for numeric operations
delta_draws = np.zeros((S, H))
# Loop over a sequence
for s in range(S):

    lag = last_delta
# Use NumPy for numeric operations
    bs  = beta[s]; sig = np.sqrt(sigma2[s])
# Loop over a sequence
    for h in range(H):

        xh = make_x_row(lag, h)

        mean_h = float(xh @ bs)

        delta_h = mean_h + rng.normal(0.0, sig)

        delta_draws[s, h] = delta_h

        lag = delta_h

# Transform to backlog levels; clip at zero
# Use NumPy for numeric operations
backlog_paths = last_backlog + np.cumsum(delta_draws, axis=1)
# Use NumPy for numeric operations
backlog_paths = np.clip(backlog_paths, 0, None)

# Summaries

q = [0.05, 0.2, 0.5, 0.8, 0.95]
# Use NumPy for numeric operations
Q = np.quantile(backlog_paths, q, axis=0).T
# Use pandas functionality
forecast_df = pd.DataFrame({

    "date":  future_dates,

    "mean":  backlog_paths.mean(axis=0),

    "median":Q[:, 2],

    "p05":   Q[:, 0],

    "p20":   Q[:, 1],

    "p80":   Q[:, 3],

    "p95":   Q[:, 4],

})
# Save a DataFrame to CSV
forecast_df.to_csv(OUT / "backlog_forecast_bayes.csv", index=False)

# ---- Per-investigator pickup rates: Gamma–Poisson posteriors ----

# For each staff_id, y_i ~ Poisson(theta_i * T_i) with daily exposure T_i (days).

# Prior theta_i ~ Gamma(alpha0, beta0) (rate parameterization) => posterior Gamma(alpha0 + y, beta0 + T).

di = daily.copy()
# Cast column(s) to a specific dtype
di["event_newcase"] = pd.to_numeric(di["event_newcase"], errors="coerce").fillna(0).astype(int)

# Group rows and compute aggregations
per_staff = (di.groupby("staff_id", as_index=False)
# Apply aggregation(s) to grouped data
               .agg(y_total=("event_newcase","sum"),

                    days=("date","nunique")))


alpha0, beta0 = 1.0, 1.0

per_staff["alpha_post"] = alpha0 + per_staff["y_total"]

per_staff["beta_post"]  = beta0 + per_staff["days"]

# Posterior summaries for daily rate theta_i

per_staff["rate_mean"]   = per_staff["alpha_post"] / per_staff["beta_post"]

per_staff["rate_median"] = gamma_dist.ppf(0.5, a=per_staff["alpha_post"], scale=1.0/per_staff["beta_post"])

per_staff["rate_p05"]    = gamma_dist.ppf(0.05, a=per_staff["alpha_post"], scale=1.0/per_staff["beta_post"])

per_staff["rate_p95"]    = gamma_dist.ppf(0.95, a=per_staff["alpha_post"], scale=1.0/per_staff["beta_post"])

# Expected pickups in next horizons

per_staff["exp_7d_mean"]  = per_staff["rate_mean"] * 7.0

per_staff["exp_28d_mean"] = per_staff["rate_mean"] * 28.0

# Save a DataFrame to CSV
per_staff.to_csv(OUT / "investigator_pickup_posterior.csv", index=False)

# Print a message or value
print("Done.")
# Print a message or value
print("Saved:", OUT / "backlog_forecast_bayes.csv")
# Print a message or value
print("Saved:", OUT / "investigator_pickup_posterior.csv")



### Code cell purpose: general processing


# Stage 3 — Bayesian Forecasting (Ready-to-Run)
**Added:** 2025-10-27 10:45:02Z (UTC)

This section fits a **hierarchical Bayesian model** (PyMC) to predict **daily investigated cases** for the next **90 days** for each **investigator, role, and team**.
It is designed to work with the daily dataset built earlier in this notebook. If the dataset is not found in memory, you can point the loader to a CSV.

**What you'll get:**
- Posterior predictive draws and summary statistics per investigator × team × role × day (next 90 days).
- Aggregations to team-level, role-level, and org-level totals.
- A quick plot of the org-level total forecast.

> Tip: First run the data build section above so the in-memory DataFrame is available for immediate modelling.


In [None]:
# Optional: install dependencies if needed (uncomment if missing).
# %pip install pymc bambi arviz holidays


## Configuration & Dataset Discovery

In [None]:
# ---- Config ----
INPUT_CSV = '/mnt/data/investigator_daily.csv'  # Set to a file if you prefer to load from disk
COUNT_COL_CANDIDATES = ['cases_investigated','investigated','num_investigated','completed_cases','cases_completed']
DATE_COL_CANDIDATES = ['date','activity_date','day']
INVESTIGATOR_COL_KEYS = ['investigator','assignee','user']
TEAM_COL_KEYS = ['team','squad']
ROLE_COL_KEYS = ['role','grade']
MAX_TRAIN_ROWS = None  # e.g., 250_000 to subsample for faster initial runs

import pandas as pd
import numpy as np
from pathlib import Path

def _find_df_in_globals():
    """Heuristically find a pandas DataFrame with the expected columns in the current namespace."""
    candidates = []
    for name, obj in globals().items():
        if isinstance(obj, pd.DataFrame):
            cols_lower = [c.lower() for c in obj.columns]
            has_date = any(c in cols_lower for c in DATE_COL_CANDIDATES)
            has_count = any(c in cols_lower for c in COUNT_COL_CANDIDATES)
            has_inv = any(any(k in c for k in INVESTIGATOR_COL_KEYS) for c in cols_lower)
            has_team = any(any(k in c for k in TEAM_COL_KEYS) for c in cols_lower)
            has_role = any(any(k in c for k in ROLE_COL_KEYS) for c in cols_lower)
            if has_date and has_count and has_inv and has_team and has_role:
                candidates.append((name, obj))
    return candidates[0][1] if candidates else None

def _standardise_columns(df):
    df = df.copy()
    lower_map = {c: c.lower() for c in df.columns}
    df.rename(columns=lower_map, inplace=True)
    # Date column
    date_col = next((c for c in DATE_COL_CANDIDATES if c in df.columns), None)
    assert date_col is not None, 'No date column found.'
    df['date'] = pd.to_datetime(df[date_col]).dt.tz_localize(None)
    # Count column
    count_col = next((c for c in COUNT_COL_CANDIDATES if c in df.columns), None)
    assert count_col is not None, 'No count column found.'
    df['y'] = pd.to_numeric(df[count_col], errors='coerce').fillna(0).astype(int)
    # Investigator/Team/Role
    def _first_col_containing(keys):
        for c in df.columns:
            for k in keys:
                if k in c:
                    return c
        return None
    inv_col = _first_col_containing(INVESTIGATOR_COL_KEYS)
    team_col = _first_col_containing(TEAM_COL_KEYS)
    role_col = _first_col_containing(ROLE_COL_KEYS)
    assert inv_col and team_col and role_col, 'Missing investigator/team/role columns.'
    df['investigator'] = df[inv_col].astype(str)
    df['team'] = df[team_col].astype(str)
    df['role'] = df[role_col].astype(str)
    # Keep only needed columns
    keep = ['date','investigator','team','role','y']
    df = df[keep].sort_values('date')
    # Ensure non-negative counts
    df['y'] = df['y'].clip(lower=0)
    return df

# Try in-memory first
df0 = _find_df_in_globals()
if df0 is None:
    p = Path(INPUT_CSV)
    if p.exists():
        df0 = pd.read_csv(p)
        print(f'Loaded dataset from {p}')
    else:
        raise FileNotFoundError('No suitable DataFrame found in memory and INPUT_CSV does not exist. Update INPUT_CSV or run the build cells above.')
else:
    print('Using dataset found in memory.')

df = _standardise_columns(df0)
if MAX_TRAIN_ROWS is not None and len(df) > MAX_TRAIN_ROWS:
    df = df.sample(MAX_TRAIN_ROWS, random_state=42).sort_values('date')
print(df.head())
print(df.dtypes)
print('Training rows:', len(df))


## Feature Engineering (Calendar & Encodings)

In [None]:
import pandas as pd
import numpy as np
try:
    import holidays
    _has_holidays = True
except Exception:
    _has_holidays = False

# Day-of-week and holiday flag (England & Wales if available)
df = df.copy()
df['dow'] = df['date'].dt.dayofweek.astype(int)
if _has_holidays:
    years = range(df['date'].dt.year.min(), df['date'].dt.year.max() + 3)
    uk = holidays.country_holidays('GB', subdiv='ENG', years=years)
    df['is_holiday'] = df['date'].dt.date.astype('datetime64')
    df['is_holiday'] = df['is_holiday'].apply(lambda d: 1 if d in uk else 0).astype(int)
else:
    df['is_holiday'] = 0

# Encode categories to integer indices
inv_codes, inv_idx = pd.factorize(df['investigator'], sort=True)
team_codes, team_idx = pd.factorize(df['team'], sort=True)
role_codes, role_idx = pd.factorize(df['role'], sort=True)

df['inv_idx'] = inv_idx.astype(int)
df['team_idx'] = team_idx.astype(int)
df['role_idx'] = role_idx.astype(int)

n_inv = len(inv_codes)
n_team = len(team_codes)
n_role = len(role_codes)
print({'n_inv': n_inv, 'n_team': n_team, 'n_role': n_role})


## Fit Hierarchical Negative Binomial (PyMC)

### Note — Why Bayesian here (PyMC)

**For data scientists (math/stats):**
- Likelihood: $y_i \sim \text{NegBin}(\mu_i, \alpha)$ with log link $\log \mu_i = X_i\beta + b_{\text{inv}[i]} + b_{\text{team}[i]} + b_{\text{role}[i]} + \dots$.
- We infer the full posterior $p(\beta, b, \alpha \mid y) \propto \prod_i p(y_i \mid \mu_i, \alpha)\, p(\beta) p(b) p(\alpha)$ using NUTS (HMC).
- Random intercepts yield **partial pooling**, stabilising estimates for sparse investigators/teams and reducing overfitting.
- Priors act as **regularisation**; posterior predictive checks (PPC) assess calibration and overdispersion.

**For non-experts (plain English):**
- Shares information across people/teams so small groups don't swing wildly.
- Gives **ranges** (credible intervals) rather than a single number—better for planning under uncertainty.
- Learns patterns like weekdays and holidays and updates as new data arrives.

See the README section **“Why Poisson–Gamma (Negative Binomial) for daily case counts?”** in `README_Investigations_Backlog_Documentation.md` for the reasoning behind the count likelihood and overdispersion.

In [None]:
import pymc as pm
import aesara.tensor as at
import numpy as np

RANDOM_SEED = 123
DRAWS = 1000   # Increase for production
TUNE = 1000
TARGET_ACCEPT = 0.9

# Build model with MutableData so we can switch to future covariates later
with pm.Model() as model:
    inv_idx_data = pm.MutableData('inv_idx', df['inv_idx'].values)
    team_idx_data = pm.MutableData('team_idx', df['team_idx'].values)
    role_idx_data = pm.MutableData('role_idx', df['role_idx'].values)
    dow_data = pm.MutableData('dow', df['dow'].values)
    hol_data = pm.MutableData('hol', df['is_holiday'].values)
    y_obs = df['y'].values

    # Hyperpriors for random intercepts
    sigma_inv = pm.HalfNormal('sigma_inv', 0.5)
    sigma_team = pm.HalfNormal('sigma_team', 0.5)
    sigma_role = pm.HalfNormal('sigma_role', 0.5)

    z_inv = pm.Normal('z_inv', 0, 1, shape=n_inv)
    z_team = pm.Normal('z_team', 0, 1, shape=n_team)
    z_role = pm.Normal('z_role', 0, 1, shape=n_role)

    inv_eff = pm.Deterministic('inv_eff', z_inv * sigma_inv)
    team_eff = pm.Deterministic('team_eff', z_team * sigma_team)
    role_eff = pm.Deterministic('role_eff', z_role * sigma_role)

    # Fixed effects
    beta_intercept = pm.Normal('beta_intercept', 0, 2)
    beta_dow = pm.Normal('beta_dow', 0, 0.5, shape=7)
    beta_hol = pm.Normal('beta_hol', 0, 0.5)

    # Linear predictor
    mu_lin = (
        beta_intercept
        + inv_eff[inv_idx_data]
        + team_eff[team_idx_data]
        + role_eff[role_idx_data]
        + beta_dow[dow_data]
        + beta_hol * hol_data
    )
    lam = pm.Deterministic('lam', at.exp(mu_lin))

    # Overdispersion for Negative Binomial
    alpha = pm.HalfNormal('alpha', 1.0)

    y = pm.NegativeBinomial('y', mu=lam, alpha=alpha, observed=y_obs)

    trace = pm.sample(DRAWS, tune=TUNE, target_accept=TARGET_ACCEPT, chains=4, random_seed=RANDOM_SEED)

    # In-sample posterior predictive checks
    ppc_insample = pm.sample_posterior_predictive(trace, var_names=['y'])
print('Model fit complete.')


## Forecast Next 90 Days

In [None]:
from datetime import timedelta
import pandas as pd
import numpy as np
from pathlib import Path

OUT_DIR = Path('/mnt/data/forecasts')
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Build future calendar (next 90 days)
last_day = df['date'].max()
future_dates = pd.date_range(last_day + pd.Timedelta(days=1), periods=90, freq='D')

# Use all observed investigator/team/role combinations
units = df[['investigator','team','role','inv_idx','team_idx','role_idx']].drop_duplicates()
future = units.assign(key=1).merge(pd.DataFrame({'date': future_dates, 'key':1}), on='key').drop('key', axis=1)

# Add features to future
future['dow'] = future['date'].dt.dayofweek.astype(int)
if 'is_holiday' in df.columns and df['is_holiday'].max() in [0,1]:
    # Recompute holidays for the new date range if possible
    try:
        import holidays
        years = range(future['date'].dt.year.min(), future['date'].dt.year.max() + 1)
        uk = holidays.country_holidays('GB', subdiv='ENG', years=years)
        future['is_holiday'] = future['date'].dt.date.astype('datetime64')
        future['is_holiday'] = future['is_holiday'].apply(lambda d: 1 if d in uk else 0).astype(int)
    except Exception:
        future['is_holiday'] = 0
else:
    future['is_holiday'] = 0

# Switch the model's data to the future design
with model:
    pm.set_data({
        'inv_idx': future['inv_idx'].values,
        'team_idx': future['team_idx'].values,
        'role_idx': future['role_idx'].values,
        'dow': future['dow'].values,
        'hol': future['is_holiday'].values,
    })
    ppc_future = pm.sample_posterior_predictive(trace, var_names=['y'])

# Summarise posterior predictive for each row
draws = ppc_future['y']  # shape: (draws*chains, N)
if draws.ndim == 3:
    # Newer PyMC returns (chains, draws, N)
    draws = draws.reshape((-1, draws.shape[-1]))
means = draws.mean(axis=0)
medians = np.median(draws, axis=0)
low90 = np.quantile(draws, 0.05, axis=0)
high90 = np.quantile(draws, 0.95, axis=0)
low50 = np.quantile(draws, 0.25, axis=0)
high50 = np.quantile(draws, 0.75, axis=0)

future_out = future.copy()
future_out['pred_mean'] = means
future_out['pred_median'] = medians
future_out['pred_p05'] = low90
future_out['pred_p95'] = high90
future_out['pred_p25'] = low50
future_out['pred_p75'] = high50

# Save investigator-level forecasts
inv_path = OUT_DIR / 'investigator_daily_forecast_90d.csv'
future_out.to_csv(inv_path, index=False)
print(f'Saved investigator-level forecasts to: {inv_path}')

# Aggregations: team-level, role-level, and org-level
team_out = (future_out
            .groupby(['date','team'], as_index=False)
            [['pred_mean','pred_median','pred_p05','pred_p95','pred_p25','pred_p75']]
            .sum())
role_out = (future_out
            .groupby(['date','role'], as_index=False)
            [['pred_mean','pred_median','pred_p05','pred_p95','pred_p25','pred_p75']]
            .sum())
org_out = (future_out
           .groupby(['date'], as_index=False)
           [['pred_mean','pred_median','pred_p05','pred_p95','pred_p25','pred_p75']]
           .sum())

team_path = OUT_DIR / 'team_daily_forecast_90d.csv'
role_path = OUT_DIR / 'role_daily_forecast_90d.csv'
org_path = OUT_DIR / 'org_daily_forecast_90d.csv'
team_out.to_csv(team_path, index=False)
role_out.to_csv(role_path, index=False)
org_out.to_csv(org_path, index=False)
print(f'Saved team-level forecasts to: {team_path}')
print(f'Saved role-level forecasts to: {role_path}')
print(f'Saved org-level forecasts to: {org_path}')


## Quick Visual: Organisation-wide Forecast (Totals)

In [None]:
import matplotlib.pyplot as plt
from pathlib import Path

org_path = Path('/mnt/data/forecasts/org_daily_forecast_90d.csv')
org = pd.read_csv(org_path, parse_dates=['date'])
plt.figure()
plt.plot(org['date'], org['pred_mean'], label='Forecast mean (next 90 days)')
plt.title('Organisation-wide investigated cases: 90-day forecast')
plt.xlabel('Date')
plt.ylabel('Cases')
plt.legend()
plt.tight_layout()
plt.show()

print('Plot displayed. Image not saved by default to keep the notebook tidy.')


### Notes & Tips
- To speed up first runs, lower `DRAWS`/`TUNE` or set `MAX_TRAIN_ROWS` to a smaller number.
- For richer structure, extend the model with time trends, seasonal splines, or random slopes.
- If you prefer formulas, you can port this to **Bambi** with `y ~ 1 + dow + is_holiday + (1|investigator) + (1|team) + (1|role)` and `family='negativebinomial'`.


# Alternative: Bambi (Formula Interface)
**Added:** 2025-10-27 10:48:20Z (UTC)

This section mirrors the PyMC approach using **Bambi**, a high-level formula interface
built on top of PyMC. It fits a **Negative Binomial** model with random intercepts for
**investigator**, **team**, and **role**, plus **day-of-week** and **holiday** effects.

Formula used:

```
y ~ 1 + dow + is_holiday + (1|investigator) + (1|team) + (1|role)
```

Outputs:
- 90-day daily forecasts per investigator, team, and role.
- Aggregations to team/role/org.
- If supported by your Bambi version, posterior predictive intervals; otherwise, mean forecasts.


In [None]:
# Optional: install if missing (uncomment if needed)
# %pip install bambi arviz


## Prepare Dataset for Bambi

In [None]:
import pandas as pd
from pathlib import Path

# Expect df with columns: date, y, dow, is_holiday, investigator, team, role
# If df not present, try to load from the same CSV used above (or adjust INPUT_CSV)
if 'df' not in globals():
    try:
        INPUT_CSV
    except NameError:
        INPUT_CSV = '/mnt/data/investigator_daily.csv'
    p = Path(INPUT_CSV)
    if p.exists():
        df = pd.read_csv(p)
        print(f'Loaded dataset from {p}')
    else:
        raise FileNotFoundError('Expected DataFrame `df` not found in memory and INPUT_CSV does not exist. Please run the build cells above or update INPUT_CSV.')

# Standardise expected columns (if needed)
df = df.copy()
lc = {c: c.lower() for c in df.columns}
df.rename(columns=lc, inplace=True)
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'])
if 'dow' not in df.columns:
    df['dow'] = df['date'].dt.dayofweek.astype(int)
if 'is_holiday' not in df.columns:
    df['is_holiday'] = 0
if 'y' not in df.columns:
    # Heuristic to find a count column
    for c in ['cases_investigated','investigated','num_investigated','completed_cases','cases_completed']:
        if c in df.columns:
            df['y'] = pd.to_numeric(df[c], errors='coerce').fillna(0).astype(int)
            break
    assert 'y' in df.columns, 'No count column detected; please add column y.'
# Ensure categorical variables are categorical (Bambi handles strings too, but categories are explicit)
for col in ['investigator','team','role']:
    df[col] = df[col].astype('category')
print(df[['date','investigator','team','role','y','dow','is_holiday']].head())


## Fit Bambi Negative Binomial

### Note — Why Bayesian here (Bambi)

Bambi uses the same Bayesian engine (PyMC) with a formula interface.

**For data scientists (math/stats):**
- Model: `y ~ 1 + dow + is_holiday + (1|investigator) + (1|team) + (1|role)` with `family='negativebinomial'`.
- Hierarchical random intercepts implement partial pooling; priors regularise parameters; NUTS samples the joint posterior.
- Use PPC and coverage of credible intervals to validate fit.

**For non-experts (plain English):**
- Same benefits as the PyMC block, but simpler syntax—handy for quick iteration and explainability.
- Produces forecast **ranges** you can plan around.

For the choice of the Negative Binomial (Poisson–Gamma) count model, see `README_Investigations_Backlog_Documentation.md` → *Why Poisson–Gamma (Negative Binomial) for daily case counts?*

In [None]:
import bambi as bmb

RANDOM_SEED = 123
DRAWS = 1000   # Increase for production
TUNE = 1000
TARGET_ACCEPT = 0.9

formula = 'y ~ 1 + dow + is_holiday + (1|investigator) + (1|team) + (1|role)'
model_bmb = bmb.Model(formula, df, family='negativebinomial')
idata_bmb = model_bmb.fit(draws=DRAWS, tune=TUNE, target_accept=TARGET_ACCEPT, chains=4, random_seed=RANDOM_SEED)
print('Bambi model fit complete.')


## Forecast Next 90 Days with Bambi

In [None]:
from datetime import timedelta
import numpy as np
import pandas as pd
from pathlib import Path

OUT_DIR = Path('/mnt/data/forecasts')
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Build future calendar (next 90 days)
last_day = df['date'].max()
future_dates = pd.date_range(last_day + pd.Timedelta(days=1), periods=90, freq='D')

# All observed unit combos
units = df[['investigator','team','role']].drop_duplicates()
future_bmb = units.assign(key=1).merge(pd.DataFrame({'date': future_dates, 'key':1}), on='key').drop('key', axis=1)
future_bmb['dow'] = future_bmb['date'].dt.dayofweek.astype(int)
if 'is_holiday' in df.columns and df['is_holiday'].max() in [0,1]:
    try:
        import holidays
        years = range(future_bmb['date'].dt.year.min(), future_bmb['date'].dt.year.max() + 1)
        uk = holidays.country_holidays('GB', subdiv='ENG', years=years)
        future_bmb['is_holiday'] = future_bmb['date'].dt.date.astype('datetime64')
        future_bmb['is_holiday'] = future_bmb['is_holiday'].apply(lambda d: 1 if d in uk else 0).astype(int)
    except Exception:
        future_bmb['is_holiday'] = 0
else:
    future_bmb['is_holiday'] = 0

# Ensure categorical types align with training
for col in ['investigator','team','role']:
    future_bmb[col] = future_bmb[col].astype('category')
    # align categories with training df
    future_bmb[col] = future_bmb[col].cat.set_categories(df[col].cat.categories)

def _summarise_pps(draws_array):
    # draws_array expected shape: (samples, N)
    means = draws_array.mean(axis=0)
    medians = np.median(draws_array, axis=0)
    p05 = np.quantile(draws_array, 0.05, axis=0)
    p95 = np.quantile(draws_array, 0.95, axis=0)
    p25 = np.quantile(draws_array, 0.25, axis=0)
    p75 = np.quantile(draws_array, 0.75, axis=0)
    return means, medians, p05, p95, p25, p75

pred_cols = ['pred_mean','pred_median','pred_p05','pred_p95','pred_p25','pred_p75']
future_out_bmb = future_bmb.copy()

try:
    # Preferred: posterior predictive samples
    pps = model_bmb.predict(idata_bmb, data=future_bmb, kind='pps')
    # Try to convert to a (samples, N) array robustly
    import numpy as np
    arr = None
    # Newer Bambi returns xarray DataArray
    try:
        import xarray as xr
        if isinstance(pps, xr.DataArray):
            if set(['chain','draw']).issubset(set(pps.dims)):
                arr = pps.stack(sample=('chain','draw')).transpose('sample','obs').values
            else:
                arr = pps.values
    except Exception:
        pass
    if arr is None:
        arr = np.asarray(pps)
        if arr.ndim == 3:  # chains, draws, N
            arr = arr.reshape((-1, arr.shape[-1]))
    m, md, p05, p95, p25, p75 = _summarise_pps(arr)
    future_out_bmb['pred_mean'] = m
    future_out_bmb['pred_median'] = md
    future_out_bmb['pred_p05'] = p05
    future_out_bmb['pred_p95'] = p95
    future_out_bmb['pred_p25'] = p25
    future_out_bmb['pred_p75'] = p75
    print('Used posterior predictive samples from Bambi for intervals.')
except Exception as e:
    print('Falling back to mean predictions only (intervals unavailable):', e)
    mu = model_bmb.predict(idata_bmb, data=future_bmb, kind='mean')
    future_out_bmb['pred_mean'] = pd.Series(mu).values
    # Leave interval columns as NaN to signal they were not computed
    for c in pred_cols[1:]:
        future_out_bmb[c] = pd.NA

# Save investigator-level forecasts (Bambi)
inv_path = OUT_DIR / 'investigator_daily_forecast_90d_bambi.csv'
future_out_bmb.to_csv(inv_path, index=False)
print(f'Saved investigator-level forecasts (Bambi) to: {inv_path}')

# Aggregations
cols = ['pred_mean','pred_median','pred_p05','pred_p95','pred_p25','pred_p75']
team_out_bmb = (future_out_bmb.groupby(['date','team'], as_index=False)[cols].sum(min_count=1))
role_out_bmb = (future_out_bmb.groupby(['date','role'], as_index=False)[cols].sum(min_count=1))
org_out_bmb = (future_out_bmb.groupby(['date'], as_index=False)[cols].sum(min_count=1))

team_path = OUT_DIR / 'team_daily_forecast_90d_bambi.csv'
role_path = OUT_DIR / 'role_daily_forecast_90d_bambi.csv'
org_path = OUT_DIR / 'org_daily_forecast_90d_bambi.csv'
team_out_bmb.to_csv(team_path, index=False)
role_out_bmb.to_csv(role_path, index=False)
org_out_bmb.to_csv(org_path, index=False)
print(f'Saved team-level forecasts (Bambi) to: {team_path}')
print(f'Saved role-level forecasts (Bambi) to: {role_path}')
print(f'Saved org-level forecasts (Bambi) to: {org_path}')


## Quick Visual: Organisation-wide Forecast (Bambi)

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path

org_path = Path('/mnt/data/forecasts/org_daily_forecast_90d_bambi.csv')
org = pd.read_csv(org_path, parse_dates=['date'])
plt.figure()
plt.plot(org['date'], org['pred_mean'], label='Bambi forecast mean (next 90 days)')
plt.title('Organisation-wide investigated cases: 90-day forecast (Bambi)')
plt.xlabel('Date')
plt.ylabel('Cases')
plt.legend()
plt.tight_layout()
plt.show()
print('Plot displayed.')
