# Predictive modelling (New Cases) 
The priority is providing the work that supports the predicting new case styles for investigations. 
Not predicting the backlog, because what they're looking for is a tool in the short term that they can better:
1. predict how many new case styles,
2. how many new investigations they can start as they increase the number of staff or vice versa.
3. If they wanted to be able to say you want to increase the numbers of cases that we investigate, what staffing levels would that require
4. And initially, you know, we've talked about doing that using a sort of a dynamic micro-simulation model or a micro-simulation model in order to do that.
5. I can pick up the stuff around the work around forecasting the actual investigation. So the number of cases that they will continue to receive. We've got, it's the business that might be added to the backlog.
6. But what you're doing is then saying within that backlog, that whole gene pair, how quickly can they pick up those cases and start investigation and start investigating them? So in terms of new case styles.
7. Is it the new cases, does it mean like a time to allocation, time to allocate to the team, to the investigator? Well, it's the time that they actually start investigating it.

- Uses the investigations database (with fields like Case Type, Date Received in OPG, Date allocated to current investigator, Status, Weighting, etc.) to build:
    - start_date: the calendar date the investigation really starts.
    - **wait_to_start**: how many days each case waited from “concern received” to “investigation started”.

- Focuses on new investigations started (when they actually move into “investigation phase”, i.e. when they’re allocated to an investigator):
    - New case starts for investigators (i.e. time from one allocation to the next),
    - Per investigator, with attention to staff type/FTE pattern (full-time vs 0.5, etc.)
    - Not **Simon**’s long lags from allocation → legal review/court.
        - From allocation → legal review / court application (small subset of cases, long lags, months/years).

- Models daily number of new investigations started by case type as a function of staff FTE (or similar capacity measure).

- Lets you run “what-if” scenarios:
    - “If we increase staff FTE to X,
    - how many new investigations can we start per day by case type?”
    - or “If we want to start Y cases per day, how much FTE do we need?”.

- **Phase 1 “simple” model code: distribution of time intervals between allocations, by staff type (and optionally case type).**

- Plugs into a dynamic simulation later.

- We have a backlog (GPEN). Cases sit there until an investigator actually starts investigating them.

- Operationally, “**a new case start**” is:
    - The day a case is allocated to an investigator and they start working on it
    - (i.e. when it leaves backlog and enters someone’s caseload).

- For each investigator, over time, we see a sequence of allocations:
    - … -> case allocated on 2024-01-03 -> next case allocated on 2024-01-07 -> …

- The key quantity to model is:
    - For a given investigator (with a given FTE pattern / staff type),
    - what is the distribution of time between one allocation and the next?

- In a **dynamic simulation** that runs day by day:
    - At each day t+1, for each investigator we need the probability they pick up a new case on that day.
    - In the simplest version, this can come directly from your empirical distribution of time intervals
    - (e.g. full-time staff typically get a new case every ~4–6 days; 0.5 FTE every ~8–10 days, etc.).
    - allocation → next allocation (for the same investigator).

## Phase 1 simple model: “gap between allocations” by staff type
- Data we need (per case)
- At minimum, per case we need:
    - investigator_id – who it was allocated to
    - date_allocated_to_current_investigator – when that investigator started it
    - fte or some staff type indicator (e.g. full_time, part_time_0_5, etc.)

- Optionally:
    - case_type
    - closure_date (for more advanced models later)
    - weighting (case complexity)

In [None]:
# Compute & summarise time between allocations

# 1. Helper: create FTE bands (staff type)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# --- Step 1 – Import EDA classes --- 
from eda_opg import EDAConfig, OPGInvestigationEDA



def add_fte_band(cases: pd.DataFrame,
                 fte_col: str = "fte",
                 band_col: str = "fte_band") -> pd.DataFrame:
    """
    Create a simple categorical staff-type band based on FTE.

    Example bands:
      - 'FT_0_8_to_1_0'    : mostly full-time
      - 'PT_0_5_to_0_8'    : mid part-time
      - 'PT_lt_0_5'        : small part-time
      - 'Unknown_FTE'      : missing / uncategorised
    """
    df = cases.copy()

    def band(f):
        if pd.isna(f):
            return "Unknown_FTE"
        if f >= 0.8:
            return "FT_0_8_to_1_0"
        if f >= 0.5:
            return "PT_0_5_to_0_8"
        if f > 0:
            return "PT_lt_0_5"
        return "Unknown_FTE"

    df[band_col] = df[fte_col].apply(band)
    return df


## 2. Compute gaps between allocations per investigator

def compute_allocation_gaps(
    cases: pd.DataFrame,
    investigator_col: str = "investigator_id",
    alloc_date_col: str = "date_allocated_to_current_investigator",
    staff_type_col: str = "fte_band",
    case_type_col: str = "case_type",
) -> pd.DataFrame:
    """
    For each investigator, compute the number of days between successive allocations.

    Returns one row per allocation *after the first* for each investigator:
      - investigator_id
      - alloc_date (date of the *current* allocation)
      - prev_alloc_date (previous allocation date)
      - gap_days (days between prev and current)
      - staff_type (e.g. fte_band)
      - case_type (of the current case, optional)
    """
    df = cases.copy()

    # Ensure allocation date is datetime
    df[alloc_date_col] = pd.to_datetime(df[alloc_date_col])

    # Sort by investigator and allocation date
    df = df.sort_values([investigator_col, alloc_date_col])

    # Compute previous allocation date per investigator
    df["prev_alloc_date"] = df.groupby(investigator_col)[alloc_date_col].shift(1)

    # Gap in days between allocations
    df["gap_days"] = (df[alloc_date_col] - df["prev_alloc_date"]).dt.days

    # Drop the first allocation for each investigator (no previous)
    df = df.dropna(subset=["gap_days"]).copy()

    # Optional: remove negative or zero-day gaps if you think those are data quirks
    df = df[df["gap_days"] > 0]

    # Keep only relevant columns
    keep_cols = [
        investigator_col,
        alloc_date_col,
        "prev_alloc_date",
        "gap_days",
    ]
    if staff_type_col in df.columns:
        keep_cols.append(staff_type_col)
    if case_type_col in df.columns:
        keep_cols.append(case_type_col)

    return df[keep_cols].reset_index(drop=True)

# 3. Summarise gap distributions by staff type (and optionally case type)

def summarise_gaps(
    gaps: pd.DataFrame,
    staff_type_col: str = "fte_band",
    case_type_col: str = "case_type",
    by_case_type: bool = False,
) -> pd.DataFrame:
    """
    Summarise the distribution of gap_days between allocations.

    If by_case_type is False:
        summary is by staff_type only.
    If True:
        summary is by (staff_type, case_type).
    """
    group_cols = [staff_type_col]
    if by_case_type and (case_type_col in gaps.columns):
        group_cols.append(case_type_col)

    def q25(x): return np.percentile(x, 25)
    def q75(x): return np.percentile(x, 75)

    summary = (
        gaps.groupby(group_cols)["gap_days"]
        .agg(
            count="count",
            mean="mean",
            median="median",
            p25=q25,
            p75=q75,
            min="min",
            max="max",
        )
        .reset_index()
        .sort_values(group_cols)
    )

    return summary


# 4. Example end-to-end usage

# --- Step 2. Load the investigations data ---
if __name__ == "__main__":
    # OLD: reading from a CSV
    # cases = pd.read_csv("investigations.csv")
    # cases = cases.rename({...})

# --- Step 3 – Paste (or call) the synthetic data creation from demo_eda.py
    # ----- 1) Create a small synthetic dataset for demonstration -----
    rng = np.random.default_rng(42)
    n = 2000

    start = pd.Timestamp("2024-01-01")
    recv_dates = start + pd.to_timedelta(rng.integers(0, 300, size=n), unit="D")

    alloc_delays = rng.integers(1, 31, size=n)
    allocated_mask = rng.random(size=n) < 0.85
    alloc_dates = pd.Series(recv_dates) + pd.to_timedelta(alloc_delays, unit="D")
    alloc_dates = alloc_dates.where(allocated_mask, pd.NaT)

    signoff_delays = rng.integers(20, 121, size=n)
    so_mask = rng.random(size=n) < 0.70
    signoff_dates = pd.Series(recv_dates) + pd.to_timedelta(signoff_delays, unit="D")
    signoff_dates = signoff_dates.where(so_mask, pd.NaT)

    case_types = rng.choice(["LPA", "Deputyship", "Other"], size=n, p=[0.6, 0.3, 0.1])
    risk_band = rng.choice(["Low", "Medium", "High"], size=n, p=[0.5, 0.35, 0.15])
    teams = rng.choice(["Team A", "Team B", "Team C"], size=n, p=[0.4, 0.4, 0.2])
    region = rng.choice(["North", "Midlands", "South"], size=n)

    investigators_on_duty = rng.integers(8, 20, size=n)
    allocations = rng.integers(0, 25, size=n)
    backlog = np.maximum(0, 500 + rng.normal(0, 60, size=n).astype(int))

    base_logit = -3.0 + 0.02 * np.nan_to_num(
        alloc_dates - recv_dates
    ).astype("timedelta64[D]").astype(float)
    risk_bump = np.select([risk_band == "High", risk_band == "Medium"], [1.2, 0.4], default=0.0)
    logit = base_logit + risk_bump
    # Clip logits to a reasonable range
    logit_clipped = np.clip(logit, -20, 20)
    prob = 1 / (1 + np.exp(-logit_clipped))
    legal_review = (rng.random(size=n) < prob).astype(int)

    df = pd.DataFrame({
        "id": np.arange(1, n + 1),
        "date_received_opg": recv_dates,
        "date_allocated_investigator": alloc_dates,
        "date_pg_signoff": signoff_dates,
        "case_type": case_types,
        "risk_band": risk_band,
        "team": teams,
        "region": region,
        "investigators_on_duty": investigators_on_duty,
        "allocations": allocations,
        "backlog": backlog,
        "legal_review": legal_review,
    })

# --- Step 4 – Instantiate the EDA toolkit and get the engineered table
    # ----- 2) Configure columns and instantiate the EDA toolkit -----
    cfg = EDAConfig(
        id_col="id",
        date_received="date_received_opg",
        date_allocated="date_allocated_investigator",
        date_signed_off="date_pg_signoff",
        target_col="legal_review",
        numeric_cols=[
            "days_to_allocate",  # NOTE: eda will create this
            "days_to_signoff",   # NOTE: eda will create this
            "investigators_on_duty",
            "allocations",
            "backlog",
        ],
        categorical_cols=["case_type", "risk_band", "team", "region"],
        time_index_col="date_received_opg",
        team_col="team",
        risk_col="risk_band",
        case_type_col="case_type",
    )

    eda = OPGInvestigationEDA(df, cfg)

    # This is your “cases” table for the gap code:
    cases = eda.df.copy()


# --- Step 5 – Add investigator IDs and FTE to cases ---
    # ----- 3) Add synthetic investigator_id and fte (for demo only) -----
    # In real data, replace this with a merge from Staff Master.
    n_investigators = 40
    cases["investigator_id"] = rng.integers(1, n_investigators + 1, size=len(cases))

    # Random FTE: mixture of FT and PT patterns
    cases["fte"] = rng.choice(
        [1.0, 0.8, 0.6, 0.5],
        size=len(cases),
        p=[0.4, 0.3, 0.2, 0.1],
    )



    # --- Step 6 – # Band staff by FTE  --- 
    # --- 2. Add staff-type bands (based on FTE) ---
    cases = add_fte_band(cases, fte_col="fte", band_col="fte_band")
    
    # --- 3. Compute gaps between allocations per investigator ---
    gaps = compute_allocation_gaps(
        cases,
        investigator_col="investigator_id",               # we just created this
        alloc_date_col="date_allocated_investigator",     # from demo_eda / config
        staff_type_col="fte_band",
        case_type_col="case_type",
    )

# --- 4. Summarise by staff type only (simplest model) ---
    summary_by_staff = summarise_gaps(
        gaps,
        staff_type_col="fte_band",
        by_case_type=False,
    )
    print("Gap distribution between allocations by staff type:")
    print(summary_by_staff.to_string(index=False))


    # --- 5. Optionally summarise by staff type AND case type ---
    summary_by_staff_case = summarise_gaps(
        gaps,
        staff_type_col="fte_band",
        case_type_col="case_type",
        by_case_type=True,
    )
    print("\nGap distribution between allocations by staff type & case type:")
    print("\nNote: “Gap” is: number of calendar days between two consecutive allocations to the same investigator.")
    print(summary_by_staff_case.head(20).to_string(index=False))


# Data from your summary table
fte_bands = ["FT 0.8–1.0", "PT 0.5–0.8"]
median_gaps = [6.0, 5.0]  # median days between allocations

plt.figure(figsize=(6, 4))
plt.bar(fte_bands, median_gaps)
plt.ylabel("Typical gap between new cases (days)")
plt.title("Median days between new case starts by staff type")
plt.tight_layout()
plt.show()



# For FT_0_8_to_1_0 staff: median gap = 4 days (p25=3, p75=7)
# For PT_0_5_to_0_8 staff: median gap = 7 days (p25=5, p75=10)

# Case types
case_types = ["Deputyship", "LPA", "Other"]

# Median gaps from your detailed table
median_ft = [5.0, 6.0, 5.0]  # FT_0_8_to_1_0
median_pt = [5.0, 5.0, 5.0]  # PT_0_5_to_0_8

x = np.arange(len(case_types))  # positions
width = 0.35  # width of each bar

plt.figure(figsize=(8, 4))
plt.bar(x - width/2, median_ft, width, label="FT 0.8–1.0")
plt.bar(x + width/2, median_pt, width, label="PT 0.5–0.8")

plt.xticks(x, case_types)
plt.ylabel("Typical gap between new cases (days)")
plt.title("Median days between new case starts by staff type and case type")
plt.legend()
plt.tight_layout()
plt.show()

