How to explain the drivers and factors that caused a step reduction in the investigation cases for Lasting Power of Attorney (LPA) demands from Office of Public Guardian (OPG)? 
To achieve this, basically we linked Investigation and LPA dataset which resulted a data frame in Python with the following variables (columns): linked_df[['uid','donor_id','lpa_reg_date','lpa_status','lpa_rec_date','poa_type','unique_id','case_no','client_donor_dob','case_type','concern_type','date_received_in_opg','status','mojap_extract_date','poa_case_type','casesubtype','poa_rec_to_invest_rec','year_concluded','link_id','uid_to_link']].  


1. How to investigate the age distribution (based on the dob of donor ('client_donor_dob') at the investigation ('date_received_in_opg') changes before and after pandemic that might be the reason for the step reduction in investigation of LPA cases? 
2. How to show that different 'casesubtype' and 'case_type' changes influenced this step reduction in the investigation of LPA cases? 
3. investigate whether: 
    - It suggests that the downward trend in the investigations rate for Health and Welfare cases or where investigations have included both Health and Welfare AND Property and Finance concerns have been gradual since 2016 rather than a step reduction associated with the pandemic. Having said that the rate of investigations particularly for Health and Welfare cases levelled off after the pandemic.
    - There isn‚Äôt any evidence of a gradual decline in Finance and Property cases, but instead the pattern that we have discussed before of a sustained stepped reduction following the pandemic.
    - The gradual decline in Health and Welfare and combined concerns from 2016 is interesting because it also coincides with what I believe was an operational decision at that time to remove the triage process for LPA investigations. This had the immediate effect of increasing the number of concerns accepted for investigation, which can be seen in the attached charts, followed by a gradual decline.  


In [None]:
import warnings
warnings.filterwarnings('ignore')
!pip install --upgrade pandas
import pandas as pd
!pip install --upgrade numpy
# print(np.__version__)
# print(np.__path__)
linked_df = pd.read_csv('inv_linked_lpa_data.csv')
linked_df

1. Investigating Age Distribution Changes Pre- and Post-Pandemic
To check whether changes in donor age at the time of investigation contributed to the reduction:

Techniques:
Descriptive Statistics & Visualization: Calculate mean, median, and IQR of donor age pre- and post-pandemic.

Kernel Density Estimation (KDE) & Histograms: Compare the age distributions before and after the pandemic.

Kolmogorov-Smirnov (KS) Test: Check if the distribution of ages significantly changed.

Causal Inference (Difference-in-Differences - DiD): Compare the mean age before and after the pandemic with a control period.

# Overall proportion 

In [None]:
# This updated code includes:
# Kernel Density Estimation (KDE) plots to show the age distribution of donors at investigation before and after the pandemic.
# Bar charts to visualize the number of investigations by age group before and after the pandemic.
# These visualizations should provide a clearer picture of how the investigation demand and age distribution have changed due to the pandemic. 

!pip install --upgrade scipy matplotlib
#!pip install matplotlib
# !pip uninstall seaborn
# !pip install seaborn
!pip install --upgrade seaborn
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp

# Convert to datetime
linked_df['client_donor_dob'] = pd.to_datetime(linked_df['client_donor_dob'], errors='coerce', dayfirst=True)
linked_df['date_received_in_opg'] = pd.to_datetime(linked_df['date_received_in_opg'])

# Calculate donor age at investigation
linked_df['donor_age_at_investigation'] = (linked_df['date_received_in_opg'] - linked_df['client_donor_dob']).dt.days / 365.25

# Define age groups
bins = [0, 18, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['0-18', '19-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']
linked_df['age_group'] = pd.cut(linked_df['donor_age_at_investigation'], bins=bins, labels=labels, right=False)

# Split pre- and post-pandemic (assuming March 2020 as pandemic start)
pre_pandemic = linked_df[linked_df['date_received_in_opg'] < '2020-03-01']
post_pandemic = linked_df[linked_df['date_received_in_opg'] >= '2020-03-01']

# Calculate overall proportion of investigations in each age group
age_group_counts = linked_df['age_group'].value_counts(normalize=True).sort_index()

# Calculate proportion of investigations in each age group for pre- and post-pandemic periods
pre_proportion = pre_pandemic['age_group'].value_counts(normalize=True).sort_index()
post_proportion = post_pandemic['age_group'].value_counts(normalize=True).sort_index()

# Plot KDE for age distribution
plt.figure(figsize=(10, 5))
sns.kdeplot(pre_pandemic['donor_age_at_investigation'], label='Pre-Pandemic', shade=True)
sns.kdeplot(post_pandemic['donor_age_at_investigation'], label='Post-Pandemic', shade=True)
plt.legend()
plt.title('Age Distribution of Donors at Investigation')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()

# Bar chart for absolute number of investigations by age group
plt.figure(figsize=(12, 6))
pre_counts = pre_pandemic['age_group'].value_counts().sort_index()
post_counts = post_pandemic['age_group'].value_counts().sort_index()
bar_width = 0.35
index = range(len(labels))

plt.bar(index, pre_counts, bar_width, label='Pre-Pandemic')
plt.bar([i + bar_width for i in index], post_counts, bar_width, label='Post-Pandemic')

plt.xlabel('Age Group')
plt.ylabel('Number of Investigations')
plt.title('Investigation Demand by Age Group Before and After Pandemic')
plt.xticks([i + bar_width / 2 for i in index], labels)
plt.legend()
plt.show()

# Bar chart for overall proportion of investigation demands by age group
plt.figure(figsize=(10, 5))
age_group_counts.plot(kind='bar', color='skyblue')
plt.xlabel('Age Group')
plt.ylabel('Proportion of Investigations')
plt.title('Overall Proportion of Investigation Demands by Age Group')
plt.show()

# Side-by-side bar chart for proportion of investigations pre- vs post-pandemic
plt.figure(figsize=(12, 6))
bar_width = 0.35
index = range(len(labels))

plt.bar(index, pre_proportion, bar_width, label='Pre-Pandemic', color='blue', alpha=0.6)
plt.bar([i + bar_width for i in index], post_proportion, bar_width, label='Post-Pandemic', color='red', alpha=0.6)

plt.xlabel('Age Group')
plt.ylabel('Proportion of Investigations')
plt.title('Proportion of Investigations by Age Group: Pre vs Post Pandemic')
plt.xticks([i + bar_width / 2 for i in index], labels)
plt.legend()
plt.show()

# Perform KS test
ks_stat, p_value = ks_2samp(pre_pandemic['donor_age_at_investigation'], post_pandemic['donor_age_at_investigation'])
print(f"KS Statistic: {ks_stat}, P-Value: {p_value}")


1. Investigating Age Distribution Changes Before and After the Pandemic
To investigate how the age distribution of donors at the time of investigation changed before and after the pandemic, you can use the following steps:

Techniques:
Descriptive Statistics: Calculate summary statistics (mean, median, standard deviation) for the age of donors before and after the pandemic.
Visualization: Use histograms, box plots, and density plots to visualize the age distribution.
Hypothesis Testing: Perform statistical tests (e.g., t-test, Mann-Whitney U test) to determine if there are significant differences in age distribution before and after the pandemic.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, wilcoxon

linked_df['year_received'] = linked_df['date_received_in_opg'].dt.year
linked_df['date_received_in_opg'] = pd.to_datetime(linked_df['date_received_in_opg'])
# Assuming linked_df is your DataFrame
linked_df['age_at_investigation'] = (pd.to_datetime(
    linked_df['date_received_in_opg']) - pd.to_datetime(
    linked_df['client_donor_dob'])).dt.days / 365.25

# Split data into before and after pandemic
before_pandemic = linked_df[(linked_df['date_received_in_opg'] < '2020-01-01') & (linked_df['date_received_in_opg'] >= '2018-01-01')]
after_pandemic = linked_df[(linked_df['date_received_in_opg'] >= '2023-01-01') & (linked_df['date_received_in_opg'] < '2025-01-01')]

# Descriptive statistics
print(f"Pre-pandemic: {before_pandemic['age_at_investigation'].describe()}")
print(f"Post-pandemic: {after_pandemic['age_at_investigation'].describe()}")

# Visualization
plt.figure(figsize=(10, 6))
sns.histplot(before_pandemic['age_at_investigation'], color='blue', label='Before Pandemic', kde=True)
sns.histplot(after_pandemic['age_at_investigation'], color='red', label='After Pandemic', kde=True)
plt.legend()
plt.title('Age Distribution of Donors at Investigation')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Hypothesis testing
t_stat, p_value = ttest_ind(before_pandemic['age_at_investigation'], after_pandemic['age_at_investigation'])
print(f'T-test: t_stat={t_stat}, p_value={p_value}')

# Perform Wilcoxon test if data is non-normal
if len(before_pandemic) == len(after_pandemic):  # Wilcoxon requires paired samples
    w_stat, w_p_value = wilcoxon(before_pandemic['age_at_investigation'], after_pandemic['age_at_investigation'])
    print(f"Wilcoxon Test: w={w_stat}, p={w_p_value}")

    # If p-value < 0.05, the difference is statistically significant.

2. Analysing Changes in Case Types (‚Äòcasesubtype‚Äô and ‚Äòcase_type‚Äô)
To assess whether shifts in case types contributed to the reduction:

Techniques:
Time Series Analysis: Visualizing case types over time.

Chi-Square Test: Checking if case distributions changed pre- and post-pandemic.

Logistic Regression: Predicting investigation likelihood based on case type.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt

# Aggregate case types by year
linked_df['year_received'] = linked_df['date_received_in_opg'].dt.year

# Assuming linked_df is your DataFrame
linked_df['age_at_investigation'] = (pd.to_datetime(
    linked_df['date_received_in_opg']) - pd.to_datetime(
    linked_df['client_donor_dob'])).dt.days / 365.25

case_type_counts = linked_df.groupby(['year_received', 'concern_type']).size().unstack()

# Plot trends
case_type_counts.plot(kind='line', figsize=(20,10), title="Investigation_Concern_Type_for_LPA_Trends_Over_Time")
# Save the figure
plt.savefig('images/Investigation_Concern_Type_for_LPA_Trends_Over_Time.png')
# Show the plot
plt.show()


# Aggregate case types by month
linked_df['month_received'] = linked_df['date_received_in_opg'].dt.to_period('M')

case_type_counts_monthly = linked_df.groupby(['month_received', 'concern_type']).size().unstack()

# Plot trends
case_type_counts_monthly.plot(kind='line', figsize=(20,10), title="Investigation Concern Type for LPA Trends Over Time (Monthly)")
# Save the figure
plt.savefig('images/Investigation_Concern_Type_for_LPA_Trends_Over_Time_Monthly.png')
# Show the plot
plt.show()
# Chi-square test
pre_post_pivot = linked_df.pivot_table(
    index='concern_type', 
    columns=linked_df['date_received_in_opg'] >= '2020-03-01', 
    aggfunc='size', fill_value=0)

chi2, p, dof, ex = chi2_contingency(pre_post_pivot)
print(f"Chi-Square Statistic: {chi2}, P-Value: {p}")

# Logistic Regression - Predicting Investigation Likelihood
linked_df['post_pandemic'] = (linked_df['date_received_in_opg'] >= '2020-03-01').astype(int)

model = smf.logit("post_pandemic ~ C(concern_type)", data=linked_df).fit()
# model = smf.logit("post_pandemic ~ C(case_type) + C(casesubtype)", data=linked_df).fit()
print(model.summary())

# If chi-square p-value is low, case type proportions changed.
# The chi-square test is used to determine if there is a significant association between two categorical variables. 
# In this case, it tests the association between casesubtype and whether the case was received before or 
# after the pandemic (post_pandemic).
# Chi-Square Statistic: A high value (48.08) indicates a strong association between the variables.
# P-Value: The extremely low p-value (4.09e-12) suggests that the association is statistically significant. 
# This means that the distribution of case subtypes before and after the pandemic is significantly different.

# Logistic regression shows which case types were more/less likely to be investigated post-pandemic.
# The logistic regression model predicts the likelihood of a case being received post-pandemic based on
# the case subtype.
# Intercept: 2.9569 (highly significant with p-value < 0.0001)
# C(casesubtype)[T.pfa]: -0.7498 (also highly significant with p-value < 0.0001)
# Intercept: The positive coefficient (2.9569) indicates that, in the absence of other factors, 
# the likelihood of a case being received post-pandemic is high.
# C(casesubtype)[T.pfa]: The negative coefficient (-0.7498) suggests that cases of subtype pfa are less 
# likely to be received post-pandemic compared to the baseline subtype.

# The chi-square test shows a significant change in case subtype distribution post-pandemic.
# The logistic regression indicates that the subtype pfa is less likely to be received post-pandemic,
# while the overall likelihood of cases being received post-pandemic is high.

2. Analyzing Changes in Case Types (‚Äòcasesubtype‚Äô and ‚Äòcase_type‚Äô)
Refinement: NLP & Clustering
If there are text-based ‚Äòconcern_type‚Äô descriptions, we can use NLP-based topic modeling to group similar concerns over time.

Clustering (K-Means, DBSCAN) can group case subtypes to reveal patterns.

2. Showing Influence of 'casesubtype' and 'case_type' Changes
To show how different 'casesubtype' and 'case_type' changes influenced the step reduction in investigation cases, you can use:

Techniques:
Categorical Analysis: Analyze the frequency and distribution of different case subtypes and types before and after the pandemic.
Chi-Square Test: Perform chi-square tests to determine if there are significant differences in the distribution of case subtypes and types.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

# Split data into before and after pandemic
before_pandemic = linked_df[linked_df['date_received_in_opg'] < '2020-03-01']
after_pandemic = linked_df[linked_df['date_received_in_opg'] >= '2020-03-01']

# Frequency distribution
concern_type_counts_before = before_pandemic['concern_type'].value_counts()
concern_type_counts_after = after_pandemic['concern_type'].value_counts()

case_type_counts_before = before_pandemic['poa_case_type'].value_counts()
case_type_counts_after = after_pandemic['poa_case_type'].value_counts()

print(f"Pre-pandemic: {case_subtype_counts_before}")
print(f"Pre-pandemic (Financial/Health and Welfare): {round(concern_type_counts_before['Financial']/concern_type_counts_before['Health and Welfare'],2)}")

# Visualization
plt.figure(figsize=(12, 6))
sns.barplot(x=concern_type_counts_before.index, 
            y=concern_type_counts_before.values, 
            color='blue', label='Before Pandemic')
plt.legend()
plt.title('Concern Type Distribution Before Pandemic')
plt.xlabel('Concern Type')
plt.ylabel('Volume')
plt.xticks(rotation=90)
plt.show()

print(f"Post-pandemic: {case_subtype_counts_after}")
print(f"Post-pandemic (Financial/Health and Welfare): {round(concern_type_counts_after['pfa']/concern_type_counts_after['hw'],2)}")
plt.figure(figsize=(12, 6))
sns.barplot(x=concern_type_counts_after.index, y=concern_type_counts_after.values, 
            color='red', label='After Pandemic')
plt.legend()
plt.title('Concern Type Distribution After Pandemic')
plt.xlabel('Concern Type')
plt.ylabel('Volume')
plt.xticks(rotation=90)
plt.show()

plt.figure(figsize=(12, 6))
sns.barplot(x=case_type_counts_before.index, y=case_type_counts_before.values, 
            color='blue', label='Before Pandemic')
plt.legend()
plt.title('Case Type Distribution Before Pandemic')
plt.xlabel('Case Type')
plt.ylabel('Volume')
plt.xticks(rotation=90)
plt.show()

plt.figure(figsize=(12, 6))
sns.barplot(x=case_type_counts_after.index, y=case_type_counts_after.values, 
            color='red', label='After Pandemic')
plt.legend()
plt.title('Case Type Distribution After Pandemic')
plt.xlabel('Case Type')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.show()

# Chi-square test
contingency_table_subtype = pd.crosstab(linked_df['casesubtype'], 
                                        linked_df['date_received_in_opg'] >= '2020-03-01')
chi2_stat_subtype, p_val_subtype, dof_subtype, ex_subtype = chi2_contingency(contingency_table_subtype)
print(f'Chi-square test for case subtype: chi2_stat={chi2_stat_subtype}, p_value={p_val_subtype}')

contingency_table_type = pd.crosstab(linked_df['poa_case_type'], 
                                     linked_df['date_received_in_opg'] >= '2020-03-01')
chi2_stat_type, p_val_type, dof_type, ex_type = chi2_contingency(contingency_table_type)
print(f'Chi-square test for case type: chi2_stat={chi2_stat_type}, p_value={p_val_type}')

In [None]:
import statsmodels.formula.api as smf
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns

# Aggregate case types by year
linked_df['year_received'] = linked_df['date_received_in_opg'].dt.year
case_type_counts = linked_df.groupby(['year_received', 'casesubtype']).size().unstack()

# Plot trends
plt.figure(figsize=(12, 6))
sns.lineplot(data=case_type_counts, palette="tab10")
plt.title("Case Type Trends Over Time")
plt.xlabel("Year Received")
plt.ylabel("Number of Cases")
plt.legend(title="Case Subtype")
plt.grid(True)
plt.show()

# Chi-square test
pre_post_pivot = linked_df.pivot_table(
    index='casesubtype', 
    columns=linked_df['date_received_in_opg'] >= '2020-03-01', 
    aggfunc='size', fill_value=0)

chi2, p, dof, ex = chi2_contingency(pre_post_pivot)
print(f"Chi-Square Statistic: {chi2}, P-Value: {p}")

# Bar plot to show differences before and after pandemic
pre_post_counts = linked_df.groupby(['casesubtype', linked_df['date_received_in_opg'] >= '2020-03-01']).size().unstack()
pre_post_counts.columns = ['Before Pandemic', 'After Pandemic']

plt.figure(figsize=(12, 6))
pre_post_counts[['Before Pandemic', 'After Pandemic']].plot(kind='bar', color=['skyblue', 'salmon'], edgecolor='black')
plt.title("Case Subtype Counts Before and After Pandemic")
plt.xlabel("Case Subtype")
plt.ylabel("Number of Cases")
plt.xticks(rotation=0)
plt.legend(title="Period")
plt.grid(True)

# # Annotate bars
# for idx, row in pre_post_counts.loc[['pfa', 'hw']].iterrows():
#     for col, value in row.items():
#         plt.text(idx, value + 50, f'{value}', ha='center', va='bottom')

plt.show()

# Logistic Regression - Predicting Investigation Likelihood
linked_df['post_pandemic'] = (linked_df['date_received_in_opg'] >= '2020-03-01').astype(int)
model = smf.logit("post_pandemic ~ C(casesubtype)", data=linked_df).fit()
print(model.summary())

3. Investigating Gradual vs. Step Decline in Investigation Cases (2016 Onwards)
To determine if the decline was gradual (since 2016) or a sharp step drop post-pandemic:

Techniques:
Time Series Decomposition: Breaking down long-term trends and seasonality.

Change Point Detection: Identifying structural breaks in investigation rates.

Interrupted Time Series Analysis (ITSA): Evaluating the impact of pandemic on investigation trends.

In [None]:
!pip install ruptures
from statsmodels.tsa.seasonal import seasonal_decompose
import ruptures as rpt
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# Aggregate investigation counts by year
investigations_by_year = linked_df.groupby('year_received').size()

# Time Series Decomposition
decomposed = seasonal_decompose(investigations_by_year, model='additive', period=1)
decomposed.plot()
plt.show()

# Change Point Detection
algo = rpt.Pelt(model="rbf").fit(investigations_by_year.values.reshape(-1, 1))
breakpoints = algo.predict(pen=10)
print("Change Points Detected at Years:", 
      [list(investigations_by_year.index)[bp] for bp in breakpoints[:-1]])

# ITSA - Comparing Pre/Post Pandemic Trends
linked_df['post_pandemic'] = (linked_df['year_received'] >= 2020).astype(int)
linked_df['casesubtype_numeric'] = linked_df['casesubtype'].astype('category').cat.codes
model = smf.ols("casesubtype_numeric ~ year_received + post_pandemic", data=linked_df).fit()
print(model.summary())

# Change point detection helps confirm whether the drop was gradual or sudden.
# No change points were detected in the time series data, indicating that there 
# were no significant shifts or structural breaks in the investigation counts over the years.
# ITSA helps quantify the impact of the pandemic on investigation trends.
# R-squared (0.010): This indicates that only 1% of the variance in casesubtype_numeric is explained by the model. This is quite low, suggesting that the model does not fit the data well.
# Intercept (43.8936): This is the expected value of casesubtype_numeric when year_received is zero and post_pandemic is zero. The high t-value (13.594) and low p-value (< 0.0001) indicate that the intercept is statistically significant.
# year_received (-0.0213): This negative coefficient suggests that as the year increases, the casesubtype_numeric value slightly decreases. The high t-value (-13.302) and low p-value (< 0.0001) indicate that this effect is statistically significant.
# post_pandemic (0.0253): This positive coefficient indicates that cases received post-pandemic have a slightly higher casesubtype_numeric value. The t-value (2.555) and p-value (0.011) suggest that this effect is statistically significant.
# Omnibus (10529.441) and Jarque-Bera (54999.577): These tests indicate that the residuals are not 
# normally distributed, which might affect the validity of the model.
# Durbin-Watson (1.960): This value is close to 2, suggesting that there is no significant autocorrelation
# in the residuals.
# Condition Number (3.05e+06): A high condition number indicates potential multicollinearity issues, 
# which means that the independent variables might be highly correlated.
# Overall, while the model shows some statistically significant relationships, the low R-squared value 
# and potential multicollinearity suggest that the model may not be very reliable for predicting 
# casesubtype_numeric. Further investigation and potentially more complex modeling might be needed 
# to better understand the data.

Findings to Look for:
Age Analysis: If older donors were investigated less post-pandemic, it may explain some of the decline.

Case Type Shifts: A significant drop in certain cases (e.g., Health & Welfare) might suggest policy changes.

Trend Analysis: If Finance & Property cases show a step decline while Health & Welfare cases decline gradually, it supports the operational decision hypothesis.

3. Investigating Gradual vs. Step Decline in Investigation Cases (2016 Onwards)
Refinement: Structural Breaks & Granger Causality
To determine if the step reduction aligns with operational decisions, we can:

Detect breakpoints in investigation rates using Bayesian Change Point Detection.

Apply Granger Causality to test whether operational changes caused case volume reductions.

In [None]:
# # A. Bayesian Change Point Detection

# import numpy as np
# import pymc3 as pm

# # Convert investigation counts to numpy array
# y = investigations_by_year.values

# # Define Model
# with pm.Model():
#     tau = pm.DiscreteUniform("tau", lower=0, upper=len(y)-1)
#     mu1 = pm.Normal("mu1", mu=np.mean(y[:len(y)//2]), sigma=np.std(y))
#     mu2 = pm.Normal("mu2", mu=np.mean(y[len(y)//2:]), sigma=np.std(y))
    
#     # Likelihood
#     idx = np.arange(len(y))
#     mu = pm.math.switch(tau > idx, mu1, mu2)
#     obs = pm.Normal("obs", mu=mu, sigma=np.std(y), observed=y)

#     trace = pm.sample(2000, return_inferencedata=True)

# # Plot Posterior Distribution of Breakpoint
# az.plot_posterior(trace, var_names=["tau"])

# # If the posterior distribution of tau (change point) aligns with 2016 or the pandemic, it suggests a structural change.

In [None]:
df_granger_noinx=df_granger.reset_index()
df_granger_noinx#.columns
df_granger_noinx.index
len(df_granger)

In [None]:
from statsmodels.tsa.stattools import grangercausalitytests
import pandas as pd

# Convert casesubtype to numeric codes
linked_df['casesubtype_numeric'] = linked_df['casesubtype'].astype('category').cat.codes

# Creating a DataFrame with lagged operational decisions
df_granger = linked_df[['year_received', 'casesubtype_numeric']].pivot_table(
    index='year_received', columns='casesubtype_numeric', aggfunc='size', fill_value=0)

# Add operational decision variable
df_granger['triage_removed'] = (df_granger.index >= 2016).astype(int)

# Create a rolling mean to make 'triage_removed' dynamic
df_granger['triage_removed_smooth'] = df_granger['triage_removed'].rolling(window=3, min_periods=1).mean()

# Check for constant columns and remove them
df_granger = df_granger.loc[:, df_granger.nunique() > 1]
print("Columns after filtering constant values:", df_granger.columns)

# Ensure 'year_received' is the index and formatted correctly
df_granger.index = pd.to_datetime(df_granger.index, format='%Y')

# Determine maximum allowable lag
max_lag = min(2, len(df_granger) - 1)
print("Max lag:", max_lag)

# Select variables for Granger causality test
target_variable = [col for col in df_granger.columns if col not in ['triage_removed', 'triage_removed_smooth']][0]
columns_for_test = [target_variable, 'triage_removed_smooth']
print("Selected columns for Granger test:", columns_for_test)

# Run Granger causality test
grangercausalitytests(df_granger[columns_for_test], maxlag=max_lag, verbose=True)

# The Granger causality test checks whether past values of one time series (triage_removed_smooth) help predict another time series (target_variable). 
# Each test result contains the following:
# ssr_ftest (F-test for joint significance): Tests if lagged values significantly improve predictions.
# ssr_chi2test (Chi-square test): Checks if adding lagged variables reduces residual error.
# lrtest (Likelihood ratio test): Compares model fits with and without the lagged predictor.
# params_ftest: Similar to ssr_ftest, but tests parameter significance.
# Each metric has:
# A statistic value (higher suggests stronger causality).
# A p-value (lower means stronger statistical significance).
# Degrees of freedom (df) used in calculations.

# Lag 1 (1-year lag)
# F-test (ssr_ftest): F=17.34, p = 0.0059 (statistically significant)
# Chi-square test (ssr_chi2test): ùúí2=26.00, p = 3.41 \times 10^{-7} (highly significant)
# Likelihood ratio test (lrtest): ùúí2=12.22, p = 0.00047 (significant)
# ‚úÖ At lag 1, triage_removed_smooth Granger-causes target_variable with high confidence.
# p-values < 0.01 suggest strong evidence that removing triage affects investigation rates after 1 year.

# Lag 2 (2-year lag)
# F-test (ssr_ftest): F=7.62, p = 0.0667 (borderline significance)
# Chi-square test (ssr_chi2test): ùúí2=40.64, p = 1.50 \times 10^{-9} (very significant)
# Likelihood ratio test (lrtest): œá 2=14.44, p = 0.00073 (significant)
# üî∏ At lag 2, triage_removed_smooth still Granger-causes target_variable, but with slightly weaker confidence.
# The F-test p-value (0.0667) is above 0.05, suggesting weaker support, but the Chi-square test remains highly significant.

# Triage removal strongly influences investigation counts after 1 year (Lag 1, p < 0.01).
# Lag 2 also shows a causal effect, but slightly weaker.
# Overall, this supports the hypothesis that removing triage led to a step reduction in investigations.

# üîé Next Steps
# Consider a Vector Autoregression (VAR) model to quantify dynamic relationships.
# Test additional lags (3+ years) to see if effects persist.
# Explore causal inference techniques (DAGs, synthetic controls) to validate findings.

3. Investigating Trends in Investigation Rates
To investigate trends in investigation rates for different case types and subtypes, you can use:

Techniques:
Time Series Analysis: Analyze the trends over time using line plots and statistical tests.
Regression Analysis: Use regression models to identify trends and changes over time.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose

# Group by year and case type/subtype
yearly_counts = linked_df.groupby(['year_concluded', 'case_type']).size().unstack().fillna(0)
yearly_counts_subtype = linked_df.groupby(['year_concluded', 'casesubtype']).size().unstack().fillna(0)

# Visualization
plt.figure(figsize=(12, 6))
yearly_counts.plot(kind='line', marker='o')
plt.title('Yearly Investigation Counts by Case Type')
plt.xlabel('Year')
plt.ylabel('Number of Investigations')
plt.legend(title='Case Type')
plt.show()

plt.figure(figsize=(12, 6))
yearly_counts_subtype.plot(kind='line', marker='o')
plt.title('Yearly Investigation Counts by Case Subtype')
plt.xlabel('Year')
plt.ylabel('Number of Investigations')
plt.legend(title='Case Subtype')
plt.show()

# Time series decomposition
for case_type in yearly_counts.columns:
    result = seasonal_decompose(yearly_counts[case_type], model='additive', period=1)
    result.plot()
    plt.title(f'Time Series Decomposition for {case_type}')
    plt.show()

# Regression analysis

X = yearly_counts.index.values.reshape(-1, 1)
for case_type in yearly_counts.columns:
    y = yearly_counts[case_type].values
    model = sm.OLS(y, sm.add_constant(X)).fit()
    print(f'Regression analysis for {case_type}:')
    print(model.summary())

Next Steps
Deep Dive into Outliers

If we find an anomaly in 2016 or 2020, investigate if it aligns with internal reports or policy changes.

Interactive Dashboards

Use Plotly or Dash to create an interactive visualization for stakeholders.

Predictive Modeling for Future Investigations

Use XGBoost or Random Forest to predict the number of investigations based on past trends.

A dashboard will help OPG stakeholders visualize the trends, analyze case distributions, and interactively explore insights. I'll create a Plotly Dash application that includes:

Age Distribution Analysis ‚Äì A histogram/KDE plot comparing pre- and post-pandemic donor ages.

Case Type Trends Over Time ‚Äì A time-series line chart showing case subtypes from 2016 onward.

Investigation Volume & Structural Changes ‚Äì A bar chart with annotations marking key events (e.g., triage removal in 2016, pandemic in 2020).

Case Type Breakdown (Pie Chart) ‚Äì To compare case distributions over different periods.

Changepoint Detection & Forecasting ‚Äì Structural break analysis visualized dynamically.

This Plotly Dash app provides an interactive dashboard for OPG stakeholders to explore:

Donor Age Distribution before and after the pandemic.

Case Type Trends Over Time (line chart).

Investigation Volume Changes with key events like triage removal in 2016.

Case Type Breakdown (pie chart).

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)


# !pip install --upgrade dash
# !pip install --upgrade plotty

# # Uninstall the current version of typing_extensions
# !pip uninstall typing-extensions -y

# # Install the latest version of typing_extensions
# !pip install typing-extensions --upgrade

# !pip install jupyter-dash --upgrade

# !pip install ipywidgets

# !jupyter nbextension list
# !jupyter labextension install @jupyter-widgets/jupyterlab-manager
# !jupyter labextension install plotlywidget
# !jupyter labextension install jupyterlab-dash
# !jupyter nbextension install --sys-prefix --py jupyter_dash
# !jupyter nbextension enable --sys-prefix --py jupyter_dash

!jupyter nbextension install --py jupyter_dash --sys-prefix
!jupyter nbextension enable --py jupyter_dash --sys-prefix


from jupyter_dash import JupyterDash
from dash import dcc, html
import plotly.express as px
import pandas as pd

# Ensure linked_df exists
if 'linked_df' not in locals():
    raise ValueError("linked_df is not defined. Load the dataset before running the dashboard.")

# Convert to datetime
linked_df['date_received_in_opg'] = pd.to_datetime(linked_df['date_received_in_opg'])
linked_df['client_donor_dob'] = pd.to_datetime(linked_df['client_donor_dob'], errors='coerce', dayfirst=True)

# Extract year
linked_df['year_received'] = linked_df['date_received_in_opg'].dt.year

# Calculate donor age at investigation
linked_df['donor_age_at_investigation'] = (linked_df['date_received_in_opg'] - linked_df['client_donor_dob']).dt.days / 365.25

# Aggregate data
case_type_trend = linked_df.groupby(['year_received', 'casesubtype']).size().reset_index(name='count')

# Create Dash App
app = JupyterDash(__name__)  # Use JupyterDash for Jupyter Lab compatibility

app.layout = html.Div([
    html.H1("OPG Investigation Dashboard"),
    
    # Age Distribution Comparison
    dcc.Graph(
        figure=px.histogram(
            linked_df, 
            x='donor_age_at_investigation', 
            color=linked_df['year_received'].apply(lambda x: 'Post-2020' if x >= 2020 else 'Pre-2020'),
            nbins=50, 
            title="Donor Age Distribution: Pre vs. Post Pandemic"
        )
    ),
    
    # Case Type Trends
    dcc.Graph(
        figure=px.line(
            case_type_trend, 
            x='year_received', 
            y='count', 
            color='casesubtype',
            title="Trends in Case Types Over Time"
        )
    ),
    
    # Investigation Volume with Key Events
    dcc.Graph(
        figure=px.bar(
            linked_df.groupby('year_received').size().reset_index(name='count'), 
            x='year_received', 
            y='count',
            title="Annual Investigation Volume"
        )
    ),
    
    # Case Type Breakdown (Pie Chart)
    dcc.Graph(
        figure=px.pie(
            linked_df, 
            names='casesubtype', 
            title="Case Type Distribution"
        )
    )
])

# Run app inside Jupyter Lab
#app.run(mode='inline')  # Try mode='external' if still not displaying
app.run(mode='inline') #, port=8051) # Try different ports (8052, 8053, etc.).
#app.run(mode='external')

additional features, such as changepoint detection visualizations or predictive modeling? 

In [None]:
from jupyter_dash import JupyterDash
from dash import dcc, html

app = JupyterDash(__name__)

app.layout = html.Div([
    html.H1("Hello Dash!"),
    dcc.Graph()
])

app.run(mode='inline')


1. Investigating Age Distribution Changes Pre- and Post-Pandemic
Refinement: Bayesian Analysis & Causal Inference
Rather than just comparing distributions, we can:

Use a Bayesian model to estimate how much the mean donor age changed.

Apply Causal Impact Analysis (Google‚Äôs CausalImpact package) to check if the change was due to the pandemic.

In [None]:
# A. Topic Modeling for Case Subtypes

# Upgrade scikit-learn
!pip install --upgrade scikit-learn
#import sklearn


# Upgrade numpy
!pip install --upgrade numpy

# Converts text data into a matrix of TF-IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer 
# Applies Latent Dirichlet Allocation (LDA) for topic modeling.
from sklearn.decomposition import LatentDirichletAllocation

# TF-IDF Vectorizer: This transforms the text data (casesubtype) into a matrix of TF-IDF features. 
# TF-IDF stands for Term Frequency-Inverse Document Frequency, which helps in highlighting important words in the documents.
# Stop Words: Common words like "the", "and", etc., are removed to focus on more meaningful words.
# Convert case types into TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(linked_df['concern_type'].astype(str))

# Creates an LDA model with 5 topics (n_components=5). LDA is a generative probabilistic 
# model that assumes each document is a mixture of topics and each topic is a mixture of words.
# Apply LDA
lda = LatentDirichletAllocation(n_components=5, random_state=42)
# The model is fitted to the TF-IDF matrix and transforms it into topic distributions.
topics = lda.fit_transform(X)

# For each topic, the top words are displayed. topic.argsort()[-5:] 
# sorts the words by their importance in the topic and selects the top 5.
# Display top words in each topic
for i, topic in enumerate(lda.components_):
    print(f"Topic {i}: {[vectorizer.get_feature_names_out()[j] for j in topic.argsort()[-5:]]}")

# This topic modeling helps identify patterns in case subtypes that may have changed pre/post-pandemic. 
# By understanding these patterns, you can gain insights into how different case types have evolved over time.
# Helps identify case subtype patterns that may have changed pre/post-pandemic.
# The results show the top words for each topic. However, the presence of 'nan' suggests that 
# there might be missing or improperly formatted data in the casesubtype column. 
# clean the data to remove or handle 'nan' values appropriately.

# B. Clustering Subtypes Over Time

from sklearn.cluster import KMeans

# Convert categorical case subtypes into numeric representations
linked_df['concern_type_encoded'] = linked_df['concern_type'].astype('category').cat.codes

print(linked_df['concern_type_encoded'])

# K-Means Clustering
kmeans = KMeans(n_clusters=20, random_state=42)
linked_df['cluster'] = kmeans.fit_predict(
    linked_df[['concern_type_encoded', 'year_received']])

# Visualize Changes in Clusters Over Time
sns.lineplot(data=linked_df, x='year_received', y='cluster', hue='concern_type', marker='o')
plt.title('Concern Type Investigation Clustering Over Time')
plt.show()

# If certain clusters disappear or emerge post-pandemic, they could explain the step reduction.

In [None]:
# # Suppress warnings from statsmodels
# import warnings
# warnings.filterwarnings("ignore")

# !pip install --upgrade pymc3
# #!python -m pip install --upgrade pip
# #!pip uninstall pymc3
# #!pip install git+https://github.com/pymc-devs/pymc3
# # !git clone https://github.com/pymc-devs/pymc3
# # !cd pymc3
# # !pip install -r requirements.txt
# # !python setup.py install
# # !python setup.py develop
# # !python -m pip install --upgrade pip
# #!pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose
# import pymc3 as pm
# import arviz as az
# from scipy.stats import kstest
# print(f"Running on PyMC v{pm.__version__}")

In [None]:
# # A. Bayesian Estimation of Age Differences
# #!pip install pymc3

# # Define Bayesian Model
# with pm.Model():
#     mu_pre = pm.Normal("mu_pre", mu=70, sigma=10)  # Prior for Pre-Pandemic Age Mean
#     mu_post = pm.Normal("mu_post", mu=70, sigma=10)  # Prior for Post-Pandemic Age Mean
#     sigma = pm.HalfNormal("sigma", sigma=10)

#     # Likelihood
#     age_pre = pm.Normal("age_pre", mu=mu_pre, sigma=sigma, observed=pre_pandemic['donor_age_at_investigation'])
#     age_post = pm.Normal("age_post", mu=mu_post, sigma=sigma, observed=post_pandemic['donor_age_at_investigation'])

#     # Difference
#     diff = pm.Deterministic("diff", mu_post - mu_pre)

#     trace = pm.sample(2000, return_inferencedata=True)

# # Plot Posterior Distribution of Age Difference
# import arviz as az
# az.plot_posterior(trace, var_names=["diff"])

# # This approach gives a probability distribution of the change in age, rather than a simple p-value.

# # If most of the posterior distribution is negative, it means donor age at investigation decreased.

In [None]:
# !pip install causalimpact
# from causalimpact import CausalImpact
# import pandas as pd

# # Ensure proper datetime conversion
# linked_df['year_received'] = pd.to_datetime(linked_df['date_received_in_opg']).dt.year

# # Compute mean donor age per year
# ts_data = linked_df.groupby('year_received')['donor_age_at_investigation'].mean()

# # Ensure DataFrame format
# ts_data = ts_data.to_frame(name="value")

# # Convert index to datetime format
# ts_data.index = pd.to_datetime(ts_data.index, format='%Y')

# # Drop missing values
# ts_data.dropna(inplace=True)

# # Define pre/post intervention periods based on actual years in index
# pre_period = [ts_data.index.min().year, 2019]
# post_period = [2020, ts_data.index.max().year]

# print(pre_period)
# print(post_period)
# print(len(ts_data))
# # Ensure there are enough data points
# if len(ts_data) < 12:
#     print("üö® Not enough data points for CausalImpact. Try alternative methods.")
# else:
#     # Run CausalImpact
#     impact = CausalImpact(data=ts_data, pre_period=pre_period, post_period=post_period)

#     # Check if the model ran successfully
#     if impact.inferences is not None:
#         impact.plot()
#         print(impact.summary())
#         print(impact.summary(output='report'))
#     else:
#         print("üö® CausalImpact failed. Check data formatting.")

# # If you only have 6 data points, CausalImpact won‚Äôt work reliably because:

# # Bayesian structural time series (BSTS) models need at least 12+ observations.

# # With just 6 years, there's not enough variance to estimate a meaningful impact.

# from causalimpact import CausalImpact
# import pandas as pd

# # Ensure proper datetime conversion
# linked_df['date_received'] = pd.to_datetime(linked_df['date_received_in_opg'])#.dt.strftime('%Y-%m').astype(str)

# # Compute mean donor age per month
# ts_data = linked_df.groupby(linked_df['date_received'].dt.to_period('M'))['donor_age_at_investigation'].mean()
# # ts_data = linked_df.groupby(linked_df['date_received'])['donor_age_at_investigation'].mean()

# # Ensure DataFrame format
# ts_data = ts_data.to_frame(name="value")

# # Convert index to datetime format
# ts_data.index = ts_data.index.to_timestamp()

# # Drop missing values
# ts_data.dropna(inplace=True)

# #linked_df['date_received'].loc[1:1][1]

# # Define pre/post intervention periods based on actual months in index
# # pre_period = [ts_data.index.min().strftime('%Y-%m'), '2019-12']
# # post_period = ['2020-01', ts_data.index.max().strftime('%Y-%m')]
# # pre_period = [ts_data.index.min(), '2019-12']
# # post_period = ['2020-01', ts_data.index.max()]
# pre_period = [ts_data.index.min(), pd.Timestamp('2019-12-01')]
# post_period = [pd.Timestamp('2020-01-01'), ts_data.index.max()]

# print(pre_period)
# print(post_period)
# print(len(ts_data))
# print(ts_data.head(10))
# data = ts_data.reset_index()

# # Ensure there are enough data points
# if len(ts_data) < 12:
#     print("üö® Not enough data points for CausalImpact. Try alternative methods.")
# else:
#     # Run CausalImpact
#     impact = CausalImpact(data=data, pre_period=pre_period, post_period=post_period)

#     # Check if the model ran successfully
#     if impact.inferences is not None:
#         impact.plot()
#         print(impact.summary())
#         print(impact.summary(output='report'))
#     else:
#         print("üö® CausalImpact failed. Check data formatting.")

In [None]:
import pandas as pd
# First, uninstall the current versions
!pip uninstall numpy statsmodels -y

# Then, install compatible versions
!pip install numpy==1.26.4 statsmodels
#!pip install --upgrade numpy
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Ensure proper datetime conversion
linked_df['year_received'] = pd.to_datetime(linked_df['date_received_in_opg']).dt.year

# Compute mean donor age per year
ts_data = linked_df.groupby('year_received')['donor_age_at_investigation'].mean()

# Create a dummy variable: 1 if post-2020, 0 if before
linked_df['post_policy'] = (linked_df['year_received'] >= 2020).astype(int)

# Dummy variable for treatment (e.g., Health & Welfare cases vs. Finance cases)
linked_df['treated'] = (linked_df['casesubtype'] == 'hw').astype(int)

# Interaction term (DiD effect)
linked_df['post_treated'] = linked_df['post_policy'] * linked_df['treated']

# Run a DiD regression
model = smf.ols("donor_age_at_investigation ~ post_policy + treated + post_treated", data=linked_df).fit()
print(model.summary())

 # If the post_treated coefficient is statistically significant, 
 #    it suggests that the intervention (policy change) affected investigations.
# Dependent Variable: donor_age_at_investigation
# R-squared: 0.006
# This indicates that the model explains only 0.6% of the variance in the dependent variable. 
# Essentially, the model has very low explanatory power.
# Adjusted R-squared: 0.006
# Similar to R-squared, it adjusts for the number of predictors in the model. Here, it also indicates very low explanatory power.
# F-statistic: 39.50
# This tests the overall significance of the model. A higher F-statistic suggests that the model is statistically significant.
# Prob (F-statistic): 1.96e-25
# The p-value associated with the F-statistic. Since it is much less than 0.05, the model is statistically significant.
# Coefficients:
# Intercept: 81.6097
# This is the average donor age at investigation when all other variables are zero.
# post_policy: 0.3935
# This coefficient represents the change in donor age at investigation post-2020. The p-value (0.239) indicates that this
# change is not statistically significant.
# treated: -4.7951
# This coefficient represents the difference in donor age at investigation between Health & Welfare cases and Finance cases. 
# The p-value (0.000) indicates that this difference is statistically significant.
# post_treated: 1.6825
# This interaction term represents the combined effect of post-2020 and treatment type. The p-value (0.182) indicates that this interaction effect is not statistically significant.
# Diagnostic Tests:
# Omnibus: 8707.441
# This tests for the normality of residuals. A high value indicates non-normality.
# Durbin-Watson: 2.006
# This tests for autocorrelation in residuals. A value close to 2 suggests no autocorrelation.
# Jarque-Bera (JB): 51834.606
# This also tests for normality. A high value indicates non-normality.
# Skew: -2.133
# This indicates the distribution of residuals is left-skewed.
# Kurtosis: 9.839
# This indicates the distribution of residuals has heavy tails (leptokurtic).
# Conclusion:
# The model is statistically significant overall, but it explains very little of the variance in donor age at investigation. 
# The coefficient for treated is significant, suggesting that Health & Welfare cases have a lower average donor age at 
# investigation compared to Finance cases. However, the coefficients for post_policy and post_treated are not significant, 
# indicating no substantial change in donor age at investigation post-2020 or due to the interaction effect.
# The diagnostics suggest non-normality in residuals, which might affect the reliability of the model. 
#You might want to consider additional variables or different model specifications to improve explanatory power.

In [None]:
# !pip install pydbtools
# !pip install awswrangler
# !pip install pandas --upgrade
# !pip install time
# !pip install numpy --upgrade
# !pip install jinja2
# !pip install lovely_logger 

# import pydbtools
# import awswrangler as wr
# import model_stages.stage1_get_invest_data as stage1
# import model_stages.stage2_prepare_invests_for_merging as stage2
# import model_stages.stage3_merge_with_POA_data as stage3


# ## Config

# date_cols = ['date_received_in_investigations', 'date_allocated_to_team',
#                   'date_allocated_to_current_investigator', 'pg_sign_off_date',
#                   'closure_date',
#                   'date_received_in_opg', 'legal_approval_date_clean',
#                  'receiptdate', 'registrationdate']

# cols_to_keep = ['unique_id', 
#                 # 'client_donor_title',
#                 # 'client_donor_forename', 'client_donor_surname', 'client_donor_dob',
#                 'case_type', 'concern_type', 'status', 'sub_status',
#                 'date_received_in_opg', 'multiple_id', 'lead_case',
#                 'days_to_pg_sign_off', 'closure_date', 'case_uid', 'receiptdate', 'registrationdate',
#                 'case_status', 'poa_case_type', 'casesubtype', 'poas_involved',
#                 'order_involved', 'legal_approval_date_clean', 'invest_concluded',
#                 'year_concluded', 'case_type_agg']
# first_date = '01/01/2019'
# last_date = '31/12/2024'

# min_LPA_receiptdate = '2017-01-01'
# max_LPA_receiptdate = '2024-12-31'
# sample_size = '1000000'


# ## Read/Link Data

# # Read Investigations Data
# df_inv_data = stage1.main(date_cols, first_date, last_date, cols_to_keep)

# # Expand Investigations Data
# expanded_df = stage2.main(df_inv_data)

# # Link to LPA Data
# linked_df = stage3.main(expanded_df, 
#                         min_LPA_receiptdate, 
#                         max_LPA_receiptdate, 
#                         sample_size)

# # print(linked_df)

# linked_Investigation_LPA_data = linked_df
# linked_Investigation_LPA_data.to_csv('linked_Investigation_LPA_data.csv', index=False)

# linked_df

# inv = linked_df.loc[linked_df['investigator'].notnull()]
# inv.to_csv('linked_Investigation_LPA_inv.csv', index=False)

# inv_linked_lpa = linked_df[['uid','donor_id','lpa_reg_date','lpa_status','lpa_rec_date','poa_type','unique_id','case_no','client_donor_dob','case_type','concern_type','date_received_in_opg','status','mojap_extract_date','poa_case_type','casesubtype','poa_rec_to_invest_rec','year_concluded','link_id','uid_to_link']]
# inv_linked_lpa
# inv_linked_lpa_data = inv_linked_lpa
# inv_linked_lpa_data.to_csv('inv_linked_lpa_data.csv', index=False)

# pydbtools.read_sql_query("SELECT DISTINCT(mojap_extract_date) FROM opg_investigations_prod.investigations ORDER BY mojap_extract_date DESC")
# lpa_dashboard = pydbtools.read_sql_query("SELECT * FROM sirius_derived.opg_lpa_dashboard LIMIT 5")