This notebook contains my analysis of the Rotterdam algorithm. Initially, I follow the steps prescribed in the lighthouse report repo. 

First, I follow the steps in HowToRun. 

Step 1: Choose which features to test on. There are 315 features in total. I start with only the feature gender, indicated by persoon_geslacht_vrouw (1=female). 

Step 2: Create an ini.config file. Each experiment must have a corresponding config file that defines the author, experiment name, features and their specific values, business rules, and bins. I created: conf\archetypes\vrouw.ini, by filling in the template. 

This ini file determines what data will be used, what variables will be tested and how the output will be structured. 

Step 3: The next step is running the data generator notebook: this produces data\03_experiment_input\20251023_jv_persoon_geslacht_vrouw.csv. This is the data that I will use as the input for the experiment. 
The result is a copy of the original synthetic data, but with double the number of rows. First the original rows, and then the same rows but only with the persoon_geslacht_vrouw variable flipped. This enables testing conditional statistical parity (counterfactual fairness). Impossible combinations are excluded here in this steps by checking business rules. Then I run notebooks\analyze.R. This uses my input file, and applies the Rotterdam algorithm to this data. The script generates data\04_experiment_output\20251023_jv_persoon_geslacht_vrouw.csv, which has 2 new columns, 'Ja' and 'Nee', in which the values are each others complement, indicating the predicted risk score. It uses the real model from Rotterdam, saved in caret format in data\01_raw\20220929_finale_model.rds.

The cut-off threshold is calculated such that the top 10% is above the threshold.

After adding the risk score, this script runs an analysis. One hot encoded variables and binning variables are handled appropriately, by creating an IV column for all rows, containing a string indicating that case's group membership. Between groups, it runs a pairwise t test and tests statistical parity.

Instead of using this R script for analysis, I will only use this to generate the prediction scores. Then, I will use the data generated this way to run my own experiments in AIF360. 




The R script generates these plots: 
[plots](../results/statistical_parity/archetypes/persoon_geslacht_vrouw/plot.pdf)
Thos shows that in each docile women and non-women are represented approximately the same. 

In [3]:
import pandas as pd

df = pd.read_csv("C:/Users/jetve/Documents/suspicion_machine/results/statistical_parity/archetypes/persoon_geslacht_vrouw/sum_stats.csv")
df


Unnamed: 0.1,Unnamed: 0,IV,variable,n,mean,sd,share_above_td_threshold,share_above_rw_threshold,perc_misrep_td_threshold,perc_misrep_rw_threshold,data_type
0,1,IV: \r\npersoon_geslacht_vrouw: 0\r\n,Ja,6542,0.611,0.086,0.101345,0.633293,0.134515,53.329257,synth
1,2,IV: \r\npersoon_geslacht_vrouw: 1\r\n,Ja,6103,0.613,0.084,0.09864,0.649844,-0.135999,54.984434,synth
2,3,IV: \r\npersoon_geslacht_vrouw: 0\r\n,Ja,12508,0.609,0.085,0.09314,0.631116,-0.685961,53.111609,synth_conditional
3,4,IV: \r\npersoon_geslacht_vrouw: 1\r\n,Ja,12508,0.615,0.085,0.110249,0.652542,1.024944,55.254237,synth_conditional


The following t-test shows that there is no significant difference between women and men. 

In [4]:
import pandas as pd

df = pd.read_csv("C:/Users/jetve/Documents/suspicion_machine/results/statistical_parity/archetypes/persoon_geslacht_vrouw/t_test.csv")
df


Unnamed: 0.1,Unnamed: 0,.y.,group1,group2,n1,n2,p,p.signif,p.adj,p.adj.signif,data_type
0,1,Ja,IV: \r\npersoon_geslacht_vrouw: 0\r\n,IV: \r\npersoon_geslacht_vrouw: 1\r\n,6542,6103,0.0664,ns,0.0664,ns,synth
1,2,Ja,IV: \r\npersoon_geslacht_vrouw: 0\r\n,IV: \r\npersoon_geslacht_vrouw: 1\r\n,12508,12508,1e-07,****,1e-07,****,synth_conditional


Now I will run my own fairness tests on this dataset using AIF360. First I load the dataset

In [5]:
df = pd.read_csv("20251023_jv_persoon_geslacht_vrouw.csv")
# filter out the conditional_synth rows
df_synth = df[df['data_type']=="synth"]


Since the AIF works with a binary score, we should translate the risk score to a binary variable. We do this using a threshold. Lighthouse has defined this as 0.7125668 for the synthetic data, in order for 10% of cases to be above the threshold.

In [6]:
# add binary prediction variable
threshold = 0.7125668 
df_synth['binary_pred']= (df_synth['Ja']> threshold).astype(int)

# check if the threshold actually gives 10% above the threshold
print((df_synth["Ja"] > 0.7125668).mean())

df_synth.head()

0.10003954132068012


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_synth['binary_pred']= (df_synth['Ja']> threshold).astype(int)


Unnamed: 0,"relatie""""",adres_aantal_brp_adres,adres_aantal_verschillende_wijken,adres_aantal_verzendadres,adres_aantal_woonadres_handmatig,adres_dagen_op_adres,adres_recentst_onderdeel_rdam,adres_recentste_buurt_groot_ijsselmonde,adres_recentste_buurt_nieuwe_westen,adres_recentste_buurt_other,...,typering_hist_sector_zorg,typering_ind,typering_indicatie_geheime_gegevens,typering_other,typering_transport__logistiek___tuinbouw,typering_zorg__schoonmaak___welzijn,data_type,Ja,Nee,binary_pred
0,1,6,3,1,0,1012,1,0,0,1,...,0,0,0,0,0,0,synth,0.600234,0.399766,0
1,2,4,2,1,0,5268,1,0,0,0,...,0,1,0,1,0,0,synth,0.748566,0.251434,1
2,3,4,2,0,1,1820,1,0,0,1,...,0,0,0,0,0,0,synth,0.781536,0.218464,1
3,4,3,2,0,1,9056,1,0,0,0,...,0,1,0,0,0,0,synth,0.61335,0.38665,0
4,5,3,3,0,2,5246,1,0,0,1,...,0,1,0,0,0,0,synth,0.688989,0.311011,0


In [7]:
# load aif packages

import sys
sys.path.insert(1, "../")  

import numpy as np
np.random.seed(0)

from aif360.datasets import StandardDataset
from aif360.metrics import BinaryLabelDatasetMetric
from IPython.display import Markdown, display

  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


Kijken welke features hoogste relative importance hebben, die ook includen

In [8]:
import pandas as pd

df = pd.read_csv("features.csv", encoding="latin-1")

df_sorted = df.sort_values("relative_importance", ascending=False)

top20 = df_sorted.head(20)

for i, row in top20.iterrows():
    print(f"{row['feature']}: {row['relative_importance']:.4f}")


persoon_leeftijd_bij_onderzoek: 100.0000
relatie_overig_actueel_vorm__kostendeler: 35.4857
contacten_onderwerp_no_show: 27.8748
competentie_vakdeskundigheid_toepassen: 25.1066
contacten_onderwerp_overleg_met_inkomen: 24.1014
adres_dagen_op_adres: 23.3291
relatie_overig_kostendeler: 20.7489
pla_historie_ontwikkeling: 19.1113
instrument_ladder_huidig_activering: 17.5048
contacten_soort_afgelopenjaar_document__uitgaand_: 15.4138
persoonlijke_eigenschappen_dagen_sinds_taaleis: 15.1538
persoonlijke_eigenschappen_dagen_sinds_opvoer: 14.9285
ontheffing_dagen_hist_vanwege_uw_medische_omstandigheden: 13.8170
relatie_partner_totaal_dagen_partner: 12.4425
pla_hist_pla_categorie_doelstelling_16: 11.9347
relatie_kind_leeftijd_verschil_ouder_eerste_kind: 11.1659
relatie_kind_huidige_aantal: 10.8480
instrument_ladder_historie_activering: 10.7604
persoon_geslacht_vrouw: 9.9040
afspraak_inspanningsperiode: 9.3556


In [9]:
# behoud alleen relevante kolommen
# relevant features and their interpretation determined from lighthouse methodology
# en de hoogste relative importance. 64 variables
df_aif = df_synth[["adres_aantal_verschillende_wijken","adres_recentste_plaats_other", "adres_recentste_plaats_rotterdam",
                   "adres_recentste_buurt_other","adres_recentste_wijk_delfshaven","contacten_onderwerp_boolean_no_show", 
                   "contacten_onderwerp_boolean_taaleis___voldoet","contacten_onderwerp_boolean_ziek__of_afmelding","beschikbaarheid_huidig_afwijkend_wegens_medische_omstandigheden", 
                   "beschikbaarheid_recent_afwijkend_wegens_medische_omstandigheden","beschikbaarheid_recent_afwijkend_wegens_sociaal_maatschappelijke_situatie",
                   "beschikbaarheid_aantal_historie_afwijkend_wegens_medische_omstandigheden","beschikbaarheid_aantal_historie_afwijkend_wegens_sociaal_maatschappelijke_situatie",
                   "relatie_partner_huidige_partner___partner__gehuwd_", "relatie_partner_totaal_dagen_partner",
                    "belemmering_financiele_problemen", "belemmering_niet_computervaardig","belemmering_psychische_problemen",
                    "belemmering_dagen_financiele_problemen","belemmering_dagen_lichamelijke_problematiek",
                    "belemmering_dagen_psychische_problemen", "typering_hist_inburgeringsbehoeftig","persoon_leeftijd_bij_onderzoek", "persoon_geslacht_vrouw",
                    "persoonlijke_eigenschappen_spreektaal","persoonlijke_eigenschappen_taaleis_schrijfv_ok","persoonlijke_eigenschappen_dagen_sinds_taaleis",
                    "persoonlijke_eigenschappen_nl_spreken1", "persoonlijke_eigenschappen_nl_spreken2", "persoonlijke_eigenschappen_nl_spreken3", "persoonlijke_eigenschappen_nl_lezen3",
                    "persoonlijke_eigenschappen_nl_lezen4", "persoonlijke_eigenschappen_nl_schrijven0","persoonlijke_eigenschappen_nl_schrijven1","persoonlijke_eigenschappen_nl_schrijven2",
                    "persoonlijke_eigenschappen_nl_schrijven3","persoonlijke_eigenschappen_nl_begrijpen3", "relatie_kind_huidige_aantal", 
                    "relatie_kind_heeft_kinderen", "relatie_kind_basisschool_kind", "relatie_kind_jongvolwassen", "relatie_kind_tiener", "relatie_kind_volwassen",
                    "relatie_overig_historie_vorm__andere_inwonende","persoonlijke_eigenschappen_uitstroom_verw_vlgs_km","relatie_kind_leeftijd_verschil_ouder_eerste_kind",
                    "belemmering_hist_taal",
                    'persoonlijke_eigenschappen_spreektaal_anders', "relatie_overig_actueel_vorm__kostendeler",
                    "contacten_onderwerp_no_show",
                    "competentie_vakdeskundigheid_toepassen",
                    "contacten_onderwerp_overleg_met_inkomen",
                    "adres_dagen_op_adres",
                    "relatie_overig_kostendeler",
                    "pla_historie_ontwikkeling",
                    "instrument_ladder_huidig_activering",
                    "contacten_soort_afgelopenjaar_document__uitgaand_",
                    "persoonlijke_eigenschappen_dagen_sinds_opvoer",
                    "ontheffing_dagen_hist_vanwege_uw_medische_omstandigheden",
                    "pla_hist_pla_categorie_doelstelling_16",
                    "instrument_ladder_historie_activering",
                    "afspraak_inspanningsperiode",
                      "Ja", "binary_pred"]]
df_aif.head()



Unnamed: 0,adres_aantal_verschillende_wijken,adres_recentste_plaats_other,adres_recentste_plaats_rotterdam,adres_recentste_buurt_other,adres_recentste_wijk_delfshaven,contacten_onderwerp_boolean_no_show,contacten_onderwerp_boolean_taaleis___voldoet,contacten_onderwerp_boolean_ziek__of_afmelding,beschikbaarheid_huidig_afwijkend_wegens_medische_omstandigheden,beschikbaarheid_recent_afwijkend_wegens_medische_omstandigheden,...,pla_historie_ontwikkeling,instrument_ladder_huidig_activering,contacten_soort_afgelopenjaar_document__uitgaand_,persoonlijke_eigenschappen_dagen_sinds_opvoer,ontheffing_dagen_hist_vanwege_uw_medische_omstandigheden,pla_hist_pla_categorie_doelstelling_16,instrument_ladder_historie_activering,afspraak_inspanningsperiode,Ja,binary_pred
0,3,0,1,1,1,0,0,1,0,0,...,1,1,3,442,1125,1,1,1,0.600234,0
1,2,0,1,0,0,0,0,1,0,1,...,0,0,10,11,1718,0,2,1,0.748566,1
2,2,0,1,1,0,0,1,1,0,0,...,0,0,7,188,1740,0,1,1,0.781536,1
3,2,0,1,0,0,1,0,0,0,0,...,1,1,3,741,1199,1,2,1,0.61335,0
4,3,0,1,1,0,0,0,1,0,1,...,1,0,6,225,929,1,3,1,0.688989,0


In [10]:
protected_attributes = ['persoon_geslacht_vrouw', 'adres_recentste_wijk_delfshaven','typering_hist_inburgeringsbehoeftig']

In [11]:
# load dataset as standardataset, define protected attributes and privileged classes
# i start with only women as the unpriviliged class, and then gradually add attributes to this to
# see how the edf value changes

dataset_orig = StandardDataset(df=df_aif.copy(),
                               label_name='binary_pred',
                               favorable_classes=[0], # 1 means high risk
                               protected_attribute_names=protected_attributes,
                               privileged_classes=[[0],[0],[0]]) 


In [12]:
# define privileged and unprivileged groups

groups = {
    "gender": {
        "priv": [{'persoon_geslacht_vrouw': 0}],
        "unpriv": [{'persoon_geslacht_vrouw': 1}]
    },
    "taaleis" : {
        "priv": [{"contacten_onderwerp_boolean_taaleis___voldoet": 1}],
        "unpriv": [{"contacten_onderwerp_boolean_taaleis___voldoet": 0}]
    },
    "delfshaven": {
        "priv": [{'adres_recentste_wijk_delfshaven': 0}],
        "unpriv": [{'adres_recentste_wijk_delfshaven': 1}]
    },
    "inburgeringsbehoeftig": {
        "priv" : [{'typering_hist_inburgeringsbehoeftig': 0}],
        "unpriv" : [{'typering_hist_inburgeringsbehoeftig': 1}]
    }
}


In [13]:
def fairness_metrics(dataset, group):
    metric = BinaryLabelDatasetMetric(
        dataset,
        unprivileged_groups=groups[group]["unpriv"],
        privileged_groups=groups[group]["priv"]
    )

    return {
        f"SP {group} = %f" % metric.mean_difference(),
        f"EDF = %f" % metric.smoothed_empirical_differential_fairness()
    }


In [14]:
print(fairness_metrics(dataset_orig, "gender"),
fairness_metrics(dataset_orig, "inburgeringsbehoeftig"),
fairness_metrics(dataset_orig, "delfshaven"))


{'EDF = 1.411612', 'SP gender = 0.002705'} {'EDF = 1.411612', 'SP inburgeringsbehoeftig = 0.030479'} {'EDF = 1.411612', 'SP delfshaven = -0.014980'}


Met bovenstaande kan ik dus per groep SP berekenen, en de globale EDF. Nu gaan we kijken naar conditional SP.

Eerst een aantal variabelen naar categorical transformeren

In [15]:
import numpy as np

# transform age to cat variable
# Bins en labels definiëren
bins = [18, 30, 40, 50, 60, np.inf]  
labels = [
    "18-29",
    "30-39",
    "40-49",
    "50-59",
    "60-69"
]

df_aif["leeftijd_cat"] = pd.cut(
    df_aif["persoon_leeftijd_bij_onderzoek"],
    bins=bins,
    labels=labels,
    right=False  # betekent: interval is inclusief lower bound, exclusief upper bound
)


# transform financiele problemen to cat variable
# Bins en labels definiëren
bins = [-np.inf, 0, 699, np.inf]
labels = ["0", "<700", "700+"]
df_aif["fin_cat"] = pd.cut(
    df_aif["belemmering_dagen_financiele_problemen"],
    bins=bins,
    labels=labels,
    right=True
)

# transform partner to cat variable
# Bins en labels definiëren
# er is maar een case met 0 in de synt dataset, dus bij <360 voegen
bins = [-np.inf,  359, 719, np.inf]
labels = ["<360", "<720", "720+"]
df_aif["partner_cat"] = pd.cut(
    df_aif["relatie_partner_totaal_dagen_partner"],
    bins=bins,
    labels=labels,
    right=True
)
# transform basisschool kind to cat variable
df_aif["basisschool_cat"] = np.where(
    df_aif["relatie_kind_basisschool_kind"] == 0,
    "0",
    "1+"
)

# transform aantal kinderen to cat variable
df_aif["kind_cat"] = np.where(
    df_aif["relatie_kind_huidige_aantal"] == 0,
    "0",
    "1+"
)



print("Financiële problemen:")
print(df_aif["fin_cat"].value_counts(dropna=False))
print("\n")

print("Partner dagen:")
print(df_aif["partner_cat"].value_counts(dropna=False))
print("\n")

print("Basisschool-kind categorie:")
print(df_aif["basisschool_cat"].value_counts(dropna=False))

print("kind categorie:")
print(df_aif["kind_cat"].value_counts(dropna=False))



Financiële problemen:
fin_cat
700+    5846
0       5630
<700    1169
Name: count, dtype: int64


Partner dagen:
partner_cat
720+    9185
<360    1784
<720    1676
Name: count, dtype: int64


Basisschool-kind categorie:
basisschool_cat
0     8292
1+    4353
Name: count, dtype: int64
kind categorie:
kind_cat
1+    8202
0     4443
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_aif["leeftijd_cat"] = pd.cut(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_aif["fin_cat"] = pd.cut(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_aif["partner_cat"] = pd.cut(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = val

In [16]:
import pandas as pd
from aif360.sklearn.metrics import selection_rate, statistical_parity_difference

def conditional_statistical_parity(df, pred_col, A, C, pos_label=1, min_size=30):
    """
    Compute Conditional Statistical Parity:
    P(Yhat | A=a_i, C=c) - P(Yhat | A=a_j, C=c)
    for each level of C.
    
    Parameters
    ----------
    df : pd.DataFrame
    pred_col : str
        Column with predicted labels (binary_pred)
    A : str
        Protected attribute (binary or categorical)
    C : str
        Conditioning variable
    pos_label : int
        Label considered 'selected' / 'negative' (usually 1 for risk scores)
    min_size : int
        Minimum group size per C level to be included
    
    Returns
    -------
    results : pd.DataFrame
        One row per value of C, with per-A-group selection rates and SPD.
    """
    
    results = []

    for c_val, df_c in df.groupby(C):

        # Skip tiny strata
        if len(df_c) < min_size:
            continue

        # selection rate per A group
        sel_rates = (
            df_c.groupby(A)[pred_col]
                .apply(lambda x: (x == pos_label).mean())
                .to_dict()
        )

        # compute SPD within this stratum (max difference between A groups)
        if len(sel_rates) > 1:
            spd = max(sel_rates.values()) - min(sel_rates.values())
        else:
            spd = None
        
        results.append({
            "C_value": c_val,
            "n_in_stratum": len(df_c),
            "selection_rates": sel_rates,
            "conditional_SPD": spd,
        })

    return pd.DataFrame(results)


In [17]:
# definieer hier de protected attribute, en variabele C waarop ik wil conditioneren
A = "persoon_geslacht_vrouw"
C = "leeftijd_cat"

In [18]:
# definieer hier de protected attribute, en variabele C waarop ik wil conditioneren
A = "leeftijd_cat"
C = "persoon_geslacht_vrouw"

In [19]:
res = conditional_statistical_parity(
    df=df_aif,
    pred_col="binary_pred",
    A=A,
    C=C,  # bijv.
    pos_label=1
)

print(res)

   C_value  n_in_stratum                                    selection_rates  \
0        0          6542  {'18-29': 0.2976190476190476, '30-39': 0.22890...   
1        1          6103  {'18-29': 0.27601809954751133, '30-39': 0.1796...   

   conditional_SPD  
0         0.263858  
1         0.232478  


  df_c.groupby(A)[pred_col]


So we see a large increase in the EDF metric after adding only 1 extra protected attribute. The AIF EDF metric only gives the final ouput. I want insights into how we get this value, so we can see which intersections of protected groups are treated most unfair by the model. Below, I implement the smoothed edf function taken from aif, but with additional insights: we see which groups have the 'worst' differences, and what the group sizes are. 

In [20]:
import numpy as np
import pandas as pd

def smoothed_base_rates_df(df, label, protected, concentration=1.0):
    """
    Repliceert _smoothed_base_rates van AIF360,
    maar geeft een DataFrame terug met groepen + sbr.
    """
    # counts per (intersectie-)groep
    counts = (
        df.groupby(protected)[label]
          .agg(['sum', 'count'])
          .reset_index()
          .rename(columns={'sum': 'n_pos', 'count': 'n_total'})
    )

    K = df[label].nunique()  # meestal 2
    alpha = concentration / K

    counts["sbr"] = (counts["n_pos"] + alpha) / (counts["n_total"] + concentration)

    return counts  # bevat: protected cols, n_pos, n_total, sbr

def smoothed_edf_explainer(df, label, protected, concentration=1.0):
    # 1) smoothed base rates per groep
    groups = smoothed_base_rates_df(df, label, protected, concentration)

    rows = []
    for i, gi in groups.iterrows():
        for j, gj in groups.iterrows():
            if i == j:
                continue

            p_i = gi["sbr"]
            p_j = gj["sbr"]

            # pos_ratio en neg_ratio zoals in AIF360
            pos = abs(np.log(p_i) - np.log(p_j))
            neg = abs(np.log(1 - p_i) - np.log(1 - p_j))

            pair_edf = max(pos, neg)

            rows.append({
                "group_i": tuple(gi[p] for p in protected),
                "group_j": tuple(gj[p] for p in protected),
                "sbr_i": p_i,
                "sbr_j": p_j,
                "n_total_i": gi["n_total"],
                "n_pos_i": gi["n_pos"],
                "n_total_j": gj["n_total"],
                "n_pos_j": gj["n_pos"],
                "pos_ratio": pos,
                "neg_ratio": neg,
                "pair_edf": pair_edf,
            })

    pair_df = pd.DataFrame(rows)

    # 2) globale EDF = max over alle paren
    EDF = pair_df["pair_edf"].max()

    # 3) de paren die deze EDF veroorzaken
    worst_pairs = pair_df[pair_df["pair_edf"] == EDF]

    return EDF, groups, pair_df, worst_pairs



In [21]:
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=['persoon_geslacht_vrouw','typering_hist_inburgeringsbehoeftig'],
    concentration=1.0
)

print("Globale EDF:", EDF)
#print("\nSmoothed base rates per groep:")
#print(groups)

#print("\nAlle pairwise ratios:")
#print(pair_df.sort_values("pair_edf", ascending=False))

print("\nParen die de EDF bepalen:")
print(worst_pairs)


Globale EDF: 0.24486633350154463

Paren die de EDF bepalen:
      group_i     group_j     sbr_i     sbr_j  n_total_i  n_pos_i  n_total_j  \
0  (0.0, 0.0)  (0.0, 1.0)  0.101615  0.079545     6499.0    660.0       43.0   
2  (0.0, 0.0)  (1.0, 1.0)  0.101615  0.079545     6499.0    660.0       43.0   
3  (0.0, 1.0)  (0.0, 0.0)  0.079545  0.101615       43.0      3.0     6499.0   
9  (1.0, 1.0)  (0.0, 0.0)  0.079545  0.101615       43.0      3.0     6499.0   

   n_pos_j  pos_ratio  neg_ratio  pair_edf  
0      3.0   0.244866   0.024269  0.244866  
2      3.0   0.244866   0.024269  0.244866  
3    660.0   0.244866   0.024269  0.244866  
9    660.0   0.244866   0.024269  0.244866  


This gives us the insight that the groups that lead to this result are small: 43, with 3 positive labels. I am not 100% sure how to interpret the inburgeringsbehoeftig variable. So I discard this and first focus on language. 

In [22]:
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=['persoon_geslacht_vrouw',"relatie_kind_heeft_kinderen","contacten_onderwerp_boolean_taaleis___voldoet","adres_recentste_wijk_delfshaven","contacten_onderwerp_boolean_ziek__of_afmelding"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)


Globale EDF: 1.612116884121157

Paren die de EDF bepalen:
                       group_i                    group_j     sbr_i     sbr_j  \
360  (0.0, 1.0, 0.0, 1.0, 1.0)  (1.0, 0.0, 1.0, 0.0, 0.0)  0.237705  0.047414   
631  (1.0, 0.0, 1.0, 0.0, 0.0)  (0.0, 1.0, 0.0, 1.0, 1.0)  0.047414  0.237705   

     n_total_i  n_pos_i  n_total_j  n_pos_j  pos_ratio  neg_ratio  pair_edf  
360       60.0     14.0      347.0     16.0   1.612117   0.222847  1.612117  
631      347.0     16.0       60.0     14.0   1.612117   0.222847  1.612117  


In [23]:
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=['persoon_geslacht_vrouw',"relatie_kind_heeft_kinderen","contacten_onderwerp_boolean_taaleis___voldoet","belemmering_financiele_problemen"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)


Globale EDF: 1.1809078589952136

Paren die de EDF bepalen:
                  group_i               group_j     sbr_i     sbr_j  \
84   (0.0, 1.0, 0.0, 1.0)  (1.0, 0.0, 1.0, 0.0)  0.200246  0.061475   
155  (1.0, 0.0, 1.0, 0.0)  (0.0, 1.0, 0.0, 1.0)  0.061475  0.200246   

     n_total_i  n_pos_i  n_total_j  n_pos_j  pos_ratio  neg_ratio  pair_edf  
84       406.0     81.0      609.0     37.0   1.180908   0.160005  1.180908  
155      609.0     37.0      406.0     81.0   1.180908   0.160005  1.180908  


In [24]:
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=['persoon_geslacht_vrouw',"relatie_kind_heeft_kinderen","contacten_onderwerp_boolean_taaleis___voldoet","belemmering_psychische_problemen"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

Globale EDF: 1.0250759084133456

Paren die de EDF bepalen:
                  group_i               group_j     sbr_i     sbr_j  \
70   (0.0, 1.0, 0.0, 0.0)  (1.0, 0.0, 1.0, 1.0)  0.167238  0.060000   
169  (1.0, 0.0, 1.0, 1.0)  (0.0, 1.0, 0.0, 0.0)  0.060000  0.167238   

     n_total_i  n_pos_i  n_total_j  n_pos_j  pos_ratio  neg_ratio  pair_edf  
70       582.0     97.0      424.0     25.0   1.025076   0.121132  1.025076  
169      424.0     25.0      582.0     97.0   1.025076   0.121132  1.025076  


I will now test based on the archetypes that Lighthouse used. I should first transform some of the continuous variables to use them with EDF. 

In [25]:
# this is the max archetype used by lighthouse, so all archetypes combined.
# variable_list = [[{"persoon_geslacht_vrouw":["1"]}],[{"relatie_partner_totaal_dagen_partner":["720"]}],
# [{"relatie_kind_huidige_aantal":["2"]}],[{"relatie_kind_heeft_kinderen":["1"]}],[{"relatie_kind_leeftijd_verschil_ouder_eerste_kind":["20"]}], 
# [{"relatie_kind_basisschool_kind":["2"]}], [{"relatie_overig_historie_vorm__andere_inwonende":["3"]}],
# [{"persoonlijke_eigenschappen_taaleis_voldaan":["0"]}],[{"persoonlijke_eigenschappen_dagen_sinds_taaleis":["0"]}], 
# [{"persoonlijke_eigenschappen_uitstroom_verw_vlgs_km":["1"]}],[{"adres_recentste_wijk_delfshaven":["1"]}], 
# [{"adres_recentste_wijk_stadscentru":[“0”]}],[{"adres_recentste_wijk_charlois":[“0”]}], [{"adres_recentste_wijk_feijenoord":[“0”]}], 
# [{"adres_recentste_wijk_ijsselmonde":[“0”]}], [{"adres_recentste_wijk_kralingen_c":[“0”]}], [{"adres_recentste_wijk_noord":[“0”]}],
#  [{"adres_recentste_wijk_other":[“0”]}], [{"adres_recentste_wijk_prins_alexa":[“0”]}], [{"adres_recentste_buurt_nieuwe_westen":["0"]}],
#  [{"adres_recentste_buurt_other":[“1”]}], [{"adres_recentste_buurt_groot_ijsselmonde":[“0”]}], [{"adres_recentste_buurt_oude_noorden":[“0”]}], 
# [{"adres_recentste_buurt_vreewijk":[“0”]}], [{"adres_recentst_onderdeel_rdam":[“1”]}], [{"adres_recentste_plaats_rotterdam":[“1”]}], 
# [{"adres_recentste_plaats_other":[“0”]}]]

Financially Struggling Single Mother: "We made her a woman, a mother of two children, who had been in a long term relationship, and has been struggling financially"
Let's transform children, relationship and financial difficulties variables and then compute EDF for this archetype. 

In [26]:
# this is how the mother archetype is defined
# variable_list = [[{"persoon_geslacht_vrouw":["ALL"]}], [{"relatie_partner_totaal_dagen_partner":["0", "720"]}], [{"relatie_kind_huidige_aantal":["0", "2"]}], 
#[{"relatie_kind_heeft_kinderen":["ALL"]}],[{"relatie_kind_leeftijd_verschil_ouder_eerste_kind":["0", "20"]}], [{"relatie_kind_basisschool_kind":["0", "2"]}],
# [{"belemmering_dagen_financiele_problemen":["0","700"]}]]

In [27]:
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=['persoon_geslacht_vrouw',"basisschool_cat","fin_cat","partner_cat"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

  df.groupby(protected)[label]


Globale EDF: 2.1362462407067007

Paren die de EDF bepalen:
                 group_i              group_j     sbr_i     sbr_j  n_total_i  \
546  (0, 1+, 700+, <360)   (1, 0, <700, <720)  0.215278  0.025424         71   
785   (1, 0, <700, <720)  (0, 1+, 700+, <360)  0.025424  0.215278         58   

     n_pos_i  n_total_j  n_pos_j  pos_ratio  neg_ratio  pair_edf  
546       15         58        1   2.136246   0.216673  2.136246  
785        1         71       15   2.136246   0.216673  2.136246  


How to interpret this? Investigating the archetype leads to a different finding, namely that men with children, long financial struggles and short relationship are even more vulnerable. 

The next archetype is that of a migrant worker. They define this in different ways, first by defining it as living in the Delfshaven district with 3 roommates. According to statistics from Rotterdam, nearly 70 percent of residents of the Delfshaven district are of migrant background.

In [28]:
# variable_list = [[{"relatie_overig_historie_vorm__andere_inwonende":["0","3"]}],[{"adres_recentste_wijk_delfshaven":["ALL"]}, 
# {"adres_recentste_wijk_stadscentru":[“ALL”]}],[{"adres_recentste_wijk_charlois":[“0”]}], [{"adres_recentste_wijk_feijenoord":[“0”]}], 
# [{"adres_recentste_wijk_ijsselmonde":[“0”]}], [{"adres_recentste_wijk_kralingen_c":[“0”]}], [{"adres_recentste_wijk_noord":[“0”]}],
#  [{"adres_recentste_wijk_other":[“0”]}], [{"adres_recentste_wijk_prins_alexa":[“0”]}], [{"adres_recentste_buurt_nieuwe_westen":["0"]}], 
# [{"adres_recentste_buurt_other":[“1”]}], [{"adres_recentste_buurt_groot_ijsselmonde":[“0”]}], [{"adres_recentste_buurt_oude_noorden":[“0”]}],
#  [{"adres_recentste_buurt_vreewijk":[“0”]}], [{"adres_recentst_onderdeel_rdam":[“1”]}], [{"adres_recentste_plaats_rotterdam":[“1”]}],
#  [{"adres_recentste_plaats_other":[“0”]}]]


In [29]:
df_aif["relatie_overig_historie_vorm__andere_inwonende"].value_counts(dropna=False)


relatie_overig_historie_vorm__andere_inwonende
1    6019
0    5959
2     652
3      15
Name: count, dtype: int64

In [30]:
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=["relatie_overig_historie_vorm__andere_inwonende","adres_recentste_wijk_delfshaven"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

Globale EDF: 1.4605861033926117

Paren die de EDF bepalen:
       group_i     group_j    sbr_i    sbr_j  n_total_i  n_pos_i  n_total_j  \
6   (0.0, 0.0)  (3.0, 1.0)  0.06963  0.30000     5162.0    359.0        4.0   
49  (3.0, 1.0)  (0.0, 0.0)  0.30000  0.06963        4.0      1.0     5162.0   

    n_pos_j  pos_ratio  neg_ratio  pair_edf  
6       1.0   1.460586   0.284502  1.460586  
49    359.0   1.460586   0.284502  1.460586  


Er zijn maar 15 cases met 3 of meer huisgenoten. Dus dat maakt dit niet heel betrouwbaar. Verder met de andere archtypes.

Next, Lighthouse looks at the language requirement and case workers estimations of how quickly they would find a job. They still include number of roommates, which I exclude from now onwards because this is a small group and we have already shown the results for this group above.  

In [31]:
#variable_list = [[{"relatie_overig_historie_vorm__andere_inwonende":["3"]}],[{"persoonlijke_eigenschappen_taaleis_voldaan":["0","1"]}],
# [{"persoonlijke_eigenschappen_dagen_sinds_taaleis":["0","720"]}],[{"persoonlijke_eigenschappen_uitstroom_verw_vlgs_km":["1", "5"]}],
# [{"adres_recentste_wijk_delfshaven":["1"]}], [{"adres_recentste_wijk_stadscentru":[“0”]}],[{"adres_recentste_wijk_charlois":[“0”]}], 
# [{"adres_recentste_wijk_feijenoord":[“0”]}], [{"adres_recentste_wijk_ijsselmonde":[“0”]}], [{"adres_recentste_wijk_kralingen_c":[“0”]}], 
# [{"adres_recentste_wijk_noord":[“0”]}], [{"adres_recentste_wijk_other":[“0”]}], [{"adres_recentste_wijk_prins_alexa":[“0”]}],
#  [{"adres_recentste_buurt_nieuwe_westen":["0"]}], [{"adres_recentste_buurt_other":[“1”]}], [{"adres_recentste_buurt_groot_ijsselmonde":[“0”]}], 
# [{"adres_recentste_buurt_oude_noorden":[“0”]}], [{"adres_recentste_buurt_vreewijk":[“0”]}], [{"adres_recentst_onderdeel_rdam":[“1”]}],
#  [{"adres_recentste_plaats_rotterdam":[“1”]}], [{"adres_recentste_plaats_other":[“0”]}]]

# transform dagen sinds taaleis to cat var
# Bins en labels definiëren
bins = [-np.inf,  0, 719, np.inf]
labels = ["0", "<720", "720+"]
df_aif["dagen_taaleis_cat"] = pd.cut(
    df_aif["persoonlijke_eigenschappen_dagen_sinds_taaleis"],
    bins=bins,
    labels=labels,
    right=True)

df_aif["dagen_taaleis_cat"].value_counts(dropna=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_aif["dagen_taaleis_cat"] = pd.cut(


dagen_taaleis_cat
<720    10429
720+     2210
0           6
Name: count, dtype: int64

Maar 6 met 0 dagen sinds taaleis, hoe kan dat? Variabele buiten beschouwing laten?

In [32]:
# only looking at language and delfshaven
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=["contacten_onderwerp_boolean_taaleis___voldoet","adres_recentste_wijk_delfshaven"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

Globale EDF: 0.37013033861369804

Paren die de EDF bepalen:
       group_i     group_j     sbr_i     sbr_j  n_total_i  n_pos_i  n_total_j  \
5   (0.0, 1.0)  (1.0, 1.0)  0.123403  0.085227     1251.0    154.0      439.0   
10  (1.0, 1.0)  (0.0, 1.0)  0.085227  0.123403      439.0     37.0     1251.0   

    n_pos_j  pos_ratio  neg_ratio  pair_edf  
5      37.0    0.37013   0.042628   0.37013  
10    154.0    0.37013   0.042628   0.37013  


In [33]:
# only looking at language and delfshaven
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=["adres_recentste_wijk_delfshaven","persoonlijke_eigenschappen_taaleis_schrijfv_ok"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

Globale EDF: 0.6242054948920592

Paren die de EDF bepalen:
      group_i     group_j     sbr_i     sbr_j  n_total_i  n_pos_i  n_total_j  \
4  (0.0, 1.0)  (1.0, 0.0)  0.073198  0.136643     4965.0    363.0      976.0   
7  (1.0, 0.0)  (0.0, 1.0)  0.136643  0.073198      976.0    133.0     4965.0   

   n_pos_j  pos_ratio  neg_ratio  pair_edf  
4    133.0   0.624205   0.070912  0.624205  
7    363.0   0.624205   0.070912  0.624205  


In [34]:
# now we include variable that describes whether case worker Estimated to find a job quickly
EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=["contacten_onderwerp_boolean_taaleis___voldoet","adres_recentste_wijk_delfshaven","persoonlijke_eigenschappen_uitstroom_verw_vlgs_km"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

Globale EDF: 1.96123611953354

Paren die de EDF bepalen:
             group_i          group_j     sbr_i     sbr_j  n_total_i  n_pos_i  \
236  (0.0, 1.0, 1.0)  (1.0, 1.0, 0.0)  0.500000  0.070342       10.0      5.0   
575  (1.0, 1.0, 0.0)  (0.0, 1.0, 1.0)  0.070342  0.500000      262.0     18.0   

     n_total_j  n_pos_j  pos_ratio  neg_ratio  pair_edf  
236      262.0     18.0   1.961236   0.620208  1.961236  
575       10.0      5.0   1.961236   0.620208  1.961236  


Actually, I dont know how to interpret the difference between 0 and 1 for "persoonlijke_eigenschappen_uitstroom_verw_vlgs_km". The same with comments. So I only look at the archteype elements that I know how to interpet. 



1. The max archetype

[
    "persoon_geslacht_vrouw",
    "relatie_partner_totaal_dagen_partner",
    "relatie_kind_huidige_aantal",
    "relatie_kind_heeft_kinderen",
    "relatie_kind_leeftijd_verschil_ouder_eerste_kind",
    "relatie_kind_basisschool_kind",
    "relatie_overig_historie_vorm__andere_inwonende", > hiervan weinig met veel huisgenoten
    "persoonlijke_eigenschappen_taaleis_voldaan",
    "persoonlijke_eigenschappen_dagen_sinds_taaleis",
    "persoonlijke_eigenschappen_uitstroom_verw_vlgs_km",
    "adres_recentste_wijk_delfshaven",
    "adres_recentste_wijk_stadscentru",
    "adres_recentste_wijk_charlois",
    "adres_recentste_wijk_feijenoord",
    "adres_recentste_wijk_ijsselmonde",
    "adres_recentste_wijk_kralingen_c",
    "adres_recentste_wijk_noord",
    "adres_recentste_wijk_other",
    "adres_recentste_wijk_prins_alexa",
    "adres_recentste_buurt_nieuwe_westen",
    "adres_recentste_buurt_other",
    "adres_recentste_buurt_groot_ijsselmonde",
    "adres_recentste_buurt_oude_noorden",
    "adres_recentste_buurt_vreewijk",
    "adres_recentst_onderdeel_rdam",
    "adres_recentste_plaats_rotterdam",
    "adres_recentste_plaats_other"
]


In [35]:
df_aif["rel_kind_age_gap_cat"] = df_aif["relatie_kind_leeftijd_verschil_ouder_eerste_kind"].astype(float)

df_aif["rel_kind_age_gap_cat"] = pd.cut(
    df_aif["rel_kind_age_gap_cat"],
    bins=[-float("inf"), 3, float("inf")],
    labels=["≤3", ">3"]
)

df_aif["rel_kind_age_gap_cat"].value_counts()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_aif["rel_kind_age_gap_cat"] = df_aif["relatie_kind_leeftijd_verschil_ouder_eerste_kind"].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_aif["rel_kind_age_gap_cat"] = pd.cut(


rel_kind_age_gap_cat
>3    10966
≤3     1679
Name: count, dtype: int64

In [36]:
# The max archetype

EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=["persoon_geslacht_vrouw",
               "partner_cat",
               "kind_cat",      #andere kind vars niet meer nodig
               "rel_kind_age_gap_cat",
               "basisschool_cat",
               "contacten_onderwerp_boolean_taaleis___voldoet",
               "dagen_taaleis_cat",
               "adres_recentste_wijk_delfshaven"
               ],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

  df.groupby(protected)[label]


Globale EDF: 4.774912960575186

Paren die de EDF bepalen:
                                  group_i                            group_j  \
39900   (0, <360, 1+, ≤3, 1+, 1, <720, 1)    (0, 720+, 0, >3, 0, 1, 720+, 0)   
130019    (0, 720+, 0, >3, 0, 1, 720+, 0)  (0, <360, 1+, ≤3, 1+, 1, <720, 1)   
130355    (0, 720+, 0, >3, 0, 1, 720+, 0)   (1, <720, 0, ≤3, 1+, 1, 720+, 0)   
130428    (0, 720+, 0, >3, 0, 1, 720+, 0)  (1, <720, 1+, >3, 1+, 1, 720+, 1)   
130501    (0, 720+, 0, >3, 0, 1, 720+, 0)      (1, 720+, 1+, >3, 0, 0, 0, 0)   
130502    (0, 720+, 0, >3, 0, 1, 720+, 0)      (1, 720+, 1+, >3, 0, 0, 0, 1)   
233676   (1, <720, 0, ≤3, 1+, 1, 720+, 0)    (0, 720+, 0, >3, 0, 1, 720+, 0)   
275651  (1, <720, 1+, >3, 1+, 1, 720+, 1)    (0, 720+, 0, >3, 0, 1, 720+, 0)   
317626      (1, 720+, 1+, >3, 0, 0, 0, 0)    (0, 720+, 0, >3, 0, 1, 720+, 0)   
318201      (1, 720+, 1+, >3, 0, 0, 0, 1)    (0, 720+, 0, >3, 0, 1, 720+, 0)   

           sbr_i     sbr_j  n_total_i  n_pos_i  n_total_j  n_

2. Migrant worker - language 
[
    "relatie_overig_historie_vorm__andere_inwonende", > dit levert heel weinig groepen op. heel weinig met meer dan 2 huisgenoten.
    "persoonlijke_eigenschappen_taaleis_voldaan",
    "persoonlijke_eigenschappen_dagen_sinds_taaleis",
    "persoonlijke_eigenschappen_uitstroom_verw_vlgs_km",
    "adres_recentste_wijk_delfshaven",
    "adres_recentste_wijk_stadscentru",
    "adres_recentste_wijk_charlois",
    "adres_recentste_wijk_feijenoord",
    "adres_recentste_wijk_ijsselmonde",
    "adres_recentste_wijk_kralingen_c",
    "adres_recentste_wijk_noord",
    "adres_recentste_wijk_other",
    "adres_recentste_wijk_prins_alexa",
    "adres_recentste_buurt_nieuwe_westen",
    "adres_recentste_buurt_other",
    "adres_recentste_buurt_groot_ijsselmonde",
    "adres_recentste_buurt_oude_noorden",
    "adres_recentste_buurt_vreewijk",
    "adres_recentst_onderdeel_rdam",
    "adres_recentste_plaats_rotterdam",
    "adres_recentste_plaats_other"
]



In [37]:
# The migrant-language archetype

EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=[ "contacten_onderwerp_boolean_taaleis___voldoet",
               "dagen_taaleis_cat",
               "adres_recentste_wijk_delfshaven"
               ],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

Globale EDF: 3.427590193018562

Paren die de EDF bepalen:
         group_i       group_j     sbr_i     sbr_j  n_total_i  n_pos_i  \
14     (0, 0, 1)  (0, 720+, 0)  0.750000  0.024349          1        1   
45  (0, 720+, 0)     (0, 0, 1)  0.024349  0.750000        882       21   

    n_total_j  n_pos_j  pos_ratio  neg_ratio  pair_edf  
14        882       21    3.42759   1.361644   3.42759  
45          1        1    3.42759   1.361644   3.42759  


  df.groupby(protected)[label]


3. Mother

[
    "persoon_geslacht_vrouw",
    "relatie_partner_totaal_dagen_partner",
    "relatie_kind_huidige_aantal",
    "relatie_kind_heeft_kinderen",
    "relatie_kind_leeftijd_verschil_ouder_eerste_kind",
    "relatie_kind_basisschool_kind",
    "belemmering_dagen_financiele_problemen",
    "persoon_leeftijd_bij_onderzoek"
]


In [38]:
# The mother-language archetype

EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=[ "persoon_geslacht_vrouw",
                "partner_cat",
                "kind_cat",
                "rel_kind_age_gap_cat",
                "basisschool_cat",
                "fin_cat",
                "leeftijd_cat"
               ],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)

  df.groupby(protected)[label]


Globale EDF: 5.174264717878057

Paren die de EDF bepalen:
                               group_i                         group_j  \
99135   (0, <720, 0, ≤3, 1+, 0, 40-49)   (1, 720+, 0, >3, 0, 0, 50-59)   
455264   (1, 720+, 0, >3, 0, 0, 50-59)  (0, <720, 0, ≤3, 1+, 0, 40-49)   

           sbr_i     sbr_j  n_total_i  n_pos_i  n_total_j  n_pos_j  pos_ratio  \
99135   0.833333  0.004717          2        2        105        0   5.174265   
455264  0.004717  0.833333        105        0          2        2   5.174265   

        neg_ratio  pair_edf  
99135    1.787031  5.174265  
455264   1.787031  5.174265  


4. The bad Dutch archetype

[
    "persoonlijke_eigenschappen_spreektaal",
    "persoonlijke_eigenschappen_taaleis_voldaan",
    "persoonlijke_eigenschappen_dagen_sinds_taaleis",
    "afspraak_afgelopen_jaar_ontheffing_taaleis",
    "contacten_onderwerp_beoordelen_taaleis",
    "afspraak_verzenden_beschikking_i_v_m__niet_voldoen_aan_wet_taaleis",
    "persoonlijke_eigenschappen_taaleis_schrijfv_ok",
    "persoonlijke_eigenschappen_spreektaal_anders",
    "contacten_onderwerp_boolean_beoordelen_taaleis",
    "contacten_onderwerp_boolean_taaleis___voldoet",
    "belemmering_hist_taal",
    "persoonlijke_eigenschappen_nl_spreken1",
    "persoonlijke_eigenschappen_nl_spreken2",
    "persoonlijke_eigenschappen_nl_spreken3",
    "persoonlijke_eigenschappen_nl_begrijpen3",
    "persoonlijke_eigenschappen_nl_schrijven0",
    "persoonlijke_eigenschappen_nl_schrijven1",
    "persoonlijke_eigenschappen_nl_schrijven2",
    "persoonlijke_eigenschappen_nl_schrijven3",
    "persoonlijke_eigenschappen_nl_schrijvenfalse",
    "persoonlijke_eigenschappen_nl_lezen3",
    "persoonlijke_eigenschappen_nl_lezen4"
]


In [48]:
# The bad_dutch archetype

EDF, groups, pair_df, worst_pairs = smoothed_edf_explainer(
    df_aif,
    label="binary_pred",
    protected=[ "contacten_onderwerp_boolean_taaleis___voldoet",
               #"dagen_taaleis_cat",
               "persoonlijke_eigenschappen_taaleis_schrijfv_ok",# variabelen die hetzelfde concept beschrijven combineren levert hele kleine groepen op. dus 1 kiezen. 
               "belemmering_hist_taal"],
    concentration=1.0
)

print("Globale EDF:", EDF)
print("\nParen die de EDF bepalen:")
print(worst_pairs)


== EDF tussen grote groepen (≥ 5.0% per groep) ==
EDF_large = 0.457
            group_i          group_j  pair_edf  n_total_i  n_total_j  \
5   (0.0, 0.0, 0.0)  (1.0, 1.0, 0.0)  0.457302     5869.0     2678.0   
42  (1.0, 1.0, 0.0)  (0.0, 0.0, 0.0)  0.457302     2678.0     5869.0   

     share_i   share_j  
5   0.464136  0.211783  
42  0.211783  0.464136  
Globale EDF: 1.340265866753188

Paren die de EDF bepalen:
            group_i          group_j     sbr_i     sbr_j  n_total_i  n_pos_i  \
24  (0.0, 1.0, 1.0)  (1.0, 0.0, 0.0)  0.042857  0.163717      174.0      7.0   
31  (1.0, 0.0, 0.0)  (0.0, 1.0, 1.0)  0.163717  0.042857      564.0     92.0   

    n_total_j  n_pos_j  pos_ratio  neg_ratio  pair_edf  
24      564.0     92.0   1.340266   0.134985  1.340266  
31      174.0      7.0   1.340266   0.134985  1.340266  


Alternatieve versie van de finctie die ook aangeeft wat de grootste groepen zijn die een groot verschil opleveren

In [40]:
import numpy as np
import pandas as pd

def smoothed_base_rates_df(df, label, protected, concentration=1.0):
    """
    Repliceert _smoothed_base_rates van AIF360,
    maar geeft een DataFrame terug met groepen + sbr.
    """
    counts = (
        df.groupby(protected)[label]
          .agg(['sum', 'count'])
          .reset_index()
          .rename(columns={'sum': 'n_pos', 'count': 'n_total'})
    )

    K = df[label].nunique()  # meestal 2
    alpha = concentration / K

    counts["sbr"] = (counts["n_pos"] + alpha) / (counts["n_total"] + concentration)

    return counts  # bevat: protected cols, n_pos, n_total, sbr


def smoothed_edf_explainer(df, label, protected, concentration=1.0):
    # 1) smoothed base rates per groep
    groups = smoothed_base_rates_df(df, label, protected, concentration)

    rows = []
    for i, gi in groups.iterrows():
        for j, gj in groups.iterrows():
            if i == j:
                continue

            p_i = gi["sbr"]
            p_j = gj["sbr"]

            # pos_ratio en neg_ratio zoals in AIF360
            pos = abs(np.log(p_i) - np.log(p_j))
            neg = abs(np.log(1 - p_i) - np.log(1 - p_j))

            pair_edf = max(pos, neg)

            rows.append({
                "group_i": tuple(gi[p] for p in protected),
                "group_j": tuple(gj[p] for p in protected),
                "sbr_i": p_i,
                "sbr_j": p_j,
                "n_total_i": gi["n_total"],
                "n_pos_i": gi["n_pos"],
                "n_total_j": gj["n_total"],
                "n_pos_j": gj["n_pos"],
                "pos_ratio": pos,
                "neg_ratio": neg,
                "pair_edf": pair_edf,
            })

    pair_df = pd.DataFrame(rows)

    # 2) globale EDF = max over alle paren
    EDF = pair_df["pair_edf"].max()
    worst_pairs = pair_df[pair_df["pair_edf"] == EDF]

    # ---- NIEUW: ook kijken naar grote groepen ----
    N = len(df)
    pair_df["share_i"] = pair_df["n_total_i"] / N
    pair_df["share_j"] = pair_df["n_total_j"] / N

    # kies zelf de drempel; hier bv. 5% per groep
    min_group_frac = 0.05

    large_pairs = pair_df[
        (pair_df["share_i"] >= min_group_frac) &
        (pair_df["share_j"] >= min_group_frac)
    ]

    if not large_pairs.empty:
        EDF_large = large_pairs["pair_edf"].max()
        worst_large_pairs = large_pairs[large_pairs["pair_edf"] == EDF_large]

        print(f"\n== EDF tussen grote groepen (≥ {min_group_frac*100:.1f}% per groep) ==")
        print(f"EDF_large = {EDF_large:.3f}")
        print(
            worst_large_pairs[
                ["group_i", "group_j", "pair_edf",
                 "n_total_i", "n_total_j", "share_i", "share_j"]
            ]
            .sort_values("pair_edf", ascending=False)
            .head(10)
        )
    else:
        print("\nGeen paren gevonden waarin beide groepen groter zijn dan de ingestelde drempel.")

    return EDF, groups, pair_df, worst_pairs


Nu ga ik statistical parity berekenen voor intersectionele groepen

In [41]:
from aif360.sklearn.metrics import (
    statistical_parity_difference,
    one_vs_rest,
)

prot = [
    "persoon_geslacht_vrouw",
    "kind_cat",
    "contacten_onderwerp_boolean_taaleis___voldoet",
    #"adres_recentste_wijk_delfshaven",
    "persoonlijke_eigenschappen_taaleis_schrijfv_ok"
]

# y: labels met protected attrs in de index
y = df_aif.set_index(prot)["binary_pred"]

vals, groups = one_vs_rest(
    func=statistical_parity_difference,
    y_true=y,
    prot_attr=prot,
    return_groups=True,
    pos_label=0,
)

sp_intersection = pd.DataFrame({
    "group": groups,
    "spd": vals,
})

# tuples uitpakken naar losse kolommen
sp_intersection[prot] = pd.DataFrame(
    sp_intersection["group"].tolist(),
    index=sp_intersection.index
)

sp_intersection = sp_intersection.drop(columns="group")

# ---- NIEUW: group size toevoegen ----
group_sizes = (
    df_aif
    .groupby(prot)
    .size()
    .reset_index(name="group_size")
)

sp_intersection = sp_intersection.merge(group_sizes, on=prot, how="left")

# sorteren op grootste absolute SPD
sp_intersection = sp_intersection.reindex(
    sp_intersection["spd"].abs().sort_values(ascending=False).index
)

print(sp_intersection.head(10))


         spd  persoon_geslacht_vrouw kind_cat  \
14  0.098127                       1       1+   
11 -0.080794                       1        0   
4   0.075593                       0       1+   
10  0.072708                       1        0   
6   0.065833                       0       1+   
1  -0.058039                       0        0   
9  -0.053801                       1        0   
3  -0.053078                       0        0   
0  -0.048106                       0        0   
12  0.044046                       1       1+   

    contacten_onderwerp_boolean_taaleis___voldoet  \
14                                              1   
11                                              1   
4                                               0   
10                                              1   
6                                               1   
1                                               0   
9                                               0   
3                                   

In [42]:
from aif360.sklearn.metrics import (
    statistical_parity_difference,
    one_vs_rest,
)

prot = [
    "persoon_geslacht_vrouw",
    "kind_cat",
    "contacten_onderwerp_boolean_taaleis___voldoet",
    "persoonlijke_eigenschappen_taaleis_schrijfv_ok",
]

# y: labels met protected attrs in de index
y = df_aif.set_index(prot)["binary_pred"]

vals, groups = one_vs_rest(
    func=statistical_parity_difference,
    y_true=y,
    prot_attr=prot,
    return_groups=True,
    pos_label=0,
)

sp_intersection = pd.DataFrame({
    "group": groups,
    "spd": vals,
})

# tuples uitpakken naar losse kolommen
sp_intersection[prot] = pd.DataFrame(
    sp_intersection["group"].tolist(),
    index=sp_intersection.index
)

sp_intersection = sp_intersection.drop(columns="group")

# ---- NIEUW: group size toevoegen ----
group_sizes = (
    df_aif
    .groupby(prot)
    .size()
    .reset_index(name="group_size")
)

sp_intersection = sp_intersection.merge(group_sizes, on=prot, how="left")

# sorteren op grootste absolute SPD
sp_intersection = sp_intersection.reindex(
    sp_intersection["spd"].abs().sort_values(ascending=False).index
)

print(sp_intersection.head(10))


         spd  persoon_geslacht_vrouw kind_cat  \
14  0.098127                       1       1+   
11 -0.080794                       1        0   
4   0.075593                       0       1+   
10  0.072708                       1        0   
6   0.065833                       0       1+   
1  -0.058039                       0        0   
9  -0.053801                       1        0   
3  -0.053078                       0        0   
0  -0.048106                       0        0   
12  0.044046                       1       1+   

    contacten_onderwerp_boolean_taaleis___voldoet  \
14                                              1   
11                                              1   
4                                               0   
10                                              1   
6                                               1   
1                                               0   
9                                               0   
3                                   

Conditional statistical parity

In [43]:
import pandas as pd
from aif360.sklearn.metrics import selection_rate, statistical_parity_difference

def conditional_statistical_parity(df, pred_col, A, C, pos_label=1, min_size=30):
    """
    Compute Conditional Statistical Parity:
    P(Yhat | A=a_i, C=c) - P(Yhat | A=a_j, C=c)
    for each level of C.
    
    Parameters
    ----------
    df : pd.DataFrame
    pred_col : str
        Column with predicted labels (binary_pred)
    A : str
        Protected attribute (binary or categorical)
    C : str
        Conditioning variable
    pos_label : int
        Label considered 'selected' / 'negative' (usually 1 for risk scores)
    min_size : int
        Minimum group size per C level to be included
    
    Returns
    -------
    results : pd.DataFrame
        One row per value of C, with per-A-group selection rates and SPD.
    """
    
    results = []

    for c_val, df_c in df.groupby(C):

        # Skip tiny strata
        if len(df_c) < min_size:
            continue

        # selection rate per A group
        sel_rates = (
            df_c.groupby(A)[pred_col]
                .apply(lambda x: (x == pos_label).mean())
                .to_dict()
        )

        # compute SPD within this stratum (max difference between A groups)
        if len(sel_rates) > 1:
            spd = max(sel_rates.values()) - min(sel_rates.values())
        else:
            spd = None
        
        results.append({
            "C_value": c_val,
            "n_in_stratum": len(df_c),
            "selection_rates": sel_rates,
            "conditional_SPD": spd,
        })

    return pd.DataFrame(results)


In [44]:
res = conditional_statistical_parity(
    df=df_aif,
    pred_col="binary_pred",
    A="persoon_geslacht_vrouw",
    C="adres_aantal_verschillende_wijken",  # bijv.
    pos_label=1
)

print(res)


   C_value  n_in_stratum                                   selection_rates  \
0        1          3533  {0: 0.07726980038634901, 1: 0.07727272727272727}   
1        2          5485   {0: 0.09507042253521127, 1: 0.0948960302457467}   
2        3          2636  {0: 0.12247557003257328, 1: 0.12079927338782924}   
3        4           814               {0: 0.12753036437246965, 1: 0.1625}   
4        5           155  {0: 0.17307692307692307, 1: 0.21568627450980393}   

   conditional_SPD  
0         0.000003  
1         0.000174  
2         0.001676  
3         0.034970  
4         0.042609  


In [45]:
res = conditional_statistical_parity(
    df=df_aif,
    pred_col="binary_pred",
    A="contacten_onderwerp_boolean_taaleis___voldoet",
    C="adres_aantal_verschillende_wijken",  # bijv.
    pos_label=1
)

print(res)


   C_value  n_in_stratum                                   selection_rates  \
0        1          3533  {0: 0.08226007478188617, 1: 0.06660746003552398}   
1        2          5485  {0: 0.09735849056603774, 1: 0.08874172185430464}   
2        3          2636              {0: 0.12675350701402804, 1: 0.10625}   
3        4           814  {0: 0.13446676970633695, 1: 0.16766467065868262}   
4        5           155  {0: 0.18181818181818182, 1: 0.20588235294117646}   

   conditional_SPD  
0         0.015653  
1         0.008617  
2         0.020504  
3         0.033198  
4         0.024064  


In [46]:
res = conditional_statistical_parity(
    df=df_aif,
    pred_col="binary_pred",
    A="adres_recentste_wijk_delfshaven",
    C="adres_aantal_verschillende_wijken",  # bijv.
    pos_label=1
)

print(res)


   C_value  n_in_stratum                                   selection_rates  \
0        1          3533  {0: 0.07687120701281187, 1: 0.07936507936507936}   
1        2          5485  {0: 0.09127234490010515, 1: 0.11917808219178082}   
2        3          2636    {0: 0.11944803794739112, 1: 0.138801261829653}   
3        4           814  {0: 0.13811420982735723, 1: 0.18032786885245902}   
4        5           155  {0: 0.18439716312056736, 1: 0.21428571428571427}   

   conditional_SPD  
0         0.002494  
1         0.027906  
2         0.019353  
3         0.042214  
4         0.029889  


In [47]:


res = conditional_statistical_parity(
    df=df_aif,
    pred_col="binary_pred",
    A="persoon_geslacht_vrouw",
    C="belemmering_financiele_problemen",  # bijv.
    pos_label=1
)

print(res)


   C_value  n_in_stratum                                   selection_rates  \
0        0          8690  {0: 0.08493771234428087, 1: 0.08046783625730994}   
1        1          3955  {0: 0.13540197461212977, 1: 0.14113785557986872}   

   conditional_SPD  
0         0.004470  
1         0.005736  
