Copyright 2023 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# [WIP] Exploring ML model-environment interactions

NOTE: This colab is a work in progress for an upcoming paper.

In this notebook, we adopt a **causal framework** to explore the impact of **model specification** choices on **algorithmic fairness over time**. Model specification refers to a series of choices that one makes when developing a predictive model, including: 

1. **Variable operationalization**:
    - Given a problem or predictive task stated in natural language, how can we map semantic concepts to features with known types that we can extract, collect, or approximate? 
    - Note that these mappings may be set-valued; for this reason, steps (1) and (2) are closely intertwined. 
2. **Variable selection**: 
    - Inclusion/exclusion of independent and dependent variable(s); identification of proxy variables.
3. **Functional form selection**:
    - What parametric or distributional assumptions can we make about how the dependent variable is related to each of the independent variable(s), given what we know or hypothesize about the data-generating process? 
    
As a motivating example, we consider the algorithmic decision-making task outlined in [Obermeyer et al. 2019](https://www.ftc.gov/system/files/documents/public_events/1548288/privacycon-2020-ziad_obermeyer.pdf). In this paper, the authors review a machine learning pipeline which has been deployed by multiple health systems to select a subset of patients to participate in care management programs, which have been empirically demonstrated to improve clinical outcomes and reduce costs. As the authors state in the paper's introduction, the underlying question when considering such constrained resource allocation tasks is how to select the subset of patients that will *derive greatest marginal benefit from participation*, relative to non-participation, subject to the satisfaction of budget constraints. However, this type of intervention effect can be difficult to estimate for a variety of reasons in the absence of randomized control trial data. For this reason, the model's developers begin their model specification task by assuming that **need for care** is a suitable proxy for **marginal benefit associated with program participation**. 

- $f: \texttt{marginal_benefit_of_program} \rightarrow \texttt{need_for_care}$

They proceed to identify three ways in which $\texttt{need_for_care}_t$ might be operationalized, represented by the set-valued function $g$ below:

- $g: \texttt{need_for_care} \rightarrow \{\texttt{cost}, \ \texttt{avoidable_cost}, \ \texttt{num_active_chronic_conditions}\}$

When it comes to each predictive model's **feature space**, the developers seek to select variables that reflect/represent each patient's $\texttt{sociodemographic_characteristics }$ and $\texttt{ medical_history}_{0:t-1}$. These operationalization mappings are formalized below:

- $h: \texttt{socdemo_characteristics} \rightarrow \{ \texttt{gender, age_bucket, insurance_type ...}\} \setminus \{\texttt{race}\}$

Note here that $\texttt{gender}$ is assumed to be both binary and time-invariant, and that age at timestep $t-1$ is used when mapping patients to discretized $\texttt{age_bucket}$s. Additionally, note that the model developers make the deliberate decision to exclude $\texttt{race}$ presumably in an effort to ensure "fairness through unawareness"---i.e., exclusion (during training) of the sensitive attribute(s) believed to be associated with disparate treatment.

- $j: \texttt{medical history} \rightarrow \{\texttt{diagnoses, procedure codes, medications, costs_incurred}\}$.

Note here that the operationalization of $\texttt{medical history}$ is informed by the availability of longitudinal claims data containing these features. While real-world claims records will typically contain multiple years of observations for each patient, in the synthetic dataset that the authors make publically available, we only observe two timesteps---i.e., features observed at time $t-1$ are used to predict the outcome variable(s) of interest at time $t$. Additionally, we note that many other operationalization choices may be possible, provided that corresponding datasets exist (e.g., medication adherence, exercise, vital signs, patient-reported symptoms, etc. might be available through sensor data and/or mobile apps).  
 
$\texttt{LASSO}  (h \cup j)_{t-1} \rightarrow (g\circ f)_t$
$


LASSO (or, more generally, regularization) as a form of variable selection given a high-dimensional and potentially sparse feature space.

In [None]:
import os
import networkx as nx
import numpy as np
import pandas as pd
from typing import Optional
from itertools import product,chain
import causalnex
from math import ceil
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
DATA_DIR = os.path.join("..", "data")
df = pd.read_csv(os.path.join(DATA_DIR, "data_new.csv"))

In [None]:
# Select columns corresponding to predictive model features 
# - Taken from https://gitlab.com/labsysmed/dissecting-bias/-/blob/master/code/model/features.py
# - Slightly modified syntax but preserves original functionality

def get_dem_features(df: pd.DataFrame, prefix: str = 'dem_', excl_race:bool=False) -> [str]:
    """Select sociodemographic features; 
        use excl_race flag to determine whether to keep or exclude race"""
    if excl_race:
        return [c for c in df.columns if c[:len(prefix)] == prefix and 'race' not in c]
    else:
        return [c for c in df.columns if c[:len(prefix)] == prefix]

def get_comorbidity_features(df: pd.DataFrame) -> [str]:
    """Select features related to patients' comorbidities at time t-1"""
    comorbidity_sum = 'gagne_sum_tm1'
    suffix_elixhauser = '_elixhauser_tm1'
    suffix_romano = '_romano_tm1'
    
    return [c for c in df.columns if c == comorbidity_sum or 
             suffix_elixhauser in c or suffix_romano in c]

def get_cost_features(df: pd.DataFrame, prefix='cost_') -> [str]:
    """Select features related to patients' incurred costs at time t-1;
        exclude features related to costs at time t"""
    return [c for c in df.columns if prefix == c[:len(prefix)] 
            and c not in ['cost_t', 'cost_avoidable_t']]

def get_lab_features(df: pd.DataFrame) -> [str]:
    """Select features related to patients' lab results at time t-1"""
    suffix_labs_counts = '_tests_tm1'
    suffix_labs_low = '-low_tm1'
    suffix_labs_high = '-high_tm1'
    suffix_labs_normal = '-normal_tm1'
    
    return [c for c in df.columns if np.any([suffix_labs_counts in c, suffix_labs_low in c, 
                                         suffix_labs_high in c, suffix_labs_normal in c])]

def get_med_features(df: pd.DataFrame, prefix='lasix_') -> [str]:
    """Select features related to patients' medications at time t-1.
        Note that the prefix they use returns only lasix meds, which are diuretics"""
    return [c for c in df.columns if c[:len(prefix)]==prefix]

def get_all_features(df: pd.DataFrame, verbose:bool=True) -> [str]:
    """Return a list of features representing the union over all available feature selection functions"""
    return list(chain(*[f(df) for f in [get_dem_features, get_comorbidity_features, 
                                          get_cost_features,get_lab_features, get_med_features]]))
    



In [None]:
def round_to_nearest_hundred(x) -> int:
    return int(ceil(x / 100.0)) * 100

def one_hot_encode_mean_score(df: pd.DataFrame, var: str, low_if_lt: float, high_if_gt: float) -> (pd.DataFrame, [str]):
    var_name = var.replace("_t", "")
    if var_name[:3] == 'ldl':
        var_name = var_name.replace("_mean", "-mean")
    
    df_t = pd.DataFrame()
    df_t['{}-low'.format(var_name)] = df[var].apply(lambda x: x < low_if_lt)
    df_t['{}-normal'.format(var_name)] = df[var].apply(lambda x: low_if_lt <= x <= high_if_gt)
    df_t['{}-high'.format(var_name)] = df[var].apply(lambda x: x > high_if_gt)

    assert np.max(np.sum(df_t, axis=1)) == 1
    
    df_tm1 = df[["{}-{}_tm1".format(x[0], x[1]) for x in  list(product([var_name], ["low", "normal", "high"]))]]
    df_tm1.columns = df_t.columns

    return pd.concat([df_tm1, df_t]), df_t.columns

def discretize_mean_score(df: pd.DataFrame, var:str, low_if_lt: float, high_if_gt: float) -> (pd.DataFrame, [str]):
    var_name = var.replace("_t", "")
    if var_name[:3] == 'ldl':
        var_name = var_name.replace("_mean", "-mean")
    
    df_t = pd.DataFrame()
    df_t[var_name] = df[var].apply(lambda x: "low" if x < low_if_lt else "normal" if low_if_lt <= x <= high_if_gt else "high" if x > high_if_gt else "unobs")
    
    df_tm1 = df[["{}-{}_tm1".format(x[0], x[1]) for x in  list(product([var_name], ["low", "normal", "high"]))]]
    df_tm1.loc[:,var_name] = df_tm1.apply(lambda row: "low" if row["{}-low_tm1".format(var_name)] == 1 
                                          else "normal" if  row["{}-normal_tm1".format(var_name)] == 1 
                                          else "high" if row["{}-high_tm1".format(var_name)] == 1
                                          else "unobs",axis=1)
    
    merged_df = pd.concat([df_tm1[[var_name]], df_t])
    merged_df.loc[:,var_name] = merged_df[var_name].astype("category")
    
    return merged_df, merged_df.columns


def get_time_series_for_outcome_vars(df: pd.DataFrame, var:str) -> (pd.DataFrame, [str]):
    
    ### Construct time-series for each of the outcome variables that they consider in the paper: (see https://gitlab.com/labsysmed/dissecting-bias/-/blob/master/data/data_dictionary.md)

    # risk_score_t: Commercial algorithmic risk score prediction for cost in year t, formed using data from year t-1. risk_score_tm1 is NOT computable because we do not have (input) data for year t-2.
    
    # program_enrolled_t: Indicator for whether patient-year was enrolled in program. FOr models, this is a function of the model and the percentile cutoff. In their data, this is the observed enrollment (ie, based on original alg. see pg. 6 of paper)
    #    program_enrolled_tm1 is NOT computable because we don't have (input) data for year t-2. Note: we might choose to default to 0?
    # TODO
    if var in ['risk_score_t', 'program_enrolled_t']:
        pass 
    
    # gagne_sum_t: Total number of active chronic illnesses. gagne_sum_tm1 exists in the data and does not need to be computed.
    elif var == 'gagne_sum_t':
        return pd.concat([df['gagne_sum_tm1'], df[var]]), ['gagne_sum']
    
    # Cost_t: Total medical expenditures, rounded to the nearest 100. cost_tm1 IS computable by summing over all costs incurred in year t-1 and rounding to the nearest 100.
    elif var == 'cost_t':
        cost_features = get_cost_features(df, prefix='cost_')
        return pd.concat([df[cost_features].sum(axis=1).apply(lambda x: round_to_nearest_hundred(x)), df[var]]), ['cost']
    
    # Cost_avoidable_t: Total avoidable (emergency + inpatient) medical expenditures, rounded to nearest 100. 
    #     cost_avoidable_tm1 IS computable by sum('cost_emergency_tm1', 'cost_ip_medical_tm1', 'cost_ip_surgical_tm1') for year t-1 and rounding to the nearest 100.
    elif var == 'cost_avoidable_t':
        return pd.concat([df[['cost_emergency_tm1', 'cost_ip_medical_tm1', 'cost_ip_surgical_tm1']].sum(axis=1).apply(lambda x: round_to_nearest_hundred(x)), df[var]]), ['cost_avoidable']
    
    # Mean systolic blood pressure in year t. We don't have `bps_mean_tm1` but we do have an indicator variable `hypertension_elixhauser_tm1` for hypertension at t-1. 
    #    Per CDC guidance (https://www.cdc.gov/bloodpressure/about.htm), a person is considered to have high blood pressure (hypertension) w/ systolic blood pressure >= 130 mm Hg.
    #    So, we can binarize and concatenate.
    elif var == 'bps_mean_t':
        return pd.concat([df['hypertension_elixhauser_tm1'], df[var].apply(lambda x: x >= 130.0)]), ['hypertension_elixhauser']
    
    # Mean HbA1C in year t. We don't have `ghba1c_mean_tm1`, 
    #    Let x = `mean GHbA1c test result`; then: low := x <4; normal := 4 <= x <= 5.7; high := x > 5.7
    elif var == 'ghba1c_mean_t':
        return discretize_mean_score(df=df, var=var, low_if_lt=4, high_if_gt=5.7)
    
    # Mean hematocrit test result in year t. We don't have `hct_mean_tm1`, 
    #    Let x = `mean hct test result`; then: low := x <35.5; normal := 35.5 <= x <= 48.6; high := x > 48.6
    elif var == 'hct_mean_t': 
        return discretize_mean_score(df=df, var=var, low_if_lt=35.5, high_if_gt=48.6)
    
    # Mean creatinine test result in year t. We don't have `cre_mean_tm1`, 
    #    Let x = `mean cre test result`; then: low := x <0.84; normal := 0.84 <= x <= 1.21; high := x > 1.21
    elif var == 'cre_mean_t':
        return discretize_mean_score(df=df, var=var, low_if_lt=0.84, high_if_gt=1.21)
    
    # Mean LDL (low-density lipoprotein cholesterol) test result in year t. We don't have `ldl_mean_tm1`, 
    #    Let x = `mean LDL test result`; then: low := x <50; normal := 50 <= x <= 99; high := x > 99
    elif var in ['ldl_mean_t']:
        return discretize_mean_score(df=df, var=var, low_if_lt=50, high_if_gt=99)
    
    else:
        raise ValueError("Variable `{}` is not currently supported. Supported variables include: gagne_sum_t, cost_t, cost_avoidable_t, bps_mean_t, ghba1c_mean_t, hct_mean_t, cre_mean_t, ldl_mean_t.".format(var))

In [None]:
def create_long_df_obermeyer(wide_df: pd.DataFrame, t_start:int=0,outcome_vars: [str] = ['gagne_sum_t', 'cost_t', 'cost_avoidable_t', 'bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t'] ):
    
    df = pd.DataFrame()
    df['idx'] = wide_df.index
    df['race'] = wide_df['race'].astype("category")
    df['gender'] = wide_df['dem_female'].apply(lambda x: 'f' if x == 1 else 'm').astype("category")
    df['age_bucket'] = wide_df.apply(lambda x: '18-24' if x['dem_age_band_18-24_tm1']==1
                                     else '25-34' if x['dem_age_band_25-34_tm1']==1
                                     else '35-44' if x['dem_age_band_35-44_tm1']==1
                                     else '45-54' if x['dem_age_band_45-54_tm1']==1
                                     else '55-64' if x['dem_age_band_55-64_tm1']==1
                                     else '65-74' if x['dem_age_band_65-74_tm1']==1
                                     else 'geq_75' if x['dem_age_band_75+_tm1']==1
                                     else 'missing',axis=1).astype("category")
    df.loc[:,'timestep'] = t_start
    
    df_t = df.copy()
    df_t.loc[:,'timestep'] = t_start + 1
    long_df = pd.concat([df,df_t], axis=0)
    
    cols_df = pd.DataFrame()
    colnames = []
    
    for v in outcome_vars:
        time_series_col, cols = get_time_series_for_outcome_vars(df=wide_df, var=v)
        cols_df = pd.concat([cols_df, time_series_col], axis=1)
        colnames.extend(cols)

    cols_df.columns = colnames
    long_df = pd.concat([long_df, cols_df],axis=1)
    return long_df

In [None]:
# For reproducibility, imported from https://gitlab.com/labsysmed/dissecting-bias/-/blob/master/code/model/util.py

"""
Utility functions.
"""
import pandas as pd
import numpy as np
import os
import git


def convert_to_log(df, col_name):
    """Convert column to log space.

    Defining log as log(x + EPSILON) to avoid division by zero.

    Parameters
    ----------
    df : pd.DataFrame
        Data dataframe.
    col_name : str
        Name of column in df to convert to log.

    Returns
    -------
    np.ndarray
        Values of column in log space

    """
    # This is to avoid division by zero while doing np.log10
    EPSILON = 1
    return np.log10(df[col_name].values + EPSILON)


def convert_to_percentile(df, col_name):
    """Convert column to percentile.

    Parameters
    ----------
    df : pd.DataFrame
        Data dataframe.
    col_name : str
        Name of column in df to convert to percentile.

    Returns
    -------
    pd.Series
        Column converted to percentile from 1 to 100

    """
    return pd.qcut(df[col_name].rank(method='first'), 100,
                   labels=range(1, 101))


def get_git_dir():
    """Get directory where git repo is saved.

    Returns
    -------
    str
        Full path of git repo home.

    """
    repo = git.Repo('.', search_parent_directories=True)
    return repo.working_tree_dir


def create_dir(*args):
    """Create directory if it does not exist.

    Parameters
    ----------
    *args : type
        Description of parameter `*args`.

    Returns
    -------
    str
        Full path of directory.

    """
    fullpath = os.path.join(*args)

    # if path does not exist, create it
    if not os.path.exists(fullpath):
        os.makedirs(fullpath)

    return fullpath


In [None]:
# For reproducibility, imported from https://gitlab.com/labsysmed/dissecting-bias/-/blob/master/code/model/model.py

"""
Functions for training model.
"""
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt


def split_by_id(df, id_field='ptid', frac_train=.6):
    """Split the df by id_field into train/holdout deterministically.

    Parameters
    ----------
    df : pd.DataFrame
        Data dataframe.
    id_field : str
        Split df by this column (e.g. 'ptid').
    frac_train : float
        Fraction assigned to train. (1 - frac_train) assigned to holdout.

    Returns
    -------
    pd.DataFrame
        Data dataframe with additional column 'split' indication train/holdout

    """
    ptid = np.sort(df[id_field].unique())
    print("Splitting {:,} unique {}".format(len(ptid), id_field))

    # deterministic split
    rs = np.random.RandomState(0)
    perm_idx = rs.permutation(len(ptid))
    num_train = int(frac_train*len(ptid))

    # obtain train/holdout
    train_idx = perm_idx[:num_train]
    holdout_idx  = perm_idx[num_train:]
    ptid_train = ptid[train_idx]
    ptid_holdout  = ptid[holdout_idx]
    print(" ...splitting by patient: {:,} train, {:,} holdout ".format(
      len(ptid_train), len(holdout_idx)))

    # make dictionaries
    train_dict = {p: "train" for p in ptid_train}
    holdout_dict  = {p: "holdout"  for p in ptid_holdout}
    split_dict = {**train_dict, **holdout_dict}

    # add train/holdout split to each
    split = []
    for e in df[id_field]:
        split.append(split_dict[e])
    df['split'] = split

    return df


def get_split_predictions(df, split):
    """Get predictions for split (train/holdout).

    Parameters
    ----------
    df : pd.DataFrame
        Data dataframe.
    split : str
        Name of split (e.g. 'holdout')

    Returns
    -------
    pd.DataFrame
        Subset of df with value split.

    """
    pred_split_df = df[df['split'] == split]
    pred_split_df = pred_split_df.drop(columns=['split'])
    return pred_split_df


def build_formulas(y_col, outcomes):
    """Build regression formulas for each outcome (y) ~ y_col predictor (x).

    Parameters
    ----------
    y_col : str
        Algorithm training label.
    outcomes : list
        All outcomes of interest.

    Returns
    -------
    list
        List of all regression formulas.

    """
    if 'risk_score' in y_col:
        predictors = ['risk_score_t']
    else:
        predictors = ['{}_hat'.format(y_col)]

    # build all y ~ x formulas
    all_formulas = []
    for y in outcomes:
        for x in predictors:
            formula = '{} ~ {}'.format(y, x)
            all_formulas.append(formula)
    return all_formulas


def get_r2_df(df, formulas):
    """Short summary.

    Parameters
    ----------
    df : pd.DataFrame
        Holdout dataframe.
    formulas : list
        List of regression formulas.

    Returns
    -------
    pd.DataFrame
        DataFrame of formula (y ~ x), holdout_r2, holdout_obs.

    """
    import statsmodels.formula.api as smf
    r2_list = []

    # run all OLS regressions
    for formula in formulas:
        model = smf.ols(formula, data=df)
        results = model.fit()
        r2_dict = {'formula (y ~ x)': formula,
                   'holdout_r2': results.rsquared,
                   'holdout_obs': results.nobs}
        r2_list.append(r2_dict)
    return pd.DataFrame(r2_list)


def train_lasso(train_df, holdout_df,
                x_column_names,
                y_col,
                outcomes,
                n_folds=10,
                include_race=False,
                plot=False,
                output_dir=None):
    """Train LASSO model and get predictions for holdout.

    Parameters
    ----------
    train_df : pd.DataFrame
        Train dataframe.
    holdout_df : pd.DataFrame
        Holdout dataframe.
    x_column_names : list
        List of column names to use as features.
    y_col : str
        Name of y column (label) to predict.
    outcomes : list
        All labels (Y) to predict.
    n_folds : int
        Number of folds for cross validation.
    include_race : bool
        Whether to include the race variable as a feature (X).
    plot : bool
        Whether to save the mean square error (MSE) plots.
    output_dir : str
        Path where to save results.

    Returns
    -------
    r2_df : pd.DataFrame
        DataFrame of formula (y ~ x), holdout_r2, holdout_obs.
    pred_df : pd.DataFrame
        DataFrame of all predictions (train and holdout).
    lasso_coef_df : pd.DataFrame
        DataFrame of lasso coefficients.

    """
    if not include_race:
        # remove the race variable
        x_cols = [x for x in x_column_names if x != 'race']
    else:
        # include the race variable
        if 'race' not in x_column_names:
            x_cols = x_column_names + ['race']
        else:
            x_cols = x_column_names

    # split X and y
    train_X = train_df[x_cols]
    train_y = train_df[y_col]

    # define cross validation (CV) generator
    # separate at the patient level
    from sklearn.model_selection import GroupKFold
    group_kfold = GroupKFold(n_splits=n_folds)
    # for the synthetic data, we split at the observation level ('index')
    group_kfold_generator = group_kfold.split(train_X, train_y,
                                              groups=train_df['index'])
    # train lasso cv model
    from sklearn.linear_model import LassoCV
    n_alphas = 100
    lasso_cv = LassoCV(
                       n_alphas=n_alphas,
                       cv=group_kfold_generator,
                       random_state=0,
                       max_iter=10000,
                       fit_intercept=True,
                       normalize=True)
    lasso_cv.fit(train_X, train_y)
    alpha = lasso_cv.alpha_
    train_r2 = lasso_cv.score(train_X, train_y)
    train_nobs = len(train_X)

    # plot
    if plot:
        plt.figure()
        alphas = lasso_cv.alphas_

        for i in range(n_folds):
            plt.plot(alphas, lasso_cv.mse_path_[:, i], ':', label='fold {}'.format(i))
        plt.plot(alphas, lasso_cv.mse_path_.mean(axis=-1), 'k',
                 label='Average across the folds', linewidth=2)
        plt.axvline(lasso_cv.alpha_, linestyle='--', color='k',
                    label='alpha: CV estimate')

        plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

        plt.xlabel(r'$\alpha$')
        plt.ylabel('MSE')
        plt.title('Mean square error (MSE) on each fold predicting {}'.format(y_col))
        plt.xscale('log')

        if include_race:
            filename = 'model_lasso_{}_race.png'.format(y_col)
        else:
            filename = 'model_lasso_{}.png'.format(y_col)
        output_dir = create_dir(output_dir)
        output_filepath = os.path.join(output_dir, filename)
        plt.savefig(output_filepath, bbox_inches='tight', dpi=500)

    # lasso coefficients
    coef_col_name = '{}_race_coef'.format(y_col) if include_race else '{}_coef'.format(y_col)
    lasso_coef_df = pd.DataFrame({'{}_coef'.format(y_col): lasso_cv.coef_}, index=train_X.columns)

    # number of lasso features
    original_features = len(x_cols)
    n_features = len(lasso_coef_df)

    def predictions_df(x_vals, y_col, split):
        """Short summary.

        Parameters
        ----------
        x_vals : pd.DataFrame
            DataFrame of all X values.
        y_col : str
            Name of y column (label) to predict.
        split : str
            Name of split (e.g. 'holdout').

        Returns
        -------
        pd.DataFrame
            DataFrame with 'y_hat' (prediction), 'y_hat_percentile', 'split'

        """
        y_hat = lasso_cv.predict(x_vals)
        y_hat_col = '{}_hat'.format(y_col)
        y_hat_df = pd.DataFrame(y_hat, columns=[y_hat_col])
        y_hat_percentile = convert_to_percentile(y_hat_df, y_hat_col)

        # include column for y_hat percentile
        y_hat_percentile_df = pd.DataFrame(y_hat_percentile)
        y_hat_percentile_df.columns = ['{}_hat_percentile'.format(y_col)]

        pred_df = pd.concat([y_hat_df, y_hat_percentile_df], axis=1)
        pred_df['split'] = split

        return pred_df

    # predict in train
    train_df_pred = predictions_df(train_X, y_col, 'train')

    # predict in holdout
    holdout_X = holdout_df[x_cols]
    holdout_df_pred = predictions_df(holdout_X, y_col, 'holdout')

    # predictions
    pred_df = pd.concat([train_df_pred, holdout_df_pred])

    # r2
    holdout_Y_pred = pd.concat([holdout_df[outcomes], holdout_df_pred], axis=1)
    formulas = build_formulas(y_col, outcomes)
    r2_df = get_r2_df(holdout_Y_pred, formulas)

    return r2_df, pred_df, lasso_coef_df


In [None]:
#### For reproducibility, imported from https://gitlab.com/labsysmed/dissecting-bias/-/blob/master/code/model/main.py

"""
Main script to train lasso model and save predictions.
"""
import pandas as pd
import numpy as np
import os



def load_data_df():
    """Load data dataframe.

    Returns
    -------
    pd.DataFrame
        DataFrame to use for analysis.

    """
    # define filepath
    #git_dir = get_git_dir()
    #data_fp = os.path.join(git_dir, 'data', 'data_new.csv')
    data_fp = os.path.join(DATA_DIR, "data_new.csv")

    # load df
    data_df = pd.read_csv(data_fp)

    # because we removed patient
    data_df = data_df.reset_index()
    return data_df


def get_Y_x_df(df, verbose):
    """Get dataframe with relevant x and Y columns.

    Parameters
    ----------
    df : pd.DataFrame
        Data dataframe.
    verbose : bool
        Print statistics of features.

    Returns
    -------
    all_Y_x_df : pd.DataFrame
        Dataframe with x (features) and y (labels) columns
    x_column_names : list
        List of all x column names (features).
    Y_predictors : list
        All labels (Y) to predict.

    """
    # cohort columns
    cohort_cols = ['index']

    # features (x)
    x_column_names = get_all_features(df, verbose)

    # include log columns
    df['log_cost_t'] = convert_to_log(df, 'cost_t')
    df['log_cost_avoidable_t'] = convert_to_log(df, 'cost_avoidable_t')

    # labels (Y) to predict
    Y_predictors = ['log_cost_t', 'gagne_sum_t', 'log_cost_avoidable_t']

    # redefine 'race' variable as indicator
    df['dem_race_black'] = np.where(df['race'] == 'black', 1, 0)

    # additional metrics used for table 2 and table 3
    table_metrics = ['dem_race_black', 'risk_score_t', 'program_enrolled_t',
                     'cost_t', 'cost_avoidable_t']

    # combine all features together -- this forms the Y_x df
    all_Y_x_df = df[cohort_cols + x_column_names + Y_predictors + table_metrics].copy()

    return all_Y_x_df, x_column_names, Y_predictors


def main():
    # load data
    data_df = load_data_df()

    # subset to relevant columns
    all_Y_x_df, x_column_names, Y_predictors = get_Y_x_df(data_df, verbose=True)

    # assign to 2/3 train, 1/3 holdout
    all_Y_x_df = split_by_id(all_Y_x_df, id_field='index',
                                   frac_train=.67)

    # define train, holdout
    # reset_index for pd.concat() along column
    train_df = all_Y_x_df[all_Y_x_df['split'] == 'train'].reset_index(drop=True)
    holdout_df = all_Y_x_df[all_Y_x_df['split'] == 'holdout'].reset_index(drop=True)

    # define output dir to save results
    #git_dir = util.get_git_dir()
    OUTPUT_DIR = create_dir(os.path.join(DATA_DIR, 'results'))

    # define parameters
    include_race = False
    n_folds = 10
    save_plot = False
    save_r2 = True

    # train model with Y = 'log_cost_t'
    log_cost_r2_df, \
    pred_log_cost_df, \
    log_cost_lasso_coef_df = train_lasso(train_df,
                                               holdout_df,
                                               x_column_names,
                                               y_col='log_cost_t',
                                               outcomes=Y_predictors,
                                               n_folds=n_folds,
                                               include_race=include_race,
                                               plot=save_plot,
                                               output_dir=OUTPUT_DIR)

    # train model with Y = 'gagne_sum_t'
    gagne_sum_t_r2_df, \
    pred_gagne_sum_t_df, \
    gagne_sum_t_lasso_coef_df = train_lasso(train_df,
                                                  holdout_df,
                                                  x_column_names,
                                                  y_col='gagne_sum_t',
                                                  outcomes=Y_predictors,
                                                  n_folds=n_folds,
                                                  include_race=include_race,
                                                  plot=save_plot,
                                                  output_dir=OUTPUT_DIR)

    # train model with Y = 'log_cost_avoidable_t'
    log_cost_avoidable_r2_df, \
    pred_log_cost_avoidable_df, \
    log_cost_avoidable_lasso_coef_df = train_lasso(train_df,
                                                         holdout_df,
                                                         x_column_names,
                                                         y_col='log_cost_avoidable_t',
                                                         outcomes=Y_predictors,
                                                         n_folds=n_folds,
                                                         include_race=include_race,
                                                         plot=save_plot,
                                                         output_dir=OUTPUT_DIR)

    if save_r2:
        formulas = build_formulas('risk_score_t', outcomes=Y_predictors)
        risk_score_r2_df = get_r2_df(holdout_df, formulas)

        r2_df = pd.concat([risk_score_r2_df,
                           log_cost_r2_df,
                           gagne_sum_t_r2_df,
                           log_cost_avoidable_r2_df])

        # save r2 file CSV
        if include_race:
            filename = 'model_r2_race.csv'
        else:
            filename = 'model_r2.csv'
        output_filepath = os.path.join(OUTPUT_DIR, filename)
        print('...writing to {}'.format(output_filepath))
        r2_df.to_csv(output_filepath, index=False)

    def get_split_predictions(df, split):
        pred_split_df = df[df['split'] == split]
        pred_split_df = pred_split_df.drop(columns=['split'])
        return pred_split_df

    # get holdout predictions
    holdout_log_cost_df = get_split_predictions(pred_log_cost_df,
                                                split='holdout')
    holdout_gagne_sum_t_df = get_split_predictions(pred_gagne_sum_t_df,
                                                   split='holdout')
    holdout_log_cost_avoidable_df = get_split_predictions(pred_log_cost_avoidable_df,
                                                          split='holdout')

    holdout_pred_df = pd.concat([holdout_df, holdout_log_cost_df,
                                 holdout_gagne_sum_t_df,
                                 holdout_log_cost_avoidable_df], axis=1)
    
    print(holdout_pred_df.columns, "log_cost_t" in holdout_pred_df.columns)

    holdout_pred_df_subset = holdout_pred_df[['index', 'split', 'dem_race_black',
                                              'gagne_sum_t',
                                              'cost_t', 'log_cost_t', 'cost_avoidable_t', 'log_cost_avoidable_t',
                                              'program_enrolled_t',
                                              'risk_score_t', #ytrue
                                              'log_cost_t_hat', 'log_cost_t_hat_percentile', #yhat_a
                                              'gagne_sum_t_hat', 'gagne_sum_t_hat_percentile', #yhat_b
                                              'log_cost_avoidable_t_hat', 'log_cost_avoidable_t_hat_percentile']].copy()  #yhat_c

    # add risk_score_percentile column
    holdout_pred_df_subset['risk_score_t_percentile'] = \
        convert_to_percentile(holdout_pred_df_subset, 'risk_score_t')

    # save to CSV
    if include_race:
        filename = 'model_lasso_predictors_race.csv'
    else:
        filename = 'model_lasso_predictors.csv'
    output_filepath = os.path.join(OUTPUT_DIR, filename)
    print('...HOLDOUT PREDICTIONS saved to {}'.format(output_filepath))
    holdout_pred_df_subset.to_csv(output_filepath, index=False)
    #print(holdout_pred_df_subset.head())
    return holdout_pred_df_subset


In [None]:
def build_model_long_df(long_df: pd.DataFrame, hdf: pd.DataFrame, t_start:int=0, ref_threshold:float=0.55, enroll_threshold=0.97, logged_dvs: [str] = ['cost_t','cost_avoidable_t']):
    
    mdf = pd.DataFrame()
    holdout_ldf = pd.merge(hdf[['index']], long_df[long_df.timestep==t_start], left_on='index', right_on='idx', how='inner')
    
    for model in ['lasso_log_cost_t', 'lasso_log_cost_avoidable_t', 'lasso_gagne_sum_t']:

        dv = model.split("lasso_")[1]
        log_dv = dv.replace("log_","") in logged_dvs
        
        temp = pd.DataFrame()
        temp['idx'] = hdf['index'].copy()
        temp['split'] = hdf['split'].copy()
        temp['model_name'] = model
        temp['ref_threshold'] = ref_threshold # todo: make these options from lists/maybe model-specific 
        temp['enroll_threshold'] = enroll_threshold
        temp['dv'] = model if model == "status_quo" else dv
        temp['timestep']= t_start
        temp['ytrue'] = convert_to_log(holdout_ldf[holdout_ldf['timestep'] == t_start], dv.replace("log_", "").replace("_t", "")).copy() if log_dv  else holdout_ldf[holdout_ldf['timestep'] == t_start][dv.replace("_t", "")].copy() 
        #temp['log_ytrue'] = convert_to_log(holdout_ldf[holdout_ldf['timestep'] == t_start], dv.replace("log_", "").replace("_t", "")).copy() if log_dv  else np.nan
        temp['yhat'] = 0 #no yhat at t=0
        #temp['log_yhat'] = np.nan #no yhat at t=0
        temp['log_dv_flag'] = log_dv
        temp['yhat_percentile'] = 0 #no yhat at t=0
        temp['decision'] = "none" # no decision at t= 0 
        temp['program_enrolled'] = "none"
        temp['sq_vs_decision'] = "none" # no decision at t=0

        temp_t1 = temp.copy()
        temp_t1.loc[:,'timestep'] += 1
        temp_t1['ytrue'] = hdf[dv].copy()
        #temp_t1['log_ytrue'] = np.nan if dv not in logged_dvs else hdf["{}".format(dv)]
        temp_t1['yhat'] = hdf["{}_hat".format(dv)]
        #temp_t1['log_yhat'] = np.nan if  dv.replace("log_","") not in logged_dvs else hdf["{}_hat".format(dv)]
        temp_t1['log_dv_flag'] = log_dv
        temp_t1['yhat_percentile'] = hdf["{}_hat_percentile".format(dv)].astype(float) /100
        temp_t1['decision'] = temp_t1['yhat_percentile'].apply(lambda x: "none" if x < ref_threshold
                                                               else "referred" if ref_threshold <= x < enroll_threshold 
                                                               else "enrolled") # nans?
        
        temp_t1['program_enrolled'] = hdf['program_enrolled_t'].apply(lambda x: "enrolled" if x == 1 else "not enrolled")
        
        temp_t1['sq_vs_decision'] = hdf['program_enrolled_t'].apply(lambda x: "enrolled" if x == 1 else "none/ref") + temp_t1['decision'].apply(lambda x: "_{}".format(x))
        
        long_df = pd.concat([temp, temp_t1])
        mdf = mdf.append(long_df)
        
    long_df_for_graph = pd.merge(ldf, mdf, on=['idx', 'timestep'], how='inner')        
    return mdf, long_df_for_graph
    

In [None]:
ldf = create_long_df_obermeyer(wide_df = df.copy())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tm1.loc[:,var_name] = df_tm1.apply(lambda row: "low" if row["{}-low_tm1".format(var_name)] == 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tm1.loc[:,var_name] = df_tm1.apply(lambda row: "low" if row["{}-low_tm1".format(var_name)] == 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tm1.

In [None]:
hdf = main()
_, long_df_for_graph = build_model_long_df(long_df=ldf, hdf=hdf)

Splitting 48,784 unique index
 ...splitting by patient: 32,685 train, 16,099 holdout 
...writing to ../data/results/model_r2.csv
Index(['index', 'dem_female', 'dem_age_band_18-24_tm1',
       'dem_age_band_25-34_tm1', 'dem_age_band_35-44_tm1',
       'dem_age_band_45-54_tm1', 'dem_age_band_55-64_tm1',
       'dem_age_band_65-74_tm1', 'dem_age_band_75+_tm1',
       'alcohol_elixhauser_tm1',
       ...
       'program_enrolled_t', 'cost_t', 'cost_avoidable_t', 'split',
       'log_cost_t_hat', 'log_cost_t_hat_percentile', 'gagne_sum_t_hat',
       'gagne_sum_t_hat_percentile', 'log_cost_avoidable_t_hat',
       'log_cost_avoidable_t_hat_percentile'],
      dtype='object', length=165) True
...HOLDOUT PREDICTIONS saved to ../data/results/model_lasso_predictors.csv


In [None]:
struct_data = long_df_for_graph.copy()
struct_data = struct_data.drop(['idx'],axis=1)
non_numeric_columns = list(struct_data.select_dtypes(exclude=[np.number]).columns)

In [None]:


le = LabelEncoder()

for col in non_numeric_columns:
    struct_data[col] = le.fit_transform(struct_data[col])

struct_data.head(5)
struct_data.set_index('timestep')

Unnamed: 0_level_0,race,gender,age_bucket,gagne_sum,cost,cost_avoidable,hypertension_elixhauser,ghba1c_mean,hct_mean,cre_mean,...,ref_threshold,enroll_threshold,dv,ytrue,yhat,log_dv_flag,yhat_percentile,decision,program_enrolled,sq_vs_decision
timestep,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,0,1,0,0.0,0.0,0,3,3,3,...,0.55,0.97,2,0.000000,0.000000,1,0.00,1,1,3
0,1,0,1,0,0.0,0.0,0,3,3,3,...,0.55,0.97,1,0.000000,0.000000,1,0.00,1,1,3
0,1,0,1,0,0.0,0.0,0,3,3,3,...,0.55,0.97,0,0.000000,0.000000,0,0.00,1,1,3
0,1,0,5,2,15300.0,9300.0,1,0,1,2,...,0.55,0.97,2,4.184720,0.000000,1,0.00,1,1,3
0,1,0,5,2,15300.0,9300.0,1,0,1,2,...,0.55,0.97,1,3.968530,0.000000,1,0.00,1,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,1,0,4,3,24200.0,0.0,0,3,3,3,...,0.55,0.97,1,0.000000,0.928810,1,0.70,2,2,6
1,1,0,4,3,24200.0,0.0,0,3,3,3,...,0.55,0.97,0,3.000000,1.814359,0,0.74,2,2,6
1,1,1,1,0,1700.0,0.0,0,3,3,3,...,0.55,0.97,2,3.230704,2.822738,1,0.03,1,2,5
1,1,1,1,0,1700.0,0.0,0,3,3,3,...,0.55,0.97,1,0.000000,0.499221,1,0.21,1,2,5


In [None]:
from causalnex.structure.notears import from_pandas
from causalnex.structure.dynotears import from_pandas_dynamic

sm = from_pandas_dynamic(struct_data, p=1)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from causalnex.plots import plot_structure, NODE_STYLE, EDGE_STYLE
from IPython.display import Image

sm.remove_edges_below_threshold(0.8)

viz = plot_structure(
    sm,
    graph_attributes={"scale": "0.5"},
    all_node_attributes=NODE_STYLE.WEAK,
    all_edge_attributes=EDGE_STYLE.WEAK,
    prog='fdp',
)
Image(viz.draw(format='png'))