# Obermeyer et-xAI

In this Jupyternotebook I (Pablo Suárez Reyero) introduce all the necessary code to develop the individual project (worth $60\%$) of the course: $$\text{Explainable and Ethical Artificial Intelligence for Engineering - 18-fi-2130-vl}$$ from $\text{Technische Universität Darmstadt}$

## Library imports
Below, you may find all the libraries I used to implement and develop this coding project. ADD A REQUIREMENTS.TXT

In [21]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Reading the raw data

In [12]:
df = pd.read_csv("data_new.csv")
print(df.head())

   risk_score_t  program_enrolled_t  cost_t  cost_avoidable_t  bps_mean_t  \
0      1.987430                   0  1200.0               0.0         NaN   
1      7.677934                   0  2600.0               0.0       119.0   
2      0.407678                   0   500.0               0.0         NaN   
3      0.798369                   0  1300.0               0.0       117.0   
4     17.513165                   0  1100.0               0.0       116.0   

   ghba1c_mean_t  hct_mean_t  cre_mean_t  ldl_mean_t   race  ...  \
0            5.4         NaN    1.110000       194.0  white  ...   
1            5.5        40.4    0.860000        93.0  white  ...   
2            NaN         NaN         NaN         NaN  white  ...   
3            NaN         NaN         NaN         NaN  white  ...   
4            NaN        34.1    1.303333        53.0  white  ...   

   trig_min-high_tm1  trig_min-normal_tm1  trig_mean-low_tm1  \
0                  0                    0                  0   


## Data inspection


In the following cell we inspect the basics of the data, aiming to know:
- Number of rows and columns (attributes)
- Any missing values? If so $\rightarrow$ fill them or delete the whole row (as we don't have that much data: $\sim 49000$ we will not delete the whole row, but rather replace all `NaN` values with the mean or the median value)
- Are there any outliers? If so $\rightarrow$ we'll get rid of them, as they may penalize training and model validation.
- Variable types.
- Maximum and minimum values.
- Basic statistics


In [17]:
# Data basics
print(df.shape)
print(df.info())
print(df.describe())

num_rows = len(df)
print(f"Number of rows: {num_rows}")
print(f"\nVariable types:\n{ df.dtypes}")
print(f"\nColumn names:\n{df.columns.tolist()}")

# let's get some data statistics
numeric_cols = df.select_dtypes(include="number")
print("\nStatistics for numeric columns:")
print(f"Mean:\n{numeric_cols.mean()}")
print(f"\nStandard deviation:\n{numeric_cols.std()}")
print(f"\nMedian:\n{numeric_cols.median()}")

# 5. Extreme values (min and max)
print(f"\nMinimum values:\n{numeric_cols.min()}")
print(f"\nMaximum values:\n{numeric_cols.max()}")

# let's see which columns have NaN values
cols_with_nan = df.columns[df.isna().any()].tolist()
print(f"Columnas with NaN values:\n{cols_with_nan}")

(48784, 160)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48784 entries, 0 to 48783
Columns: 160 entries, risk_score_t to gagne_sum_t
dtypes: float64(22), int64(137), object(1)
memory usage: 59.6+ MB
None
       risk_score_t  program_enrolled_t         cost_t  cost_avoidable_t  \
count  48784.000000        48784.000000   48784.000000      48784.000000   
mean       4.393692            0.009265    7659.716300       2434.722450   
std        5.519582            0.095811   17989.921192      12058.341779   
min        0.000000            0.000000       0.000000          0.000000   
25%        1.443859            0.000000    1200.000000          0.000000   
50%        2.887719            0.000000    2800.000000          0.000000   
75%        5.350773            0.000000    6600.000000        100.000000   
max      100.000000            1.000000  550500.000000     642700.000000   

         bps_mean_t  ghba1c_mean_t    hct_mean_t    cre_mean_t    ldl_mean_t  \
count  38116.000000   132

## Controlling Temporal Data Leakage in the Model

In this notebook, I focus on predicting an outcome at time *t* using only the information that would have been available at *t‑1*. Any variable measured at *t* cannot be used as a predictor, because doing so introduces future information into the model. This type of leakage inflates performance and prevents the model from being usable in a real setting.

In practice, at time *t‑1* I do not have access to `cost_t`, `gagne_sum_t`, `risk_score_t`, or any other outcome recorded at *t*. The model must therefore be trained under the same constraints it will face when deployed.

To maintain this consistency, I review each variable and keep only those that would genuinely be known at prediction time.

---

## Variables Removed

I remove the following from the predictor set:

- `risk_score_t`, since it is already the output of another model.
- `cost_t`, which is the target I aim to predict.
- All variables ending in `_t`, because they represent outcomes at time *t*.
- Any constructed target or any feature derived from future information.

---

## Variables Retained

I keep only information available at *t‑1*, such as:

- Allowed demographic attributes.
- Biomarkers with the suffix `_tm1`.
- Comorbidities with the suffix `_tm1`.
- Costs with the suffix `_tm1`.
- Utilization variables with the suffix `_tm1`.

These reflect the patient’s state before the prediction moment and do not introduce future information.

---

## Summary

The model must be trained under the same temporal constraints it will face in real use. For that reason, I restrict the predictors to variables available at *t‑1* and remove anything that encodes information from *t*. This prevents data leakage and ensures that the estimated performance is realistic and reproducible.


In [18]:
# All time t variables (future outcomes)
t_columns = [col for col in df.columns if col.endswith('_t')]

# All time t-1 predictors
tm1_columns = [col for col in df.columns if col.endswith('_tm1')]

print("Time t columns:", t_columns)
print("Time t-1 columns:", tm1_columns)

Time t columns: ['risk_score_t', 'program_enrolled_t', 'cost_t', 'cost_avoidable_t', 'bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t', 'gagne_sum_t']
Time t-1 columns: ['dem_age_band_18-24_tm1', 'dem_age_band_25-34_tm1', 'dem_age_band_35-44_tm1', 'dem_age_band_45-54_tm1', 'dem_age_band_55-64_tm1', 'dem_age_band_65-74_tm1', 'dem_age_band_75+_tm1', 'alcohol_elixhauser_tm1', 'anemia_elixhauser_tm1', 'arrhythmia_elixhauser_tm1', 'arthritis_elixhauser_tm1', 'bloodlossanemia_elixhauser_tm1', 'coagulopathy_elixhauser_tm1', 'compdiabetes_elixhauser_tm1', 'depression_elixhauser_tm1', 'drugabuse_elixhauser_tm1', 'electrolytes_elixhauser_tm1', 'hypertension_elixhauser_tm1', 'hypothyroid_elixhauser_tm1', 'liver_elixhauser_tm1', 'neurodegen_elixhauser_tm1', 'obesity_elixhauser_tm1', 'paralysis_elixhauser_tm1', 'psychosis_elixhauser_tm1', 'pulmcirc_elixhauser_tm1', 'pvd_elixhauser_tm1', 'renal_elixhauser_tm1', 'uncompdiabetes_elixhauser_tm1', 'valvulardz_elixhauser_tm1', 'wtlo

Now, let us define our targets for both the regression and classification tasks

In [19]:
# regression target
y_reg = df['cost_t']

# classification target (example using Gagne index)
threshold = df['gagne_sum_t'].quantile(0.75)
df['high_need'] = (df['gagne_sum_t'] >= threshold).astype(int)
y_clf = df['high_need']

We are ready to define now our feature matrix. Note that this si the most efficient way, as I don't have to manually drop `risk_score_t` or `cost_t`. Both variables end with the suffix `_t`, and our feature selection process keeps only columns ending in `_tm1`. By restricting the predictors to `_tm1` variables, any column representing information from time *t* is automatically excluded.

This approach is safer and more reliable than removing individual variables by hand, since it prevents accidental inclusion of future information and keeps the preprocessing consistent across datasets.


In [20]:
X = df[tm1_columns].copy()