In [57]:
import dalex as dx
import xgboost

import sklearn

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import platform
print(f'Python {platform.python_version()}')

{package.__name__: package.__version__ for package in [dx, xgboost, sklearn, pd, np]}

Python 3.11.5


{'dalex': '1.6.0',
 'xgboost': '2.0.1',
 'sklearn': '1.3.2',
 'pandas': '2.1.2',
 'numpy': '1.26.1'}

## Task A

PDP:

$g^1_{PD}(z) = E_{X_2} (z + x_2)^2 = \int_{-1}^1 (z + y)^2 / 2 \, dy = \int_{-1}^1 (z^2 + y^2 + zy) / 2 \, dy = z^2 + 1/3$

ME:

$g_1^{MP}(z) = E_{X_2|x_1 = z} (z + x_2)^2 = 4z^2$

## Task B

### 0. For the selected data set, train at least one tree-based ensemble model, e.g. random forest, gbdt, xgboost.

In [58]:
!wget -nc https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/churn.csv

File ‘churn.csv’ already there; not retrieving.



In [59]:
dataset = pd.read_csv('churn.csv', index_col=0)
dataset.head()

Unnamed: 0,total_day_minutes,total_day_charge,total_eve_minutes,total_eve_charge,total_night_minutes,total_night_charge,total_intl_minutes,total_intl_charge,TARGET
0,265.1,45.07,197.4,16.78,244.7,11.01,10.0,2.7,0
1,161.6,27.47,195.5,16.62,254.4,11.45,13.7,3.7,0
2,243.4,41.38,121.2,10.3,162.6,7.32,12.2,3.29,0
3,299.4,50.9,61.9,5.26,196.9,8.86,6.6,1.78,0
4,166.7,28.34,148.3,12.61,186.9,8.41,10.1,2.73,0


In [60]:
X = dataset.drop(columns='TARGET')
y = dataset.TARGET

In [61]:
model = xgboost.XGBClassifier()
model.fit(X, y)

### 1. Calculate the predictions for some selected observations.

In [62]:
observation1Index = 0
observation2Index = 1
observation3Index = 5

In [63]:
observation1 = X.iloc[observation1Index]
observation1

total_day_minutes      265.10
total_day_charge        45.07
total_eve_minutes      197.40
total_eve_charge        16.78
total_night_minutes    244.70
total_night_charge      11.01
total_intl_minutes      10.00
total_intl_charge        2.70
Name: 0, dtype: float64

In [64]:
observation2 = X.iloc[observation2Index]
observation2

total_day_minutes      161.60
total_day_charge        27.47
total_eve_minutes      195.50
total_eve_charge        16.62
total_night_minutes    254.40
total_night_charge      11.45
total_intl_minutes      13.70
total_intl_charge        3.70
Name: 1, dtype: float64

In [65]:
observation3 = X.iloc[observation3Index]
observation3

total_day_minutes      223.40
total_day_charge        37.98
total_eve_minutes      220.60
total_eve_charge        18.75
total_night_minutes    203.90
total_night_charge       9.18
total_intl_minutes       6.30
total_intl_charge        1.70
Name: 5, dtype: float64

In [66]:
probabilities = model.predict_proba([observation1, observation2, observation3])

print(f'The probability of 0 in the first observation is {probabilities[0][0]}')
print(f'The probability of 0 in the second observation is {probabilities[1][0]}')
print(f'The probability of 0 in the third observation is {probabilities[2][0]}')

The probability of 0 in the first observation is 0.837304949760437
The probability of 0 in the second observation is 0.866310715675354
The probability of 0 in the third observation is 0.8529270887374878


### 2. Then, calculate the what-if explanations of these predictions using Ceteris Paribus profiles (also called What-if plots), e.g. in Python: AIX360, Alibi dalex, PDPbox; in R: pdp, DALEX, ALEplot.

In [67]:
def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X, y, predict_function=pf_xgboost_classifier_categorical)

Preparation of a new explainer is initiated

  -> data              : 5000 rows 8 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 5000 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x7f31f4d179c0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.000151, mean = 0.142, max = 0.998
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.712, mean = -0.000114, max = 0.932
  -> model_info        : package xgboost

A new explainer has been created!


In [68]:
cp1 = explainer.predict_profile(new_observation=observation1)
cp1.plot(variables=['total_intl_minutes', 'total_night_charge'])

Calculating ceteris paribus: 100%|██████████| 8/8 [00:00<00:00, 215.27it/s]


I selected two variables with significant effects: "total international minutes" and "total night charge" We can see that in this observation, a change (especially an increase) in total international minutes would increase the chances of a positive prediction, while total night charge seems to be mostly detrimental to a positive prediction.

In [69]:
cp2 = explainer.predict_profile(new_observation=observation2)
cp2.plot(variables=['total_intl_minutes', 'total_night_charge'])

Calculating ceteris paribus: 100%|██████████| 8/8 [00:00<00:00, 308.91it/s]


In this observation total international minutes seem to barely matter, while a decrease in total night charge would largely increase chances of positive prediction.

In [70]:
cp3 = explainer.predict_profile(new_observation=observation3)
cp3.plot(variables=['total_intl_minutes', 'total_night_charge'])

Calculating ceteris paribus: 100%|██████████| 8/8 [00:00<00:00, 408.26it/s]


In this observation the chances of positive prediction increase with slight reduction or big increase to total international minutes, but total night charge has barely any effect.

### 3. Find two observations in the data set, such that they have different CP profiles. For example, model predictions are increasing with age for one observation and decreasing with age for another one.

In the second and third observation above, model predictions are increasing with increases to total night charge in the second observation, but increasing with decreases (though slightly) in the third observation. In the same pair of observations we can also see that CP profiles for total international minutes also heavily differ.

### 4. Compare CP, which is a local explanation, with PDP, which is a global explanation.

In [72]:
pdp = explainer.model_profile()
pdp.plot(variables=['total_intl_minutes', 'total_night_charge'])

We can see similar profiles to what we saw in the above observations, that is model predictions increase with increases to total international minutes and decreases to total night charge.

## 5. Compare PDP between between at least two different models.

In [74]:
import sklearn

randomForest = sklearn.ensemble.RandomForestClassifier() # type: ignore
randomForest.fit(X, y)
rfExplainer = dx.Explainer(randomForest, X, y, predict_function=pf_xgboost_classifier_categorical)

Preparation of a new explainer is initiated

  -> data              : 5000 rows 8 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 5000 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x7f31f4d179c0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0, mean = 0.144, max = 0.99
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.41, mean = -0.00257, max = 0.49
  -> model_info        : package sklearn

A new explainer has been created!


In [75]:
rfpdp = rfExplainer.model_profile()
rfpdp.plot(variables=['total_intl_minutes', 'total_night_charge'])

Calculating ceteris paribus: 100%|██████████| 8/8 [00:00<00:00,  9.59it/s]


In Random Forest Classifier, the PDP is similar in the international minutes variable, but very different with night charge - instead of decreasing slope, we have an "U" shaped plot.