# Homework 6

The aim of Homework 6 is to get acquinted with the concept of Permutation-based Variable Importance. It is a global explanation method that aims at explaining the influence of a particular variable by evaluating the model of interest when it gets randomly permuted.

To follow up on previous conclusions, we will reuse the models and dataset from Homework 5. That is, the baseline tree-based model will be the random forest model with other models being: multilayer perceptron, logistic regression and a simple decision tree with limited depth. We will use the *churn* dataset in which the task is to predict whether a particular client of a telephone company churned or not.

## Subtask 0.

We begin with training a random forest (RF) model on a single train/test split with a 4:1 ratio. We evaluate all models using ROC AUC and PR AUC to take into account the class imbalance and aggregate the performance over different thresholds. The RF model achieves a 0.7 ROC AUC and around 0.4 PR AUC. These values indicate that, in general, it is a challenging predictive task.

In [165]:
subtask_zero('random_forest')

random_forest
roc_auc: 0.709708604965522
pr_auc: 0.3987325603160147


## Subtask 1.

We continue with calculating PVIs for the RF model. To properly address the problem of class imbalance, we will calculate PVIs using PR AUC as the score function. Baseline (no permutations) performance is, as indicated previously, at around 0.4. Next rows indicate, for each feature, the mean and standard deviation (over 10 iterations) of the difference in performance between the baseline and the model evaluated when the given feature is permuted. Clearly, the RF model makes great use of the *total_day_minutes* variable which is in line with the findings from previous homeworks. Very close to that is the *total_day_charge* feature. These two variables are the most important for the model, and other features seem to  negligibly influence the performance when permuted.

In [166]:
subtask_one()


PVI for random_forest

baseline: 0.399

total_day_minutes: 0.110 +/- 0.016
total_day_charge: 0.108 +/- 0.018
total_eve_minutes: 0.036 +/- 0.015
total_eve_charge: 0.030 +/- 0.010
total_night_charge: 0.010 +/- 0.013
total_night_minutes: 0.003 +/- 0.010
total_intl_minutes: 0.002 +/- 0.007
total_intl_charge: -0.003 +/- 0.006


## Subtask 2.

Next, we train additional models: a simple decision tree with maximum depth of 6 (DT), a multilayer perceptron (MLP) and logistic regression (LR). Below, their respective performance measures are included. Interestingly, MLP performs on-par or even slightly better than the RF model, LR achieves the lowest ROC AUC with a competitive PR AUC, while DT achieves the highest ROC AUC but does not handle the imbalance issue well and scores the lowest in terms of PR AUC.s

In [167]:
subtask_zero()

random_forest
roc_auc: 0.709708604965522
pr_auc: 0.3987325603160147


decision_tree
roc_auc: 0.7540452520689676
pr_auc: 0.35753437522730497


mlp
roc_auc: 0.7151811736791975
pr_auc: 0.4207226693592752


logistic_regression
roc_auc: 0.642420820286433
pr_auc: 0.3811247088126789




The PVIs for the above models show some interesting patterns. First, MLP model, which achieves the best performance, seems to be the only one that effectively makes use of more than two variables by exploiting the information from *total_day_charge*, *total_day_minutes* and *total_intl_charge*. LR and RF seem to use the same two features as the most important, but the linearity of LR is probably its main limitation. The DT model seems to focus only on the *total_day_charge*, which shows that there is great benefit coming from using multiple trees as in RF model.

In [168]:
subtask_two()


PVI for random_forest

baseline: 0.399

total_day_minutes: 0.110 +/- 0.016
total_day_charge: 0.108 +/- 0.018
total_eve_minutes: 0.036 +/- 0.015
total_eve_charge: 0.030 +/- 0.010
total_night_charge: 0.010 +/- 0.013
total_night_minutes: 0.003 +/- 0.010
total_intl_minutes: 0.002 +/- 0.007
total_intl_charge: -0.003 +/- 0.006

PVI for decision_tree

baseline: 0.358

total_day_charge: 0.199 +/- 0.008
total_eve_minutes: 0.049 +/- 0.004
total_day_minutes: 0.028 +/- 0.002
total_eve_charge: 0.028 +/- 0.013
total_intl_minutes: 0.022 +/- 0.005
total_night_charge: 0.017 +/- 0.007
total_intl_charge: 0.009 +/- 0.003
total_night_minutes: -0.005 +/- 0.004

PVI for mlp

baseline: 0.421

total_day_charge: 0.250 +/- 0.015
total_day_minutes: 0.169 +/- 0.013
total_intl_charge: 0.136 +/- 0.012
total_eve_minutes: 0.095 +/- 0.016
total_night_minutes: 0.061 +/- 0.016
total_intl_minutes: 0.058 +/- 0.015
total_night_charge: 0.056 +/- 0.019
total_eve_charge: 0.004 +/- 0.018

PVI for logistic_regression

baseline:

## Subtask 3.

We finish with a comparison of three different methods of calculating feature importance in a tree-based model: PVI, scikit-learn's *feature_importance_* attribute of RF that measures the importance in terms of the accumulation of the impurity decrease within each tree, and the TreeSHAP algorithm from the *shap* package. We include the results below. Interestingly, while the values provided by the methods are not strictly comparable, we can compare the provided rankings of features. These match almost exactly - the only difference is with the pairs (*total_night_charge*, *total_night_minutes*), (*total_day_minutes*, *total_day_charge*) of features that are in a reverse order when using PVI or TreeSHAP. The fact that these explanations match so precisely indicates that we might be close to understanding how the model exactly works, and that the limitations of each method might not influence the explanation at this specific instance.

In [169]:
subtask_three()


PVI for random_forest

baseline: 0.399

total_day_minutes: 0.110 +/- 0.016
total_day_charge: 0.108 +/- 0.018
total_eve_minutes: 0.036 +/- 0.015
total_eve_charge: 0.030 +/- 0.010
total_night_charge: 0.010 +/- 0.013
total_night_minutes: 0.003 +/- 0.010
total_intl_minutes: 0.002 +/- 0.007
total_intl_charge: -0.003 +/- 0.006

scikit-learn feature_importance_

total_day_minutes: 0.199
total_day_charge: 0.182
total_eve_minutes: 0.125
total_eve_charge: 0.124
total_night_minutes: 0.107
total_night_charge: 0.106
total_intl_minutes: 0.078
total_intl_charge: 0.080

TreeSHAP feature importance





total_day_minutes: 0.035
total_day_charge: 0.036
total_eve_minutes: 0.022
total_eve_charge: 0.020
total_night_minutes: 0.013
total_night_charge: 0.013
total_intl_minutes: 0.013
total_intl_charge: 0.011


# Appendix

In [120]:
import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance

In [121]:
PATH_DATASET = 'churn.csv'
DATASET = pd.read_csv(PATH_DATASET, index_col = 0)

In [122]:
SEED = 0

In [123]:
METRICS = {
    'roc_auc': roc_auc_score,
    'pr_auc': average_precision_score}
SCORING = make_scorer(
    average_precision_score, 
    greater_is_better = True, 
    needs_proba = True)

In [124]:
MODELS = {
    'random_forest': RandomForestClassifier(random_state = SEED),
    'decision_tree': DecisionTreeClassifier(random_state = SEED, max_depth = 5),
    'mlp': Pipeline([
        ('standard_scaler', StandardScaler()),
        ('mlp', MLPClassifier((32, 32), 'relu', random_state = SEED, max_iter = 1000))]),
    'logistic_regression': Pipeline([
        ('standard_scaler', StandardScaler()),
        ('logistic_regression', LogisticRegression(random_state = SEED))]),}

In [125]:
def get_train_test_split():
    x, y = DATASET.iloc[:, :-1], DATASET.iloc[:, -1]
    return train_test_split(x, y, random_state = SEED)

def train_models():
    x_train, x_test, y_train, y_test = get_train_test_split()
    results = {model_name: {} for model_name in MODELS.keys()}
    trained_models = {}
    for model_name, model in MODELS.items():
        model.fit(x_train, y_train)
        trained_models[model_name] = model
        y_pred = model.predict_proba(x_test)[:, -1]
        for metric_name, metric in METRICS.items():
            results[model_name][metric_name] = metric(y_test, y_pred)
    print('Finished')
    return trained_models, results

In [126]:
trained_models, test_metrics = train_models()
x_train, x_test, y_train, y_test = get_train_test_split()
feature_names = x_test.columns.tolist()

Finished


In [127]:
def subtask_zero(model_name = None):
    if model_name is None:
        for model_name, metrics in test_metrics.items():
            print(model_name)
            for metric_name, metric_v in metrics.items():
                print(f'{metric_name}: {metric_v}')
            print('\n')
    else:
        print(model_name)
        for metric_name, metric_v in test_metrics[model_name].items():
            print(f'{metric_name}: {metric_v}')

subtask_zero('random_forest')

random_forest
roc_auc: 0.709708604965522
pr_auc: 0.3987325603160147


In [128]:
def get_pvi(model_name):
    print(f'\nPVI for {model_name}\n')
    clf = trained_models[model_name]
    pvi = permutation_importance(
        clf, x_test, y_test, n_repeats = 10, random_state = SEED, scoring = SCORING)
    print(f"baseline: {test_metrics[model_name]['pr_auc']:.3f}\n")
    for i in pvi.importances_mean.argsort()[::-1]:
        print(f"{feature_names[i]}: {pvi.importances_mean[i]:.3f} +/- {pvi.importances_std[i]:.3f}")

def subtask_one():
    get_pvi('random_forest')

subtask_one()


PVI for random_forest

baseline: 0.399

total_day_minutes: 0.110 +/- 0.016
total_day_charge: 0.108 +/- 0.018
total_eve_minutes: 0.036 +/- 0.015
total_eve_charge: 0.030 +/- 0.010
total_night_charge: 0.010 +/- 0.013
total_night_minutes: 0.003 +/- 0.010
total_intl_minutes: 0.002 +/- 0.007
total_intl_charge: -0.003 +/- 0.006


In [129]:
def subtask_two():
    for model_name in trained_models.keys():
        get_pvi(model_name)

subtask_zero()

random_forest
roc_auc: 0.709708604965522
pr_auc: 0.3987325603160147


decision_tree
roc_auc: 0.7540452520689676
pr_auc: 0.35753437522730497


mlp
roc_auc: 0.7151811736791975
pr_auc: 0.4207226693592752


logistic_regression
roc_auc: 0.642420820286433
pr_auc: 0.3811247088126789




In [130]:
subtask_two()


PVI for random_forest

baseline: 0.399

total_day_minutes: 0.110 +/- 0.016
total_day_charge: 0.108 +/- 0.018
total_eve_minutes: 0.036 +/- 0.015
total_eve_charge: 0.030 +/- 0.010
total_night_charge: 0.010 +/- 0.013
total_night_minutes: 0.003 +/- 0.010
total_intl_minutes: 0.002 +/- 0.007
total_intl_charge: -0.003 +/- 0.006

PVI for decision_tree

baseline: 0.358

total_day_charge: 0.199 +/- 0.008
total_eve_minutes: 0.049 +/- 0.004
total_day_minutes: 0.028 +/- 0.002
total_eve_charge: 0.028 +/- 0.013
total_intl_minutes: 0.022 +/- 0.005
total_night_charge: 0.017 +/- 0.007
total_intl_charge: 0.009 +/- 0.003
total_night_minutes: -0.005 +/- 0.004

PVI for mlp

baseline: 0.421

total_day_charge: 0.250 +/- 0.015
total_day_minutes: 0.169 +/- 0.013
total_intl_charge: 0.136 +/- 0.012
total_eve_minutes: 0.095 +/- 0.016
total_night_minutes: 0.061 +/- 0.016
total_intl_minutes: 0.058 +/- 0.015
total_night_charge: 0.056 +/- 0.019
total_eve_charge: 0.004 +/- 0.018

PVI for logistic_regression

baseline:

In [164]:
def subtask_three():
    subtask_one()

    print('\nscikit-learn feature_importance_\n')
    imps = trained_models['random_forest'].feature_importances_
    for f_name, f_val in zip(feature_names, imps):
        print(f'{f_name}: {f_val:.3f}')

    print('\nTreeSHAP feature importance\n')
    shap_vals = shap.TreeExplainer(
        trained_models['random_forest'], 
        data = x_test, model_output = 'probability').shap_values(x_test)
    shap_imps = np.abs(shap_vals[1]).mean(0)
    for f_name, f_imp in zip(feature_names, shap_imps):
        print(f'{f_name}: {f_imp:.3f}')

subtask_three()


PVI for random_forest

baseline: 0.399

total_day_minutes: 0.110 +/- 0.016
total_day_charge: 0.108 +/- 0.018
total_eve_minutes: 0.036 +/- 0.015
total_eve_charge: 0.030 +/- 0.010
total_night_charge: 0.010 +/- 0.013
total_night_minutes: 0.003 +/- 0.010
total_intl_minutes: 0.002 +/- 0.007
total_intl_charge: -0.003 +/- 0.006

scikit-learn feature_importance_

total_day_minutes: 0.199
total_day_charge: 0.182
total_eve_minutes: 0.125
total_eve_charge: 0.124
total_night_minutes: 0.107
total_night_charge: 0.106
total_intl_minutes: 0.078
total_intl_charge: 0.080

TreeSHAP feature importance





total_day_minutes: 0.035
total_day_charge: 0.036
total_eve_minutes: 0.022
total_eve_charge: 0.020
total_night_minutes: 0.013
total_night_charge: 0.013
total_intl_minutes: 0.013
total_intl_charge: 0.011
