In this mini-lecture, we begin the journey on studying interpretable machine learning models. 

In [None]:
# !pip install raiwidgets --user
# !pip install azureml-core
# !pip install azureml-interpret

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import os 
import pprint
import sklearn as sk
from collections import defaultdict

from scipy.stats import spearmanr
from scipy.cluster import hierarchy

from raiwidgets import ExplanationDashboard

from sklearn import datasets
from sklearn.datasets import load_diabetes
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import permutation_importance

from interpret.ext.blackbox import MimicExplainer
from interpret.ext.blackbox import PFIExplainer
from interpret.ext.glassbox import LGBMExplainableModel
from interpret.ext.glassbox import LinearExplainableModel
from interpret.ext.glassbox import SGDExplainableModel
from interpret.ext.glassbox import DecisionTreeExplainableModel

%matplotlib inline

In [None]:
path="C:\\Users\\gao\\GAO_Jupyter_Notebook\\Datasets"
os.chdir(path)

#path="C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\Introduction to Data Science Using Python\\datasets"
#os.chdir(path)

### I. Permutation Feature Importance

**Permutation feature importance (PFI)** measures the increase in the prediction error of the model after we permuted the feature's values, which breaks the relationship between the feature and the true outcome. The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is "important" if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction. A feature is "unimportant" if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction. The permutation feature importance measurement was introduced by Breiman (2001) for random forests. But this versioun of the feature importance measurement is model-specific (to random forests). Based on this idea, Fisher, Rudin, and Dominici (2018) proposed a model-agnostic version of the feature importance and called it model reliance (denoted by MRL in subsequent passages). They also introduced more advanced ideas about feature importance, for example a (model-specific) version that takes into account that many prediction models may predict the data well. Although very fast for computing global explanations, PFI does not support local explanations and can be inaccurate when there are feature interactions.

Here is the permutation feature importance algorithm based on Fisher, Rudin, and Dominici (2018):

   1. Train a model $y=m(x)+\epsilon$, with feature matrix $X$, target vector $y$, error measure (loss) $l(y,\hat{y})$ (think of an example of linear model where $y=X\beta + \epsilon$ with $m(X)=X\beta$, and we estimate the model by $y=X\hat{\beta}+\hat{\epsilon}$ so that $\hat{\epsilon}=y-\hat{y}=y-X\hat{\beta}$, with $\widehat{m(X)}=X\hat{\beta}=\hat{y}$).
   2. Estimate the original model error $\epsilon = l(y, \hat{y})$ (e.g. mean squared error)
   3. For each feature $k=1,2,...K$:
       - Generate feature matrix $\tilde{X}$ by permuting feature $k$ in the data $X$. This breaks the association between feature $k$ and true outcome $y$.
       - Estimate error $\tilde{\epsilon}=l(y,\widehat{m(\tilde{X})})$ based on the predictions of the permuted data.
       - Calculate permutation feature importance $PFI(k)= \frac{\tilde{\epsilon}}{\epsilon}$. Alternatively, the difference can be used: $PFI(k)= \tilde{\epsilon}-\epsilon$. Notice that a positive aspect of using the error ratio instead of the error difference is that the feature importance measurements are comparable across different problems.
   4. Sort features by descending PFI.

Fisher, Rudin, and Dominici (2018) suggest in their paper to split the dataset in half and swap the values of feature $k$ of the two halves instead of permuting feature $k$. This is exactly the same as permuting feature $k$, if you think about it. If you want a more accurate estimate, you can estimate the error of permuting feature $k$ by pairing each instance with the value of feature $k$ of each other instance (except with itself). This gives you a dataset of size $n(n-1)$ to estimate the permutation error, and it takes a large amount of computation time. 

The next question we need to ask is: should we compute the importance on training or test data? The simple answer is: it depends. And there is no right answer here. Answering the question about training or test data touches the fundamental question of what feature importance is. The best way to understand the difference between feature importance based on training vs. based on test data is an "extreme" example. Say we trained a support vector machine to predict a continuous, random target outcome given 50 random features (200 instances). By "random" we mean that the target outcome is independent of the 50 features. This is like predicting tomorrow's temperature given the latest lottery numbers. If the model "learns" any relationships, then it basically overfits. And in fact, the SVM is very likely to overfit on the training data to some extent. The mean absolute error (MAE) for the training data is 0.29 and for the test data 0.82, which is also the error of the best possible model that always predicts the mean outcome of 0 (MAE of 0.78). In other words, the SVM model is garbage. What values for the feature importance would you expect for the 50 features of this overfitted SVM? Should it be zero because none of the features contribute to improved performance on unseen test data? Or should the importances reflect how much the model depends on each of the features, regardless whether the learned relationships generalize to unseen data? In the end, we need to decide whether we want to know how much the model relies on each feature for making predictions (implying that we should use the training data) or how much the feature contributes to the performance of the model on unseen data (implying that we should use the test data).

Let's see an example below. Using the bike-sharing example we discussed in prevous lectures, we see that 'temp' (temperature) is the most important feature, whereas 'holiday' is the least important one:

In [None]:
from IPython.display import Image
Image("PFI example.PNG", height=400, width=400)

Here are the advantages of PFI:

   - It has a nice interpretation: feature importance is the increase in model error when the feature's information is destroyed.
   - Feature importance provides a highly compressed, global insight into the model's behavior.
   - The importance measure automatically takes into account all interactions with other features. By permuting the feature you also destroy the interaction effects with other features. This means that the permutation feature importance takes into account both the main feature effect and the interaction effects on model performance.
   - Permutation feature importance does not require retraining the model. Some other methods suggest deleting a feature, retraining the model and then comparing the model error. Since the retraining of a machine learning model can take a long time, "only" permuting a feature can save a lot of time. Importance methods that retrain the model with a subset of features appear intuitive at first glance, but the model with the reduced data is meaningless for the feature importance. We are interested in the feature importance of a fixed model. Retraining with a reduced dataset creates a different model than the one we are interested in. 


Here are the disadvantages of PFI:

   - The importance measure automatically takes into account all interactions with other features. This is also a disadvantage because the importance of the interaction between two features is included in the importance measurements of both features. This means that the feature importances do not add up to the total drop in performance, but the sum is larger. Only if there is no interaction between the features, as in a linear model, the importances add up approximately.
   - Permutation feature importance is linked to the error of the model. This is not inherently bad, but in some cases not what you need. In some cases, you might prefer to know how much the model's output varies for a feature without considering what it means for performance. For example, you want to find out how robust your model's output is when someone manipulates the features. In this case, you would not be interested in how much the model performance decreases when a feature is permuted, but how much of the model's output variance is explained by each feature. Model variance (explained by the features) and feature importance correlate strongly when the model generalizes well (i.e. it does not overfit).
   - The permutation feature importance depends on shuffling the feature, which adds randomness to the measurement. When the permutation is repeated, the results might vary greatly. Repeating the permutation and averaging the importance measures over repetitions stabilizes the measure, but increases the time of computation.
   - If features are correlated, the permutation feature importance can be biased by unrealistic data instances. The problem is the same as with partial dependence plots: The permutation of features produces unlikely data instances when two or more features are correlated. When they are positively correlated (like height and weight of a person) and I shuffle one of the features, I create new instances that are unlikely or even physically impossible (2 meter person weighing 30 kg for example), yet I use these new instances to measure the importance. In other words, for the permutation feature importance of a correlated feature, we consider how much the model performance decreases when we exchange the feature with values we would never observe in reality. 
   - If features are correlated, the permutation feature importance can be biased by unrealistic data instances. The problem is the same as with partial dependence plots: The permutation of features produces unlikely data instances when two or more features are correlated. When they are positively correlated (like height and weight of a person) and I shuffle one of the features, I create new instances that are unlikely or even physically impossible (2 meter person weighing 30 kg for example), yet I use these new instances to measure the importance. In other words, for the permutation feature importance of a correlated feature, we consider how much the model performance decreases when we exchange the feature with values we would never observe in reality. 

Now let's see some code example of PFI. There are many packages that can perform PFI analysis. Here are the main ones:

   - 'scikit-learn'
   - 'eli5': allows us to visualize and debug various ML models using unified API
   - 'azureml-interpret' (with 'azureml-core')

In 'scikit-learn' package, permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature. The permutation_importance() function calculates the feature importance of estimators for a given dataset. The _n\_repeats_ parameter sets the number of times a feature is randomly shuffled and returns a sample of feature importances.

Below, we use the diabetes dataset to understand the relationship between the covariates and the response. The data has 10 baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of $n = 442$ diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. Here we will employ the ridge regression model:

In [None]:
diabetes = load_diabetes(as_frame=True)
print(type(diabetes))
type(diabetes.data)

In [None]:
diabetes.data.head()

In [None]:
diabetes.target.head()

In [None]:
X_train, X_val, y_train, y_val = train_test_split(diabetes.data, diabetes.target, random_state=0)

In [None]:
model = Ridge(alpha=1e-2).fit(X_train, y_train)
model.score(X_val, y_val) # this is R-squared by default

In [None]:
r = permutation_importance(model, X_val, y_val, n_repeats=30, random_state=0)

for i in r.importances_mean.argsort()[::-1]:
    if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
        print(f"{diabetes.feature_names[i]:<8}"
              f"{r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")

Note that the importance values for the top features represent a large fraction of the reference score of 0.356.

There is a relation between PFI and impurity-based importance in trees. Tree-based models provide an alternative measure of feature importances based on the mean decrease in impurity (MDI). Impurity is quantified by the splitting criterion of the decision trees (Gini, entropy or MSE). However, this method can give high importance to features that may not be predictive on unseen data when the model is overfitting. Permutation-based feature importance, on the other hand, avoids this issue, since it can be computed on unseen data. Furthermore, impurity-based feature importance for trees are strongly biased and favor high cardinality features (typically numerical features) over low cardinality features such as binary features or categorical variables with a small number of possible categories.

Permutation-based feature importances do not exhibit such a bias. Additionally, the permutation feature importance may be computed performance metric on the model predictions predictions and can be used to analyze any model class (not just tree-based models).

When two features are correlated and one of the features is permuted, the model will still have access to the feature through its correlated feature. This will result in a lower importance value for both features, where they might actually be important. One way to handle this is to cluster features that are correlated and only keep one feature from each cluster. This strategy is explored in the following example. In this example, we compute the permutation importance on the Wisconsin breast cancer dataset using permutation_importance(). The RandomForestClassifier() can easily get about 97% accuracy on a test dataset. Because this dataset contains multicollinear features, the permutation importance will show that none of the features are important. One approach to handling multicollinearity is by performing hierarchical clustering on the features' Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster. Let's first train the model and evaluate the model on the test data:

In [None]:
data = load_breast_cancer(as_frame=True) # training a random forest on the breast cancer dataset and evaluate its accuracy on a test set
X, y = data.data, data.target
print(type(X), type(y))
X.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
print("Accuracy on test data: {:.2f}".format(clf.score(X_test, y_test)))

Now we plot the tree-based feature importance and the permutation importance. The permutation importance plot shows that permuting a feature drops the accuracy by at most 0.012, which would suggest that none of the features are important. This is in contradiction with the high test accuracy computed above: some feature must be important. The permutation importance is calculated on the training set to show how much the model relies on each feature during training:

In [None]:
result = permutation_importance(clf, X_train, y_train, n_repeats=10,
                                random_state=42)
perm_sorted_idx = result.importances_mean.argsort()

tree_importance_sorted_idx = np.argsort(clf.feature_importances_)
tree_indices = np.arange(0, len(clf.feature_importances_)) + 0.5

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
ax1.barh(tree_indices,
         clf.feature_importances_[tree_importance_sorted_idx], height=0.7)
ax1.set_yticks(tree_indices)
ax1.set_yticklabels(data.feature_names[tree_importance_sorted_idx])
ax1.set_ylim((0, len(clf.feature_importances_)))
ax2.boxplot(result.importances[perm_sorted_idx].T, vert=False,
            labels=data.feature_names[perm_sorted_idx])
fig.tight_layout()
plt.show()

When features are collinear, permutating one feature will have little effect on the models performance because it can get the same information from a correlated feature. One way to handle multicollinear features is by performing hierarchical clustering on the Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster. First, we plot a heatmap of the correlated features:

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
corr = spearmanr(X).correlation
corr_linkage = hierarchy.ward(corr)
dendro = hierarchy.dendrogram(corr_linkage, labels=data.feature_names.tolist(), ax=ax1, leaf_rotation=90)
dendro_idx = np.arange(0, len(dendro['ivl']))

ax2.imshow(corr[dendro['leaves'], :][:, dendro['leaves']])
ax2.set_xticks(dendro_idx)
ax2.set_yticks(dendro_idx)
ax2.set_xticklabels(dendro['ivl'], rotation='vertical')
ax2.set_yticklabels(dendro['ivl'])
fig.tight_layout()
plt.show()

Next, we manually pick a threshold by visual inspection of the dendrogram to group our features into clusters and choose a feature from each cluster to keep, select those features from our dataset, and train a new random forest. The test accuracy of the new random forest did not change much compared to the random forest trained on the complete dataset:

In [None]:
cluster_ids = hierarchy.fcluster(corr_linkage, 1, criterion='distance')
cluster_id_to_feature_ids = defaultdict(list)
for idx, cluster_id in enumerate(cluster_ids):
    cluster_id_to_feature_ids[cluster_id].append(idx)
selected_features = [v[0] for v in cluster_id_to_feature_ids.values()]
selected_features

In [None]:
X_train_sel = X_train.iloc[:, selected_features]
X_test_sel = X_test.iloc[:, selected_features]
clf_sel = RandomForestClassifier(n_estimators=100, random_state=42)
clf_sel.fit(X_train_sel, y_train)
print("Accuracy on test data with features removed: {:.2f}".format(clf_sel.score(X_test_sel, y_test)))

PFI has become one of the most useful tools to interpret ML models. Beside the standard 'scikit-learn' library, there are other libraries that can do this. Here we will illustrate another example using Microsoft Azure's package 'azureml-interpret'. The package should be installed together with the 'azureml-core' package. Here, let's use the Boston housing dataset to illustrate the usage of Microsoft Azure's libraries:

In [None]:
boston_data = load_boston()
print(boston_data.DESCR)

Let's do the test-train split and train a gradient boosting model (as of 2021 June, there is a package conflict between TensorFlow and PFI, so we will use a gradient boosting model for now rather than a neural network model):

In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_data.data, boston_data.target, test_size=0.2, random_state=0)

In [None]:
reg = GradientBoostingRegressor(n_estimators=100, max_depth=4, learning_rate=0.01, loss='huber', random_state=1)
model = reg.fit(X_train, y_train)

We have a 'model' object now from the fit() method. Now let's use PFI for our analysis. The PFIExplainer() method has many arguments. For example, the parameter _metric_ can be used to pass a metric name or function to evaluate the permutation. Note that if a metric function is provided a higher value must be better. Otherwise, we need to take the negative of the function or set the parameter _is_error_metric_ to be 'True'. The default metrics are the following:
 
   - F1 score for binary classification
   - F1 score with micro average for multiclass classification
   - MAE for regression

In [None]:
explainer = PFIExplainer(model, features=boston_data.feature_names)
global_explanation = explainer.explain_global(X_test, true_labels=y_test)
print('global importance rank:', '\n')
pprint.pprint(global_explanation.get_feature_importance_dict())

In [None]:
ExplanationDashboard(global_explanation, model, dataset=X_test, true_y=y_test)

### II. Global Surrogate Model

A global surrogate model is an interpretable model that is trained to approximate the predictions of a black box model. We can draw conclusions about the black box model by interpreting the surrogate model. Surrogate models are also used in engineering: If an outcome of interest is expensive, time-consuming or otherwise difficult to measure (e.g. because it comes from a complex computer simulation), a cheap and fast surrogate model of the outcome can be used instead. The difference between the surrogate models used in engineering and in interpretable machine learning is that the underlying model is a machine learning model (not a simulation) and that the surrogate model must be interpretable. The purpose of (interpretable) surrogate models is to approximate the predictions of the underlying model as accurately as possible and to be interpretable at the same time. The idea of surrogate models can be found under different names: Approximation model, metamodel, response surface model, emulator etc. 

There is actually not much theory needed to understand surrogate models. We want to approximate our black box prediction function $y=m(.)+\epsilon$ as closely as possible with the surrogate model prediction function $g(.)$, under the constraint that the surrogate model is interpretable. Usually, $g(.)$ can take on the form of linear or decision tree form. Here are the steps to obtain a surrogate model:

   1. Select a dataset $X$. This can be the same dataset that was used for training the black box model or a new dataset from the same distribution. You could even select a subset of the data or a grid of points, depending on your application.
   2. For the selected dataset $X$, get the predictions of the black box model.
   3. Select an interpretable model type (linear model, decision tree, ...).
   4. Train the interpretable model on the dataset $X$ and its predictions $\hat{y}$. Congratulations! You now have a surrogate model.
   5. Measure how well the surrogate model replicates the predictions of the black box model (e.g. using $R^{2}$)
   6. If the model proxies the predictions well, interpret the surrogate model.

Global surrogate model method comes with both advantages and disadvantages as well. For example, the surrogate model method is flexible: any model from the interpretable models chapter can be used. This also means that you can exchange not only the interpretable model, but also the underlying black box model. Suppose you create some complex model and explain it to different teams in your company. One team is familiar with linear models, the other team can understand decision trees. You can train two surrogate models (linear model and decision tree) for the original black box model and offer two kinds of explanations. If you find a better performing black box model, you do not have to change your method of interpretation, because you can use the same class of surrogate models.

We can measure how close the surrogate model is to the black box model. Let us assume we are not very close, but close enough. It could happen that the interpretable model is very close for one subset of the dataset, but widely divergent for another subset. In this case the interpretation for the simple model would not be equally good for all data points.

A Python package being developed which builds on the basic version of the global surrogate model is the 'smt' package. Another one which will develop into a dominant package for ML interpretability is from Microsoft Azure called 'MimicExplainer'. As of June 2021, Microsoft Azure has made a lot of progress in ML interpretability and ML fairness. Official documentation can be found online here: 

   - https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability
   
Let's use the Boston housing dataset to illustrate the usage of Microsoft Azure's libraries again:

In [None]:
boston_data = load_boston()

Next let's do the train-test split and then train the model using gradient boosting (regression):

In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_data.data, boston_data.target, test_size=0.2, random_state=0)

reg = GradientBoostingRegressor(n_estimators=100, max_depth=4, learning_rate=0.1, loss='huber', random_state=1)
model = reg.fit(X_train, y_train)

We have a 'model' object now from the fit() method. Now let's use the global surrogate model to find a global explanation. The MimicExplainer() method requires a few arguments. The argument _augment_data_ is optional and if 'True', oversamples the initialization examples to improve surrogate model accuracy to fit original model. This will be useful for high-dimensional data where the number of rows is less than the number of columns. The argument _max_num_of_augmentations_ is optional and defines the maximum number of times we can increase the input data size. The _LGBMExplainableModel_ can be replaced with _LinearExplainableModel_, _SGDExplainableModel_, or _DecisionTreeExplainableModel_:

In [None]:
explainer = MimicExplainer(model, X_train, LGBMExplainableModel, 
                           augment_data=True, max_num_of_augmentations=10, 
                           features=boston_data.feature_names)

In [None]:
print('global importance rank:', '\n')
pprint.pprint(global_explanation.get_feature_importance_dict())

In [None]:
ExplanationDashboard(global_explanation, model, dataset=X_test, true_y=y_test)

### References: 

#### Interpretable AI:
   - https://christophm.github.io/interpretable-ml-book/
   - https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability

#### Permutation Feature Importance
   - Breiman, Leo. "Random Forests." Machine Learning 45 (1). Springer: 5-32 (2001).
   - Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. "Model Class Reliance: Variable importance measures for any machine learning model class, from the 'Rashomon' perspective." http://arxiv.org/abs/1801.01489 (2018).
   - https://docs.microsoft.com/en-us/archive/blogs/machinelearning/permutation-feature-importance
   - https://scikit-learn.org/stable/modules/permutation_importance.html
   - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
   - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer
   - https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-multicollinear-py
   
#### Global Surrogate Model   
   - https://santiagof.medium.com/model-interpretability-making-your-model-confesses-surrogate-models-3dbf72bee8e
   - https://smt.readthedocs.io/en/stable/
   - https://github.com/SMTorg/smt
   - https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-regression-local.ipynb