# Interpretability with Explainable Boosting Machine (EBM)


## Overview of Tutorial
This notebook is Part 1 of a four part workshop that demonstrates how to use [InterpretML](interpret.ml) and [Fairlearn](fairlearn.org) (and their integrations with Azure Machine Learning) to understand and analyze models better. The different components of the workshop are as follows:

- Part 1: [Interpretability with glassbox models (EBM)](https://github.com/microsoft/ResponsibleAI-Airlift/blob/main/Interpret/EBM/Interpretable%20Classification%20Methods.ipynb) (HERE)
- Part 2: [Explain blackbox models with SHAP (and upload explanations to Azure Machine Learning)](https://github.com/microsoft/ResponsibleAI-Airlift/blob/main/Interpret/SHAP/explain-model-SHAP.ipynb)
- Part 3: [Run Interpretability on Azure Machine Learning](https://github.com/microsoft/ResponsibleAI-Airlift/blob/main/Interpret/SHAP/explain-model-Azure.ipynb)
- Part 4: [Model fairness assessment and unfairness mitigation](https://github.com/microsoft/ResponsibleAI-Airlift/blob/main/Fairness/AI-fairness-Census.ipynb)

## Introduction

EBM is an interpretable model developed at Microsoft Research. It uses modern machine learning techniques like bagging, gradient boosting, and automatic interaction detection to breathe new life into traditional GAMs (Generalized Additive Models). This makes EBMs as accurate as state-of-the-art techniques like random forests and gradient boosted trees. However, unlike these blackbox models, EBMs produce lossless explanations and are editable by domain experts.


This notebook showcases how to train an EBM classification model and explore its explanations.

Problem: Adult Census Income (Predict whether income exceeds $50K/yr based on census data)



![](./images/EBM.png)



## What Is Machine Learning Interpretability?
Interpretability is the ability to explain why your model made the predictions it did. The Azure Machine Learning service offers various interpretability features to help accomplish this task. These features include:

- Feature importance values for both raw and engineered features.
- Interpretability on real-world datasets at scale, during training and inference.
- Interactive visualizations to aid you in the discovery of patterns in data and explanations at training time.

By accurately interpretabiliting your model, it allows you to:

- Use the insights for debugging your model.
- Validate model behavior matches their objectives.
- Check for bias in the model.
- Build trust in your customers and stakeholders.

## Install Required Packages

[InterpretML](https://github.com/interpretml) is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof. With this package, you can train interpretable glassbox models and explain blackbox systems. InterpretML helps you understand your model's global behavior, or understand the reasons behind individual predictions.

In [None]:
%pip install interpret

Collecting interpret
  Downloading interpret-0.2.1-py3-none-any.whl (1.4 kB)
Collecting interpret-core[dash,debug,decisiontree,ebm,lime,linear,notebook,plotly,required,sensitivity,shap,treeinterpreter]>=0.2.1
  Downloading interpret_core-0.2.1-py3-none-any.whl (5.0 MB)
[K     |████████████████████████████████| 5.0 MB 3.7 MB/s eta 0:00:01
[?25hCollecting dash>=1.0.0; extra == "dash"
  Downloading dash-1.14.0.tar.gz (70 kB)
[K     |████████████████████████████████| 70 kB 2.1 MB/s eta 0:00:01
Collecting dash-table>=4.1.0; extra == "dash"
  Downloading dash_table-4.9.0.tar.gz (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 2.4 MB/s eta 0:00:01
Collecting dash-cytoscape>=0.1.1; extra == "dash"
  Downloading dash_cytoscape-0.2.0.tar.gz (3.6 MB)
[K     |████████████████████████████████| 3.6 MB 1.7 MB/s eta 0:00:01
[?25hCollecting psutil>=5.6.2; extra == "debug"
  Downloading psutil-5.7.2.tar.gz (460 kB)
[K     |████████████████████████████████| 460 kB 1.6 MB/s eta 0:00:01
Col

Collecting scikit-image>=0.12
  Downloading scikit_image-0.17.2-cp38-cp38-macosx_10_13_x86_64.whl (12.2 MB)
[K     |████████████████████████████████| 12.2 MB 5.2 MB/s eta 0:00:01
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Collecting brotli
  Downloading Brotli-1.0.7-cp38-cp38-macosx_10_9_x86_64.whl (412 kB)
[K     |████████████████████████████████| 412 kB 2.4 MB/s eta 0:00:01
Collecting imageio>=2.3.0
  Downloading imageio-2.9.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 2.7 MB/s eta 0:00:01
[?25hCollecting tifffile>=2019.7.26
  Downloading tifffile-2020.7.24-py3-none-any.whl (146 kB)
[K     |████████████████████████████████| 146 kB 3.4 MB/s eta 0:00:01
[?25hCollecting PyWavelets>=1.1.1
  Downloading PyWavelets-1.1.1-cp38-cp38-macosx_10_9_x86_64.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 4.7 MB/s eta 0:00:01
[?25hCollecting networkx>=2.0
  Downloading networkx-2.4-py3-none-any.whl (1.6 MB)


[K     |████████████████████████████████| 1.6 MB 3.5 MB/s eta 0:00:01
Building wheels for collected packages: dash, dash-table, dash-cytoscape, psutil, lime, SALib, dill, flask-compress, dash-renderer, dash-core-components, dash-html-components, future, retrying
  Building wheel for dash (setup.py) ... [?25ldone
[?25h  Created wheel for dash: filename=dash-1.14.0-py3-none-any.whl size=80128 sha256=efd534519590c5de51c99fea69f90ccec4c0ec35144dd6f313383da096c6417f
  Stored in directory: /Users/mufy/Library/Caches/pip/wheels/ff/a5/77/633ec617c6fa139fa32b868c02e93bc85a32437f5083f48170
  Building wheel for dash-table (setup.py) ... [?25ldone
[?25h  Created wheel for dash-table: filename=dash_table-4.9.0-py3-none-any.whl size=1780768 sha256=e5f5e05b50935a05d70c802cd466c6e96153ed730db7d19ea72abac6c78cb8b9
  Stored in directory: /Users/mufy/Library/Caches/pip/wheels/ff/7e/f9/cea86201fdc6aea9e9b9b314fbcc9a3f62af08845474000f83
  Building wheel for dash-cytoscape (setup.py) ... [?25ldone
[?

After installing packages, you must close and reopen the notebook as well as restarting the kernel.

## Setup a classification experiment

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    header=None)
df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
# df = df.sample(frac=0.1, random_state=1)
train_cols = df.columns[0:-1]
label = df.columns[-1]
X = df[train_cols]
y = df[label].apply(lambda x: 0 if x == " <=50K" else 1) #Turning response into 0 and 1

seed = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

## Explore the dataset

In [5]:
from interpret import show
from interpret.data import ClassHistogram

hist = ClassHistogram().explain_data(X_train, y_train, name = 'Train Data')
show(hist)

ModuleNotFoundError: No module named 'dash'

## Train the Explainable Boosting Machine (EBM)

In [4]:
from interpret.glassbox import ExplainableBoostingClassifier, LogisticRegression, ClassificationTree, DecisionListClassifier

ebm = ExplainableBoostingClassifier(random_state=seed)
ebm.fit(X_train, y_train)   #Works on dataframes and numpy arrays

ExplainableBoostingClassifier(feature_names=['Age', 'WorkClass', 'fnlwgt',
                                             'Education', 'EducationNum',
                                             'MaritalStatus', 'Occupation',
                                             'Relationship', 'Race', 'Gender',
                                             'CapitalGain', 'CapitalLoss',
                                             'HoursPerWeek', 'NativeCountry'],
                              feature_types=['continuous', 'categorical',
                                             'continuous', 'categorical',
                                             'continuous', 'categorical',
                                             'categorical', 'categorical',
                                             'categorical', 'categorical',
                                             'continuous', 'continuous',
                                             'continuous', 'categorical'],
                      

## Global Explanations: What the model learned overall

In [5]:
ebm_global = ebm.explain_global(name='EBM')
show(ebm_global)

## Local Explanations: How an individual prediction was made

In [None]:
ebm_local = ebm.explain_local(X_test[:5], y_test[:5], name='EBM')
show(ebm_local)

## Evaluate EBM performance

In [6]:
from interpret.perf import ROC

ebm_perf = ROC(ebm.predict_proba).explain_perf(X_test, y_test, name='EBM')
show(ebm_perf)

## Let's test out a few other Explainable Models

In [8]:
from interpret.glassbox import LogisticRegression, ClassificationTree

# We have to transform categorical variables to use Logistic Regression and Decision Tree
X_enc = pd.get_dummies(X, prefix_sep='.')
feature_names = list(X_enc.columns)
X_train_enc, X_test_enc, y_train, y_test = train_test_split(X_enc, y, test_size=0.20, random_state=seed)

lr = LogisticRegression(random_state=seed, feature_names=feature_names, penalty='l1', solver='liblinear')
lr.fit(X_train_enc, y_train)

tree = ClassificationTree()
tree.fit(X_train_enc, y_train)

<interpret.glassbox.decisiontree.ClassificationTree at 0x7fb8b63701d0>

## Compare performance using the Dashboard

In [9]:
lr_perf = ROC(lr.predict_proba).explain_perf(X_test_enc, y_test, name='Logistic Regression')
tree_perf = ROC(tree.predict_proba).explain_perf(X_test_enc, y_test, name='Classification Tree')

show(lr_perf)
show(tree_perf)
show(ebm_perf)

## Glassbox: All of our models have global and local explanations

In [10]:
lr_global = lr.explain_global(name='Logistic Regression')
tree_global = tree.explain_global(name='Classification Tree')

show(lr_global)
show(tree_global)
show(ebm_global)

## Dashboard: look at everything at once

In [11]:
# Do everything in one shot with the InterpretML Dashboard by passing a list into show

show([hist, lr_global, lr_perf, tree_global, tree_perf, ebm_global, ebm_perf], share_tables=True)