Regression Example Machine Learning Dashboard
For this tutorial, let us use the sample toy dataset ‘Diabetes’ dataset from the sklearn library to build the Machine Learning dashboard for a Regression problem.
We require the Pandas library for the DataFrame while the ExplainerDashboard and Dash Bootstrap libraries are for building the dashboard. The sklearn library would be used to get the toy dataset, split it and import the RandomForestRegressor to train the model for this regression example.

In [1]:
#Importing Libraries & Packages
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from explainerdashboard import RegressionExplainer, ExplainerDashboard

Importing the dataset

In [2]:
#Import the Diabetes Dataset
from sklearn.datasets import load_diabetes
data= load_diabetes()
#print the dataset
data

{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990749, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06833155, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286131, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04688253,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452873, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00422151,  0.00306441]]),
 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
         69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
         68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
         87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
        259.,  53., 190., 142.,  75., 142., 155., 225.,  59

Loading the dataset
We need to load the dataset onto the X and y variables to create a Pandas DataFrame. X will hold the features and y will hold the target values.

In [4]:
#create a DataFrame from the dataset
X=pd.DataFrame(data.data,columns=data.feature_names)
#Printing first five rows of the DataFrame
X.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [6]:
#Load target values in y
y=pd.DataFrame(data.target,columns=["target"])
y.head()

Unnamed: 0,target
0,151.0
1,75.0
2,141.0
3,206.0
4,135.0


Now our data is ready and we can train the model using RandomForestRegressor.

Splitting the dataset
Let us split the dataset in 80–20 ratio using the train-test split function from sklearn.

In [7]:
#Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

(353, 10) (353, 1) (89, 10) (89, 1)


Training the model
We can now train the model with RandomForestRegressor with randomly selected values for estimators. You can also try with a different value or use XGBoost instead to train the model and compare.

In [8]:
#Training the model
model = RandomForestRegressor(n_estimators=50, max_depth=5)
model.fit(X_train, y_train.values.ravel())

Note: We are using the recommended command ‘ravel()’ convert ‘y_train’ to a 1d array in this step.This reshaping the column-vector y will avoid the DataConversionWarning generated by the RandomForestRegressor.

Setting up the Dashboard instance using the trained model

In [11]:
explainer = RegressionExplainer(model, X_test, y_test)
#Start the Dashboard
db = ExplainerDashboard(explainer,title="Diabetes Prediction",whatif=False)
#Running the app on a local port 3050
db.run(port=3050)

Changing class type to RandomForestRegressionExplainer...
Generating self.shap_explainer = shap.TreeExplainer(model)
Building ExplainerDashboard..
Detected notebook environment, consider setting mode='external', mode='inline' or mode='jupyterlab' to keep the notebook interactive while the dashboard is running...
Generating layout...
Calculating shap values...
Calculating predictions...
Calculating residuals...
Calculating absolute residuals...
Calculating shap interaction values...
Reminder: TreeShap computational complexity is O(TLD^2), where T is the number of trees, L is the maximum number of leaves in any tree and D the maximal depth of any tree. So reducing these will speed up the calculation.
Calculating dependencies...
Calculating importances...
Calculating ShadowDecTree for each individual decision tree...
Reminder: you can store the explainer (including calculated dependencies) with explainer.dump('explainer.joblib') and reload with e.g. ClassifierExplainer.from_file('explaine

ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=3050): Max retries exceeded with url: /_alive_4a0c3984-3ced-4e8b-803b-879df7091de3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001EBBA4FB100>: Failed to establish a new connection: [WinError 10049] The requested address is not valid in its context'))

Insights from the Dashboard
With this Dashboard, we can get some insights like-

Shap Values which indicate how each individual feature affects the prediction
Permutation importances which allow us to dig deeper to visualize how the model performance deteriorates with shuffling of a feature
In the case of a Regression model using XGBoost or RandomForestRegressor similar to this tutorial, we can visualize the individual decision trees whereas in case of Classifier models, we can get confusion matrix, ROC-AUC curves etc. to understand the models decisions better.
What-If (in case turned on while starting the dashboard) to help understand the changes in the model behavior if we modify the features or parts of the data. It also allows us to compare different models.
However, it is also helpful to have some basic understanding of the above plots and the parameters they include to make sense of the insights from such a machine learning dashboard. For anyone looking for detailed information on the theory for this tutorial topic, I would recommend reading the book ‘Interpretable Machine Learning’ by Christoph Molnar.