Skip to content

Commit

Permalink
Merge pull request #82 from oegedijk/dev
Browse files Browse the repository at this point in the history
version 0.3.2 is ready!
  • Loading branch information
oegedijk committed Feb 25, 2021
2 parents c72305f + 2c50926 commit 3884b8e
Show file tree
Hide file tree
Showing 29 changed files with 890 additions and 261 deletions.
26 changes: 22 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ model.fit(X_train, y_train)
explainer = ClassifierExplainer(model, X_test, y_test,
cats=['Deck', 'Embarked',
{'Gender': ['Sex_male', 'Sex_female', 'Sex_nan']}],
cats_notencoded={'Embarked': 'Stowaway'}, # defaults to 'NOT_ENCODED'
descriptions=feature_descriptions, # defaults to None
labels=['Not survived', 'Survived'], # defaults to ['0', '1', etc]
idxs = test_names, # defaults to X.index
Expand All @@ -116,7 +117,7 @@ explainer = ClassifierExplainer(model, X_test, y_test,

db = ExplainerDashboard(explainer,
title="Titanic Explainer", # defaults to "Model Explainer"
whatif=False, # you can switch off tabs with bools
shap_interaction=False, # you can switch off tabs with bools
)
db.run(port=8050)
```
Expand Down Expand Up @@ -184,6 +185,11 @@ There are a few tricks to make this less painful:
number of trees, `L` is the maximum number of leaves in any tree and
`D` the maximal depth of any tree. So reducing the number of leaves or average
depth in the decision tree can really speed up SHAP calculations.
4. Plotting only a random sample of points. When you have a lots of observations,
simply rendering the plots may get slow as well. You can pass the `plot_sample`
parameter to render a (different each time) random sample of observations
for the various scatter plots in the dashboard. E.g.:
`ExplainerDashboard(explainer, plot_sample=1000).run()`

## Launching from within a notebook

Expand Down Expand Up @@ -345,9 +351,12 @@ ExplainerDashboard(explainer,
pdp_col='Fare', # initial pdp feature
cutoff=0.8, # cutoff for classification plots
round=2 # rounding to apply to floats
show_metrics=['accuracy', 'f1', custom_metric] # only show certain metrics
plot_sample=1000, # only display a 1000 random markers in scatter plots
)
```


### Designing your own layout

All the components in the dashboard are modular and re-usable, which means that
Expand Down Expand Up @@ -456,8 +465,10 @@ or with waitress (also works on Windows):

When you deploy a dashboard with a dataset with a large number of rows (`n`) and columns (`m`),
the memory usage of the dashboard can be substantial. You can check the (approximate)
memory usage with `explainer.memory_usage()`. In order to reduce the memory
footprint there are a number of things you can do:
memory usage with `explainer.memory_usage()`. (as a side note: if you have lots
of rows, you probably want to set the `plot_sample` parameter as well)

In order to reduce the memory footprint there are a number of things you can do:

1. Not including shap interaction tab: shap interaction values are shape (`n*m*m`),
so can take a subtantial amount of memory.
Expand All @@ -480,7 +491,14 @@ footprint there are a number of things you can do:
and `index` as argument and returns the observed outcome `y` for
that index.
- with `explainer.set_index_list_func()` you can set a function
that returns a list of available indexes that can be queried.
that returns a list of available indexes that can be queried. Only gets
called upon start of the dashboard.

If you have a very large number of indexes and the user is able to look
them up elsewhere, you can also replace the index dropdowns with a simple free
text field with `index_dropdown=False`. Only valid indexes (i.e. in the
`get_index_list()` list) get propagated
to other components by default, but this can be overriden with `index_check=False`.

Important: these function can be called multiple times by multiple independent
components, so probably best to implement some kind of caching functionality.
Expand Down
62 changes: 62 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,67 @@
# Release Notes

## Version 0.3.2:

Highlights:
- Control what metrics to show or use your own custom metrics using `show_metrics`
- Set the naming for onehot features with all `0`s with `cats_notencoded`
- Speed up plots by displaying only a random sample of markers in scatter plots with `plot_sample`.
- make index selection a free text field with `index_dropdown=False`

### New Features
- new parameter `show_metrics` for both `explainer.metrics()`, `ClassifierModelSummaryComponent`
and `RegressionModelSummaryComponent`:
- pass a list of metrics and only display those metrics in that order
- you can also pass custom scoring functions as long as they
are of the form `metric_func(y_true, y_pred)`: `show_metrics=[metric_func]`
- For `ClassifierExplainer` what is passed to the custom metric function
depends on whether the function takes additional parameters `cutoff`
and `pos_label`. If these are not arguments, then `y_true=self.y_binary(pos_label)`
and `y_pred=np.where(self.pred_probas(pos_label)>cutoff, 1, 0)`.
Else the raw `self.y` and `self.pred_probas` are passed for the
custom metric function to do something with.
- custom functions are also stored to `dashboard.yaml` and imported upon
loading `ExplainerDashboard.from_config()`
- new parameter `cats_notencoded`: a dict to indicate how to name the value
of a onehotencoded features when all onehot columns equal 0. Defaults
to `'NOT_ENCODED'`, but can be adjusted with this parameter. E.g.
`cats_notencoded=dict(Deck="Deck not known")`.
- new parameter `plot_sample` to only plot a random sample in the various
scatter plots. When you have a large dataset, this may significantly
speed up various plots without sacrificing much in expressiveness:
`ExplainerDashboard(explainer, plot_sample=1000).run`
- new parameter `index_dropdown=False` will replace the index dropdowns with a
free text field. This can be useful when you have a lot of potential indexes,
and the user is expected to know the index string.
Input will be checked for validity with `explainer.index_exists(index)`,
and field indicates when input index does not exist. If index does not exist,
will not be forwarded to other components, unless you also set `index_check=False`.
- adds mean absolute percentage error to the regression metrics. If it is too
large a warning will be printed. Can be excluded with the new `show_metrics`
parameter.

### Bug Fixes
- `get_classification_df` added to `ClassificationComponent` dependencies.
-

### Improvements
- accepting single column `pd.Dataframe` for `y`, and automatically converting
it to a `pd.Series`
- if WhatIf `FeatureInputComponent` detects the presence of missing onehot features
(i.e. rows where all columns of the onehotencoded feature equal 0), then
adds `'NOT_ENCODED'` or the matching value from `cats_notencoded` to the
dropdown options.
- Generating `name` for parameters for `ExplainerComponents` for which no
name is given is now done with a determinative process instead of a random
`uuid`. This should help with scaling custom dashboards across cluster
deployments. Also drops `shortuuid` dependency.
- `ExplainerDashboard` now prints out local ip address when starting dashboard.
- `get_index_list()` is only called once upon starting dashboard.

### Other Changes
-
-

## Version 0.3.1:
This version is mostly about pre-calculating and optimizing the classifier statistics
components. Those components should now be much more responsive with large datasets.
Expand Down
13 changes: 6 additions & 7 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@

# TODO


## Bugs:

## Plots:
- add sample_size parameter for shap dependence plot with large dataset
- make plot background transparent?
- Only use ScatterGl above a certain cutoff
- seperate standard shap plots for shap_interaction plots
Expand All @@ -20,11 +20,8 @@
- new method?

### Regression plots:
- add plot_sample parameter

## Explainers:
- add show_metrics parameter to ``metrics`` and ``ModelSummaryComponent``.
- add metrics, classification_df, roc_auc_cruve, pr_auc_curve, etc to calculate_properties()
- pass n_jobs to pdp_isolate
- add ExtraTrees and GradientBoostingClassifier to tree visualizers
- add plain language explanations
Expand All @@ -38,6 +35,7 @@


## Dashboard:
- make poweredby right align
- more flexible instantiate_component:
- no explainer needed (if explainer component detected, pass otherwise ignore)
- add TablePopout
Expand All @@ -59,9 +57,11 @@


### Components
- autodetect when uuid name get rendered and issue warning

- add predictions list to whatif composite:
- https://github.com/oegedijk/explainerdashboard/issues/85
- add circular callbacks to cutoff - cutoff percentile
- Add side-by-side option to cutoff selector component
- add `index_dropdown=True` parameter. Alternative: free entry input.
- add filter to index selector using pattern matching callbacks:
- https://dash.plotly.com/pattern-matching-callbacks
- add querystring method to ExplainerComponents
Expand Down Expand Up @@ -95,7 +95,6 @@
- to explainer class methods
- to explainer_methods
- to explainer_plots
- Add pydata video when it comes online (january 4th)


## Library level:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
autodoc_mock_imports = ['matplotlib', 'np', 'dash', 'dash_bootstrap_components',
'dash_html_components', 'dash_table', 'dash_core_components',
'dtreeviz', 'numpy', 'pandas', 'pd',
'sklearn', 'shap', 'plotly', 'shortuuid',
'sklearn', 'shap', 'plotly',
'joblib', 'dash_auth', 'jupyter_dash', 'oyaml', 'click',
'flask', 'flask_simplelogin', 'werkzeug']

Expand Down
53 changes: 53 additions & 0 deletions docs/source/custom.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ cats toggle will be hidden on every component that has one::

ExplainerDashboard(explainer,
no_permutations=True, # do not show or calculate permutation importances
hide_popout=True, # hide the 'popout' button for each graph
hide_poweredby=True, # hide the 'powerered by: explainerdashboard' footer
hide_popout=True, # hide the 'popout' button from each graph
hide_depth=True, # hide the depth (no of features) dropdown
Expand Down Expand Up @@ -126,9 +127,61 @@ Some examples of useful parameters to pass::
pdp_col='Fare', # initial pdp feature
cutoff=0.8, # cutoff for classification plots
round=2 # round floats to 2 digits
show_metrics=['accuracy', 'f1', custom_metric] # only show certain metrics
plot_sample=1000, # only display a 1000 random markers in scatter plots
)

Using custom metrics
====================

By default the dashboard shows a number of metrics for classifiers (accuracy, etc)
and regression models (R-squared, etc). You can control which metrics are shown
and in what order by passing ``show_metrics``::

ExplainerDashboard(explainer, show_metrics=['accuracy', 'f1', 'recall']).run()

However you can also define custom metrics functions yourself as long as they
take ``y_true`` and ``y_pred`` as parameters::

def custom_metric(y_true, y_pred):
return np.mean(y_true)-np.mean(y_pred)

ExplainerDashboard(explainer, show_metrics=['accuracy', custom_metric]).run()

For ``ClassifierExplainer``, ``y_true`` and ``y_pred`` will have already been
calculated as an array of ``1`` and ``0`` depending on the ``pos_label`` and
``cutoff`` that was passed to ``explainer.metrics()``. However, if you take
``pos_label`` and ``cutoff`` as parameters to the custom metric function, then you will get the
unprocessed raw labels and `pred_probas`. So for example you could calculate
a sum of cost function over the confusion matrix as a custom metric. Then the following
metrics would all work and have the equivalent result::

from sklearn.metrics import confusion_matrix

def cost_metric(y_true, y_pred):
cost_matrix = np.array([[10, -50], [-20, 10]])
cm = confusion_matrix(y_true, y_pred)
return (cost_matrix * cm).sum()

def cost_metric2(y_true, y_pred, cutoff):
return cost_metric(y_true, np.where(y_pred>cutoff, 1, 0))

def cost_metric3(y_true, y_pred, pos_label):
return cost_metric(np.where(y_true==pos_label, 1, 0), y_pred[:, pos_label])

def cost_metric4(y_true, y_pred, cutoff, pos_label):
return cost_metric(np.where(y_true==pos_label, 1, 0),
np.where(y_pred[:, pos_label] > cutoff, 1, 0))

explainer.metrics(show_metrics=[cost_metric, cost_metric2, cost_metric3, cost_metric4]).run()

.. note::
When storing an ``ExplainerDashboard.to_yaml()`` the custom metric functions will
be stored to the ``.yaml`` file with a reference to their name and module.
So when loading the dashboard ``from_config()`` you have to make sure the
metric function can be found by the same name in the same module (which
could be ``__main__``), otherwise the dashboard will fail to load.

Building custom layout
======================

Expand Down
18 changes: 17 additions & 1 deletion docs/source/explainers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,21 @@ You can now use these categorical features directly as input for plotting method
``explainer.plot_dependence("Deck")``, which will now generate violin plots
instead of the default scatter plots.

cats_notencoded
---------------

When you have onehotencoded a categorical feature, you may have dropped some columns
during feature selection. Or there are new categories in the test set that were not encoded
as columns in the training set. In that cases all columns in your onehot encoding may be equal
to ``0`` for some rows. By default the value assigned to the aggregated feature for such cases is ``'NOT_ENCODED'``,
but this can be overriden with the ``cats_notencoded`` parameter::

ClassifierExplainer(model, X, y,
cats=[{'Gender': ['Sex_male', 'Sex_female']}, 'Deck', 'Embarked'],
cats_notencoded={'Gender': 'Gender Other', 'Deck': 'Unknown Deck', 'Embarked':'Stowaway'})



idxs
----

Expand All @@ -123,7 +138,7 @@ but you can also pass it explicitly, e.g.: ``index_name="Passenger"``.
descriptions
------------

``descriptions`` can be passed as a dictionary of descriptions for each variable.
``descriptions`` can be passed as a dictionary of descriptions for each feature.
In order to be explanatory, you often have to explain the meaning of the features
themselves (especially if the naming is not obvious).
Passing the dict along to descriptions will show hover-over tooltips for the
Expand All @@ -136,6 +151,7 @@ the ``cats`` parameter, you can also give descriptions of these groups, e.g::
'Gender': 'Gender of the passenger',
'Fare': 'The price of the ticket paid for by the passenger',
'Deck': 'The deck of the cabin of the passenger',
'Age': 'Age of the passenger in year'
})


Expand Down

0 comments on commit 3884b8e

Please sign in to comment.