Merge pull request #72 from oegedijk/dev

Dev: version 0.3
oegedijk · Jan 27, 2021 · f58767a · f58767a
2 parents e1e6254 + 056c80e
commit f58767a
Show file tree

Hide file tree

Showing 46 changed files with 58,526 additions and 99,783 deletions.
diff --git a/.gitignore b/.gitignore
@@ -129,11 +129,9 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
-catboost_info/learn_error.tsv
-catboost_info/time_left.tsv
-catboost_info/learn/events.out.tfevents
-.vscode/settings.json
+catboost_info/*
 .vscode/settings.json
+
 scratch_notebook.ipynb
 scratch_import.py
 show_and_tell_draft.md
@@ -166,4 +164,5 @@ dashboard1.yaml
 dashboard2.yaml
 users.yaml
 users.json
+store_test.csv
 
diff --git a/.vscode/settings.json b/.vscode/settings.json
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@ a single [ExplainerHub](https://explainerdashboard.readthedocs.io/en/latest/hub.
 
  Examples deployed at: [titanicexplainer.herokuapp.com](http://titanicexplainer.herokuapp.com), 
  detailed documentation at [explainerdashboard.readthedocs.io](http://explainerdashboard.readthedocs.io), 
- example notebook on how to launch dashboard for different models [here](https://github.com/oegedijk/explainerdashboard/blob/master/dashboard_examples.ipynb), and an example notebook on how to interact with the explainer object [here](https://github.com/oegedijk/explainerdashboard/blob/master/explainer_examples.ipynb).
+ example notebook on how to launch dashboard for different models [here](notebooks/dashboard_examples.ipynb), and an example notebook on how to interact with the explainer object [here](notebooks/explainer_examples.ipynb).
 
  Works with `scikit-learn`, `xgboost`, `catboost`, `lightgbm` and others.
 
@@ -300,7 +300,6 @@ cats toggle will be hidden on every component that has one:
 ```python
 ExplainerDashboard(explainer, 
                     no_permutations=True, # do not show or calculate permutation importances
-                    hide_cats=True, # hide the group cats toggles
                     hide_depth=True, # hide the depth (no of features) dropdown
                     hide_sort=True, # hide sort type dropdown in contributions graph/table
                     hide_orientation=True, # hide orientation dropdown in contributions graph/table
@@ -336,9 +335,9 @@ ExplainerDashboard(explainer,
                     col='Fare', # initial feature in shap graphs
                     color_col='Age', # color feature in shap dependence graph
                     interact_col='Age', # interaction feature in shap interaction
-                    cats=False, # do not group categorical onehot features
                     depth=5, # only show top 5 features
                     sort = 'low-to-high', # sort features from lowest shap to highest in contributions graph/table
+                    cats_topx=3, # show only the top 3 categories for categorical features
                     cats_sort='alphabet', # short categorical features alphabetically
                     orientation='horizontal', # horizontal bars in contributions graph
                     index='Rugg, Miss. Emily', # initial index to display
@@ -364,12 +363,12 @@ a few toggles:
 from explainerdashboard.custom import *
 
 class CustomDashboard(ExplainerComponent):
-    def __init__(self, explainer, **kwargs):
+    def __init__(self, explainer, name=None):
         super().__init__(explainer, title="Custom Dashboard")
-        self.confusion = ConfusionMatrixComponent(explainer,
+        self.confusion = ConfusionMatrixComponent(explainer, name=self.name+"cm",
                             hide_selector=True, hide_percentage=True,
                             cutoff=0.75)
-        self.contrib = ShapContributionsGraphComponent(explainer,
+        self.contrib = ShapContributionsGraphComponent(explainer, name=self.name+"contrib",
                             hide_selector=True, hide_cats=True, 
                             hide_depth=True, hide_sort=True,
                             index='Rugg, Miss. Emily')
@@ -452,17 +451,51 @@ or with waitress (also works on Windows):
     $ waitress-serve dashboard:app
 ```
 
-
+### Minimizing memory usage
+
+When you deploy a dashboard with a dataset with a large number of rows (`n`) and columns (`m`),
+the memory usage of the dashboard can be substantial. You can check the (approximate)
+memory usage with `explainer.memory_usage()`. In order to reduce the memory
+footprint there are a number of things you can do:
+
+1. Not including shap interaction tab: shap interaction values are shape (`n*m*m`),
+    so can take a subtantial amount of memory.
+2. Setting a lower precision. By default shap values are stored as `'float64'`,
+    but you can store them as `'float32'` instead and save half the space:
+    ```ClassifierExplainer(model, X_test, y_test, precision='float32')```. You 
+    can also set a lower precision on your `X_test` dataset yourself ofcourse.
+3. For multi class classifier, by default `ClassifierExplainer` calculates
+    shap values for all classes. If you're only interested in a single class
+    you can drop the other shap values: `explainer.keep_shap_pos_label_only(pos_label)`
+4. Storing data externally. You can for example only store a subset of 10.000 rows in
+    the explainer itself (enough to generate importance and dependence plots),
+    and store the rest of your millions of rows of input data in an external file 
+    or database:
+    - with `explainer.set_X_row_func()` you can set a function that takes 
+        and `index` as argument and returns a single row dataframe with model
+        compatible input data for that index. This function can include a query
+        to a database or fileread. 
+    - with `explainer.set_y_func()` you can set a function that takes 
+        and `index` as argument and returns the observed outcome `y` for
+        that index.
+    - with `explainer.set_index_list_func()` you can set a function 
+        that returns a list of available indexes that can be queried.
+
+    Important: these function can be called multiple times by multiple independent
+    components, so probably best to implement some kind of caching functionality.
+    The functions you pass can be also methods, so you have access to all of the
+    internals of the explainer.
+
 
 ## Documentation
 
 Documentation can be found at [explainerdashboard.readthedocs.io](https://explainerdashboard.readthedocs.io/en/latest/).
 
-Example notebook on how to launch dashboards for different model types here: [dashboard_examples.ipynb](https://github.com/oegedijk/explainerdashboard/blob/master/dashboard_examples.ipynb).
+Example notebook on how to launch dashboards for different model types here: [dashboard_examples.ipynb](notebooks/dashboard_examples.ipynb).
 
-Example notebook on how to interact with the explainer object here: [explainer_examples.ipynb](https://github.com/oegedijk/explainerdashboard/blob/master/explainer_examples.ipynb).
+Example notebook on how to interact with the explainer object here: [explainer_examples.ipynb](notebooks/explainer_examples.ipynb).
 
-Example notebook on how to design a custom dashboard: [custom_examples.ipynb](https://github.com/oegedijk/explainerdashboard/blob/master/custom_examples.ipynb).
+Example notebook on how to design a custom dashboard: [custom_examples.ipynb](notebooks/custom_examples.ipynb).
 
 
 

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -1,5 +1,122 @@
 # Release Notes
 
+
+## 0.3.0:
+This is a major release and comes with lots of breaking changes to the lower level 
+`ClassifierExplainer` and `RegressionExplainer` API. The higherlevel `ExplainerComponent` and `ExplainerDashboard` API has not been
+changed however, except for the deprecation of the `cats` and `hide_cats` parameters.
+
+Explainers generated with version `explainerdashboard <= 0.2.20.1` will not work 
+with this version, so if you have stored explainers to disk you either have to 
+rebuild them with this new version, or downgrade back to `explainerdashboard==0.2.20.1`! 
+(hope you pinned your dependencies in production! ;)
+
+Main motivation for these breaking changes was to improve memory usage of the
+dashboards, especially in production. This lead to the deprecation of the
+dual cats grouped/not grouped functionality of the dashboard. Once I had committed
+to that breaking change, I decided to clean up the entire API and do all the 
+needed breaking changes at once. 
+
+
+### Breaking Changes
+- onehot encoded features are now merged by default. This means that the `cats=True`
+    parameter has been removed from all explainer methods, and the `group cats` 
+    toggle has been removed from all `ExplainerComponents`. This saves both
+    on code complexity and memory usage. If you wish to see the see the individual
+    contributions of onehot encoded columns, simply don't pass them to the 
+    `cats` parameter upon construction.
+- Deprecated explainer attributes:
+    - `BaseExplainer`:
+        - `self.shap_values_cats`
+        - `self.shap_interaction_values_cats`
+        - `permutation_importances_cats`
+        - `self.get_dfs()`
+        - `formatted_contrib_df()`
+        - `self.to_sql()`
+        - `self.check_cats()` 
+        - `equivalent_col`
+    - `ClassifierExplainer`:
+        - `get_prop_for_label`
+
+- Naming changes to attributes:
+    - `BaseExplainer`:
+        - `importances_df()` -> `get_importances_df()`
+        - `feature_permutations_df()` -> `get_feature_permutations_df()`
+        - `get_int_idx(index)` -> `get_idx(index)`
+        - `importances_df()` -> `get_importances_df()`
+        - `contrib_df()` -> `get_contrib_df()` *
+        - `contrib_summary_df()` -> `self.get_summary_contrib_df()` *
+        - `interaction_df()` -> `get_interactions_df()` *
+        - `shap_values` -> `get_shap_values_df`
+        - `plot_shap_contributions()` -> `plot_contributions()`
+        - `plot_shap_summary()` -> `plot_importances_detailed()`
+        - `plot_shap_dependence()` -> `plot_dependence()`
+        - `plot_shap_interaction()` -> `plot_interaction()`
+        - `plot_shap_interaction_summary()` -> `plot_interactions_detailed()`
+        - `plot_interactions()` -> `plot_interactions_importance()`
+        - `n_features()` -> `n_features`
+        - `shap_top_interaction()` -> `top_shap_interactions` 
+        - `shap_interaction_values_by_col()` -> `shap_interactions_values_for_col()`
+    - `ClassifierExplainer`:
+        - `self.pred_probas` -> `self.pred_probas()`
+        - `precision_df()` -> `get_precision_df()` *
+        - `lift_curve_df()` -> `get_liftcurve_df()` *
+    - `RandomForestExplainer`/`XGBExplainer`:
+        - `decision_trees` -> `shadow_trees`
+        - `decisiontree_df()` -> `get_decisionpath_df()`
+        - `decisiontree_summary_df()` -> `get_decisionpath_summary_df()`
+        - `decision_path_file()` -> `decisiontree_file()`
+        - `decision_path()` -> `decisiontree()`
+        - `decision_path_encoded()` -> `decisiontree_encoded()`
+
+### New Features
+- new `Explainer` parameter `precision`: defaults to `'float64'`. Can be set to
+    `'float32'` to save on memory usage: `ClassifierExplainer(model, X, y, precision='float32')`
+- new `memory_usage()` method to show which internal attributes take the most memory.
+- for multiclass classifiers: `keep_shap_pos_label_only(pos_label)` method:
+    - drops shap values and shap interactions for all labels except `pos_label`
+    - this should significantly reduce memory usage for multi class classification
+        models.
+    - not needed for binary classifiers.
+- added `get_index_list()`, `get_X_row(index)`, and `get_y(index)` methods.
+    - these can be overridden with `.set_index_list_func()`, `.set_X_row_func()`
+        and `.set_y_func()`.
+    - by overriding these functions you can for example sample observations 
+        from a database or other external storage instead of from `X_test`, `y_test`.
+- added `Popout` buttons to all the major graphs that open a large modal
+    showing just the graph. This makes it easier to focus on a particular
+    graph without distraction from the rest of the dashboard and all it's toggles.
+- added `max_cat_colors` parameters to `plot_importance_detailed` and `plot_dependence` and `plot_interactions_detailed`
+    - prevents plotting getting slow with categorical features with many categories.
+    - defaults to `5`
+    - can be set as `**kwarg` to `ExplainerDashboard`
+- adds category limits and sorting to `RegressionVsCol` component
+- adds property `X_merged` that gives a dataframe with the onehot columns merged.
+
+### Bug Fixes
+- shap dependence: when no point cloud, do not highlight!
+- Fixed bug with calculating contributions plot/table for whatif component,
+    when InputFeatures had not fully loaded, resulting in shap error.
+
+### Improvements
+- saving `X.copy()`, instead of using a reference to `X`
+    - this would result in more memory usage in development
+        though, so you can `del X_test` to save memory.
+- `ClassifierExplainer` only stores shap (interaction) values for the positive
+    class: shap values for the negative class are generated on the fly
+    by multiplying with `-1`.
+- encoding onehot columns as `np.int8` saving memory usage
+- encoding categorical features as `pd.category` saving memory usage
+- added base `TreeExplainer` class that `RandomForestExplainer` and `XGBExplainer` both derive from
+    - will make it easier to extend tree explainers to other models in the future
+        - e.g. catboost and lightgbm
+- got rid of the callable properties (that were their to assure backward compatibility),
+    and replaced them with regular methods.
+
+### Other Changes
+-
+-
+
 ## 0.2.20.1:
 
 

diff --git a/TODO.md b/TODO.md
@@ -1,16 +1,10 @@
 
 # TODO:
 
-## Bugs:
-- dash contributions reload bug: Exception: Additivity check failed in TreeExplainer!
-- shap dependence: when no point cloud, do not highlight!
-
-## Layout:
-- Find a proper frontender to help :)
+## Version 0.3:
+- check InlineExplainer 
 
-## dfs:
-- wrap shap values in pd.DataFrames?
-- wrap predictions in pd.Series?
+## Bugs:
 
 ## Plots:
 - make plot background transparent?
@@ -21,10 +15,6 @@
     - https://community.plotly.com/t/announcing-plotly-py-4-12-horizontal-and-vertical-lines-and-rectangles/46783
 - add some of these:
     https://towardsdatascience.com/introducing-shap-decision-plots-52ed3b4a1cba
-- shap dependence plot, sort categorical features by:
-    - alphabet
-    - number of obs
-    - mean abs shap
 
 ### Classifier plots:
 - move predicted and actual to outer layer of ConfusionMatrixComponent
@@ -36,23 +26,15 @@
 ### Regression plots:
 
 
-
 ## Explainers:
-- add get_X_row() and get_index_list() methods, and implement it throughout the dashboard.
-- minimize pd.DataFrame and np.array size:
-    - astype(float16), pd.category, etc
 - pass n_jobs to pdp_isolate
-- add option drop non-cats
 - add ExtraTrees and GradientBoostingClassifier to tree visualizers
 - add plain language explanations
     - could add an parameter to the` explainer.plot_*` function  `in_words=True` in which 
         case instead of a plot the function returns a verbal description of the 
         relationship in the plot.
     - Then add an "in words" button to the components, that show a popup with
         the verbal explanation.
-- rename RandomForestExplainer and XGBExplainer methods into something more logical
-    - Breaking change!
-
 
 ## notebooks:
 
@@ -85,16 +67,13 @@
 - add pos_label_name property to PosLabelConnector search
 - add "number of indexes" indicator to RandomIndexComponents for current restrictions
 - set equivalent_col when toggling cats in dependence/interactions
-
-- add width/height to components
 - whatif:
     - Add a constraints function to whatif component:
         - tests if current feature input is allowed
         - gives specific feedback when constraint broken
         - could build WhatIfComponentException for this?
     - Add sliders option to what if component
 
-
 ## Methods:
 - add support for SamplingExplainer, PartitionExplainer, PermutationExplainer, AdditiveExplainer
 - add support for LimeTabularExplainer:
@@ -110,13 +89,17 @@
 - write tests for explainer_plots
 
 ## Docs:
+- add memory savings to docs:
+    - memory_usage()
+    - keep_shap_pos_label_only()
+    - set_X_row_func, etc
 - add cats_topx cats_sort to docs
 - add hide_wizard and wizard to docs
 - add hide_poweredby to docs
 - add Docker deploy example (from issue)
 - document register_components no longer necessary
 - add new whatif parameters to README and docs
-- add section to README on storing and loading explainer/dashboard from file/config
+- add section to docs and README on storing and loading explainer/dashboard from file/config
 
 - retake screenshots of components as cards
 - Add type hints:
@@ -130,7 +113,6 @@
 ## Library level:
 - Make example heroku deployment repo
 - Make example heroku ExplainerHub repo
-- hide (prefix '_') to non-public API class methods
 - submit pull request to shap with broken test for 
     https://github.com/slundberg/shap/issues/723
 
diff --git a/docs/source/cli.rst b/docs/source/cli.rst
@@ -1,5 +1,5 @@
-``explainerdashboard`` CLI
-**************************
+explainerdashboard CLI
+**********************
 
 The library comes with a ``explainerdashboard`` command line tool (CLI) that
 you can use to build and run explainerdashboards from your terminal. 
@@ -23,7 +23,8 @@ from the command line by running::
 
     $ explainerdashboard run explainer.joblib
 
-Or to run on specific port, not launch a browser or show help::
+The CLI uses the ``waitress`` web server by default to run your dashboard.
+To run on a specific port, not launch a browser or show help::
 
     $ explainerdashboard run explainer.joblib --port 8051
     $ explainerdashboard run explainer.joblib --no-browser