diff --git a/.github/workflows/explainerdashboard.yml b/.github/workflows/explainerdashboard.yml index cff7d0c..c56c899 100644 --- a/.github/workflows/explainerdashboard.yml +++ b/.github/workflows/explainerdashboard.yml @@ -41,4 +41,4 @@ jobs: flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics - name: Test with pytest run: | - pytest + pytest -k "not selenium" diff --git a/README.md b/README.md index de074ab..0f8ade6 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ that explains the workings of a (scikit-learn compatible) machine learning model. The dashboard provides interactive plots on model performance, feature importances, feature contributions to individual predictions, "what if" analysis, -partial dependence plots, SHAP (interaction) values, visualisation of individual +partial dependence plots, SHAP (interaction) values, visualization of individual decision trees, etc. You can also interactively explore components of the dashboard in a @@ -218,23 +218,25 @@ There are a few tricks to make this less painful: values can be very slow to calculate, and often are not needed for analysis. For permutation importances you can set the `n_jobs` parameter to speed up the calculation in parallel. -2. Storing the explainer. The calculated properties are only calculated once +2. Calculate approximate shap values. You can pass approximate=True as a shap parameter by + passing `shap_kwargs=dict(approximate=True)` to the explainer initialization. +4. Storing the explainer. The calculated properties are only calculated once for each instance, however each time when you instantiate a new explainer instance they will have to be recalculated. You can store them with `explainer.dump("explainer.joblib")` and load with e.g. `ClassifierExplainer.from_file("explainer.joblib")`. All calculated properties are stored along with the explainer. -3. Using a smaller (test) dataset, or using smaller decision trees. +5. Using a smaller (test) dataset, or using smaller decision trees. TreeShap computational complexity is `O(TLD^2)`, where `T` is the number of trees, `L` is the maximum number of leaves in any tree and `D` the maximal depth of any tree. So reducing the number of leaves or average depth in the decision tree can really speed up SHAP calculations. -4. Pre-computing shap values. Perhaps you already have calculated the shap values +6. Pre-computing shap values. Perhaps you already have calculated the shap values somewhere, or you can calculate them off on a giant cluster somewhere, or your model supports [GPU generated shap values](https://github.com/rapidsai/gputreeshap). You can simply add these pre-calculated shap values to the explainer with `explainer.set_shap_values()` and `explainer.set_shap_interaction_values()` methods. -5. Plotting only a random sample of points. When you have a lots of observations, +7. Plotting only a random sample of points. When you have a lots of observations, simply rendering the plots may get slow as well. You can pass the `plot_sample` parameter to render a (different each time) random sample of observations for the various scatter plots in the dashboard. E.g.: @@ -536,7 +538,7 @@ In order to reduce the memory footprint there are a number of things you can do: 2. Setting a lower precision. By default shap values are stored as `'float64'`, but you can store them as `'float32'` instead and save half the space: ```ClassifierExplainer(model, X_test, y_test, precision='float32')```. You - can also set a lower precision on your `X_test` dataset yourself ofcourse. + can also set a lower precision on your `X_test` dataset yourself of course. 3. For multi class classifier, by default `ClassifierExplainer` calculates shap values for all classes. If you're only interested in a single class you can drop the other shap values: `explainer.keep_shap_pos_label_only(pos_label)`