Feature request: sklearn.pipeline.Pipeline model type #11

raybellwaves · 2020-11-11T12:41:42Z

Whenever possible I try and use sklearn's pipeline to log my transformations.

It would be great if explainerdashboard could work with these https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

raybellwaves · 2020-11-11T12:47:22Z

may need to use these

https://stackoverflow.com/a/28837740/6046019

oegedijk · 2020-11-11T12:47:42Z

Oh, interesting, will have to have a look at how that would work. Problem is that I need to pass the final model and the final input features to the shap explainer.

But what could work is to put the data through the whole pipeline, except the final model. Store that as the input data. Then take out the final model and store that. And then get the shap values. Could work!

oegedijk · 2020-11-16T11:08:47Z

So I could build something like this into the Explainer to support pipelines: take all steps except the last and use it to transform the input X, and take the final step of the pipeline and extract the model.

However, here I make some strong assumption that the columns of the transformed X are the same as the input X.columns. This is not true in general though (e.g. with onehotencoders you would add additional columns). In general sklearn transformers output numpy arrays instead of dataframes, so that makes it a bit tricky to assign column names...

Any idea on how best to handle this in order to support Pipelines?

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True, as_frame=True)

pipe = Pipeline(steps=[
    ('standardscale', StandardScaler()),
    ('model', RandomForestClassifier())]).fit(X, y)

def split_pipeline(pipeline, X):
    X_transformed = pd.DataFrame(Pipeline(pipeline.steps[:-1]).transform(X), columns=X.columns)
    model = pipeline.steps[-1][1]
    return X_transformed, model

Xt, model = split_pipeline(pipe, X)
model.predict(Xt)

explainer = ClassifierExplainer(model, Xt, y)

oegedijk · 2020-11-17T19:56:34Z

So getting the feature names of the transformed dataframe seems to be an as of yet unresolved issue in sklearn (although multiple SLEPs have been proposed to deal with the issue).

For now I added support for Pipelines as long as they do not add, remove or reorder any columns in the input dataframe. (next release). When sklearn.Pipeline will support a proper Pipeline.get_feature_names() method, this will automatically be picked up by explainerdashboard. (Which means you can also monkeypatch it in already if you want to use more complicated Pipelines that generate new columns such as those involving OneHotEncoder).

raybellwaves mentioned this issue Nov 11, 2020

Failed to guess the type of shap explainer to use #10

Closed

oegedijk closed this as completed Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: sklearn.pipeline.Pipeline model type #11

Feature request: sklearn.pipeline.Pipeline model type #11

raybellwaves commented Nov 11, 2020

raybellwaves commented Nov 11, 2020

oegedijk commented Nov 11, 2020

oegedijk commented Nov 16, 2020

oegedijk commented Nov 17, 2020

Feature request: sklearn.pipeline.Pipeline model type #11

Feature request: sklearn.pipeline.Pipeline model type #11

Comments

raybellwaves commented Nov 11, 2020

raybellwaves commented Nov 11, 2020

oegedijk commented Nov 11, 2020

oegedijk commented Nov 16, 2020

oegedijk commented Nov 17, 2020