Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: sklearn.pipeline.Pipeline model type #11

Closed
raybellwaves opened this issue Nov 11, 2020 · 4 comments
Closed

Feature request: sklearn.pipeline.Pipeline model type #11

raybellwaves opened this issue Nov 11, 2020 · 4 comments

Comments

@raybellwaves
Copy link
Contributor

Whenever possible I try and use sklearn's pipeline to log my transformations.

It would be great if explainerdashboard could work with these https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

@raybellwaves
Copy link
Contributor Author

may need to use these

https://stackoverflow.com/a/28837740/6046019

@oegedijk
Copy link
Owner

Oh, interesting, will have to have a look at how that would work. Problem is that I need to pass the final model and the final input features to the shap explainer.

But what could work is to put the data through the whole pipeline, except the final model. Store that as the input data. Then take out the final model and store that. And then get the shap values. Could work!

@oegedijk
Copy link
Owner

So I could build something like this into the Explainer to support pipelines: take all steps except the last and use it to transform the input X, and take the final step of the pipeline and extract the model.

However, here I make some strong assumption that the columns of the transformed X are the same as the input X.columns. This is not true in general though (e.g. with onehotencoders you would add additional columns). In general sklearn transformers output numpy arrays instead of dataframes, so that makes it a bit tricky to assign column names...

Any idea on how best to handle this in order to support Pipelines?

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True, as_frame=True)

pipe = Pipeline(steps=[
    ('standardscale', StandardScaler()),
    ('model', RandomForestClassifier())]).fit(X, y)

def split_pipeline(pipeline, X):
    X_transformed = pd.DataFrame(Pipeline(pipeline.steps[:-1]).transform(X), columns=X.columns)
    model = pipeline.steps[-1][1]
    return X_transformed, model

Xt, model = split_pipeline(pipe, X)
model.predict(Xt)

explainer = ClassifierExplainer(model, Xt, y)

@oegedijk
Copy link
Owner

So getting the feature names of the transformed dataframe seems to be an as of yet unresolved issue in sklearn (although multiple SLEPs have been proposed to deal with the issue).

For now I added support for Pipelines as long as they do not add, remove or reorder any columns in the input dataframe. (next release). When sklearn.Pipeline will support a proper Pipeline.get_feature_names() method, this will automatically be picked up by explainerdashboard. (Which means you can also monkeypatch it in already if you want to use more complicated Pipelines that generate new columns such as those involving OneHotEncoder).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants