Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with 'drop' column transformer #1

Closed
carvalhomb opened this issue Oct 9, 2020 · 2 comments
Closed

Problem with 'drop' column transformer #1

carvalhomb opened this issue Oct 9, 2020 · 2 comments

Comments

@carvalhomb
Copy link

(Not sure if this is the correct place to make comments on your code; if not, please accept my apologies!)
Thank you for this great code!
I'm trying to use it, and I ran across an issue with a column transformer that uses the 'drop' option. For example:


import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {'a': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'b': [np.NaN, 'F', np.NaN, 'F', np.NaN, 'M'],
        'c': [123, 145, 100, np.NaN, np.NaN, 150],
        'd': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'e': [14, np.NaN, 29, np.NaN, 52, 45],
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['c', 'd', 'e']
categorical_features = ['a', 'b']

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', 'drop', numerical_features),  # drop numerical columns
        ('cat', categorical_transformer, categorical_features),
    ])
    
    
# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
    ('variance_threshold', VarianceThreshold(threshold=0.0)),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape new dataframe: {}'.format(str(transf_df.shape)))

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)

new_cols = feature_importance.get_feature_names()
print('Length new cols: {}'.format(len(new_cols)))
print('New cols: {}'.format(new_cols))

Currently, the above code will output the following:

Shape new dataframe: (6, 9)
Length new cols: 12
New cols: ['c', 'd', 'e', 'a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

But cols c, d and e (numerical_features) should have been dropped.

If I alter line 91 (https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L91) from :

if transformer_name == 'remainder' and transformer == 'drop':

to:

if transformer == 'drop':

then I get the expected result:

Shape new dataframe: (6, 9)
Length new cols: 9
New cols: ['a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

Not sure how this change affects other functionalities of the class.

@kylegilde
Copy link
Owner

Thanks for pointing that out! I didn't know you can drop columns in that way. I added that change to the new version I just pushed, which also has some more attributes to help understand what the pipeline did.

@kylegilde
Copy link
Owner

great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants