Problem with 'drop' column transformer #1

carvalhomb · 2020-10-09T10:07:23Z

(Not sure if this is the correct place to make comments on your code; if not, please accept my apologies!)
Thank you for this great code!
I'm trying to use it, and I ran across an issue with a column transformer that uses the 'drop' option. For example:


import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {'a': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'b': [np.NaN, 'F', np.NaN, 'F', np.NaN, 'M'],
        'c': [123, 145, 100, np.NaN, np.NaN, 150],
        'd': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'e': [14, np.NaN, 29, np.NaN, 52, 45],
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['c', 'd', 'e']
categorical_features = ['a', 'b']

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', 'drop', numerical_features),  # drop numerical columns
        ('cat', categorical_transformer, categorical_features),
    ])
    
    
# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
    ('variance_threshold', VarianceThreshold(threshold=0.0)),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape new dataframe: {}'.format(str(transf_df.shape)))

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)

new_cols = feature_importance.get_feature_names()
print('Length new cols: {}'.format(len(new_cols)))
print('New cols: {}'.format(new_cols))

Currently, the above code will output the following:

Shape new dataframe: (6, 9)
Length new cols: 12
New cols: ['c', 'd', 'e', 'a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

But cols c, d and e (numerical_features) should have been dropped.

If I alter line 91 (https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L91) from :

if transformer_name == 'remainder' and transformer == 'drop':

to:

if transformer == 'drop':

then I get the expected result:

Shape new dataframe: (6, 9)
Length new cols: 9
New cols: ['a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

Not sure how this change affects other functionalities of the class.

The text was updated successfully, but these errors were encountered:

kylegilde · 2020-10-11T00:45:30Z

Thanks for pointing that out! I didn't know you can drop columns in that way. I added that change to the new version I just pushed, which also has some more attributes to help understand what the pipeline did.

kylegilde · 2020-10-12T14:23:27Z

great, thanks!

kylegilde closed this as completed Oct 11, 2020

kylegilde reopened this Oct 11, 2020

carvalhomb mentioned this issue Oct 12, 2020

Allowing Feature Selection inside or before Column Transformer #2

Open

kylegilde closed this as completed Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with 'drop' column transformer #1

Problem with 'drop' column transformer #1

carvalhomb commented Oct 9, 2020

kylegilde commented Oct 11, 2020

kylegilde commented Oct 12, 2020

Problem with 'drop' column transformer #1

Problem with 'drop' column transformer #1

Comments

carvalhomb commented Oct 9, 2020

kylegilde commented Oct 11, 2020

kylegilde commented Oct 12, 2020