You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Not sure if this is the correct place to make comments on your code; if not, please accept my apologies!)
Thank you for this great code!
I'm trying to use it, and I ran across an issue with a column transformer that uses the 'drop' option. For example:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import feature_importance as fi
data = {'a': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
'b': [np.NaN, 'F', np.NaN, 'F', np.NaN, 'M'],
'c': [123, 145, 100, np.NaN, np.NaN, 150],
'd': [10, np.NaN, 30, np.NaN, np.NaN, 20],
'e': [14, np.NaN, 29, np.NaN, 52, 45],
}
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])
numerical_features = ['c', 'd', 'e']
categorical_features = ['a', 'b']
# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
])
# Put together the column transformer
column_transformer = ColumnTransformer(
transformers=[
('num', 'drop', numerical_features), # drop numerical columns
('cat', categorical_transformer, categorical_features),
])
# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
('column_transformer', column_transformer),
('variance_threshold', VarianceThreshold(threshold=0.0)),
])
# Fit the preprocesser
fitted_pp = preproc.fit(df)
# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape new dataframe: {}'.format(str(transf_df.shape)))
# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)
new_cols = feature_importance.get_feature_names()
print('Length new cols: {}'.format(len(new_cols)))
print('New cols: {}'.format(new_cols))
Currently, the above code will output the following:
Shape new dataframe: (6, 9)
Length new cols: 12
New cols: ['c', 'd', 'e', 'a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']
But cols c, d and e (numerical_features) should have been dropped.
Thanks for pointing that out! I didn't know you can drop columns in that way. I added that change to the new version I just pushed, which also has some more attributes to help understand what the pipeline did.
(Not sure if this is the correct place to make comments on your code; if not, please accept my apologies!)
Thank you for this great code!
I'm trying to use it, and I ran across an issue with a column transformer that uses the 'drop' option. For example:
Currently, the above code will output the following:
But cols c, d and e (numerical_features) should have been dropped.
If I alter line 91 (https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L91) from :
to:
then I get the expected result:
Not sure how this change affects other functionalities of the class.
The text was updated successfully, but these errors were encountered: