Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature importance [enhancement] #27

Closed
mglowacki100 opened this issue Sep 4, 2019 · 8 comments
Closed

feature importance [enhancement] #27

mglowacki100 opened this issue Sep 4, 2019 · 8 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@mglowacki100
Copy link

It'd be nice to have 'feature importance' exposed it the same way as in sklearn.

@mglowacki100
Copy link
Author

It can be done with rfpimp library i monkey-patching AutoML with scikit (quick temp fix)
Here is sample code:

#https://github.com/parrt/random-forest-importances
#!pip install rfpimp 
import rfpimp

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML
from sklearn.metrics import accuracy_score
        

df = pd.read_csv(...)

df_train, df_test = train_test_split(df, test_size=0.20)

X_train, y_train = df_train.drop('Target',axis=1), df_train['Target']
X_test, y_test = df_test.drop('Target',axis=1), df_test['Target']
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))


print('training')
automl = AutoML(...
)

automl.fit(X_train, y_train)

#monkey-patching of AutoML
def score(self, X, y, sample_weight=None):
    return accuracy_score(y, self.predict(X)['label'], sample_weight=sample_weight)
setattr(AutoML, 'score', score)


print('feature importance')
imp = rfpimp.importances(automl, X_test, y_test) # permutation
viz = rfpimp.plot_importances(imp)
viz.view()

@pplonski pplonski added this to To do in mljar-supervised Oct 22, 2019
@pplonski pplonski added the enhancement New feature or request label Apr 8, 2020
@pplonski pplonski self-assigned this Apr 8, 2020
@pplonski pplonski added this to the version 0.2.0 milestone Apr 8, 2020
@pplonski pplonski modified the milestones: version 0.2.0, version 0.3.0 Apr 16, 2020
@pplonski pplonski moved this from To do to In progress in mljar-supervised Apr 24, 2020
@pplonski pplonski pinned this issue Apr 24, 2020
@pplonski
Copy link
Contributor

pplonski commented Apr 24, 2020

sklearn supports feature importance: https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html

I will use their implementation. (I need to add predict_proba in algorithms interface just to be compatible with sklearn interface)

pplonski added a commit that referenced this issue Apr 27, 2020
pplonski added a commit that referenced this issue Apr 27, 2020
@pplonski
Copy link
Contributor

For each fold, the feature importance is computed based on permutation. In the plot, there are displayed top-25 best features. All importance values are saved to the CSV file.

The example of the plot:
image

@pplonski pplonski unpinned this issue Apr 27, 2020
@pplonski pplonski moved this from In progress to Done in mljar-supervised May 4, 2020
@Tonywhitemin
Copy link

Hi pplonski,
Sorry to keep bothering you...
I tried to understand the "features_scores_threshold_2.5.csv" file as following table.
And some questions list below, could you help?

  1. Do the numbers under the learnerX indicate the importance of the feature? How are they calculated?
  2. How to compare which feature corresponds to each column?
  3. How to calculate the "counter" value?
  4. It seems that the number of features in this csv file may contain the number of golden features, right?

image

@pplonski
Copy link
Contributor

pplonski commented Jun 2, 2022

@Tonywhitemin the feature importance is computed for each learner. The importance is computed with a permutation method. Each learner has a vector with importance for each feature. The columns in CSV are joined based on features. The first row is the first feature from the dataset.

When computing importance there is injected random feature to the dataset. The counter keeps information of how many times a feature has lower importance than a random feature.

@Tonywhitemin
Copy link

@pplonski Thanks for your reply!
As you said, "The counter keeps information of how many times a feature has lower importance than a random feature."
But can we know the information about the random feature's value in this csv file? (or which row should be the random feature's value?)

By the way, it seems that the number of features in this csv file may contain some of golden features, is that correct?
Thank you!

@pplonski
Copy link
Contributor

pplonski commented Jun 2, 2022

@Tonywhitemin you will need to check that in code ... I dont remember all details, sorry!

@Tonywhitemin
Copy link

Hi @pplonski,
Please don't say sorry, you've helped a lot, I really appreciate your help!
Have a nice day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

4 participants