Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binning method for continuous values #23

Closed
marcbllv opened this issue Sep 8, 2016 · 4 comments
Closed

Binning method for continuous values #23

marcbllv opened this issue Sep 8, 2016 · 4 comments

Comments

@marcbllv
Copy link
Contributor

marcbllv commented Sep 8, 2016

Hello,
First, thank you for your nice repository, it's really interesting to work with!

I have a question about the binning process (I hope this is the right place to ask ?):
Why did you choose to bin the continuous values using quartiles and not another splitting criterion ?

I tested lime on several datasets and in many cases the interesting points are extreme ones & outliers. But this binning doesn't handle them, usually it would put the 3 boundaries in the area where all the points in the majority class lie. If there are a lot of zeros for instance (more than 75%), LIME will output the same contribution for any positive value, and thus providing no information except "this feature is not zero".
So I was thinking about whether or not a supervized binning would be useful here ? I've tested an entropy based binning (with still 4 bins), and it tends to "explore" a bit more the extreme values. But now bins don't have the same size and they can probably be very unbalanced.

If you have any thoughts about this, I'd be glad to read them!
Marc

@marcotcr
Copy link
Owner

marcotcr commented Sep 9, 2016

Hello,
I did quartiles just because it was quick and easy to implement. I've also had it happen to me that most point lie in one quartile, so that the discretization is not so good.

Whether or not we want to discretize is a question that I've been pondering for a while. Anyway, to answer your question: if we are discretizing, I think smarter discretization methods would probably be helpful.

@marcbllv marcbllv reopened this Sep 15, 2016
@marcbllv
Copy link
Contributor Author

marcbllv commented Sep 15, 2016

Hi Marco,

I've implemented a entropy based discretizer, using scikit entropy decision trees, here: https://github.com/marcbllv/lime if ever you have time and would like to try it.

The discretizers are in a separated file, with an abstract class & children classes where specific binnings are implemented. I mostly used your code for the base class. The labels now needs to be passed to the explainer to use supervized binning, but not passing them is ok if keeping the QuartileDiscretizer (default None), so it should keep compatibility.

I've tested it on classification (not regression yet) and it seems to give nice results as it explores more the extreme values than with the quartiles.

I don't know if this needs to be integrated in your own repo ? I mostly did it for testing purposes on my side but if you're interested let me know :-)

@marcotcr
Copy link
Owner

I like how you've used the Decision Tree to do the discretization. I think it would perhaps make more sense if the discretization did not depend on the labels, but only on the data itself.
Anyway, I'm a bit slammed for the next few weeks, but as soon as I have some time I will try out your discretization. If it seems to make more sense for a few datasets, we should merge it in LIME. If you have particular examples where the quartile discretizer is bad and the entropy one is good, send them to me : ).

Thanks,

@marcotcr
Copy link
Owner

marcotcr commented Oct 9, 2016

@mrkaiser included this in a recent pull request, so closing this issue.

@marcotcr marcotcr closed this as completed Oct 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants