Binning method for continuous values #23

marcbllv · 2016-09-08T09:42:06Z

Hello,
First, thank you for your nice repository, it's really interesting to work with!

I have a question about the binning process (I hope this is the right place to ask ?):
Why did you choose to bin the continuous values using quartiles and not another splitting criterion ?

I tested lime on several datasets and in many cases the interesting points are extreme ones & outliers. But this binning doesn't handle them, usually it would put the 3 boundaries in the area where all the points in the majority class lie. If there are a lot of zeros for instance (more than 75%), LIME will output the same contribution for any positive value, and thus providing no information except "this feature is not zero".
So I was thinking about whether or not a supervized binning would be useful here ? I've tested an entropy based binning (with still 4 bins), and it tends to "explore" a bit more the extreme values. But now bins don't have the same size and they can probably be very unbalanced.

If you have any thoughts about this, I'd be glad to read them!
Marc

marcotcr · 2016-09-09T17:17:06Z

Hello,
I did quartiles just because it was quick and easy to implement. I've also had it happen to me that most point lie in one quartile, so that the discretization is not so good.

Whether or not we want to discretize is a question that I've been pondering for a while. Anyway, to answer your question: if we are discretizing, I think smarter discretization methods would probably be helpful.

marcbllv · 2016-09-15T09:06:12Z

Hi Marco,

I've implemented a entropy based discretizer, using scikit entropy decision trees, here: https://github.com/marcbllv/lime if ever you have time and would like to try it.

The discretizers are in a separated file, with an abstract class & children classes where specific binnings are implemented. I mostly used your code for the base class. The labels now needs to be passed to the explainer to use supervized binning, but not passing them is ok if keeping the QuartileDiscretizer (default None), so it should keep compatibility.

I've tested it on classification (not regression yet) and it seems to give nice results as it explores more the extreme values than with the quartiles.

I don't know if this needs to be integrated in your own repo ? I mostly did it for testing purposes on my side but if you're interested let me know :-)

marcotcr · 2016-09-17T22:25:14Z

I like how you've used the Decision Tree to do the discretization. I think it would perhaps make more sense if the discretization did not depend on the labels, but only on the data itself.
Anyway, I'm a bit slammed for the next few weeks, but as soon as I have some time I will try out your discretization. If it seems to make more sense for a few datasets, we should merge it in LIME. If you have particular examples where the quartile discretizer is bad and the entropy one is good, send them to me : ).

Thanks,

marcotcr · 2016-10-09T02:09:30Z

@mrkaiser included this in a recent pull request, so closing this issue.

marcbllv closed this as completed Sep 15, 2016

marcbllv reopened this Sep 15, 2016

marcotcr closed this as completed Oct 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binning method for continuous values #23

Binning method for continuous values #23

marcbllv commented Sep 8, 2016

marcotcr commented Sep 9, 2016

marcbllv commented Sep 15, 2016 •

edited

Loading

marcotcr commented Sep 17, 2016

marcotcr commented Oct 9, 2016

Binning method for continuous values #23

Binning method for continuous values #23

Comments

marcbllv commented Sep 8, 2016

marcotcr commented Sep 9, 2016

marcbllv commented Sep 15, 2016 • edited Loading

marcotcr commented Sep 17, 2016

marcotcr commented Oct 9, 2016

marcbllv commented Sep 15, 2016 •

edited

Loading