-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binning method for continuous values #23
Comments
Hello, Whether or not we want to discretize is a question that I've been pondering for a while. Anyway, to answer your question: if we are discretizing, I think smarter discretization methods would probably be helpful. |
Hi Marco, I've implemented a entropy based discretizer, using scikit entropy decision trees, here: https://github.com/marcbllv/lime if ever you have time and would like to try it. The discretizers are in a separated file, with an abstract class & children classes where specific binnings are implemented. I mostly used your code for the base class. The labels now needs to be passed to the explainer to use supervized binning, but not passing them is ok if keeping the QuartileDiscretizer (default None), so it should keep compatibility. I've tested it on classification (not regression yet) and it seems to give nice results as it explores more the extreme values than with the quartiles. I don't know if this needs to be integrated in your own repo ? I mostly did it for testing purposes on my side but if you're interested let me know :-) |
I like how you've used the Decision Tree to do the discretization. I think it would perhaps make more sense if the discretization did not depend on the labels, but only on the data itself. Thanks, |
@mrkaiser included this in a recent pull request, so closing this issue. |
Hello,
First, thank you for your nice repository, it's really interesting to work with!
I have a question about the binning process (I hope this is the right place to ask ?):
Why did you choose to bin the continuous values using quartiles and not another splitting criterion ?
I tested lime on several datasets and in many cases the interesting points are extreme ones & outliers. But this binning doesn't handle them, usually it would put the 3 boundaries in the area where all the points in the majority class lie. If there are a lot of zeros for instance (more than 75%), LIME will output the same contribution for any positive value, and thus providing no information except "this feature is not zero".
So I was thinking about whether or not a supervized binning would be useful here ? I've tested an entropy based binning (with still 4 bins), and it tends to "explore" a bit more the extreme values. But now bins don't have the same size and they can probably be very unbalanced.
If you have any thoughts about this, I'd be glad to read them!
Marc
The text was updated successfully, but these errors were encountered: