New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Bayesian Kernel Density Classifier #279
Comments
I am open to this actually and am happy to hear you’ve considered this project. @MBrouns thoughts? There’s a few thoughts on my side.
|
@koaning very cool! kk.
|
By docs I mean a short thing in a jupyter notebook. If you look at, per example, our fairness documentation or our meta model stuff you might get an impression of what we're going for. One possibility is to demonstrate how it is different to a GMM approach and how a choice of a kernel influences the prediction. It would also be nice if the doc-string explains what bandwidth does. Since you're open to helping maintain the feature I certainly wouldn't mind seeing a PR for this but I'll check with @MBrouns in person tomorrow to make sure he's onboard. One thing to keep an eye on is the way we test. We have loads to standard tests and these things fail unless |
Another thought popped in. Could the usage of the prior be something that can be turned off or set manually? There are moments when a prior from domain knowledge is better than what the dataset can provide. There's an experimental meta-estimator with this idea here. |
I think I’m open to this as well. I would like to see a notebook though comparing it to the gmm version as I’m having a hard time reasoning through the differences in behavior atm. Also, there’s some sklearn specific things that you currently don’t adhere to. I’ve written down my thoughts on this here: https://www.mbrouns.com/posts/2020-01-26-sklearns-init/. It might be good to read through that and adjust the implementation accordingly. |
Haven't heard from this. No rush. Just wanted to check :) |
Whoops sorry. https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G I never noticed how bad scaling is on larger than of the academic datasets and that actually has been a big thing. I've used it on 2D species distributional data (I'm trying to see if I am allowed to use that dataset for the notebook). |
You notebook states;
For training the GMM. I don't know how sklearn does it internally, but I wonder how the EM algorithm has that bound. Got a reference for it? Mostly out of my own curiosity. |
I don't know precisely for sklearn either but I figured since the expectation step has to perform a dot product as the most expensive step to compute the expectation that would be p^3 (and apparently ~p^2.3). It should also scale linearly with the number of clusters. (Not an academic level citation I know but https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra) |
Strange. I copied your notebook and seem to have a significantly different result at the bottom. The colab can be found here. I seem to get roughly the same scores across all approaches. I increased the granularity of the plot a bit. |
That is extremely strange. I've requested access to your colab notebook, but I'll rerun to see what is going on. This make more sense anyway because when I was trying to rerun it, again and again, I couldn't figure out why everything was losing performance for very large sample sizes. |
EM is typically stochastic so there's a risk of getting stuck in a bad local. That's my 5 cents. I didn't see an access request. Maybe this link? |
That link worked. While that could explain the GMM results I observed I couldn't explain the KDE results also. Update: I was able to get similar results to you on a fresh start of my notebook this time. I guess it was a strange Heisenbug. |
https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G I add a species example. Turns out sklearn has a species dataset so you can see a real expected use case |
at the end all algs seem to drop again with a large ‘n’ ... maybe increase
the number of iterations to be sure it converges properly?
|
I finally found the real bug causing the problem. It's fixed now tho. Problem was a line effectively like the following was run when subsetting the data. |
Also, @MBrouns which part of the code specific does the class not adhere to PS: I do know have to add all the fancy |
It's quite subtle, but in the |
@arose13 Feel free to start working on an implementation and PR. As for the docs, it would be preferable if you could condense it down. Mostly in terms of compute time, we might want to have these docs auto generate and we prefer to have the docs build in mere minutes. As for tests, feel free to look at how we test the |
Quick review on implementation; class BayesianKernelDensityClassifier(BaseEstimator, ClassifierMixin):
"""
NOTE: This class is still being developed to make it work well with all the sklearn components.
Bayesian Classifier that uses Kernel Density Estimations to generate the joint distribution
Parameters:
- bandwidth: float
- kernel: for scikit learn KernelDensity
"""
def __init__(self, bandwidth=0.2, kernel='gaussian'):
self.classes_, self.models_, self.priors_logp_ = [None] * 3
self.bandwidth = bandwidth
self.kernel = kernel
def fit(self, X, y):
self.classes_ = np.sort(np.unique(y))
training_sets = [X[y == yi] for yi in self.classes_]
self.models_ = [KernelDensity(bandwidth=self.bandwidth, kernel=self.kernel).fit(x_subset)
for x_subset in training_sets]
self.priors_logp_ = [np.log(x_subset.shape[0] / X.shape[0]) for x_subset in training_sets]
return self
def predict_proba(self, X):
logp = np.array([model.score_samples(X) for model in self.models_]).T
result = np.exp(logp + self.priors_logp_)
return result / result.sum(1, keepdims=True)
def predict(self, X):
return self.classes_[np.argmax(self.predict_proba(X), 1)] Looking at this implementation;
|
Okay cool, I'll open a work in progress pull request. I'm not sure where to put it actually. I was going to put it in the I could also make a new submodule called neighbours |
and now for something completely different @arose13 if there are any lessons you've learned while implementing this feature, feel free to add it here; This library offers a poem with words of wisdom that will appear once you type;
When you make a meaningful contribution to our library, you're also invited to add something to the poem. |
lol! That's very cute. I'm not much of a poet, so I'll to think very hard about this one ^_^ |
This is now merged. |
The prior is$P(y=0)$ . I primarily use it for spatial problems.
It is similar to the GMM Classifier with only 2 caveats I can think of.
The text was updated successfully, but these errors were encountered: