Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Bayesian Kernel Density Classifier #279

Closed
arose13 opened this issue Jan 28, 2020 · 26 comments
Closed

[FEATURE] Bayesian Kernel Density Classifier #279

arose13 opened this issue Jan 28, 2020 · 26 comments
Labels
enhancement New feature or request

Comments

@arose13
Copy link
Contributor

arose13 commented Jan 28, 2020

  • I've been using this Bayesian kernel density classifier for a few years and I thought I should move it out from my poorly organized project to this one here.

The prior is $P(y=0)$. I primarily use it for spatial problems.

It is similar to the GMM Classifier with only 2 caveats I can think of.

  • Hyperparameters are easier to decide on.
  • Scaling is worse as I believe due to the KDE part scaling linearly with the sample size.
# noinspection PyPep8Naming
class BayesianKernelDensityClassifier(BaseEstimator, ClassifierMixin):
    """
    Bayesian Classifier that uses Kernel Density Estimations to generate the joint distribution
    Parameters:
        - bandwidth: float
        - kernel: for scikit learn KernelDensity
    """
    def __init__(self, bandwidth=0.2, kernel='gaussian'):
        self.classes_, self.models_, self.priors_logp_ = [None] * 3
        self.bandwidth = bandwidth
        self.kernel = kernel

    def fit(self, X, y):
        self.classes_ = np.sort(np.unique(y))
        training_sets = [X[y == yi] for yi in self.classes_]
        self.models_ = [KernelDensity(bandwidth=self.bandwidth, kernel=self.kernel).fit(x_subset)
                        for x_subset in training_sets]

        self.priors_logp_ = [np.log(x_subset.shape[0] / X.shape[0]) for x_subset in training_sets]
        return self

    def predict_proba(self, X):
        logp = np.array([model.score_samples(X) for model in self.models_]).T
        result = np.exp(logp + self.priors_logp_)
        return result / result.sum(1, keepdims=True)

    def predict(self, X):
        return self.classes_[np.argmax(self.predict_proba(X), 1)]
@arose13 arose13 added the enhancement New feature or request label Jan 28, 2020
@koaning
Copy link
Owner

koaning commented Jan 28, 2020

I am open to this actually and am happy to hear you’ve considered this project. @MBrouns thoughts? There’s a few thoughts on my side.

  • do you have an example for the docs that can help defend this approach? it is similar to gmm so it would be nice to demonstrate the difference and where it might be better.
  • does it make sense to also offer an outlier detector with this approach?
  • could you rephrase what you mean with the prior?
  • is this feature something you might be willing to help maintain?

@arose13
Copy link
Contributor Author

arose13 commented Jan 28, 2020

@koaning very cool! kk.

  • By docs you mean a paper or just a jupyter notebook or something like that?
  • It is possible in principle to use it as an outlier detector. Especially since your other outlier detectors don't mind using arbitrary thresholds of likelihoods for calling something an outlier.
  • The prior in this case is the probability of y=0 being the proportion of y=0 in the training dataset. So for example, if the probability of someone having breast cancer is 2% then the prior to the breast cancer class is 0.02 and the prior to the benign case is 0.98. These priors adjust the posterior
  • This feature is something I'd have no problem maintaining.

@koaning
Copy link
Owner

koaning commented Jan 28, 2020

By docs I mean a short thing in a jupyter notebook. If you look at, per example, our fairness documentation or our meta model stuff you might get an impression of what we're going for. One possibility is to demonstrate how it is different to a GMM approach and how a choice of a kernel influences the prediction. It would also be nice if the doc-string explains what bandwidth does.

Since you're open to helping maintain the feature I certainly wouldn't mind seeing a PR for this but I'll check with @MBrouns in person tomorrow to make sure he's onboard.

One thing to keep an eye on is the way we test. We have loads to standard tests and these things fail unless _check_X_y is used in .fit() and unless check_is_fitted is used in .predict_proba(). You should see plenty of examples in our library, it's a scikit-learn internal thing. It would also be good to add simple tests that use easy/obvious datasets.

@koaning
Copy link
Owner

koaning commented Jan 29, 2020

Another thought popped in. Could the usage of the prior be something that can be turned off or set manually? There are moments when a prior from domain knowledge is better than what the dataset can provide.

There's an experimental meta-estimator with this idea here.

@MBrouns
Copy link
Collaborator

MBrouns commented Jan 29, 2020

I think I’m open to this as well. I would like to see a notebook though comparing it to the gmm version as I’m having a hard time reasoning through the differences in behavior atm.

Also, there’s some sklearn specific things that you currently don’t adhere to. I’ve written down my thoughts on this here: https://www.mbrouns.com/posts/2020-01-26-sklearns-init/. It might be good to read through that and adjust the implementation accordingly.

@arose13
Copy link
Contributor Author

arose13 commented Jan 29, 2020

@koaning 100% the prior could be made into any arbitrary function and set manually.

@MBrouns and @koaning I'll whip up a notebook today by end of day today. The difference are subtle but clear depending on the situation so I'll make sure to demonstrate that

@koaning
Copy link
Owner

koaning commented Feb 2, 2020

Haven't heard from this. No rush. Just wanted to check :)

@arose13
Copy link
Contributor Author

arose13 commented Feb 2, 2020

Whoops sorry.

https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G

I never noticed how bad scaling is on larger than of the academic datasets and that actually has been a big thing. I've used it on 2D species distributional data (I'm trying to see if I am allowed to use that dataset for the notebook).

@koaning
Copy link
Owner

koaning commented Feb 2, 2020

You notebook states;

The time complexity infact is 𝑂(𝑝3) where 𝑝 is the number of dimensions.

For training the GMM. I don't know how sklearn does it internally, but I wonder how the EM algorithm has that bound. Got a reference for it? Mostly out of my own curiosity.

@arose13
Copy link
Contributor Author

arose13 commented Feb 2, 2020

I don't know precisely for sklearn either but I figured since the expectation step has to perform a dot product as the most expensive step to compute the expectation that would be p^3 (and apparently ~p^2.3). It should also scale linearly with the number of clusters.

(Not an academic level citation I know but https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra)

@koaning
Copy link
Owner

koaning commented Feb 2, 2020

Strange. I copied your notebook and seem to have a significantly different result at the bottom. The colab can be found here. I seem to get roughly the same scores across all approaches. I increased the granularity of the plot a bit.

image

@arose13
Copy link
Contributor Author

arose13 commented Feb 2, 2020

That is extremely strange. I've requested access to your colab notebook, but I'll rerun to see what is going on. This make more sense anyway because when I was trying to rerun it, again and again, I couldn't figure out why everything was losing performance for very large sample sizes.

@koaning
Copy link
Owner

koaning commented Feb 2, 2020

EM is typically stochastic so there's a risk of getting stuck in a bad local. That's my 5 cents.

I didn't see an access request. Maybe this link?

@arose13
Copy link
Contributor Author

arose13 commented Feb 2, 2020

That link worked. While that could explain the GMM results I observed I couldn't explain the KDE results also.

Update: I was able to get similar results to you on a fresh start of my notebook this time. I guess it was a strange Heisenbug.

@arose13
Copy link
Contributor Author

arose13 commented Feb 5, 2020

https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G

I add a species example. Turns out sklearn has a species dataset so you can see a real expected use case

@koaning
Copy link
Owner

koaning commented Feb 5, 2020 via email

@arose13
Copy link
Contributor Author

arose13 commented Feb 5, 2020

I finally found the real bug causing the problem. It's fixed now tho.

Problem was a line effectively like the following was run when subsetting the data. x_subset = x[:len(x)]

@arose13
Copy link
Contributor Author

arose13 commented Feb 5, 2020

Also, @MBrouns which part of the code specific does the class not adhere to sklearn style? I assume something is wrong with the __init__ method but after comparing it to a few sklego classes I don't see what you're seeing.

PS: I do know have to add all the fancy check_array() and what not

@MBrouns
Copy link
Collaborator

MBrouns commented Feb 5, 2020

It's quite subtle, but in the __init__, the classes_, models_, and priors_logp_ attributes are initialized. This causes check_is_fitted to think that the model has already been fitted (because there exist attributes with a trailing underscore on the class).

@koaning
Copy link
Owner

koaning commented Feb 6, 2020

@arose13 Feel free to start working on an implementation and PR.

As for the docs, it would be preferable if you could condense it down. Mostly in terms of compute time, we might want to have these docs auto generate and we prefer to have the docs build in mere minutes.

As for tests, feel free to look at how we test the GMM-style classifiers. If you do something similar then it's fine.

@koaning
Copy link
Owner

koaning commented Feb 6, 2020

Quick review on implementation;

class BayesianKernelDensityClassifier(BaseEstimator, ClassifierMixin):
    """ 
    NOTE: This class is still being  developed to make it work well with all the sklearn components.

    Bayesian Classifier that uses Kernel Density Estimations to generate the joint distribution
    Parameters:
        - bandwidth: float
        - kernel: for scikit learn KernelDensity
    """
    def __init__(self, bandwidth=0.2, kernel='gaussian'):
        self.classes_, self.models_, self.priors_logp_ = [None] * 3
        self.bandwidth = bandwidth
        self.kernel = kernel

    def fit(self, X, y):
        self.classes_ = np.sort(np.unique(y))
        training_sets = [X[y == yi] for yi in self.classes_]
        self.models_ = [KernelDensity(bandwidth=self.bandwidth, kernel=self.kernel).fit(x_subset)
                        for x_subset in training_sets]

        self.priors_logp_ = [np.log(x_subset.shape[0] / X.shape[0]) for x_subset in training_sets]
        return self

    def predict_proba(self, X):
        logp = np.array([model.score_samples(X) for model in self.models_]).T
        result = np.exp(logp + self.priors_logp_)
        return result / result.sum(1, keepdims=True)

    def predict(self, X):
        return self.classes_[np.argmax(self.predict_proba(X), 1)]

Looking at this implementation;

  • all the things you're adding in the __init__ should not have a trailing underscore
  • all the things you're adding in the __init__ should be verified (and perhaps raise ValueError) in the .fit() method
  • all of the attributes added in the .fit() method need to end in an underscore (like self.models_), note that if you add something in .fit() that we should not have it added in __init__ as None
  • the attributes added in .fit() need to be checked if they exist in the predict methods with check_if_fitted

@arose13
Copy link
Contributor Author

arose13 commented Feb 6, 2020

Okay cool, I'll open a work in progress pull request. I'm not sure where to put it actually. I was going to put it in the naive_bayes submodule since similar classifiers are there but this nor any of the classes there implement very 'naive' Bayes. But that's a small potato.

I could also make a new submodule called neighbours

@koaning
Copy link
Owner

koaning commented Feb 6, 2020

#283

@koaning
Copy link
Owner

koaning commented Feb 7, 2020

and now for something completely different

@arose13 if there are any lessons you've learned while implementing this feature, feel free to add it here;

#106

This library offers a poem with words of wisdom that will appear once you type;

from sklego import this 

When you make a meaningful contribution to our library, you're also invited to add something to the poem.

@arose13
Copy link
Contributor Author

arose13 commented Feb 7, 2020

lol! That's very cute. I'm not much of a poet, so I'll to think very hard about this one ^_^

@koaning
Copy link
Owner

koaning commented Feb 14, 2020

This is now merged.

@koaning koaning closed this as completed Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants