Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn's multivariate imputer #147

Closed
mcreixell opened this issue May 12, 2020 · 12 comments
Closed

sklearn's multivariate imputer #147

mcreixell opened this issue May 12, 2020 · 12 comments

Comments

@mcreixell
Copy link
Collaborator

@aarmey What do you think of this approach to handle incomplete data?

https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation

@aarmey
Copy link
Member

aarmey commented May 12, 2020

For imputation, you have to be able to make a pretty good guess about the value of the missing data. I'm not sure we can do that.

Is our current approach a problem?

@mcreixell
Copy link
Collaborator Author

I see–our current approach using pom is definitely working but I feel like sklearn's gmm method is more reliable in general so was wondering if there would be an easy way to handle missingness within sklearn. Sometimes pom struggles to fill all clusters so I have to use a while set(labels) < ncl: keep fitting and does a worse job at separating clusters than sklearn's method:

Screen Shot 2020-05-12 at 8 10 40 PM

Screen Shot 2020-05-12 at 8 20 19 PM

@aarmey
Copy link
Member

aarmey commented May 13, 2020

Not sure I follow where pom's method would be different from sklearns—do you know?

@mcreixell
Copy link
Collaborator Author

I think that the main difference is that pom allows you to specify the distribution of each component from this list (https://pomegranate.readthedocs.io/en/latest/Distributions.html) whereas sklearn's by default uses a multivariate Gaussian distribution. With pom we can also use a multivariate Gaussian but for some reason it struggles to fit when the number of clusters is bigger than 4 or 5. So I'm using a normal distribution that seems to work better. I don't know if that's what would make both different but it's the only difference I can come up with. I don't think that this is critical at this point. I will try to get some results from the CPTAC data set for next week to see how pom's method handles it.

@aarmey
Copy link
Member

aarmey commented May 13, 2020

I agree that's probably the difference. What do you mean by it struggles to fit?

@mcreixell
Copy link
Collaborator Author

I mean that some of the clusters are empty. I found an issue related to this on git a while ago and the developer recommended using fewer clusters when this happens.

@aarmey
Copy link
Member

aarmey commented May 13, 2020

Huh... but this doesn't happen with sklearn, when using the presumably matching distribution?

@mcreixell
Copy link
Collaborator Author

mcreixell commented May 13, 2020

Nope, sklearn never returns empty clusters regardless of the number of clusters used.

@mcreixell
Copy link
Collaborator Author

I just remembered, they also say this in pom's documentation:

"""
Use MultivariateGaussianDistribution when you want the full correlation matrix within the feature vector. When you want a strict diagonal correlation (i.e no correlation or “independent”), this is achieved using IndependentComponentsDistribution with NormalDistribution for each feature. There is no implementation of spherical or other variations of correlation.
"""

We usually set sklearn's covariance_type to diagonal so this should make both methods the same I think...

@aarmey
Copy link
Member

aarmey commented May 13, 2020

Right... diagonal should be the same as "IndependentComponentsDistribution with NormalDistribution".

@aarmey
Copy link
Member

aarmey commented Jun 12, 2020

I'm pretty sure this is resolved.

@mcreixell
Copy link
Collaborator Author

@aarmey we were close here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants