sklearn's multivariate imputer #147

mcreixell · 2020-05-12T21:04:44Z

@aarmey What do you think of this approach to handle incomplete data?

https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation

aarmey · 2020-05-12T21:09:51Z

For imputation, you have to be able to make a pretty good guess about the value of the missing data. I'm not sure we can do that.

Is our current approach a problem?

mcreixell · 2020-05-13T03:22:10Z

I see–our current approach using pom is definitely working but I feel like sklearn's gmm method is more reliable in general so was wondering if there would be an easy way to handle missingness within sklearn. Sometimes pom struggles to fill all clusters so I have to use a while set(labels) < ncl: keep fitting and does a worse job at separating clusters than sklearn's method:

aarmey · 2020-05-13T03:58:05Z

Not sure I follow where pom's method would be different from sklearns—do you know?

mcreixell · 2020-05-13T04:58:49Z

I think that the main difference is that pom allows you to specify the distribution of each component from this list (https://pomegranate.readthedocs.io/en/latest/Distributions.html) whereas sklearn's by default uses a multivariate Gaussian distribution. With pom we can also use a multivariate Gaussian but for some reason it struggles to fit when the number of clusters is bigger than 4 or 5. So I'm using a normal distribution that seems to work better. I don't know if that's what would make both different but it's the only difference I can come up with. I don't think that this is critical at this point. I will try to get some results from the CPTAC data set for next week to see how pom's method handles it.

aarmey · 2020-05-13T13:49:47Z

I agree that's probably the difference. What do you mean by it struggles to fit?

mcreixell · 2020-05-13T14:30:02Z

I mean that some of the clusters are empty. I found an issue related to this on git a while ago and the developer recommended using fewer clusters when this happens.

aarmey · 2020-05-13T15:10:29Z

Huh... but this doesn't happen with sklearn, when using the presumably matching distribution?

mcreixell · 2020-05-13T15:13:42Z

Nope, sklearn never returns empty clusters regardless of the number of clusters used.

mcreixell · 2020-05-13T17:32:31Z

I just remembered, they also say this in pom's documentation:

"""
Use MultivariateGaussianDistribution when you want the full correlation matrix within the feature vector. When you want a strict diagonal correlation (i.e no correlation or “independent”), this is achieved using IndependentComponentsDistribution with NormalDistribution for each feature. There is no implementation of spherical or other variations of correlation.
"""

We usually set sklearn's covariance_type to diagonal so this should make both methods the same I think...

aarmey · 2020-05-13T17:38:21Z

Right... diagonal should be the same as "IndependentComponentsDistribution with NormalDistribution".

aarmey · 2020-06-12T00:24:46Z

I'm pretty sure this is resolved.

mcreixell · 2021-10-01T23:36:18Z

@aarmey we were close here!

aarmey closed this as completed Jun 12, 2020

mcreixell mentioned this issue Jun 15, 2020

Duplicating gmm methods #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn's multivariate imputer #147

sklearn's multivariate imputer #147

mcreixell commented May 12, 2020

aarmey commented May 12, 2020

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

mcreixell commented May 13, 2020 •

edited

Loading

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

aarmey commented Jun 12, 2020

mcreixell commented Oct 1, 2021

sklearn's multivariate imputer #147

sklearn's multivariate imputer #147

Comments

mcreixell commented May 12, 2020

aarmey commented May 12, 2020

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

mcreixell commented May 13, 2020 • edited Loading

mcreixell commented May 13, 2020

aarmey commented May 13, 2020

aarmey commented Jun 12, 2020

mcreixell commented Oct 1, 2021

mcreixell commented May 13, 2020 •

edited

Loading