Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about sample_vMF #17

Closed
dmortem opened this issue Dec 24, 2018 · 8 comments
Closed

Question about sample_vMF #17

dmortem opened this issue Dec 24, 2018 · 8 comments

Comments

@dmortem
Copy link

dmortem commented Dec 24, 2018

Hi,
@jasonlaska Thanks for your codes! I learned a lot from the codes!

Recently I met a problem: when I used 'VonMisesFisherMixture' to estimate a distribution of sequence data, and then used 'sample_vMF' to produce some pseudo samples, I found all pseudo samples have the same trend as the real ones, but the Y value is always smaller than the real samples.

Later, I created a list of 10 sequences, all with the value of [0.95 , 0.9, 0.85, 0.8, ..., 0.05, 0]. In my opinion, when I produced pseudo samples from vmf distribution, I should get the same sequence. However, I got [3.82291703e-01, 3.62182865e-01, 3.42056000e-01, 3.21941382e-01, 3.01821386e-01, 2.81705993e-01, 2.61568550e-01, 2.41452144e-01, 2.21317166e-01, 2.01214795e-01, 1.81081612e-01, 1.60967944e-01, 1.40855783e-01, 1.20740353e-01, 1.00600080e-01, 8.04905383e-02, 6.03600387e-02, 4.02586200e-02, 2.01175857e-02, -8.79339154e-06], which is still a straight line, but each value becomes smaller (that is, 0.95 -> 0.382). Why is that? How to solve this problem?

Thank you!

@jasonlaska
Copy link
Owner

Can you show me how you're using this sequence in the function call to sample_VMF? The signature takes both mu and kappa parameters each the length of the dimension (and a number of samples). Each output sample will have unit norm, which may be why you are seeeing that scaling effect.

@dmortem
Copy link
Author

dmortem commented Dec 25, 2018

Here is my source code:

x = [i/20. for i in range(19)]
x_list = []
for i in range(10):
    x_list.append(np.array(x))
x_list = np.array(x_list)

vmf_soft = VonMisesFisherMixture(n_clusters=1, n_jobs=15, posterior_type='hard')
vmf_soft.fit(x_list)
center = vmf_soft.cluster_centers_[0]
kappa = vmf_soft.concentrations_[0]

fake_sample = np.array(sample_vMF(center, kappa, 2))

Thanks a lot!

@jasonlaska
Copy link
Owner

jasonlaska commented Dec 26, 2018

Hi @dmortem,

First, I'd like to validate that this is what you intended: You're generating a set of identical 19-dimensional points, so we should expect the clustering algorithm to produce a center that is identical to these points and have a very high kappa so that most samples are near to this point.

The main thing I notice here is that the input points are not unit normalized (i.e., they are not on the surface of the sphere). You can see that if we were to normalize the rows, we will get the following values for each row:

x_list[0] / np.linalg.norm(x_list[0])
array([0.        , 0.02177518, 0.04355036, 0.06532553, 0.08710071,
       0.10887589, 0.13065107, 0.15242624, 0.17420142, 0.1959766 ,
       0.21775178, 0.23952696, 0.26130213, 0.28307731, 0.30485249,
       0.32662767, 0.34840285, 0.37017802, 0.3919532 ])

Since the input provided is not normalized, the algorithm will automatically project these points to their closest point on the sphere (i.e., l2-normalize them) as part of the clustering. Looking at the center and kappa from your code reveals:

>>> center
array([0.        , 0.02177518, 0.04355036, 0.06532553, 0.08710071,
       0.10887589, 0.13065107, 0.15242624, 0.17420142, 0.1959766 ,
       0.21775178, 0.23952696, 0.26130213, 0.28307731, 0.30485249,
       0.32662767, 0.34840285, 0.37017802, 0.3919532 ])

>>> kappa
10000000000.0

Thus, drawing any number of samples with this mu and kappa results in vectors that are very close to center:

>>> samples = sample_vMF(center, kappa, 5)
>>> samples
array([[ 6.32807952e-06,  2.17864995e-02,  4.35500737e-02,
         6.53322527e-02,  8.71095727e-02,  1.08881644e-01,
         1.30639007e-01,  1.52421854e-01,  1.74200326e-01,
         1.95993355e-01,  2.17752592e-01,  2.39536626e-01,
         2.61298267e-01,  2.83075639e-01,  3.04853585e-01,
         3.26616624e-01,  3.48406769e-01,  3.70171007e-01,
         3.91954664e-01],
       [-2.55597079e-06,  2.17727691e-02,  4.35469153e-02,
         6.53037181e-02,  8.71155660e-02,  1.08857345e-01,
         1.30663982e-01,  1.52433263e-01,  1.74222724e-01,
         1.95972540e-01,  2.17760572e-01,  2.39528979e-01,
         2.61289916e-01,  2.83078353e-01,  3.04848925e-01,
         3.26616358e-01,  3.48405330e-01,  3.70185607e-01,
         3.91948825e-01],
       [-1.36665318e-06,  2.17944338e-02,  4.35683507e-02,
         6.53144059e-02,  8.71042671e-02,  1.08872207e-01,
         1.30636564e-01,  1.52429533e-01,  1.74200980e-01,
         1.95979053e-01,  2.17759354e-01,  2.39538467e-01,
         2.61295634e-01,  2.83088949e-01,  3.04848135e-01,
         3.26637761e-01,  3.48401665e-01,  3.70155995e-01,
         3.91956256e-01],
       [ 1.45076901e-05,  2.17576482e-02,  4.35492214e-02,
         6.53258495e-02,  8.70920150e-02,  1.08878649e-01,
         1.30659238e-01,  1.52415545e-01,  1.74194117e-01,
         1.95975390e-01,  2.17747226e-01,  2.39530462e-01,
         2.61307605e-01,  2.83069683e-01,  3.04841559e-01,
         3.26632971e-01,  3.48404188e-01,  3.70177861e-01,
         3.91965990e-01],
       [-2.12690749e-08,  2.17772530e-02,  4.35430105e-02,
         6.53248000e-02,  8.71232600e-02,  1.08865881e-01,
         1.30640309e-01,  1.52433249e-01,  1.74201411e-01,
         1.95978082e-01,  2.17733089e-01,  2.39521882e-01,
         2.61299886e-01,  2.83091667e-01,  3.04850095e-01,
         3.26631264e-01,  3.48415265e-01,  3.70162670e-01,
         3.91958856e-01]])

>>> for sample in samples:
...     print(np.linalg.norm(center - sample))
...
3.276563405953557e-05
4.696966559956132e-05
4.536780825124043e-05
3.567424819392873e-05
4.314138629688178e-05

In summary, the inputs you provided are not normalized, the resulting cluster center is the normalized input.

I hope this helps.

@dmortem
Copy link
Author

dmortem commented Jan 3, 2019

That really helps! Thanks a lot!

@dmortem
Copy link
Author

dmortem commented Jan 3, 2019

When I tried to set 'n_clusters=K' where K is larger than 1 in 'VonMisesFisherMixture', I met the problem 'VMF scaling denominator was inf.'. In #10, you said 'That exception is raised when the dimension is too large.', but my data is only 10*19, not so large. Thus how to solve it?

@jasonlaska
Copy link
Owner

Yea it's an outstanding issue. It has to do with values becoming too large in some iterations, which I think is due to long vectors (since they are normalized to 1). Let me know if you figure it out.

@dmortem
Copy link
Author

dmortem commented Jan 3, 2019

OK, I will try to do more experiments about it.

@itamar-dw
Copy link

Normalization when kappa or mu are large can be dealt with by noting that you calculate log likelihood, not likelihood, so overflow can be avoided. For example log(exp(kappa*X.dot(mu).T) = kappa*X.dot(mu).T, and for the normalization I think this trick can be used as well. For example, when the dimension is 3 the normalization has a closed form, so I've modified the code in _vmf_method by adding:

if n_features == 3:
    log_norm = np.log(kappa) - np.log(2. * np.pi) - kappa - np.log(1. - np.exp(-2. * kappa))                                                                                                               
    log_density = kappa * X.dot(mu).T                                                                                                                                                                      
    return log_norm + log_density 

I believe it can be done for general dimension, but I didn't verify it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants