Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hidden Markov model with Gaussian emissions not training properly #400

Closed
joshdorrington opened this issue Mar 19, 2018 · 18 comments
Closed

Comments

@joshdorrington
Copy link

I am trying to fit a time series of a chaotic system with a 3 state hidden markov model with gaussian emissions. I format the data as a numpy array called input_data, and run the code below:

`

In[1]: input_data.shape
Out[1]: (1, 200001, 6)
In[2]: model=HiddenMarkovModel()
In[3]: gaussmodel=model.from_samples(MultivariateGaussianDistribution, n_components=3, X=input_data)

`

Unfortunately the returned states of the model are not accurate, all components of the gaussian means are nearly identical and strictly descending absolute value:

`

In[4]: gaussmodel.states[1].distribution.parameters[0]
Out[4]: 
[-0.09880035589313062,
 -0.0979069486147086,
 -0.09692864738743179,
 -0.09590506750933378,
 -0.09484837637195158,
 -0.09376959974296883
In[5]: gaussmodel.states[0].distribution.parameters[0]
Out[5]: 
[0.25390719830239683,
 0.2529250438880774,
 0.25183304658880706,
 0.25062083419641734,
 0.24927706774681227,
 0.24778986004254214]`

The data set has no such symmetry to it, and in fact these mean values are in some cases totally outside the range of values present in the data, so I do not think it is a problem with the kmeans initialisation of the states.

@jmschrei
Copy link
Owner

Howdy

It would be easiest for me to debug if you could provide the data as well. Can you also check the version of pomegranate that you have? Taking a look at the results when you toggle verbose=True might help as well.

@joshdorrington
Copy link
Author

I'm on 0.7.3.
Will try and reduce the data a bit for upload, it is quite a large file at present.

@jmschrei
Copy link
Owner

Can you try also upgrading to 0.9.0 as well?

@joshdorrington
Copy link
Author

The data file is attached:
np_input_file.txt

The error persists unchanged with 0.9.0

data=np.fromfile("np_input_file.txt").reshape([1,200001,6])
model=HiddenMarkovModel()
Gaussmodel=model.from_samples(MultivariateGaussianDistribution, n_components=3, X=data)

Gaussmodel.states[1].distribution.parameters[0]

We see that when the data and regime centroids are projected onto the 0th and 3rd axes, there is no relation to the region the points lie in:

image

@joshdorrington
Copy link
Author

Oh, and just to show that the data itself isn't the issue, this is a similar graph produced using hmmlearn's implementation of a gaussian HMM:
image

@jkleckner
Copy link
Contributor

@joshdorrington Those are beautiful plots that accomplish Edward Tufte's goals of self-explanation. Could you include a pointer to the code for both of them?

@jmschrei
Copy link
Owner

I agree, those plots are super nice. I'll take a look at it soon, I'm currently travelling. If you need an immediate solution, can you try the code below? There might be some simple typos, but the gist should be the same as hmm.from_samples.

from pomegranate import *
from sklearn.cluster import KMeans

data=np.fromfile("np_input_file.txt").reshape([1,200001,6])

cov = numpy.eye(6)
mu = KMeans(3).fit_predict(data)


s1 = State(MultivariateGaussianDistribution(mu[0], cov))
s2 = State(MultivariateGaussianDistribution(mu[1], cov))
s3 = State(MultivariateGaussianDistribution(mu[2], cov))

model=HiddenMarkovModel()
model.add_states(s1, s2, s3)
model.add_transition(model.start, s1, 0.33)
model.add_transition(model.start, s2, 0.33)
model.add_transition(model.start, s3, 0.34)

model.add_transition(s1, s1, 0.33)
model.add_transition(s1, s2, 0.33)
model.add_transition(s1, s3, 0.34)

model.add_transition(s2, s1, 0.33)
model.add_transition(s2, s2, 0.33)
model.add_transition(s2, s3, 0.34)

model.add_transition(s3, s1, 0.33)
model.add_transition(s3, s2, 0.33)
model.add_transition(s3, s3, 0.34)

model.fit(data, verbose=True)

@joshdorrington
Copy link
Author

@jkleckner , thanks for your kind comment. Sorry, I'm not exactly clear about what you'd like me to link to; the plotting code, or the state fitting code?

I implemented a slight variant of your proposed code @jmschrei, just to tidy up a couple of little bugs, but it's basically the same:

from pomegranate import *
from sklearn.cluster import KMeans

data=np.fromfile("np_input_file.txt").reshape([1,200001,6])
cov = numpy.eye(6)
kmodel=KMeans(3).fit(data[0,:,:])
mu=kmodel.cluster_centers_

s1=State(MultivariateGaussianDistribution(mu[0],cov))
s2=State(MultivariateGaussianDistribution(mu[1],cov))
s3=State(MultivariateGaussianDistribution(mu[2],cov))
model=HiddenMarkovModel()
model.add_states(s1, s2, s3)
model.add_transition(model.start, s1, 0.33)
model.add_transition(model.start, s2, 0.33)
model.add_transition(model.start, s3, 0.34)

model.add_transition(s1, s1, 0.33)
model.add_transition(s1, s2, 0.33)
model.add_transition(s1, s3, 0.34)

model.add_transition(s2, s1, 0.33)
model.add_transition(s2, s2, 0.33)
model.add_transition(s2, s3, 0.34)

model.add_transition(s3, s1, 0.33)
model.add_transition(s3, s2, 0.33)
model.add_transition(s3, s3, 0.34)

model.bake()
model.fit(data, verbose=True)

And this returns a full output of:

[1] Improvement: 4486744.222106556      Time (s): 0.7452
[2] Improvement: 253413.92377016693     Time (s): 0.7567
[3] Improvement: 528234.9305049619      Time (s): 0.7647
[4] Improvement: 354598.88527076365     Time (s): 0.7574
[5] Improvement: 206177.16622723825     Time (s): 0.7627
[6] Improvement: 109022.87481217086     Time (s): 0.7647
[7] Improvement: 77593.92290376872      Time (s): 0.7642
[8] Improvement: 50681.91781912651      Time (s): 0.7657
[9] Improvement: 32392.17777218949      Time (s): 0.7647
[10] Improvement: 18563.571588578634    Time (s): 0.7627
[11] Improvement: 12440.86556516774     Time (s): 0.7657
[12] Improvement: 9567.817236023955     Time (s): 0.7642
[13] Improvement: 7484.057423806749     Time (s): 0.7657
[14] Improvement: 6353.9542276170105    Time (s): 0.7713
[15] Improvement: 5694.35950438492      Time (s): 0.7709
[16] Improvement: 4795.899998524226     Time (s): 0.7747
[17] Improvement: 4107.538776298054     Time (s): 0.7802
[18] Improvement: 4062.0021453266963    Time (s): 0.773
[19] Improvement: 4661.232641399838     Time (s): 0.7697
[20] Improvement: 5629.863652213477     Time (s): 0.7692
[21] Improvement: 8139.677873382345     Time (s): 0.7762
[22] Improvement: 10800.85995469708     Time (s): 0.7867
[23] Improvement: 10285.470221821219    Time (s): 0.7847
[24] Improvement: 7690.012382719666     Time (s): 0.7761
[25] Improvement: 5123.2117690509185    Time (s): 0.7787
[26] Improvement: 3463.128731983714     Time (s): 0.7737
[27] Improvement: 2543.4822256304324    Time (s): 0.7712
[28] Improvement: 1966.4555519074202    Time (s): 0.7802
[29] Improvement: 1665.5144956689328    Time (s): 0.7712
[30] Improvement: 1500.730795876123     Time (s): 0.7721
[31] Improvement: 1302.8138508787379    Time (s): 0.7747
[32] Improvement: 1177.0327271195129    Time (s): 0.7737
[33] Improvement: 1042.6559759248048    Time (s): 0.7762
[34] Improvement: 959.4716712161899     Time (s): 0.7812
[35] Improvement: 903.9185708174482     Time (s): 0.7787
[36] Improvement: 858.5212775124237     Time (s): 0.7848
[37] Improvement: 777.4658112591133     Time (s): 0.7812
[38] Improvement: 698.238952703774      Time (s): 0.7817
[39] Improvement: 590.0614401493222     Time (s): 0.777
[40] Improvement: 602.8234331635758     Time (s): 0.7807
[41] Improvement: 543.2950761094689     Time (s): 0.7867
[42] Improvement: 446.0494925547391     Time (s): 0.7732
[43] Improvement: 364.80554305482656    Time (s): 0.7732
[44] Improvement: 306.95581830106676    Time (s): 0.7732
[45] Improvement: 254.04788305331022    Time (s): 0.7742
[46] Improvement: 219.48022534698248    Time (s): 0.7656
[47] Improvement: 166.90107309632003    Time (s): 0.7721
[48] Improvement: 163.44407942518592    Time (s): 0.7682
[49] Improvement: 134.6664313301444     Time (s): 0.7627
[50] Improvement: 118.20883244741708    Time (s): 0.7677
[51] Improvement: 84.43263314757496     Time (s): 0.7667
[52] Improvement: 53.964324260130525    Time (s): 0.7672
[53] Improvement: 39.15444684121758     Time (s): 0.7717
[54] Improvement: 47.30971831921488     Time (s): 0.7778
[55] Improvement: 26.459519919008017    Time (s): 0.7653
[56] Improvement: 11.759501944296062    Time (s): 0.7652
[57] Improvement: 14.247651782818139    Time (s): 0.768
[58] Improvement: 3.5007481994107366    Time (s): 0.7688
[59] Improvement: -8.499886938370764    Time (s): 0.7667
Total Training Improvement: 6247302.914771961
Total Training Time (s): 46.2738
Out[69]: 6247302.914771961

However when plotted the same problem persists. Here we see that the K means centers were actually pretty close to where the hmmlearn HMM ended up:
pomegranate_hmm_issue

@jkleckner
Copy link
Contributor

@joshdorrington It would be helpful to see both the fit and the plot. If they are too big to put inline you could always create a gist. When the pomegranate fit is working, it would also be useful to see the two side-by-side for comparison. Note that there is a notebook example that does this in [1] where the speed of them is compared.

[1] https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_0_pomegranate_overview.ipynb

@jkleckner
Copy link
Contributor

@joshdorrington, Could it be that you are getting a mismatched expectation of which indices correspond between the models? When I run it, the numbers looking at the plots (code would be better) appear that 0 and 3 indices might correspond to what you have? Note that the kmeans process will produce a non-deterministic order under some circumstances for the clusters.

print(model.states[0].distribution.mu) # [ 0.83899543  0.24577457 -0.19747712 -0.4214513  -0.23748714  0.23345555]
print(model.states[1].distribution.mu) # [ 0.79304567  0.2437491  -0.23705874 -0.39506542 -0.20489905  0.26616131]
print(model.states[2].distribution.mu) # [ 0.92149593  0.12998673 -0.04736718 -0.61579304 -0.13607195  0.06758548]

print(model.states[0].distribution.mu[0], model.states[0].distribution.mu[3],) # (0.92149591399201758, -0.61579319540642929)
print(model.states[1].distribution.mu[0], model.states[1].distribution.mu[3],) # (0.7930455929589999, -0.39506531301105141)
print(model.states[2].distribution.mu[0], model.states[2].distribution.mu[3],) # (0.83899550866569073, -0.42145130878958492)

@joshdorrington
Copy link
Author

@jkleckner I have made a gist with my full fitting and plotting included for both programs:
https://gist.github.com/joshdorrington/2913269ec89932f9b68b5fe4c91cdf5c

As hmmlearn still has no python 3 compatibility they are separate files. Your results do look right, unfortunately I can't reproduce them! I don't think the indices are the problem, as I don't make any assumptions about which states are which, I only filter on a subspace of the data; presumably the order of the dimensions in the fitted states will always correspond to the data.

@jkleckner
Copy link
Contributor

@joshdorrington In trying to reproduce this my own build environment is now not working. I have been using conda which has been quite reliable for me but I think you have hit something incompatible with current revisions. What package manager do you use and can you upload the package versions? Specifically conda env export or pip freeze.

@joshdorrington
Copy link
Author

joshdorrington commented Mar 22, 2018

I also use Anaconda, here is the full list of packages and versions:
condasetup1.txt

Key details are conda v4.50, python v3.60 and pomegranate v0.90

@jmschrei
Copy link
Owner

jmschrei commented Mar 22, 2018

When I run the first script, I think I'm getting the answers that you'd expect.

from pomegranate import *
import numpy as np

data=np.fromfile("np_input_file.txt").reshape([1,200001,6])

Gaussmodel=HiddenMarkovModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=data, verbose=True)

print Gaussmodel.states[0].distribution.parameters[0] # [0.7826816683883048, 0.24117942443206677, -0.26013475663367996, -0.3822771140018437, -0.1924771863914723, 0.2848611308798533]
print Gaussmodel.states[1].distribution.parameters[0] # [0.9011424396640376, 0.18619544314987144, -0.0415208854330756, -0.5671348934410917, -0.22473978132127376, 0.07526019711117092]
print Gaussmodel.states[2].distribution.parameters[0] # [0.8524057518122062, 0.20865981112786372, -0.2155342911203444, -0.44643304424702357, -0.17002630608375832, 0.24277978252249166]

I noticed a bug in your gist, which is that you reshaped your data to be 200001 by 6, which is why you might've gotten LinAlgErrors or NaN improvements. When I comment that line out, I get the same improvement scores as before, and the following plot (after adding in a plt.show command at the end)

image

I am on pomegranate v0.9.0, though I am running py2.7 as well.

@jmschrei
Copy link
Owner

I just ran the same code on python 3.6 and got the same results as above. Can you try commenting out the reshape command and running again?

@joshdorrington
Copy link
Author

I have been plunged into a new level of confusion, as like you I was suddenly seeing correct states being fitted with the same code I was using before. I have now identified what seems to be the problem but I have absolutely no idea why.

If I import the data into python from a .csv with pandas.read_csv the training doesnt work, but if I import it from a .np file using np.fromfile it does. I present the test code I have used to check this below. The data when passed to pomegranate is in the same data format, in the same type of array, and contains the same values, so I see no reason why this should happen:

the csv data file (suffixed.txt for github reasons):
pd_input_file.txt

the comparison gist: https://gist.github.com/joshdorrington/e5a1fc3c651ee6b2d5812f31e13903c4

When I run this gist I get incorrect results for pd_gaussmodel as posted before, and correct results for the np_gaussmodel

@jmschrei
Copy link
Owner

jmschrei commented Mar 23, 2018

The bug is due to a difference in format between numpy arrays and pandas data frames. Since dataframes store data contiguously by column, not row, it's essentially stored as a transposed numpy array. When you extract the values, instead of numpy creating a new array where each row is contiguous, it just returns an array that's been transposed. For numpy operations this is fine, because they all operate using the appropriate strides whether the array is transposed or not. You can see the actual difference between the two arrays below:

print pd_data.strides #(1600008, 8, 1600008)
print np_data.strides #(9600048, 48, 8)

If you add in pd_data = pd_data.copy() you'll get the same results as the numpy loading.

I should do more error checking to assure that it is a row contiguous array before training starts to stop this type of situation from arising in the future.

tl;dr The underlying cython code behind pomegranate assumes that it is row-contiguous, like most numpy arrays are, not column contiguous, like pandas dataframes are.

@joshdorrington
Copy link
Author

@jmschrei thanks very much, I'm glad we could get to the bottom of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants