User / item embeddings Nan with large training set #130

lesterlitch · 2016-12-06T22:13:25Z

Apologies if this is user error, but I appear to be getting Nan embeddings from LightFM and I'm unsure what I could have done wrong. I followed the documentation, and have raise the issue on SO.

Basically I have a large data-set where collaborative filtering is working fine, but where user / item embeddings are provided, the model produces nan embeddings.

http://stackoverflow.com/questions/40967226/lightfm-user-item-producing-nan-embeddings

maciejkula · 2016-12-06T23:30:51Z

What are the values in the feature matrices? Have you tried normalizing them (for example to be between 0 and 1)?

lesterlitch · 2016-12-07T02:04:24Z

Thanks for the response!

I have yes, using the learn function - i.e. normalize(raw_member_features_matrix, axis=1, norm='l1')

Here are the first rows from the item and user feature csr matrices:

items_features[0,items_features[0].nonzero()[1]].todense()

matrix([[ 0.2189845 ,  0.18301879,  0.29823944,  0.19250721,  0.62113589,
          0.28761694,  0.4387733 ,  0.15228976,  0.32908452]], dtype=float32)

members_features[0,members_features[0].nonzero()[1]].todense()

matrix([[ 0.01500955,  0.00687691,  0.00488463,  0.03807613,  0.01714612,
          0.06524359,  0.01370857,  0.0203032 ,  0.0091073 ,  0.01899276,
          0.0170573 ,  0.03180252,  0.03951597,  0.03765749,  0.02067481,
          0.00863998,  0.03003284,  0.010614  ,  0.01699004,  0.02135187,
          0.02568188,  0.02606232,  0.01938645,  0.06161183,  0.0126634 ,
          0.01294042,  0.00720311,  0.030777  ,  0.01884086,  0.01178526,
          0.05592889,  0.02763181,  0.00907691,  0.01116292,  0.01343661,
          0.01717991,  0.01464464,  0.00726902,  0.01353738,  0.00541887,
          0.01728139,  0.01083446,  0.04138919,  0.01978991,  0.05642271,
          0.00835726]], dtype=float32)

maciejkula · 2016-12-07T02:44:43Z

Just for my understanding: what does your indexing items_features[0,items_features[0].nonzero()[1]] accomplish?

Are you expecting these matrices to be dense?

lesterlitch · 2016-12-07T02:46:44Z

That's right, What I'm trying to do with that is go from items_features (a csr matrix of shape n_items, n_features) to a dense matrix of the values in the first row of that matrix, just to show the non-zero values are normalized between 0 and 1.

items_features[0] gives:

<1x2790 sparse matrix of type '<type 'numpy.float32'>'
with 9 stored elements in Compressed Sparse Row format>

maciejkula · 2016-12-07T02:48:40Z

Cool, I understand.

Can you try reducing the learning rate and/or reducing the scale of the nonzero items even further?

maciejkula · 2016-12-07T02:50:00Z

You could try turning off regularization as well to try to narrow the problem down.

lesterlitch · 2016-12-07T02:50:18Z

Cool, will try those and come back to you. Cheers.

maciejkula · 2016-12-07T02:50:59Z

You're also using a lot of parallelism: this may cause problems if a lot of your users or items have the same features.

Let me know what you find!

maciejkula · 2016-12-21T06:33:26Z

Any luck?

lesterlitch · 2016-12-21T08:30:47Z

Unfortunately not. I tried reducing the scale further (0-0.1), reducing the learning rate (several values), removing regularisation and running with only 4 threads. Strangely I can get a result with either only user or only item embeddings but not both. I'm not sure if this is a factor, but I have many more users than items (around 10x).

maciejkula · 2017-01-29T22:39:28Z

Can you try the newest version (1.12)? It has numerical stability improvements which may resolve your problem.

lesterlitch · 2017-02-07T04:47:01Z

In the new version I get:

ValueError: Not all estimated parameters are finite, your model may have diverged. Try decreasing the learning rate.

Learning rate is 0.001 and have tried down to 0.00001. Have normalized features between 0-1, but also tried 0-0.1 and 0-0.01

My datasets look like:

items_features
(<39267x2790 sparse matrix of type '<type 'numpy.float32'>'
with 335801 stored elements in Compressed Sparse Row format>,

members_features
<305803x2790 sparse matrix of type '<type 'numpy.float32'>'
with 14772846 stored elements in Compressed Sparse Row format>,

interaction_matrix
<305803x39267 sparse matrix of type '<type 'numpy.float32'>'
with 3767965 stored elements in COOrdinate format>)

maciejkula · 2017-02-07T10:49:45Z

Hmm. I may have to have a look at your code and data. Can you email me at lightfm@zoho.com?

maciejkula · 2017-02-07T15:51:39Z

It would be best if you could reproduce the problem using synthetic data (or a subset of your data that you don't mind sharing).

maciejkula · 2017-02-17T17:07:43Z

Is this still a problem? I'd really like to help if it is!

lesterlitch · 2017-04-04T06:07:48Z

Just had a chance to revisit this.

When I recreate my matrices with random floats and ints, same value scale and same sparseness / shapes, I didn't encounter the same problem.

After investigation I discovered a bunch of empty rows in my member / item features. It seems the model can handle a few, but in my case there were 700 or so, and that was enough to push parameters to infinity.

Is this expected behavior?

maciejkula · 2017-04-08T08:19:08Z

No, this shouldn't be the case. My first suspicion was that I don't zero the representation buffers, but they are zeroed.

maciejkula · 2017-06-15T22:06:22Z

If you can construct a minimal test case that manifests this problem, I would be happy to have a look and solve this.

lesterlitch · 2017-07-12T21:54:09Z

Hi, sorry I was so slow on this. I've done a bunch more testing and found that when using very sparse factors for users and items, the learning rate needs to be very small to prevent divergence. This was an issue previously, possibly because of the numerical stability issues you mentioned? Anyway, after upgrading and retesting I can get the model to fit by adjusting the learning rate. Thanks!

maciejkula · 2017-07-12T22:14:48Z

No worries, glad to see you found a solution.

kewlcoder · 2018-09-28T18:26:28Z

Hi Maciej,
I am getting the same error - "ValueError: Not all estimated parameters are finite, your model may have diverged. Try decreasing the learning rate."

I have tried all these values for learning rate - [0.05, 0.025, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001] but still giving the same error.
Also, it only occurs when I add the item_features. It works fine with interactions + user_features data.

Please help !

dwy904 · 2018-10-19T20:09:27Z

Why I got all zeros in both of the embedding matrices [user and item]?

ness001 · 2020-06-04T13:48:18Z

Me, too. Even if with extremely little learning rate, the error still got popped. I can't figure out why.

lesterlitch closed this as completed Jul 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User / item embeddings Nan with large training set #130

User / item embeddings Nan with large training set #130

lesterlitch commented Dec 6, 2016

maciejkula commented Dec 6, 2016

lesterlitch commented Dec 7, 2016 •

edited by maciejkula

maciejkula commented Dec 7, 2016

lesterlitch commented Dec 7, 2016 •

edited

maciejkula commented Dec 7, 2016

maciejkula commented Dec 7, 2016

lesterlitch commented Dec 7, 2016

maciejkula commented Dec 7, 2016 •

edited

maciejkula commented Dec 21, 2016

lesterlitch commented Dec 21, 2016

maciejkula commented Jan 29, 2017

lesterlitch commented Feb 7, 2017 •

edited

maciejkula commented Feb 7, 2017

maciejkula commented Feb 7, 2017

maciejkula commented Feb 17, 2017

lesterlitch commented Apr 4, 2017 •

edited

maciejkula commented Apr 8, 2017

maciejkula commented Jun 15, 2017

lesterlitch commented Jul 12, 2017

maciejkula commented Jul 12, 2017

kewlcoder commented Sep 28, 2018 •

edited

dwy904 commented Oct 19, 2018 •

edited

ness001 commented Jun 4, 2020

User / item embeddings Nan with large training set #130

User / item embeddings Nan with large training set #130

Comments

lesterlitch commented Dec 6, 2016

maciejkula commented Dec 6, 2016

lesterlitch commented Dec 7, 2016 • edited by maciejkula

maciejkula commented Dec 7, 2016

lesterlitch commented Dec 7, 2016 • edited

maciejkula commented Dec 7, 2016

maciejkula commented Dec 7, 2016

lesterlitch commented Dec 7, 2016

maciejkula commented Dec 7, 2016 • edited

maciejkula commented Dec 21, 2016

lesterlitch commented Dec 21, 2016

maciejkula commented Jan 29, 2017

lesterlitch commented Feb 7, 2017 • edited

maciejkula commented Feb 7, 2017

maciejkula commented Feb 7, 2017

maciejkula commented Feb 17, 2017

lesterlitch commented Apr 4, 2017 • edited

maciejkula commented Apr 8, 2017

maciejkula commented Jun 15, 2017

lesterlitch commented Jul 12, 2017

maciejkula commented Jul 12, 2017

kewlcoder commented Sep 28, 2018 • edited

dwy904 commented Oct 19, 2018 • edited

ness001 commented Jun 4, 2020

lesterlitch commented Dec 7, 2016 •

edited by maciejkula

lesterlitch commented Dec 7, 2016 •

edited

maciejkula commented Dec 7, 2016 •

edited

lesterlitch commented Feb 7, 2017 •

edited

lesterlitch commented Apr 4, 2017 •

edited

kewlcoder commented Sep 28, 2018 •

edited

dwy904 commented Oct 19, 2018 •

edited