Adding dtype to LDAModel to speed it up #1656

xelez · 2017-10-26T07:28:59Z

Started implementing #1576

Current state:

added dtype to LdaModel
added asserts about dtype everywhere to be sure that haven't missed any conversion
probably need help in handling load/save, maybe link to docs about how it works

And I need to somehow rewrite test asserts like this:
self.assertTrue(all(model.alpha == np.array([0.3, 0.3])))

Cause model.alpha is now converted to float32 (or whatever dtype) and np.array standard dtype is float64. Use np.allclose, maybe?

And I'm not sure where to discuss things, here or in the issue.

piskvorky · 2017-10-26T07:39:02Z

Good idea with the asserts!

I don't think save/load need any special handling at all. save() just saves the object, and then load() loads it back (using the same types as when the object was saved).

Maybe the only tricky part is how to handle backward compatibility: should loading models saved before this change stil work?

I'd say yes. We need to test this explicitly: save an "old" model, then load it using your new code with dtypes and asserts, make sure everything continues to work as expected.

The other compatibility direction (load new model in old code) is not necessary.

xelez · 2017-10-26T08:00:59Z

Do I need to do anything to make code save my new dtype field as well?

And yes, there is a compatibility problem as tests shown me. To achieve it, I'll need to set dtype to float64 if it's not present in the saved model. I'll need some time to wrap my head around code in load/save methods.

menshikh-iv · 2017-10-26T11:52:57Z

Nice @xelez, my suggestions about it:

Default dtype must be np.float32.
As I remember, you no need to make anything else for save your dtype fields (all will be handled automatically). But you need to modify load for old models (without this field).
About backward compatibility:

add a check, this model old or new (check that dtype is defined), if defined, no need to make anything else (this model is new)
if not - check what's dtype was used for matrices and fill-up new instance with this dtype. Also, add the warning about it (dtype isn't loaded, we'll use ...)

P/S similar task #1319

xelez · 2017-10-28T11:45:47Z

Implemented setting of dtype for old models.

Two things left to do:

modify tests to work with float32
cleanup asserts and TODO's that I've added

xelez · 2017-10-28T12:44:13Z

Looked up failing tests:

I'll also need to modify AuthorTopicModel and maybe other classes based on LdaModel

…e when checking if something is float.

xelez · 2017-10-28T13:39:40Z

Fixed tests and quick-fixed AuthorTopicModel.

* replace assert with docstring comment * add test to check that it really saves dtype for different inputs

xelez · 2017-10-28T17:36:03Z

gensim/models/ldamodel.py

@@ -354,25 +373,25 @@ def init_dir_prior(self, prior, name):

        if isinstance(prior, six.string_types):
            if prior == 'symmetric':
-                logger.info("using symmetric %s at %s", name, 1.0 / prior_shape)
-                init_prior = np.asarray([1.0 / self.num_topics for i in xrange(prior_shape)])
+                logger.info("using symmetric %s at %s", name, 1.0 / prior_shape) #TODO: prior_shape?


I have feeling that it should be

"using symmetric %s at %s", name, 1.0 / self.num_topics

am I right? @menshikh-iv

Yes @xelez, you are correct!

piskvorky · 2017-10-29T07:39:00Z

gensim/matutils.py

@@ -600,19 +600,13 @@ def jaccard_distance(set1, set2):
 def dirichlet_expectation(alpha):
    """
    For a vector `theta~Dir(alpha)`, compute `E[log(theta)]`.
-
+    Saves dtype of the argument.


What does this comment mean? Looks out of place.

I wanted to add some note that this function returns np.array with the same dtype as input alpha. Well, probably it's not really needed.

If that's the intent, it's not really apparent from the text Saves dtype of the argument.

…odels using LdaModel

xelez · 2017-11-01T08:12:55Z

@piskvorky , @menshikh-iv I think I've finished.

piskvorky · 2017-11-01T09:31:44Z

gensim/models/ldamodel.py

@@ -1231,7 +1230,7 @@ def load(cls, fname, *args, **kwargs):

        # the same goes for dtype (except it was added later)
        if not hasattr(result, 'dtype'):
-            result.dtype = np.float64 # float64 was used before as default in numpy
+            result.dtype = np.float64  # float64 was used before as default in numpy
            logging.warning("dtype was not set, so using np.float64")


A more concrete message please. When reading this warning, users will be left scratching their heads: set where? Why? What does this mean to me?

How about "dtype not set in saved %s file %s, assuming np.float64" % (result.__class__.__name__, fname)?
And only log at INFO or even DEBUG level, since it's an expected state when loading an old model, nothing out of ordinary.

Question: isn't it better to infer the dtype from the loaded object? Can it ever happen that it's something else, not np.float64?

Fixed message, decided info level suites better.

About inferring. Not clear how to do it. Infer from LdaState.eta and LdaState.sstats? But then we had test that their sum is np.float64, so it's safe to assume that we don't loose precision when setting dtype to np.float64 and np.float32 is not enough.

Anyway, let's imagine situation some of nd.arrays are somehow of different dtype, like np.float32 and some are np.float64. The right dtype is still np.float64.

piskvorky · 2017-11-01T09:34:17Z

gensim/models/hdpmodel.py

@@ -538,7 +538,8 @@ def suggested_lda_model(self):
        The num_topics is m_T (default is 150) so as to preserve the matrice shapes when we assign alpha and beta.
        """
        alpha, beta = self.hdp_to_lda()
-        ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word, random_state=self.random_state)
+        ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word,
+                                 random_state=self.random_state, dtype=np.float64)


Code style: no vertical indent.

piskvorky · 2017-11-01T09:35:13Z

gensim/models/ldamodel.py

-        self.sstats = np.zeros(shape)
+    def __init__(self, eta, shape, dtype=np.float32):
+        self.eta = eta.astype(dtype, copy=False)
+        self.sstats = np.zeros(shape, dtype)


Using positional arguments can lead to subtle bugs with numpy. Better use explicit names for keyword parameters: dtype=dtype.

piskvorky · 2017-11-01T09:35:52Z

gensim/models/ldaseqmodel.py

@@ -244,7 +245,8 @@ def lda_seq_infer(self, corpus, topic_suffstats, gammas, lhoods,
        vocab_len = self.vocab_len
        bound = 0.0

-        lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word)
+        lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word,
+                                dtype=np.float64)


Code style: no vertical indent.

piskvorky · 2017-11-01T09:36:05Z

gensim/models/ldaseqmodel.py

@@ -419,7 +421,8 @@ def __getitem__(self, doc):
        """
        Similar to the LdaModel __getitem__ function, it returns topic proportions of a document passed.
        """
-        lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word)
+        lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word,
+                                      dtype=np.float64)


Code style: no vertical indent.

menshikh-iv · 2017-11-01T14:35:17Z

gensim/models/ldamodel.py

+        # Check if `dtype` is set after main pickle load
+        # if not, then it's an old model and we should set it to default `np.float64`
+        if not hasattr(result, 'dtype'):
+            result.dtype = np.float64  # float64 was used before as default in numpy


Old LDA used float64, really?

Pretty much everything is using float64 cause it's default dtype when creating arrays.

menshikh-iv · 2017-11-01T14:39:29Z

gensim/models/ldaseqmodel.py

@@ -130,7 +130,8 @@ def __init__(self, corpus=None, time_slice=None, id2word=None, alphas=0.01, num_
            if initialize == 'gensim':
                lda_model = ldamodel.LdaModel(
                    corpus, id2word=self.id2word, num_topics=self.num_topics,
-                    passes=passes, alpha=self.alphas, random_state=random_state
+                    passes=passes, alpha=self.alphas, random_state=random_state,
+                    dtype=np.float64


Maybe it will be a good idea to change default behaviour (to float32)?
CC @piskvorky @xelez

Not now, LdaSeqModel will require modifications similar to those I made in LdaModel to handle dtype properly.

menshikh-iv · 2017-11-01T14:40:37Z

gensim/test/basetmtests.py

@@ -48,6 +48,7 @@ def test_get_topics(self):
        vocab_size = len(self.model.id2word)
        for topic in topics:
            self.assertTrue(isinstance(topic, np.ndarray))
-            self.assertEqual(topic.dtype, np.float64)
+            # Note: started moving to np.float32 as default
+            # self.assertEqual(topic.dtype, np.float64)


need to enable + switch to float32

This will break other topic models then

menshikh-iv · 2017-11-01T14:41:49Z

gensim/test/test_matutils.py

+
+
+class TestMatUtils(unittest.TestCase):
+    def test_dirichlet_expectation_keeps_precision(self):


I don't think that make new file is a good idea. Please move this tests to LDA tests class.

Also, please add tests for the load old model with new code for all models that you changed.

Test not just loading the old models, but also using them.

The asserts that we newly sprinkled into the code may trigger errors in various places, if something is wrong.

@menshikh-iv That test don't involve LDA at all, so it's wrong place for this test. Haven't found any file involving testing matutils functions so I think a separate file is not that bad.

xelez · 2017-11-06T18:43:16Z

@piskvorky , @menshikh-iv see latest commit for backwards compatibility tests.

By the way, I think it's good idea to remove my asserts before merging. They were used mostly during tests to ensure that I haven't missed any place to add dtype. That way we definitely won't break old code or models.

xelez · 2017-11-07T06:46:29Z

By the way, why .npy files are ignored in .gitignore?

menshikh-iv

Great @xelez, please fix last changes, LGTM for me, wdyt @piskvorky ?

menshikh-iv · 2017-11-07T07:37:49Z

gensim/models/ldamodel.py

@@ -820,6 +856,7 @@ def show_topics(self, num_topics=10, num_words=10, log=False, formatted=True):

            # add a little random jitter, to randomize results around the same alpha
            sort_alpha = self.alpha + 0.0001 * self.random_state.rand(len(self.alpha))
+            # random_state.rand returns float64, but converting back to dtype won't speed up anything


Maybe .astype (for consistency only) ?

Consistency vs one additional array copy. I'm not sure :)

menshikh-iv · 2017-11-07T07:38:41Z

gensim/models/ldamodel.py

+        # dtype could be absent in old models
+        if not hasattr(result, 'dtype'):
+            result.dtype = np.float64  # float64 was implicitly used before (cause it's default in numpy)
+            logging.info("dtype was not set in saved %s file %s, assuming np.float64", result.__class__.__name__, fname)


Maybe warn?

See #1656 (comment) for discussion. Although it's expected state when loading old model, maybe a warning still a good thing.

menshikh-iv · 2017-11-07T07:43:18Z

gensim/matutils.py

    """
    if len(alpha.shape) == 1:
        result = psi(alpha) - psi(np.sum(alpha))
    else:
        result = psi(alpha) - psi(np.sum(alpha, 1))[:, np.newaxis]
-    return result.astype(alpha.dtype)  # keep the same precision as input
+    return result


Please return astype, because
np.float32 -> np.float32
np.float64 -> np.float64
but
np.float16 -> np.float32

Oh, my bad, you're right!
Then tests that I added in separate file aren't needed.

menshikh-iv · 2017-11-07T07:54:38Z

gensim/test/test_ldamodel.py

@@ -242,7 +243,7 @@ def testGetDocumentTopics(self):
            self.assertTrue(isinstance(topic, list))
            for k, v in topic:
                self.assertTrue(isinstance(k, int))
-                self.assertTrue(isinstance(v, float))
+                self.assertTrue(np.issubdtype(v, float))


simple isinstance is better here (and everywhere).

simple isinstance fails cause np.float32 is not an instance of float

menshikh-iv · 2017-11-07T07:55:46Z

gensim/test/test_matutils.py

+
+class TestMatUtils(unittest.TestCase):
+    def test_dirichlet_expectation_keeps_precision(self):
+        for dtype in (np.float32, np.float64, np.complex64, np.complex128):


Add np.float16 and you'll see a problem

menshikh-iv · 2017-11-14T19:58:14Z

Thanks a lot @xelez, nice work 👍

piskvorky · 2017-11-14T20:52:06Z

Great feature!

@xelez how would you summarize it, for layman people who just want "the gist"? The title says Adding dtype to LDAModel to speed it up -- what was the final speed-up?

menshikh-iv · 2017-11-15T04:33:43Z

@piskvorky with LdaMulticore (80k words, 100 topics) ~ 20%, I checked it yesterday.

piskvorky · 2017-11-15T07:57:19Z

That's neat :) Let's make sure this information makes it into the release notes / tweets etc.

rmalouf · 2017-11-18T21:48:46Z

There are a couple of places in ldamodel.py where 1e-100 gets added to phinorm to avoid division by zero. That constant will need to be adjusted depending on the precision being used.

piskvorky · 2017-11-19T19:58:59Z

Good point. I'd hope this would be caught by the unit tests though -- @menshikh-iv ?

menshikh-iv · 2017-11-20T07:18:52Z

@piskvorky this isn't catched by unittests, because operation + 1e-100 don't change original dtype.
@rmalouf nice catch! thanks for your remark:+1:
@xelez can you fix this? (need to scale 1e-100 according to arrays dtype https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L476 and https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L487 in LDA + same fixes in another TMs).

piskvorky · 2017-11-20T07:26:13Z

The unit tests should catch the operation of "add epsilon" not working, which (I presume) leads to some issues.

In other words, if the unit tests pass, what is the problem?

menshikh-iv · 2017-11-20T07:33:55Z

For types < float64, it can produce problems with zeros if phinorm is zero -> division by zero -> nan

In [1]: import numpy as np
In [2]: np.float64(1e-100)
Out[2]: 1e-100

In [3]: np.float32(1e-100)
Out[3]: 0.0

For me, it looks like medium bug.

piskvorky · 2017-11-20T07:38:37Z

That's not my question. My question is: do unit tests catch it?

If not, is it an issue with the unit tests (=> update unit tests), or with the algorithm (=> update gensim code)?

If yes, how come we didn't discover the bug earlier.

menshikh-iv · 2017-11-20T08:12:25Z

@piskvorky unittests doesn't catch it.

piskvorky · 2017-11-20T08:23:22Z

Then the unit tests should be improved, as part of the solution here -- so that we catch similar bugs automatically in the future.

menshikh-iv · 2017-11-20T08:28:40Z

The problem here is not in tests at all, it's generally impossible to catch this bug in this code with unittests. Here, perhaps, we need to change the lda code itself, but I do not think this is a good idea.

piskvorky · 2017-11-20T08:33:25Z

I don't understand -- if there's no way to catch a bug, then there is no bug.

rmalouf · 2017-11-20T17:16:59Z

It's definitely test-for-able in principle: I noticed the bug because I started getting division by zero errors in a processing pipeline that used to work. I don't know how create a minimal corpus that triggers it, though.

menshikh-iv · 2017-11-20T20:11:27Z

Summarizing: this is a bug, we need to fix it.

* Add dtype to LdaModel, assert about it everywhere * Implement loading of old models without dtype * Use assert_allclose instead of == in LdaModel tests. Use np.issubdtype when checking if something is float. * Fix AuthorTopicModel * Fix matutils.dirichlet_expectation * replace assert with docstring comment * add test to check that it really saves dtype for different inputs * Change default to np.float32 and cleanup * Fix wrong logging message * Remove out-of-place comment * Cleanup PEP8 * Add dtype to sklearn LdaTransformer * Set precision explicitly in lda model converters * Add dtype to LdaMulticore * Set dtype to float64 explicitly to retain backward compatibility in models using LdaModel * Cleanup asserts and fix another place calculating in float64 * Fix import * Fix remarks by piskvorky * Add backward compatibility tests * Add missing .npy files * Fix dirichlet_expectation not working with np.float16 * Fix path to saved model

Add dtype to LdaModel, assert about it everywhere

8e4c56a

Implement loading of old models without dtype

cdc603f

xelez added 2 commits October 28, 2017 18:24

Use assert_allclose instead of == in LdaModel tests. Use np.issubdtyp…

38ebc1b

…e when checking if something is float.

Fix AuthorTopicModel

c44ca8a

Fix matutils.dirichlet_expectation

e4d98ba

* replace assert with docstring comment * add test to check that it really saves dtype for different inputs

xelez commented Oct 28, 2017

View reviewed changes

Change default to np.float32 and cleanup

88845fd

piskvorky reviewed Oct 29, 2017

View reviewed changes

xelez added 9 commits October 30, 2017 20:50

Fix wrong logging message

f2e22f4

Remove out-of-place comment

a95648e

Cleanup PEP8

c6cd798

Add dtype to sklearn LdaTransformer

d2651f9

Set precision explicitly in lda model converters

bc6cc4f

Add dtype to LdaMulticore

f7baf17

Set dtype to float64 explicitly to retain backward compatibility in m…

ef319f3

…odels using LdaModel

Cleanup asserts and fix another place calculating in float64

998b0b2

Fix import

7541dd9

xelez changed the title ~~[WIP] Adding dtype to LDAModel to speed it up~~ Adding dtype to LDAModel to speed it up Nov 1, 2017

piskvorky reviewed Nov 1, 2017

View reviewed changes

piskvorky requested changes Nov 1, 2017

View reviewed changes

menshikh-iv suggested changes Nov 1, 2017

View reviewed changes

xelez added 2 commits November 2, 2017 11:07

Fix remarks by piskvorky

5abf970

Add backward compatibility tests

61c3263

Add missing .npy files

4705422

menshikh-iv suggested changes Nov 7, 2017

View reviewed changes

xelez added 3 commits November 7, 2017 13:07

Fix dirichlet_expectation not working with np.float16

956612a

Merge branch 'develop' into ldamodel_dtype

add7bc0

Fix path to saved model

4deee31

menshikh-iv approved these changes Nov 7, 2017

View reviewed changes

menshikh-iv merged commit 82c394a into piskvorky:develop Nov 14, 2017

menshikh-iv mentioned this pull request Dec 7, 2017

Fix epsilon according to dtype in LdaModel #1770

Merged

piskvorky mentioned this pull request Mar 15, 2019

Zero probabilities in LDA model #2418

Open

piskvorky mentioned this pull request Apr 11, 2019

Exploding Perplexity for big number of topics #2443

Open



		class TestMatUtils(unittest.TestCase):
		def test_dirichlet_expectation_keeps_precision(self):

Adding dtype to LDAModel to speed it up #1656

Adding dtype to LDAModel to speed it up #1656

Conversation

xelez commented Oct 26, 2017

piskvorky commented Oct 26, 2017 • edited

xelez commented Oct 26, 2017

menshikh-iv commented Oct 26, 2017 • edited

xelez commented Oct 28, 2017

xelez commented Oct 28, 2017

xelez commented Oct 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Oct 29, 2017 • edited

Choose a reason for hiding this comment

xelez commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Nov 1, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xelez commented Nov 6, 2017

xelez commented Nov 7, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xelez Nov 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Nov 14, 2017

piskvorky commented Nov 14, 2017

menshikh-iv commented Nov 15, 2017 • edited

piskvorky commented Nov 15, 2017

rmalouf commented Nov 18, 2017

piskvorky commented Nov 19, 2017

menshikh-iv commented Nov 20, 2017

piskvorky commented Nov 20, 2017 • edited

menshikh-iv commented Nov 20, 2017

piskvorky commented Nov 20, 2017 • edited

menshikh-iv commented Nov 20, 2017

piskvorky commented Nov 20, 2017 • edited

menshikh-iv commented Nov 20, 2017

piskvorky commented Nov 20, 2017

rmalouf commented Nov 20, 2017

menshikh-iv commented Nov 20, 2017

piskvorky commented Oct 26, 2017 •

edited

menshikh-iv commented Oct 26, 2017 •

edited

piskvorky Oct 29, 2017 •

edited

piskvorky Nov 1, 2017 •

edited

xelez Nov 7, 2017 •

edited

menshikh-iv commented Nov 15, 2017 •

edited

piskvorky commented Nov 20, 2017 •

edited

piskvorky commented Nov 20, 2017 •

edited

piskvorky commented Nov 20, 2017 •

edited