-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the topic distribution for all doc is similar #1
Comments
Hello! Thank you for reporting, but this is obviously not enough to give any feedback. Please provide the information on your corpus (total number of documents and words), model parameters and fitting process (number of iterations), and package version. |
The corpus is Chinese text split with spaces. 100%|██████████| 18374731/18374731 [00:05<00:00, 3271476.08it/s] Version: 0.5.10 Thank you for your help! |
This is the output of test_btm.py when I run it on the SearchSnippets.txt corpus This is the result of p_zd /documents vs topics probability matrix. |
Running 20 iterations may lead to such results. This is simply not enough for the model to converge. My recent experiments show that model perplexity stabilizes somewhere around 500 iterations. But even with such a small number of iterations I cannot replicate this result. Could you please give the full code you are using and also pass |
class TestBTM(unittest.TestCase):
# Plotting tests
def test_btm_class(self):
with gzip_open('../dataset/SearchSnippets.txt.gz', 'rb') as file:
texts = file.readlines()
X, vocab = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(X)
biterms = btm.get_biterms(X)
LOGGER.info('Modeling started')
model = btm.BTM(X, vocab, T=8, W=vocab.size, M=20, alpha=50/8, beta=0.01)
# t1 = time.time()
model.fit(biterms, seed=12345, iterations=20)
# t2 = time.time()
# LOGGER.info(t2 - t1)
self.assertIsInstance(model.matrix_topics_words_, np.ndarray)
self.assertTupleEqual(model.matrix_topics_words_.shape, (8, vocab.size))
LOGGER.info('Modeling finished')
LOGGER.info('Inference started')
p_zd = model.transform(docs_vec)
print("sum_b",p_zd)
LOGGER.info('Inference "sum_b" finished')
p_zd = model.transform(docs_vec, infer_type='sum_w')
print("sum_w",p_zd)
LOGGER.info('Inference "sum_w" finished')
p_zd = model.transform(docs_vec, infer_type='mix')
print("mix",p_zd)
LOGGER.info('Inference "mix" finished')
LOGGER.info('Perplexity started')
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
self.assertIsInstance(perplexity, float)
self.assertNotEqual(perplexity, 0.)
LOGGER.info('Perplexity finished')
LOGGER.info('Coherence started')
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
self.assertIsInstance(coherence, np.ndarray)
self.assertGreater(coherence.shape[0], 0)
LOGGER.info('Coherence finished')
LOGGER.info('Model saving/loading started')
with open('model.pickle', 'wb') as file:
self.assertIsNone(pkl.dump(model, file))
with open('model.pickle', 'rb') as file:
self.assertIsInstance(pkl.load(file), btm._btm.BTM)
LOGGER.info('Model saving/loading finished')
if __name__ == '__main__':
unittest.main() I ran tests/test_btm.py directly and printed the results of p_zd for different infer_types, without making any changes to the code. sum_b[[9.99989362e-01 1.51973980e-06 1.51973980e-06 ... 1.51973980e-06
1.51973980e-06 1.51973980e-06]
[9.99885135e-01 1.64092341e-05 1.64092341e-05 ... 1.64092341e-05
1.64092341e-05 1.64092341e-05]
[9.99941434e-01 8.36655753e-06 8.36655753e-06 ... 8.36655753e-06
8.36655753e-06 8.36655753e-06]
...
[9.99962688e-01 5.33031226e-06 5.33031226e-06 ... 5.33031226e-06
5.33031226e-06 5.33031226e-06]
[9.99977867e-01 3.16187601e-06 3.16187601e-06 ... 3.16187601e-06
3.16187601e-06 3.16187601e-06]
[9.99899798e-01 1.43145008e-05 1.43145008e-05 ... 1.43145008e-05
1.43145008e-05 1.43145008e-05]] sum_w100%|██████████| 12295/12295 [00:00<00:00, 88027.49it/s] [[9.99971071e-01 4.13271496e-06 4.13271496e-06 ... 4.13271496e-06
4.13271496e-06 4.13271496e-06]
[9.99908825e-01 1.30250357e-05 1.30250357e-05 ... 1.30250357e-05
1.30250357e-05 1.30250357e-05]
[9.99932596e-01 9.62911447e-06 9.62911447e-06 ... 9.62911447e-06
9.62911447e-06 9.62911447e-06]
...
[9.99949473e-01 7.21817072e-06 7.21817072e-06 ... 7.21817072e-06
7.21817072e-06 7.21817072e-06]
[9.99959500e-01 5.78571297e-06 5.78571297e-06 ... 5.78571297e-06
5.78571297e-06 5.78571297e-06]
[9.99910889e-01 1.27301811e-05 1.27301811e-05 ... 1.27301811e-05
1.27301811e-05 1.27301811e-05]] mix100%|██████████| 12295/12295 [00:00<00:00, 181185.33it/s] [[9.99999971e-01 4.08965415e-09 4.08965415e-09 ... 4.08965415e-09
4.08965415e-09 4.08965415e-09]
[9.99998995e-01 1.43548690e-07 1.43548690e-07 ... 1.43548690e-07
1.43548690e-07 1.43548690e-07]
[9.99999967e-01 4.68500846e-09 4.68500846e-09 ... 4.68500846e-09
4.68500846e-09 4.68500846e-09]
...
[1.00000000e+00 3.58952802e-13 3.58952802e-13 ... 3.58952802e-13
3.58952802e-13 3.58952802e-13]
[1.00000000e+00 3.92627489e-13 3.92627489e-13 ... 3.92627489e-13
3.92627489e-13 3.92627489e-13]
[9.99999686e-01 4.48075858e-08 4.48075858e-08 ... 4.48075858e-08
4.48075858e-08 4.48075858e-08]] |
Still I cannot replicate your results. I am getting more sensible values with the same code. Please post the output of |
Iterations = 500 gives the same result |
I have managed to reproduce this bug under MacOS and Windows, but model is being fitted correctly under Linux. I will try to figure out the cause |
My OS is Win10. |
I have found the cause: random number generator was broken under Windows and MacOS (but not under Linux). The new version of package will be released shortly. |
Thank you again for drawing attention to this problem! It is now fixed, and the new release is available in PyPi. |
topic
[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07]
[9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08]
[9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07]
[9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07]
[9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10]
[9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]
The text was updated successfully, but these errors were encountered: