the topic distribution for all doc is similar #1

JennieGerhardt · 2021-04-02T02:05:57Z

topic

[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07]
[9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08]
[9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07]
[9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07]
[9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10]
[9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]

maximtrp · 2021-04-02T05:44:17Z

Hello! Thank you for reporting, but this is obviously not enough to give any feedback. Please provide the information on your corpus (total number of documents and words), model parameters and fitting process (number of iterations), and package version.

JennieGerhardt · 2021-04-03T02:42:48Z

The corpus is Chinese text split with spaces.
Model parameters :T=i, W=vocab.size, M=20, alpha=50/i, beta=0.01

100%|██████████| 18374731/18374731 [00:05<00:00, 3271476.08it/s]
100%|██████████| 20/20 [01:12<00:00, 3.64s/it]
100%|██████████| 45388/45388 [00:03<00:00, 12323.28it/s]

Version: 0.5.10

Thank you for your help!

JennieGerhardt · 2021-04-03T02:52:58Z

This is the output of test_btm.py when I run it on the SearchSnippets.txt corpus
100%|██████████| 641202/641202 [00:00<00:00, 2503526.78it/s]
100%|██████████| 20/20 [00:02<00:00, 7.64it/s]
100%|██████████| 12295/12295 [00:00<00:00, 158025.11it/s]

This is the result of p_zd /documents vs topics probability matrix.
[[9.99999967e-01 4.72269092e-09 4.72269092e-09 ... 4.72269092e-09
4.72269092e-09 4.72269092e-09]
[9.99999130e-01 1.24323721e-07 1.24323721e-07 ... 1.24323721e-07
1.24323721e-07 1.24323721e-07]
[9.99999968e-01 4.62112701e-09 4.62112701e-09 ... 4.62112701e-09
4.62112701e-09 4.62112701e-09]
...
[1.00000000e+00 2.54964359e-13 2.54964359e-13 ... 2.54964359e-13
2.54964359e-13 2.54964359e-13]
[1.00000000e+00 3.10026664e-13 3.10026664e-13 ... 3.10026664e-13
3.10026664e-13 3.10026664e-13]
[9.99999712e-01 4.11230259e-08 4.11230259e-08 ... 4.11230259e-08
4.11230259e-08 4.11230259e-08]]

maximtrp · 2021-04-03T06:48:55Z

Running 20 iterations may lead to such results. This is simply not enough for the model to converge. My recent experiments show that model perplexity stabilizes somewhere around 500 iterations.

But even with such a small number of iterations I cannot replicate this result. Could you please give the full code you are using and also pass seed value to model fit method?

JennieGerhardt · 2021-04-03T07:36:33Z

class TestBTM(unittest.TestCase):

    # Plotting tests
    def test_btm_class(self):
        with gzip_open('../dataset/SearchSnippets.txt.gz', 'rb') as file:
            texts = file.readlines()

        X, vocab = btm.get_words_freqs(texts)
        docs_vec = btm.get_vectorized_docs(X)
        biterms = btm.get_biterms(X)

        LOGGER.info('Modeling started')
        model = btm.BTM(X, vocab, T=8, W=vocab.size, M=20, alpha=50/8, beta=0.01)
        # t1 = time.time()
        model.fit(biterms, seed=12345, iterations=20)
        # t2 = time.time()
        # LOGGER.info(t2 - t1)
        self.assertIsInstance(model.matrix_topics_words_, np.ndarray)
        self.assertTupleEqual(model.matrix_topics_words_.shape, (8, vocab.size))
        LOGGER.info('Modeling finished')

        LOGGER.info('Inference started')
        p_zd = model.transform(docs_vec)
        print("sum_b",p_zd)
        LOGGER.info('Inference "sum_b" finished')
        p_zd = model.transform(docs_vec, infer_type='sum_w')
        print("sum_w",p_zd)
        LOGGER.info('Inference "sum_w" finished')
        p_zd = model.transform(docs_vec, infer_type='mix')
        print("mix",p_zd)
        LOGGER.info('Inference "mix" finished')

        LOGGER.info('Perplexity started')
        perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
        self.assertIsInstance(perplexity, float)
        self.assertNotEqual(perplexity, 0.)
        LOGGER.info('Perplexity finished')

        LOGGER.info('Coherence started')
        coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
        self.assertIsInstance(coherence, np.ndarray)
        self.assertGreater(coherence.shape[0], 0)
        LOGGER.info('Coherence finished')

        LOGGER.info('Model saving/loading started')
        with open('model.pickle', 'wb') as file:
            self.assertIsNone(pkl.dump(model, file))
        with open('model.pickle', 'rb') as file:
            self.assertIsInstance(pkl.load(file), btm._btm.BTM)
        LOGGER.info('Model saving/loading finished')

if __name__ == '__main__':
    unittest.main()

I ran tests/test_btm.py directly and printed the results of p_zd for different infer_types, without making any changes to the code.
100%|██████████| 641202/641202 [00:00<00:00, 3105825.15it/s]
100%|██████████| 20/20 [00:02<00:00, 8.80it/s]
100%|██████████| 12295/12295 [00:00<00:00, 54307.48it/s]

sum_b

[[9.99989362e-01 1.51973980e-06 1.51973980e-06 ... 1.51973980e-06
  1.51973980e-06 1.51973980e-06]
 [9.99885135e-01 1.64092341e-05 1.64092341e-05 ... 1.64092341e-05
  1.64092341e-05 1.64092341e-05]
 [9.99941434e-01 8.36655753e-06 8.36655753e-06 ... 8.36655753e-06
  8.36655753e-06 8.36655753e-06]
 ...
 [9.99962688e-01 5.33031226e-06 5.33031226e-06 ... 5.33031226e-06
  5.33031226e-06 5.33031226e-06]
 [9.99977867e-01 3.16187601e-06 3.16187601e-06 ... 3.16187601e-06
  3.16187601e-06 3.16187601e-06]
 [9.99899798e-01 1.43145008e-05 1.43145008e-05 ... 1.43145008e-05
  1.43145008e-05 1.43145008e-05]]

sum_w

100%|██████████| 12295/12295 [00:00<00:00, 88027.49it/s]

[[9.99971071e-01 4.13271496e-06 4.13271496e-06 ... 4.13271496e-06
  4.13271496e-06 4.13271496e-06]
 [9.99908825e-01 1.30250357e-05 1.30250357e-05 ... 1.30250357e-05
  1.30250357e-05 1.30250357e-05]
 [9.99932596e-01 9.62911447e-06 9.62911447e-06 ... 9.62911447e-06
  9.62911447e-06 9.62911447e-06]
 ...
 [9.99949473e-01 7.21817072e-06 7.21817072e-06 ... 7.21817072e-06
  7.21817072e-06 7.21817072e-06]
 [9.99959500e-01 5.78571297e-06 5.78571297e-06 ... 5.78571297e-06
  5.78571297e-06 5.78571297e-06]
 [9.99910889e-01 1.27301811e-05 1.27301811e-05 ... 1.27301811e-05
  1.27301811e-05 1.27301811e-05]]

mix

100%|██████████| 12295/12295 [00:00<00:00, 181185.33it/s]

[[9.99999971e-01 4.08965415e-09 4.08965415e-09 ... 4.08965415e-09
  4.08965415e-09 4.08965415e-09]
 [9.99998995e-01 1.43548690e-07 1.43548690e-07 ... 1.43548690e-07
  1.43548690e-07 1.43548690e-07]
 [9.99999967e-01 4.68500846e-09 4.68500846e-09 ... 4.68500846e-09
  4.68500846e-09 4.68500846e-09]
 ...
 [1.00000000e+00 3.58952802e-13 3.58952802e-13 ... 3.58952802e-13
  3.58952802e-13 3.58952802e-13]
 [1.00000000e+00 3.92627489e-13 3.92627489e-13 ... 3.92627489e-13
  3.92627489e-13 3.92627489e-13]
 [9.99999686e-01 4.48075858e-08 4.48075858e-08 ... 4.48075858e-08
  4.48075858e-08 4.48075858e-08]]

maximtrp · 2021-04-03T08:42:39Z

Still I cannot replicate your results. I am getting more sensible values with the same code. Please post the output of pip list and the results obtained with 200 iterations (not 20).

JennieGerhardt · 2021-04-03T08:48:58Z

Package                            Version
---------------------------------- ------------
-atplotlib                         2.2.3
-illow                             5.2.0
absl-py                            0.10.0
alabaster                          0.7.11
anaconda-client                    1.7.2
anaconda-navigator                 1.9.2
anaconda-project                   0.8.2
appdirs                            1.4.3
Appium-Python-Client               1.0.2
asn1crypto                         0.24.0
astor                              0.8.1
astroid                            2.0.4
astropy                            3.0.4
astunparse                         1.6.3
atomicwrites                       1.2.1
attrs                              18.2.0
Automat                            0.7.0
Babel                              2.6.0
backcall                           0.1.0
backports.shutil-get-terminal-size 1.0.0
beautifulsoup4                     4.6.3
bert-serving                       0.0.1
bert-serving-client                1.10.0
bert-serving-server                1.10.0
bert-tensorflow                    1.0.4
bibtexparser                       1.2.0
bitarray                           0.8.3
bitermplus                         0.5.10
bkcharts                           0.2
blaze                              0.11.3
bleach                             2.1.4
bokeh                              0.13.0
boto                               2.49.0
boto3                              1.10.48
botocore                           1.13.48
Bottleneck                         1.2.1
cachetools                         4.1.1
certifi                            2020.11.8
cffi                               1.11.5
chardet                            3.0.4
click                              6.7
cloudpickle                        0.5.5
clyent                             1.2.2
cmake                              3.18.4.post1
colorama                           0.3.9
comtypes                           1.1.7
conda                              4.8.3
conda-build                        3.15.1
conda-package-handling             1.7.0
constantly                         15.1.0
contextlib2                        0.5.5
cryptography                       3.0
cssselect                          1.1.0
cycler                             0.10.0
Cython                             0.29.14
cytoolz                            0.9.0.1
dask                               0.19.1
datashape                          0.5.4
decorator                          4.3.0
defusedxml                         0.5.0
distributed                        1.23.1
docutils                           0.14
docx                               0.2.4
emoji                              0.6.0
entrypoints                        0.2.3
et-xmlfile                         1.0.1
fake-useragent                     0.1.11
Faker                              4.1.1
fastcache                          1.0.2
filelock                           3.0.8
Flask                              1.0.2
Flask-Cors                         3.0.6
funcy                              1.15
future                             0.18.2
gast                               0.3.3
gensim                             3.8.3
gevent                             1.3.6
glob2                              0.6
google-api-core                    1.23.0
google-auth                        1.23.0
google-auth-oauthlib               0.4.1
google-cloud-language              2.0.0
google-pasta                       0.2.0
googleapis-common-protos           1.52.0
GPUtil                             1.4.0
greenlet                           0.4.15
grpcio                             1.33.2
h5py                               2.10.0
heapdict                           1.0.0
html5lib                           1.0.1
hyperlink                          18.0.0
idna                               2.10
imageio                            2.4.1
imagesize                          1.1.0
importlib-metadata                 1.7.0
incremental                        17.5.0
ipykernel                          4.10.0
ipython                            6.5.0
ipython-genutils                   0.2.0
ipywidgets                         7.4.1
isort                              4.3.4
itsdangerous                       0.24
jdcal                              1.4
jedi                               0.12.1
jieba                              0.42.1
Jinja2                             2.10
jmespath                           0.10.0
joblib                             0.16.0
jsonschema                         2.6.0
jupyter                            1.0.0
jupyter-client                     5.2.3
jupyter-console                    5.2.0
jupyter-core                       4.4.0
jupyterlab                         0.34.9
jupyterlab-launcher                0.13.1
Keras                              2.2.4
Keras-Applications                 1.0.8
keras-bert                         0.82.0
keras-embed-sim                    0.8.0
keras-layer-normalization          0.14.0
keras-multi-head                   0.27.0
keras-pos-embd                     0.11.0
keras-position-wise-feed-forward   0.6.0
Keras-Preprocessing                1.1.2
keras-self-attention               0.46.0
keras-transformer                  0.38.0
keyring                            13.2.1
kiwisolver                         1.0.1
lazy-object-proxy                  1.3.1
libcst                             0.3.13
llvmlite                           0.24.0
locket                             0.2.0
lxml                               4.2.5
Markdown                           3.2.2
MarkupSafe                         1.0
matplotlib                         3.3.2
mccabe                             0.6.1
menuinst                           1.4.14
mglearn                            0.1.9
mistune                            0.8.3
mkl-fft                            1.0.4
mkl-random                         1.0.1
mmdnn                              0.1.3
mock                               4.0.2
more-itertools                     4.3.0
mouse                              0.7.1
move                               0.1.3
mpmath                             1.0.0
msgpack                            0.5.6
MulticoreTSNE                      0.1
multipledispatch                   0.6.0
mypy-extensions                    0.4.3
mysqlclient                        2.0.1
navigator-updater                  0.2.1
nbconvert                          5.4.0
nbformat                           4.4.0
networkx                           2.1
nltk                               3.5
nose                               1.3.7
notebook                           5.6.0
numba                              0.39.0
numexpr                            2.6.8
numpy                              1.18.5
numpydoc                           0.8.0
oauthlib                           3.1.0
odo                                0.5.1
olefile                            0.46
openpyxl                           2.5.6
opt-einsum                         3.3.0
packaging                          17.1
pandas                             0.23.4
pandocfilters                      1.4.2
parsel                             1.5.2
parso                              0.3.1
partd                              0.3.8
path.py                            11.1.0
pathlib2                           2.3.2
patsy                              0.5.0
pep8                               1.7.1
pickleshare                        0.7.4
Pillow                             7.2.0
pip                                20.2.2
pkginfo                            1.4.2
pluggy                             0.7.1
ply                                3.11
prometheus-client                  0.3.1
prompt-toolkit                     1.0.15
proto-plus                         1.11.0
protobuf                           3.13.0
psutil                             5.4.7
py                                 1.6.0
pyasn1                             0.4.8
pyasn1-modules                     0.2.8
pycodestyle                        2.4.0
pycosat                            0.6.3
pycparser                          2.18
pycrypto                           2.6.1
pycurl                             7.43.0.5
PyDispatcher                       2.0.5
pyflakes                           2.0.0
Pygments                           2.2.0
PyHamcrest                         2.0.2
pyLDAvis                           2.1.2
pylint                             2.1.1
pymongo                            3.11.0
PyMouse                            1.0
PyMySQL                            0.10.0
pyodbc                             4.0.24
pyOpenSSL                          19.1.0
pyparsing                          2.2.0
PySocks                            1.6.8
pytest                             3.8.0
pytest-arraydiff                   0.2
pytest-astropy                     0.4.0
pytest-doctestplus                 0.1.3
pytest-openfiles                   0.3.0
pytest-remotedata                  0.3.0
pytest-runner                      5.2
python-dateutil                    2.7.3
python-docx                        0.8.10
pytorch-pretrained-bert            0.6.2
pytz                               2020.4
PyWavelets                         1.0.0
pywin32                            223
pywinpty                           0.5.4
PyYAML                             5.3.1
pyzmq                              17.1.2
qt5reactor                         0.6.3
QtAwesome                          0.4.4
qtconsole                          4.4.1
QtPy                               1.5.0
queuelib                           1.5.0
redis                              3.5.3
regex                              2020.7.14
requests                           2.25.0
requests-oauthlib                  1.3.0
rope                               0.11.0
rsa                                4.6
ruamel-yaml                        0.15.46
s3transfer                         0.2.1
scikit-image                       0.14.0
scikit-learn                       0.19.2
scipy                              1.4.1
Scrapy                             1.6.0
seaborn                            0.9.0
selenium                           3.141.0
Send2Trash                         1.5.0
service-identity                   17.0.0
setuptools                         50.3.2
simplegeneric                      0.8.1
singledispatch                     3.4.0.3
six                                1.15.0
smart-open                         2.1.1
snowballstemmer                    1.2.1
snownlp                            0.12.3
sortedcollections                  1.0.1
sortedcontainers                   2.0.5
Sphinx                             1.7.9
sphinxcontrib-websupport           1.1.0
spyder                             3.3.1
spyder-kernels                     0.2.6
SQLAlchemy                         1.2.11
statsmodels                        0.9.0
stop-words                         2018.7.23
sympy                              1.1.1
tables                             3.4.4
tblib                              1.3.2
tensorboard                        1.14.0
tensorboard-plugin-wit             1.7.0
tensorflow-estimator               1.14.0
tensorflow-gpu                     1.14.0
termcolor                          1.1.0
terminado                          0.8.1
testpath                           0.3.1
text-unidecode                     1.3
toolz                              0.9.0
torch                              1.6.0+cu101
torchvision                        0.7.0+cu101
tornado                            5.1
tqdm                               4.26.0
traitlets                          4.3.2
Twisted                            18.7.0
typing-extensions                  3.7.4.3
typing-inspect                     0.6.0
unicodecsv                         0.14.1
urllib3                            1.25.10
w3lib                              1.21.0
wcwidth                            0.1.7
webencodings                       0.5.1
Werkzeug                           0.14.1
wheel                              0.31.1
widgetsnbextension                 3.4.1
win-inet-pton                      1.0.1
win-unicode-console                0.5
wincertstore                       0.2
wrapt                              1.12.1
xlrd                               1.1.0
XlsxWriter                         1.1.0
xlwings                            0.11.8
xlwt                               1.3.0
zict                               0.1.3
zipp                               3.1.0
zope.interface                     4.5.0

Iterations = 500 gives the same result

maximtrp · 2021-04-03T09:45:03Z

I have managed to reproduce this bug under MacOS and Windows, but model is being fitted correctly under Linux. I will try to figure out the cause

JennieGerhardt · 2021-04-03T10:30:11Z

My OS is Win10.
I created a new virtual environment and updated the packages, but it still doesn't work.

maximtrp · 2021-04-03T11:56:28Z

I have found the cause: random number generator was broken under Windows and MacOS (but not under Linux). The new version of package will be released shortly.

maximtrp · 2021-04-04T08:59:56Z

Thank you again for drawing attention to this problem! It is now fixed, and the new release is available in PyPi.

maximtrp self-assigned this Apr 3, 2021

maximtrp added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Apr 3, 2021

maximtrp closed this as completed Apr 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the topic distribution for all doc is similar #1

the topic distribution for all doc is similar #1

JennieGerhardt commented Apr 2, 2021

maximtrp commented Apr 2, 2021

JennieGerhardt commented Apr 3, 2021

JennieGerhardt commented Apr 3, 2021

maximtrp commented Apr 3, 2021

JennieGerhardt commented Apr 3, 2021 •

edited by maximtrp

Loading

maximtrp commented Apr 3, 2021 •

edited

Loading

JennieGerhardt commented Apr 3, 2021 •

edited by maximtrp

Loading

maximtrp commented Apr 3, 2021 •

edited

Loading

JennieGerhardt commented Apr 3, 2021

maximtrp commented Apr 3, 2021

maximtrp commented Apr 4, 2021

the topic distribution for all doc is similar #1

the topic distribution for all doc is similar #1

Comments

JennieGerhardt commented Apr 2, 2021

topic

maximtrp commented Apr 2, 2021

JennieGerhardt commented Apr 3, 2021

JennieGerhardt commented Apr 3, 2021

maximtrp commented Apr 3, 2021

JennieGerhardt commented Apr 3, 2021 • edited by maximtrp Loading

sum_b

sum_w

mix

maximtrp commented Apr 3, 2021 • edited Loading

JennieGerhardt commented Apr 3, 2021 • edited by maximtrp Loading

maximtrp commented Apr 3, 2021 • edited Loading

JennieGerhardt commented Apr 3, 2021

maximtrp commented Apr 3, 2021

maximtrp commented Apr 4, 2021

JennieGerhardt commented Apr 3, 2021 •

edited by maximtrp

Loading

maximtrp commented Apr 3, 2021 •

edited

Loading

JennieGerhardt commented Apr 3, 2021 •

edited by maximtrp

Loading

maximtrp commented Apr 3, 2021 •

edited

Loading