Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the topic distribution for all doc is similar #1

Closed
JennieGerhardt opened this issue Apr 2, 2021 · 11 comments
Closed

the topic distribution for all doc is similar #1

JennieGerhardt opened this issue Apr 2, 2021 · 11 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@JennieGerhardt
Copy link

topic

[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07  3.12592152e-07]
[9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08  2.43742411e-08]
[9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07  1.83996702e-07]
[9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07  2.77376339e-07]
[9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10  3.94318712e-10]
[9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07  3.92884503e-07]

@maximtrp
Copy link
Owner

maximtrp commented Apr 2, 2021

Hello! Thank you for reporting, but this is obviously not enough to give any feedback. Please provide the information on your corpus (total number of documents and words), model parameters and fitting process (number of iterations), and package version.

@JennieGerhardt
Copy link
Author

The corpus is Chinese text split with spaces.
Model parameters :T=i, W=vocab.size, M=20, alpha=50/i, beta=0.01

100%|██████████| 18374731/18374731 [00:05<00:00, 3271476.08it/s]
100%|██████████| 20/20 [01:12<00:00, 3.64s/it]
100%|██████████| 45388/45388 [00:03<00:00, 12323.28it/s]

Version: 0.5.10

Thank you for your help!

@JennieGerhardt
Copy link
Author

This is the output of test_btm.py when I run it on the SearchSnippets.txt corpus
100%|██████████| 641202/641202 [00:00<00:00, 2503526.78it/s]
100%|██████████| 20/20 [00:02<00:00, 7.64it/s]
100%|██████████| 12295/12295 [00:00<00:00, 158025.11it/s]

This is the result of p_zd /documents vs topics probability matrix.
[[9.99999967e-01 4.72269092e-09 4.72269092e-09 ... 4.72269092e-09
4.72269092e-09 4.72269092e-09]
[9.99999130e-01 1.24323721e-07 1.24323721e-07 ... 1.24323721e-07
1.24323721e-07 1.24323721e-07]
[9.99999968e-01 4.62112701e-09 4.62112701e-09 ... 4.62112701e-09
4.62112701e-09 4.62112701e-09]
...
[1.00000000e+00 2.54964359e-13 2.54964359e-13 ... 2.54964359e-13
2.54964359e-13 2.54964359e-13]
[1.00000000e+00 3.10026664e-13 3.10026664e-13 ... 3.10026664e-13
3.10026664e-13 3.10026664e-13]
[9.99999712e-01 4.11230259e-08 4.11230259e-08 ... 4.11230259e-08
4.11230259e-08 4.11230259e-08]]

@maximtrp
Copy link
Owner

maximtrp commented Apr 3, 2021

Running 20 iterations may lead to such results. This is simply not enough for the model to converge. My recent experiments show that model perplexity stabilizes somewhere around 500 iterations.

But even with such a small number of iterations I cannot replicate this result. Could you please give the full code you are using and also pass seed value to model fit method?

@JennieGerhardt
Copy link
Author

JennieGerhardt commented Apr 3, 2021

class TestBTM(unittest.TestCase):

    # Plotting tests
    def test_btm_class(self):
        with gzip_open('../dataset/SearchSnippets.txt.gz', 'rb') as file:
            texts = file.readlines()

        X, vocab = btm.get_words_freqs(texts)
        docs_vec = btm.get_vectorized_docs(X)
        biterms = btm.get_biterms(X)

        LOGGER.info('Modeling started')
        model = btm.BTM(X, vocab, T=8, W=vocab.size, M=20, alpha=50/8, beta=0.01)
        # t1 = time.time()
        model.fit(biterms, seed=12345, iterations=20)
        # t2 = time.time()
        # LOGGER.info(t2 - t1)
        self.assertIsInstance(model.matrix_topics_words_, np.ndarray)
        self.assertTupleEqual(model.matrix_topics_words_.shape, (8, vocab.size))
        LOGGER.info('Modeling finished')

        LOGGER.info('Inference started')
        p_zd = model.transform(docs_vec)
        print("sum_b",p_zd)
        LOGGER.info('Inference "sum_b" finished')
        p_zd = model.transform(docs_vec, infer_type='sum_w')
        print("sum_w",p_zd)
        LOGGER.info('Inference "sum_w" finished')
        p_zd = model.transform(docs_vec, infer_type='mix')
        print("mix",p_zd)
        LOGGER.info('Inference "mix" finished')

        LOGGER.info('Perplexity started')
        perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
        self.assertIsInstance(perplexity, float)
        self.assertNotEqual(perplexity, 0.)
        LOGGER.info('Perplexity finished')

        LOGGER.info('Coherence started')
        coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
        self.assertIsInstance(coherence, np.ndarray)
        self.assertGreater(coherence.shape[0], 0)
        LOGGER.info('Coherence finished')

        LOGGER.info('Model saving/loading started')
        with open('model.pickle', 'wb') as file:
            self.assertIsNone(pkl.dump(model, file))
        with open('model.pickle', 'rb') as file:
            self.assertIsInstance(pkl.load(file), btm._btm.BTM)
        LOGGER.info('Model saving/loading finished')

if __name__ == '__main__':
    unittest.main()

I ran tests/test_btm.py directly and printed the results of p_zd for different infer_types, without making any changes to the code.
100%|██████████| 641202/641202 [00:00<00:00, 3105825.15it/s]
100%|██████████| 20/20 [00:02<00:00, 8.80it/s]
100%|██████████| 12295/12295 [00:00<00:00, 54307.48it/s]

sum_b

[[9.99989362e-01 1.51973980e-06 1.51973980e-06 ... 1.51973980e-06
  1.51973980e-06 1.51973980e-06]
 [9.99885135e-01 1.64092341e-05 1.64092341e-05 ... 1.64092341e-05
  1.64092341e-05 1.64092341e-05]
 [9.99941434e-01 8.36655753e-06 8.36655753e-06 ... 8.36655753e-06
  8.36655753e-06 8.36655753e-06]
 ...
 [9.99962688e-01 5.33031226e-06 5.33031226e-06 ... 5.33031226e-06
  5.33031226e-06 5.33031226e-06]
 [9.99977867e-01 3.16187601e-06 3.16187601e-06 ... 3.16187601e-06
  3.16187601e-06 3.16187601e-06]
 [9.99899798e-01 1.43145008e-05 1.43145008e-05 ... 1.43145008e-05
  1.43145008e-05 1.43145008e-05]]

sum_w

100%|██████████| 12295/12295 [00:00<00:00, 88027.49it/s]

[[9.99971071e-01 4.13271496e-06 4.13271496e-06 ... 4.13271496e-06
  4.13271496e-06 4.13271496e-06]
 [9.99908825e-01 1.30250357e-05 1.30250357e-05 ... 1.30250357e-05
  1.30250357e-05 1.30250357e-05]
 [9.99932596e-01 9.62911447e-06 9.62911447e-06 ... 9.62911447e-06
  9.62911447e-06 9.62911447e-06]
 ...
 [9.99949473e-01 7.21817072e-06 7.21817072e-06 ... 7.21817072e-06
  7.21817072e-06 7.21817072e-06]
 [9.99959500e-01 5.78571297e-06 5.78571297e-06 ... 5.78571297e-06
  5.78571297e-06 5.78571297e-06]
 [9.99910889e-01 1.27301811e-05 1.27301811e-05 ... 1.27301811e-05
  1.27301811e-05 1.27301811e-05]]

mix

100%|██████████| 12295/12295 [00:00<00:00, 181185.33it/s]

[[9.99999971e-01 4.08965415e-09 4.08965415e-09 ... 4.08965415e-09
  4.08965415e-09 4.08965415e-09]
 [9.99998995e-01 1.43548690e-07 1.43548690e-07 ... 1.43548690e-07
  1.43548690e-07 1.43548690e-07]
 [9.99999967e-01 4.68500846e-09 4.68500846e-09 ... 4.68500846e-09
  4.68500846e-09 4.68500846e-09]
 ...
 [1.00000000e+00 3.58952802e-13 3.58952802e-13 ... 3.58952802e-13
  3.58952802e-13 3.58952802e-13]
 [1.00000000e+00 3.92627489e-13 3.92627489e-13 ... 3.92627489e-13
  3.92627489e-13 3.92627489e-13]
 [9.99999686e-01 4.48075858e-08 4.48075858e-08 ... 4.48075858e-08
  4.48075858e-08 4.48075858e-08]]

@maximtrp
Copy link
Owner

maximtrp commented Apr 3, 2021

Still I cannot replicate your results. I am getting more sensible values with the same code. Please post the output of pip list and the results obtained with 200 iterations (not 20).

@maximtrp maximtrp self-assigned this Apr 3, 2021
@maximtrp maximtrp added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Apr 3, 2021
@JennieGerhardt
Copy link
Author

JennieGerhardt commented Apr 3, 2021

Package                            Version
---------------------------------- ------------
-atplotlib                         2.2.3
-illow                             5.2.0
absl-py                            0.10.0
alabaster                          0.7.11
anaconda-client                    1.7.2
anaconda-navigator                 1.9.2
anaconda-project                   0.8.2
appdirs                            1.4.3
Appium-Python-Client               1.0.2
asn1crypto                         0.24.0
astor                              0.8.1
astroid                            2.0.4
astropy                            3.0.4
astunparse                         1.6.3
atomicwrites                       1.2.1
attrs                              18.2.0
Automat                            0.7.0
Babel                              2.6.0
backcall                           0.1.0
backports.shutil-get-terminal-size 1.0.0
beautifulsoup4                     4.6.3
bert-serving                       0.0.1
bert-serving-client                1.10.0
bert-serving-server                1.10.0
bert-tensorflow                    1.0.4
bibtexparser                       1.2.0
bitarray                           0.8.3
bitermplus                         0.5.10
bkcharts                           0.2
blaze                              0.11.3
bleach                             2.1.4
bokeh                              0.13.0
boto                               2.49.0
boto3                              1.10.48
botocore                           1.13.48
Bottleneck                         1.2.1
cachetools                         4.1.1
certifi                            2020.11.8
cffi                               1.11.5
chardet                            3.0.4
click                              6.7
cloudpickle                        0.5.5
clyent                             1.2.2
cmake                              3.18.4.post1
colorama                           0.3.9
comtypes                           1.1.7
conda                              4.8.3
conda-build                        3.15.1
conda-package-handling             1.7.0
constantly                         15.1.0
contextlib2                        0.5.5
cryptography                       3.0
cssselect                          1.1.0
cycler                             0.10.0
Cython                             0.29.14
cytoolz                            0.9.0.1
dask                               0.19.1
datashape                          0.5.4
decorator                          4.3.0
defusedxml                         0.5.0
distributed                        1.23.1
docutils                           0.14
docx                               0.2.4
emoji                              0.6.0
entrypoints                        0.2.3
et-xmlfile                         1.0.1
fake-useragent                     0.1.11
Faker                              4.1.1
fastcache                          1.0.2
filelock                           3.0.8
Flask                              1.0.2
Flask-Cors                         3.0.6
funcy                              1.15
future                             0.18.2
gast                               0.3.3
gensim                             3.8.3
gevent                             1.3.6
glob2                              0.6
google-api-core                    1.23.0
google-auth                        1.23.0
google-auth-oauthlib               0.4.1
google-cloud-language              2.0.0
google-pasta                       0.2.0
googleapis-common-protos           1.52.0
GPUtil                             1.4.0
greenlet                           0.4.15
grpcio                             1.33.2
h5py                               2.10.0
heapdict                           1.0.0
html5lib                           1.0.1
hyperlink                          18.0.0
idna                               2.10
imageio                            2.4.1
imagesize                          1.1.0
importlib-metadata                 1.7.0
incremental                        17.5.0
ipykernel                          4.10.0
ipython                            6.5.0
ipython-genutils                   0.2.0
ipywidgets                         7.4.1
isort                              4.3.4
itsdangerous                       0.24
jdcal                              1.4
jedi                               0.12.1
jieba                              0.42.1
Jinja2                             2.10
jmespath                           0.10.0
joblib                             0.16.0
jsonschema                         2.6.0
jupyter                            1.0.0
jupyter-client                     5.2.3
jupyter-console                    5.2.0
jupyter-core                       4.4.0
jupyterlab                         0.34.9
jupyterlab-launcher                0.13.1
Keras                              2.2.4
Keras-Applications                 1.0.8
keras-bert                         0.82.0
keras-embed-sim                    0.8.0
keras-layer-normalization          0.14.0
keras-multi-head                   0.27.0
keras-pos-embd                     0.11.0
keras-position-wise-feed-forward   0.6.0
Keras-Preprocessing                1.1.2
keras-self-attention               0.46.0
keras-transformer                  0.38.0
keyring                            13.2.1
kiwisolver                         1.0.1
lazy-object-proxy                  1.3.1
libcst                             0.3.13
llvmlite                           0.24.0
locket                             0.2.0
lxml                               4.2.5
Markdown                           3.2.2
MarkupSafe                         1.0
matplotlib                         3.3.2
mccabe                             0.6.1
menuinst                           1.4.14
mglearn                            0.1.9
mistune                            0.8.3
mkl-fft                            1.0.4
mkl-random                         1.0.1
mmdnn                              0.1.3
mock                               4.0.2
more-itertools                     4.3.0
mouse                              0.7.1
move                               0.1.3
mpmath                             1.0.0
msgpack                            0.5.6
MulticoreTSNE                      0.1
multipledispatch                   0.6.0
mypy-extensions                    0.4.3
mysqlclient                        2.0.1
navigator-updater                  0.2.1
nbconvert                          5.4.0
nbformat                           4.4.0
networkx                           2.1
nltk                               3.5
nose                               1.3.7
notebook                           5.6.0
numba                              0.39.0
numexpr                            2.6.8
numpy                              1.18.5
numpydoc                           0.8.0
oauthlib                           3.1.0
odo                                0.5.1
olefile                            0.46
openpyxl                           2.5.6
opt-einsum                         3.3.0
packaging                          17.1
pandas                             0.23.4
pandocfilters                      1.4.2
parsel                             1.5.2
parso                              0.3.1
partd                              0.3.8
path.py                            11.1.0
pathlib2                           2.3.2
patsy                              0.5.0
pep8                               1.7.1
pickleshare                        0.7.4
Pillow                             7.2.0
pip                                20.2.2
pkginfo                            1.4.2
pluggy                             0.7.1
ply                                3.11
prometheus-client                  0.3.1
prompt-toolkit                     1.0.15
proto-plus                         1.11.0
protobuf                           3.13.0
psutil                             5.4.7
py                                 1.6.0
pyasn1                             0.4.8
pyasn1-modules                     0.2.8
pycodestyle                        2.4.0
pycosat                            0.6.3
pycparser                          2.18
pycrypto                           2.6.1
pycurl                             7.43.0.5
PyDispatcher                       2.0.5
pyflakes                           2.0.0
Pygments                           2.2.0
PyHamcrest                         2.0.2
pyLDAvis                           2.1.2
pylint                             2.1.1
pymongo                            3.11.0
PyMouse                            1.0
PyMySQL                            0.10.0
pyodbc                             4.0.24
pyOpenSSL                          19.1.0
pyparsing                          2.2.0
PySocks                            1.6.8
pytest                             3.8.0
pytest-arraydiff                   0.2
pytest-astropy                     0.4.0
pytest-doctestplus                 0.1.3
pytest-openfiles                   0.3.0
pytest-remotedata                  0.3.0
pytest-runner                      5.2
python-dateutil                    2.7.3
python-docx                        0.8.10
pytorch-pretrained-bert            0.6.2
pytz                               2020.4
PyWavelets                         1.0.0
pywin32                            223
pywinpty                           0.5.4
PyYAML                             5.3.1
pyzmq                              17.1.2
qt5reactor                         0.6.3
QtAwesome                          0.4.4
qtconsole                          4.4.1
QtPy                               1.5.0
queuelib                           1.5.0
redis                              3.5.3
regex                              2020.7.14
requests                           2.25.0
requests-oauthlib                  1.3.0
rope                               0.11.0
rsa                                4.6
ruamel-yaml                        0.15.46
s3transfer                         0.2.1
scikit-image                       0.14.0
scikit-learn                       0.19.2
scipy                              1.4.1
Scrapy                             1.6.0
seaborn                            0.9.0
selenium                           3.141.0
Send2Trash                         1.5.0
service-identity                   17.0.0
setuptools                         50.3.2
simplegeneric                      0.8.1
singledispatch                     3.4.0.3
six                                1.15.0
smart-open                         2.1.1
snowballstemmer                    1.2.1
snownlp                            0.12.3
sortedcollections                  1.0.1
sortedcontainers                   2.0.5
Sphinx                             1.7.9
sphinxcontrib-websupport           1.1.0
spyder                             3.3.1
spyder-kernels                     0.2.6
SQLAlchemy                         1.2.11
statsmodels                        0.9.0
stop-words                         2018.7.23
sympy                              1.1.1
tables                             3.4.4
tblib                              1.3.2
tensorboard                        1.14.0
tensorboard-plugin-wit             1.7.0
tensorflow-estimator               1.14.0
tensorflow-gpu                     1.14.0
termcolor                          1.1.0
terminado                          0.8.1
testpath                           0.3.1
text-unidecode                     1.3
toolz                              0.9.0
torch                              1.6.0+cu101
torchvision                        0.7.0+cu101
tornado                            5.1
tqdm                               4.26.0
traitlets                          4.3.2
Twisted                            18.7.0
typing-extensions                  3.7.4.3
typing-inspect                     0.6.0
unicodecsv                         0.14.1
urllib3                            1.25.10
w3lib                              1.21.0
wcwidth                            0.1.7
webencodings                       0.5.1
Werkzeug                           0.14.1
wheel                              0.31.1
widgetsnbextension                 3.4.1
win-inet-pton                      1.0.1
win-unicode-console                0.5
wincertstore                       0.2
wrapt                              1.12.1
xlrd                               1.1.0
XlsxWriter                         1.1.0
xlwings                            0.11.8
xlwt                               1.3.0
zict                               0.1.3
zipp                               3.1.0
zope.interface                     4.5.0

Iterations = 500 gives the same result

@maximtrp
Copy link
Owner

maximtrp commented Apr 3, 2021

I have managed to reproduce this bug under MacOS and Windows, but model is being fitted correctly under Linux. I will try to figure out the cause

@JennieGerhardt
Copy link
Author

My OS is Win10.
I created a new virtual environment and updated the packages, but it still doesn't work.

@maximtrp
Copy link
Owner

maximtrp commented Apr 3, 2021

I have found the cause: random number generator was broken under Windows and MacOS (but not under Linux). The new version of package will be released shortly.

@maximtrp
Copy link
Owner

maximtrp commented Apr 4, 2021

Thank you again for drawing attention to this problem! It is now fixed, and the new release is available in PyPi.

@maximtrp maximtrp closed this as completed Apr 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants