Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault running run-core-concepts-py against first lines of Shakespeare's sonnets #3101

Open
jorendorff opened this issue Apr 3, 2021 · 4 comments

Comments

@jorendorff
Copy link

(Hi! Thanks for putting so much effort into the tutorials! Gensim looks amazing and I'm looking forward to experimenting more. Happy to help debug--anything you need. I know Python, CPython, and C/C++ pretty well, and I can drive GDB.)

run-core-concepts.py crashes with Segmentation fault (core dumped) when I replace text_corpus with a list of first lines of Shakespeare's sonnets.

Steps/code/corpus to reproduce

  1. Go to https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html, scroll to the bottom, click "Download Python source code". Move the downloaded script into a new, empty directory.

  2. cd to that directory, python3 -m venv venv; source venv/bin/activate; pip install gensim.

  3. Note that python run_core_concepts.py now works fine, up until the end where it tries to import matplotlib.

  4. Now edit run_core_concepts.py and replace the text_corpus with this:

    text_corpus = [
        'From fairest creatures we desire increase,',
        'When forty winters shall besiege thy brow,',
        'Look in thy glass and tell the face thou viewest',
        'Unthrifty loveliness, why dost thou spend',
        'Those hours, that with gentle work did frame',
        "Then let not winter's ragged hand deface,",
        'Lo! in the orient when the gracious light',
        "Music to hear, why hear'st thou music sadly?",
        "Is it for fear to wet a widow's eye,",
        "For shame deny that thou bear'st love to any,",
        "As fast as thou shalt wane, so fast thou grow'st",
        'When I do count the clock that tells the time,',
        'O! that you were your self; but, love, you are',
        'Not from the stars do I my judgement pluck;',
        'When I consider every thing that grows',
        'But wherefore do not you a mightier way',
        'Who will believe my verse in time to come,',
        "Shall I compare thee to a summer's day?",
    ]
  5. Note that python run_core_concepts.py now crashes; the output ends with

     [(0, 1), (7, 1), (11, 1), (12, 1), (14, 1)],
     [(3, 1), (6, 1), (12, 1)],
     [(7, 1), (11, 1), (13, 1)],
     [(14, 1)],
     [(1, 1), (12, 1)]]
    []
    Segmentation fault (core dumped)

The crash occurs on the line:

sims = index[tfidf[query_bow]]

The output of index.lifecycle_events here is:

[{'msg': 'calculated IDF weights for 18 documents and 15 features (39 matrix non-zeros)', 'datetime': '2021-04-02T19:33:36.428329', 'gensim': '4.0.1', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.8.0-48-generic-x86_64-with-glibc2.29', 'event': 'initialize'}]

Versions

Linux-5.8.0-48-generic-x86_64-with-glibc2.29
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0]
Bits 64
NumPy 1.20.2
SciPy 1.6.2
/home/jorendorff/play/gensim-issue/venv/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
gensim 4.0.1
FAST_VERSION 1
@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2021

The index in that tutorial has a hardwired number of features: num_features=12.

When you change your TFIDF corpus, you'll need to adapt its number of features too. That's the cause of the crash (technically, scipy.sparse the library that Gensim uses for sparse matrix manipulation doesn't check array bounds and segfaults).

Your log says calculated IDF weights for 18 documents and 15 features (39 matrix non-zeros) so you could do num_features=15. But IIRC you can use num_features=max(tfidf.dfs) + 1 more generally – so you don't have to update the number every time.

Let me know if that helped and we can close this ticket.

@jorendorff
Copy link
Author

Yes, that change fixes the crash.

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 3, 2021

@piskvorky Perhaps in the tutorial code, we can set the number of features dynamically, based on the dictionary size? There's no need to hard code it, and it will prevent other people from tripping over a similar problem.

@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2021

Definitely. @jorendorff can you submit a PR? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants