Segfault running run-core-concepts-py against first lines of Shakespeare's sonnets #3101

jorendorff · 2021-04-03T00:40:08Z

(Hi! Thanks for putting so much effort into the tutorials! Gensim looks amazing and I'm looking forward to experimenting more. Happy to help debug--anything you need. I know Python, CPython, and C/C++ pretty well, and I can drive GDB.)

run-core-concepts.py crashes with Segmentation fault (core dumped) when I replace text_corpus with a list of first lines of Shakespeare's sonnets.

Steps/code/corpus to reproduce

Go to https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html, scroll to the bottom, click "Download Python source code". Move the downloaded script into a new, empty directory.
cd to that directory, python3 -m venv venv; source venv/bin/activate; pip install gensim.
Note that python run_core_concepts.py now works fine, up until the end where it tries to import matplotlib.

Now edit run_core_concepts.py and replace the text_corpus with this:

text_corpus = [
    'From fairest creatures we desire increase,',
    'When forty winters shall besiege thy brow,',
    'Look in thy glass and tell the face thou viewest',
    'Unthrifty loveliness, why dost thou spend',
    'Those hours, that with gentle work did frame',
    "Then let not winter's ragged hand deface,",
    'Lo! in the orient when the gracious light',
    "Music to hear, why hear'st thou music sadly?",
    "Is it for fear to wet a widow's eye,",
    "For shame deny that thou bear'st love to any,",
    "As fast as thou shalt wane, so fast thou grow'st",
    'When I do count the clock that tells the time,',
    'O! that you were your self; but, love, you are',
    'Not from the stars do I my judgement pluck;',
    'When I consider every thing that grows',
    'But wherefore do not you a mightier way',
    'Who will believe my verse in time to come,',
    "Shall I compare thee to a summer's day?",
]

Note that python run_core_concepts.py now crashes; the output ends with

 [(0, 1), (7, 1), (11, 1), (12, 1), (14, 1)],
 [(3, 1), (6, 1), (12, 1)],
 [(7, 1), (11, 1), (13, 1)],
 [(14, 1)],
 [(1, 1), (12, 1)]]
[]
Segmentation fault (core dumped)

The crash occurs on the line:

sims = index[tfidf[query_bow]]

The output of index.lifecycle_events here is:

[{'msg': 'calculated IDF weights for 18 documents and 15 features (39 matrix non-zeros)', 'datetime': '2021-04-02T19:33:36.428329', 'gensim': '4.0.1', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.8.0-48-generic-x86_64-with-glibc2.29', 'event': 'initialize'}]

Versions

Linux-5.8.0-48-generic-x86_64-with-glibc2.29
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0]
Bits 64
NumPy 1.20.2
SciPy 1.6.2
/home/jorendorff/play/gensim-issue/venv/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
gensim 4.0.1
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

piskvorky · 2021-04-03T07:54:16Z

The index in that tutorial has a hardwired number of features: num_features=12.

When you change your TFIDF corpus, you'll need to adapt its number of features too. That's the cause of the crash (technically, scipy.sparse the library that Gensim uses for sparse matrix manipulation doesn't check array bounds and segfaults).

Your log says calculated IDF weights for 18 documents and 15 features (39 matrix non-zeros) so you could do num_features=15. But IIRC you can use num_features=max(tfidf.dfs) + 1 more generally – so you don't have to update the number every time.

Let me know if that helped and we can close this ticket.

jorendorff · 2021-04-03T11:33:22Z

Yes, that change fixes the crash.

mpenkov · 2021-04-03T11:58:48Z

@piskvorky Perhaps in the tutorial code, we can set the number of features dynamically, based on the dictionary size? There's no need to hard code it, and it will prevent other people from tripping over a similar problem.

piskvorky · 2021-04-03T12:54:11Z

Definitely. @jorendorff can you submit a PR? Thanks.

Utopiah mentioned this issue Aug 1, 2022

dynamic number of features for tf-idf #3373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault running run-core-concepts-py against first lines of Shakespeare's sonnets #3101

Segfault running run-core-concepts-py against first lines of Shakespeare's sonnets #3101

jorendorff commented Apr 3, 2021

piskvorky commented Apr 3, 2021 •

edited

jorendorff commented Apr 3, 2021

mpenkov commented Apr 3, 2021

piskvorky commented Apr 3, 2021 •

edited

Segfault running run-core-concepts-py against first lines of Shakespeare's sonnets #3101

Segfault running run-core-concepts-py against first lines of Shakespeare's sonnets #3101

Comments

jorendorff commented Apr 3, 2021

Steps/code/corpus to reproduce

Versions

piskvorky commented Apr 3, 2021 • edited

jorendorff commented Apr 3, 2021

mpenkov commented Apr 3, 2021

piskvorky commented Apr 3, 2021 • edited

piskvorky commented Apr 3, 2021 •

edited

piskvorky commented Apr 3, 2021 •

edited