Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gensim Doc2Vec model Segmentation Faulting for Large Corpus #2894

Closed
mohsin-ashraf opened this issue Jul 24, 2020 · 31 comments · Fixed by #2907
Closed

Gensim Doc2Vec model Segmentation Faulting for Large Corpus #2894

mohsin-ashraf opened this issue Jul 24, 2020 · 31 comments · Fixed by #2907

Comments

@mohsin-ashraf
Copy link

mohsin-ashraf commented Jul 24, 2020

Problem description

What are you trying to achieve?
I was trying to train a doc2vec model on a corpus of 10M (10 millions) documents for my dataset roughly having a length of ~5000 words on average. The idea was to generate a semantic search index on these documents using the doc2vec model.

What is the expected result?
I was expecting it to be completed successfully as I tested for the smaller dataset. On a smaller dataset of size 100K documents, it worked fine and I was able to do basic benchmarking for the search index, which successfully passed the criteria.

What are you seeing instead?
When I started training on the 10M dataset. After building the vocabulary the training of the doc2vec model stoped and resulted in Segmentation Fault.

Steps/code/corpus to reproduce

Include full tracebacks, logs, and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
Here is the link to the example to reproduce it. It uses the following libraries (unfortunately I could not make a virtual env. due to some issues.) other than mentioned below.
RandomWords

Attached is the logging file.
logging_progress.log

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
NumPy 1.19.0
SciPy 1.5.1
gensim 3.8.3
FAST_VERSION 1

Here is the google group thread for a detailed discussion

@mohsin-ashraf
Copy link
Author

For pinning down the exact number over which it always segmentations faults and below which it always works, I am trying some other experiments as well. For now, it works fine for 5M documents and gives the expected results.

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2020

Thanks for the effort to create a reproducible example with random data! But, what is the significance of list_of_lengths.pickle? Can the code be updated to not use an opaque pickled object for necessary length parameters?

@mohsin-ashraf
Copy link
Author

I tried my best to reproduce the exact scenario that I had. The list_of_lengths.pickle contains the length of the documents in my corpus. Although we eliminate it using the random module of python. What do you suggest? I'll update the code as needed.

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2020

If it's a simple list of int document lengths, a file with one number per line should work as well. (And, if simply making every doc the same length works to reproduce, that'd be just as good.) This data, even with the RandomWords as texts, creates the crash for you?

@mohsin-ashraf
Copy link
Author

a file with one number per line should work as well

I'll convert the pickle to a text file with one number per line.

This data, even with the RandomWords as texts, creates the crash for you?

Yes it creates the segmentation fault.

if simply making every doc the same length works to reproduce, that'd be just as good

Will check this aswell.

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2020

Another fairly quick test worth running: for a corpus_file.txt that triggers the problem, if you try the more generic method of specifying a corpus – a Python iterable, such as one that's created by the LineSentence utility class on that same corpus_file.txt – does it crash at the same place? (It might be far slower that way, especially w/ 30 workers – but if it only crashes with the corpus_file-style specification it points at different code paths, perhaps with unwise implementation limits, as the culprit.)

@mohsin-ashraf
Copy link
Author

Updated the repository for the pickle to a text file.

a file with one number per line should work as well

I'll convert the pickle to a text file with one number per line.

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2020

Thanks! These are word counts for the docs, right? I see that of 10,000,000 docs, about 781K are over 10,000 words. While this should be accepted OK by gensim (& certainly not create a crash), just FYI: there is an internal implementation limit where words past the 10,000th of a text are silently ignored. In order for longer docs to be considered by Doc2Vec, they'd need to be broken into 10K-long docs which then share the same document tag (which actually isn't yet possible in the corpus_file mode).

@mohsin-ashraf
Copy link
Author

Thanks for letting me know about some of the internal details of Doc2Vec. I have just checked the top 100 most lengthy documents in my corpus, they seem to contain 200K+ tokens some of them contain even 3M tokens, could that be a problem. Following is the table of lengths of the top 50 most lengthy documents.

Most lengthy documents Length
316691
316703
316742
316773
316783
316797
316817
316823
316850
316865
316865
316929
316929
316929
317139
317195
317195
317307
317733
318162
344057
351887
356643
356643
363108
363271
363271
363271
373338
385525
388338
388382
388594
388732
388923
397950
448824
448824
450107
455986
467019
485819
535723
652092
659184
659184
659184
749399
2535337
3523532

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2020

It shouldn't cause a crash; it will mean only the 1st 10K tokens of those docs will be used for training. (It might be a factor involved in the crash, I'm not sure.)

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2020

When you run this code, how big is the corpus_file.txt created by the 1st 15 lines of your reproduce_error.py script? (My rough estimate is somewhere above 300GB.) How big is your true-data corpus_file.txt?

When you report that a 5M-line variant works, is that with half your real data, or half the RandomWords data, or both? (In any 5M line cases you've run, how large are the corpus_file.txt files involved?)

@gojomo
Copy link
Collaborator

gojomo commented Jul 24, 2020

Also, note: if you can .save() the model after the .build_vocab() step, then it may be possible to just use a .load() to restore the model to the state right before a .train() triggers the fault - much quicker than repeating the vocab-scan each time. See the PR I made vs your error repo for quickie (untested) example of roughly what I mean.

With such a quick-reproduce recipe, we would then want to try:

(1) the non-corpus-file path. instead of:

model.train(corpus_file=corpus_path, total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)

...try...

from gensim.models.doc2vec import TaggedLineDocument
model.train(documents=TaggedLineDocument(corpus_path), total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)

If this starts training without the same instant fault in the corpus_file= variant, as I suspect it will, we'll know the problem is specific to the corpus_file code. (No need to let the training complete.)

(2) getting a core-dump of the crash, & opening it in gdb to view the call stack(s) at the moment of the fault, which might also closely identify whatever bug/limit is being mishandled. (The essence is: (a) ensure the environment is set to dump a core in case of segfault, usually via a ulimit adjustment; (b) run gdb -c COREFILENAME; (c) run gdb commands to inspect state, most importantly for our purposes: thread apply all bt (show traces of all threads).

@mohsin-ashraf
Copy link
Author

When you run this code, how big is the corpus_file.txt created by the 1st 15 lines of your reproduce_error.py script? (My rough estimate is somewhere above 300GB.) How big is your true-data corpus_file.txt?

Its 315GB in size.

When you report that a 5M-line variant works, is that with half your real data, or half the RandomWords data, or both? (In any 5M line cases you've run, how large are the corpus_file.txt files involved?)

It is half of my real data. The exact size of the corpus file is 157GB.

@mohsin-ashraf
Copy link
Author

Also, note: if you can .save() the model after the .build_vocab() step, then it may be possible to just use a .load() to restore the model to the state right before a .train() triggers the fault - much quicker than repeating the vocab-scan each time. See the PR I made vs your error repo for quickie (untested) example of roughly what I mean.

With such a quick-reproduce recipe, we would then want to try:

(1) the non-corpus-file path. instead of:

model.train(corpus_file=corpus_path, total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)

...try...

from gensim.models.doc2vec import TaggedLineDocument
model.train(documents=TaggedLineDocument(corpus_path), total_examples=model.corpus_count,total_words=model.corpus_total_words, epochs=model.epochs)

If this starts training without the same instant fault in the corpus_file= variant, as I suspect it will, we'll know the problem is specific to the corpus_file code. (No need to let the training complete.)

(2) getting a core-dump of the crash, & opening it in gdb to view the call stack(s) at the moment of the fault, which might also closely identify whatever bug/limit is being mishandled. (The essence is: (a) ensure the environment is set to dump a core in case of segfault, usually via a ulimit adjustment; (b) run gdb -c COREFILENAME; (c) run gdb commands to inspect state, most importantly for our purposes: thread apply all bt (show traces of all threads).

Thanks for that! I am currently working on finding the exact number above which we always get a segmentation fault and below which we always get the training successful. I am pretty close to the number (using binary search).

For other options TaggedLineDocument I'll give them a separate look.

@mohsin-ashraf
Copy link
Author

I have finally found the exact number over which the doc2vec model always gives segmentation fault, and under which it always starts training (although I did not let the training process to complete). 7158293 is the exact number on which and below which the doc2vec model starts successful training whereas the if we increase the number even by one it gives the segmentation fault. I used the synthetic dataset and used documents on lengths 100 tokens ony to speed up the process.

@piskvorky
Copy link
Owner

Can you post the full log (at least INFO level) from that run?

@gojomo
Copy link
Collaborator

gojomo commented Jul 27, 2020

Good to hear of your progress, & that's a major clue, as 7158293 * 300 dimensions = 2,147,487,900, suspiciously close to 2^31 (2,147,483,648).

That's strongly suggestive that the problem is some misuse of a signed 32-bit int where a wider int type should be used, and indexing overflow is causing the crash.

Have you been able to verify my theory that training would get past that quick crash if the .train() is called with a TaggedLineDocument corpus passed as documents instead of a filename passed as corpus_file? (This could re-use a model on the crashing file that was .save()d after .build_vocab(), if you happen to have created one.)

(Another too-narow-type problem, though one that's only caused missed training & not a segfault, is #2679. All the cython code should get a scan for potential use of signed/unsigned 32-bit ints where 64-bits would be required for the very-large, > 2GB/4GB array-indexing that's increasingly common in *2Vec models.

@piskvorky
Copy link
Owner

piskvorky commented Jul 27, 2020

Thanks. I wanted to eyeball the log in order to spot any suspicious numbers (signs of overflow), but @gojomo's observation above is already a good smoking gun.

If you want this resolved quickly the best option might be to check for potential int32-vs-int64 variable problems yourself. It shouldn't be too hard, the file is here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec_corpusfile.pyx (look for ints where there should be long long).

@mohsin-ashraf
Copy link
Author

Should I make a PR after updating the code?

@piskvorky
Copy link
Owner

piskvorky commented Jul 27, 2020

Of course :)

mohsin-ashraf added a commit to mohsin-ashraf/gensim that referenced this issue Jul 27, 2020
mohsin-ashraf added a commit to mohsin-ashraf/gensim that referenced this issue Jul 27, 2020
@mohsin-ashraf
Copy link
Author

I have failed the circle CI checks could you take a look at my PR here and let me know what I'm doing wrong. I wasn't able to install the test and wasn't able to run the tox * commands

@mohsin-ashraf
Copy link
Author

mohsin-ashraf commented Jul 27, 2020

Good to hear of your progress, & that's a major clue, as 7158293 * 300 dimensions = 2,147,487,900, suspiciously close to 2^31 (2,147,483,648).

That's strongly suggestive that the problem is some misuse of a signed 32-bit int where a wider int type should be used, and indexing overflow is causing the crash.

Have you been able to verify my theory that training would get past that quick crash if the .train() is called with a TaggedLineDocument corpus passed as documents instead of a filename passed as corpus_file? (This could re-use a model on the crashing file that was .save()d after .build_vocab(), if you happen to have created one.)

(Another too-narow-type problem, though one that's only caused missed training & not a segfault, is #2679. All the cython code should get a scan for potential use of signed/unsigned 32-bit ints where 64-bits would be required for the very-large, > 2GB/4GB array-indexing that's increasingly common in *2Vec models.

Using TaggedLineDocument did not trigger any error for the larger dataset!. I did not complete the training job though.

@gojomo
Copy link
Collaborator

gojomo commented Jul 27, 2020

If you want this resolved quickly the best option might be to check for potential int32-vs-int64 variable problems yourself. It shouldn't be too hard, the file is here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec_corpusfile.pyx (look for ints where there should be long long).

But note:

  • the real issue might be in code shared with word2vec_inner.pyx or doc2vec_inner.pyx or word2vec_corpusfile.pyx, or definitions in any of the matching .pxd files;
  • the offending type might have some name more like np.uint_32 than just int (even though I'm not sure np.uint_32 itself, which should go to 2^32, could be the problem), or be some other alias obscuring the raw types involved, or be hidden in some other library function used incorrectly;
  • most int usages there are probably OK, especially if in pure Python/Cython code, and a simple replace of all int refs with long long refs (as in your PR) is likely to break other things

If you're able to do the gdb-related steps – saving a core from an actual fault, and using gdb to show the backtraces of all threads at the moment of the error – that may point more directly to a single function, or few lines, where the wrong types are being used, or an oversized Python int is shoved to a narrower type where it overflows. And when trying fixes, or adding debugging code, you'll need to be able to set up your whole local install for rebuilding the compiled shared libraries – which will require C/C++ build tools, and the cython package, & an explicit 'build' step after editing the files – in order to run the code locally (either via the whole unit-test suite or just in your custom code trigger).

@mohsin-ashraf
Copy link
Author

It's just a quick run of gdb the result for segmentation fault is shown in the below image.
Screenshot from 2020-07-28 09-37-30

Let me know if its helpful, or any further instructions?

@gojomo
Copy link
Collaborator

gojomo commented Jul 28, 2020

The textual output of the gdb command thread apply all bt will be most informative. (After that, if the offending thread is obvious, some more inspection of its local variables, in the crashing frame and perhaps a few frames up, may also be helpful... but the bt backtraces first.)

@mohsin-ashraf
Copy link
Author

mohsin-ashraf commented Jul 28, 2020

(gdb) bt
#0  0x00007fff7880bb30 in saxpy_kernel_16 ()
   from /home/mohsin/.local/lib/python3.6/site-packages/scipy/spatial/../../scipy.libs/libopenb
lasp-r0-085ca80a.3.9.so
#1  0x00007fff7880bd4f in saxpy_k_NEHALEM ()
   from /home/mohsin/.local/lib/python3.6/site-packages/scipy/spatial/../../scipy.libs/libopenb
lasp-r0-085ca80a.3.9.so
#2  0x00007fff783042cb in saxpy_ ()
   from /home/mohsin/.local/lib/python3.6/site-packages/scipy/spatial/../../scipy.libs/libopenb
lasp-r0-085ca80a.3.9.so
#3  0x00007fff1f773912 in ?? ()
   from /home/mohsin/.local/lib/python3.6/site-packages/gensim/models/doc2vec_corpusfile.cpytho
n-36m-x86_64-linux-gnu.so
#4  0x00007fff1f77459f in ?? ()
   from /home/mohsin/.local/lib/python3.6/site-packages/gensim/models/doc2vec_corpusfile.cpython-36m-x86_64-linux-gnu.so
#5  0x000000000050ac25 in ?? ()
#6  0x000000000050d390 in _PyEval_EvalFrameDefault ()
#7  0x0000000000508245 in ?? ()
#8  0x0000000000509642 in _PyFunction_FastCallDict ()
#9  0x0000000000595311 in ?? ()
#10 0x00000000005a067e in PyObject_Call ()
#11 0x000000000050d966 in _PyEval_EvalFrameDefault ()
#12 0x0000000000508245 in ?? ()
#13 0x0000000000509642 in _PyFunction_FastCallDict ()
#14 0x0000000000595311 in ?? ()
#15 0x00000000005a067e in PyObject_Call ()
#16 0x000000000050d966 in _PyEval_EvalFrameDefault ()
#17 0x0000000000509d48 in ?? ()
#18 0x000000000050aa7d in ?? ()
#19 0x000000000050c5b9 in _PyEval_EvalFrameDefault ()
---Type <return> to continue, or q <return> to quit---return
#20 0x0000000000509d48 in ?? ()
#21 0x000000000050aa7d in ?? ()
#22 0x000000000050c5b9 in _PyEval_EvalFrameDefault ()
#23 0x0000000000509455 in _PyFunction_FastCallDict ()
#24 0x0000000000595311 in ?? ()
#25 0x00000000005a067e in PyObject_Call ()
#26 0x00000000005e1b72 in ?? ()
#27 0x0000000000631f44 in ?? ()
#28 0x00007ffff77cc6db in start_thread (arg=0x7ffc97fff700) at pthread_create.c:463
#29 0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The full logging file for thread apply all bt is given below.
gdb.txt

@gojomo
Copy link
Collaborator

gojomo commented Jul 28, 2020

thread apply all bt is the command needed to get all the thread backtraces. (But, from your one thread, it looks like important symbols might be missing, so this may not be as helpful as I'd hoped until/unless that's fixed, and I don't know the fix offhand.)

@mohsin-ashraf
Copy link
Author

thread apply all bt is the command needed to get all the thread backtraces. (But, from your one thread, it looks like important symbols might be missing, so this may not be as helpful as I'd hoped until/unless that's fixed, and I don't know the fix offhand.)

Just updated my comment above

@mohsin-ashraf
Copy link
Author

When this issue will be resolved in the gensim?

@gojomo
Copy link
Collaborator

gojomo commented Jul 29, 2020

Re: the backtrace(s):

I think the one thread backtrace you've highlighted is should be the one where the segfault occurred, though I believe I've occasionally seen cases where the 'current thread' in a core is something else. (And often, the thread/frame that "steps on" some misguided data isn't the one that caused the problem, via some more subtle error arbitrarily earlier.)

Having symbols & line-numbers in the trace would make it more useful, but I'm not sure what (probably minor) steps you'd have to take to get those. (It might be enabled via installing some gdb extra, or using a cygdbgdb-wrapper that comes with cython, or maybe even just using a py-bt command instead of bt.)

However, looking at just the filenames, it seems the segfault actually occurs inside scipy code, that code in doc2vec_corpusfile (in frame #3) likely calls with unwise/already-corrupted parameters. (When we have symbols, debugger-inspecting that frame, and/or adding extra sanity-checking/logging around that offending line, will likely provide the next key clue to the real problem.)

Re: when fixed?

You've provided a clear map to reproducing & what's likely involved (a signed 32-bit int overflow), and I suspect the recipe to trigger can be made even smaller/faster. (EG: instead of 7.2M 300D vectors w/ 100-word training docs, 400K 6000D vectors w/ 1-word training docs is likely to trigger the same overflow - so no 300GB+ test file or long slow vocab-scan necessary.) That will make it easier for myself or others to investigate further. But I'm not sure when there'll be time for that, or it will succeed in finding a fix, or when an official release with the fix will happen.

In the meantime, workarounds could include: (1) using the non-corpus_file method of specifying the corpus - which, if you were successfully using ~30 threads, may slow your training by a factor of 3 or more, but should complete without segfault. (2) training only on some representative subset of docs, and/or with lower dimensions, making sure doc_count * vector_size < 2^31 - but then inferring vectors, outside of training, for any excess documents.

@gojomo
Copy link
Collaborator

gojomo commented Jul 29, 2020

I wrote:

You've provided a clear map to reproducing & what's likely involved (a signed 32-bit int overflow), and I suspect the recipe to trigger can be made even smaller/faster. (EG: instead of 7.2M 300D vectors w/ 100-word training docs, 400K 6000D vectors w/ 1-word training docs is likely to trigger the same overflow - so no 300GB+ test file or long slow vocab-scan necessary.)

Confirmed that with a test file created by...

with open('400klines', 'w') as f:
    for _ in range(400000):
        f.write('a\n')

...the following is enough to trigger a fault...

model = Doc2Vec(corpus_file='400klines', min_count=1, vector_size=6000)

Further, it may be sufficient to change the 3 lines in doc2vec_corpusfile.pyx that read...

 cdef int _doc_tag = start_doctag

...to...

 cdef long long _doc_tag = start_doctag

That avoids the crash in the tiny test case above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants