Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec does not run faster with more workers caused by sentences length #1509

Closed
carbonz0 opened this issue Jul 27, 2017 · 30 comments
Closed
Assignees
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@carbonz0
Copy link

carbonz0 commented Jul 27, 2017

Description

Word2Vec does not run faster with more workers caused by sentences length:
When I use raw text8 data, multi-core worked fine, but my corpus is short text, one single line only contains several words, and when I randomly split text8 data to multiple line (e.g. only 3 ~ 8 words per line), and found more workers become useless.

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

Linux-3.10.0-229.7.2.el7.x86_64-x86_64-with-centos-7.1.1503-Core
('Python', '2.7.5 (default, Nov 20 2015, 02:00:19) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.3.0')
('FAST_VERSION', 1)

@gojomo
Copy link
Collaborator

gojomo commented Jul 27, 2017

When you gave it giant lines, you may have seen deceptive speed indicators: due to implementation limits in the optimized code, texts over 10,000 tokens are truncated to 10,000 tokens – with the rest ignored. (Throwing away lots of data can make things look very fast!)

When you give it small lines, it is internally batching them together for efficiency (though not in such a way that any context windows overlap the breaks you've supplied). But the rates you're seeing are probably a more accurate estimate of what it takes to train on all supplied words.

The inability of the code to fully utilize all cores, or even increase throughput, with more than 3-16 workers (no matter how many cores are available) is a known limitation, mostly due to the 'global interpreter lock' single-threading imposed on Python code, and perhaps somewhat due to the current architecture of a single corpus-reading thread handing work out to multiple worker threads. (Though, that 2nd factor can be minimized if your corpus iterable is relatively efficient, such as by working with data only in RAM or from fast volumes.)

See related discussion in issues like #1486, #1291, #532, & #336.

@carbonz0
Copy link
Author

Thank very much for you reply.
I searched for the problem for several days, and read the discussion in above issues and in google groups, but I found my problem maybe different from those previous issues, including:
#157

From the code word2vec.py I see the words are internally batched when I give small lines, but the training speed and core usage is different.

When I ran the raw text8 with 10 workers, below is the debug log snippet, the bach-size =10000 and sentences number=1, speed is 806140 words/s.
Meanwhile I noticed multi-core is active by use htop.

2017-07-27 11:43:03,572 : DEBUG : queueing job #3354 (10000 words, 1 sentences) at alpha 0.01518
2017-07-27 11:43:03,572 : DEBUG : queueing job #3355 (10000 words, 1 sentences) at alpha 0.01518
2017-07-27 11:43:03,572 : INFO : PROGRESS: at 38.38% examples, 806140 words/s, in_qsize 60, out_qsize 1
2017-07-27 11:43:03,586 : DEBUG : queueing job #3356 (10000 words, 1 sentences) at alpha 0.01517
2017-07-27 11:43:03,588 : DEBUG : queueing job #3357 (10000 words, 1 sentences) at alpha 0.01517
2017-07-27 11:43:03,589 : DEBUG : queueing job #3358 (10000 words, 1 sentences) at alpha 0.01517
2017-07-27 11:43:03,617 : DEBUG : queueing job #3359 (10000 words, 1 sentences) at alpha 0.01517

But when I split text8 to multiple lines, below is the log, the speed is only about 169806 words/s.

2017-07-27 14:09:02,062 : DEBUG : queueing job #758 (999 words, 385 sentences) at alpha 0.02478
2017-07-27 14:09:02,065 : DEBUG : queueing job #759 (994 words, 356 sentences) at alpha 0.02478
2017-07-27 14:09:02,068 : DEBUG : queueing job #760 (1000 words, 358 sentences) at alpha 0.02478
2017-07-27 14:09:02,068 : INFO : PROGRESS: at 0.82% examples, 169806 words/s, in_qsize 40, out_qsize 0
2017-07-27 14:09:02,071 : DEBUG : queueing job #761 (998 words, 351 sentences) at alpha 0.02478
2017-07-27 14:09:02,077 : DEBUG : queueing job #762 (997 words, 371 sentences) at alpha 0.02478
2017-07-27 14:09:02,081 : DEBUG : queueing job #763 (997 words, 352 sentences) at alpha 0.02478

@gojomo
Copy link
Collaborator

gojomo commented Jul 27, 2017

Though it won't account for the full difference, note that rate-timings a bit deeper into training, or for the full training, can be more stable than rates at the very beginning, before all threads active and CPU caches warm.

How many cores does your system have?

Which batch-size(s) are you tuning?

Are you splitting the lines in-memory on-the-fly, or once to a separate line-broken file on disk?

Are you sure you didn't do something else, in the 2nd case, to force smaller (1000-word) training batches? (The default of 10000 would mean those words-totals per job should be a lot closer to 10000.)

(It's tough to be sure all the things varies in your tests without the full code.)

@carbonz0
Copy link
Author

carbonz0 commented Jul 28, 2017

Thank you for your patience, I didn't clearly report these details and sorry for that.

I cleaned my tests code and put in gist below, with no params tuning, and you can see the differences:
https://gist.github.com/knighter/57a3ce26114b071c2287c84c355dfec5

comparison in short:

text8, 1 worker:

2017-07-28 11:45:42,161 : INFO : PROGRESS: at 49.36% examples, 131211 words/s, in_qsize 2, out_qsize 0

text8, 20 workers:

2017-07-28 11:50:29,296 : INFO : PROGRESS: at 49.84% examples, 875161 words/s, in_qsize 39, out_qsize 0

text8_split, 1 worker:

2017-07-28 11:54:01,222 : INFO : PROGRESS: at 49.74% examples, 224895 words/s, in_qsize 1, out_qsize 0

text8_split, 20 workers:

2017-07-28 11:59:06,013 : INFO : PROGRESS: at 49.75% examples, 239028 words/s, in_qsize 0, out_qsize 0

@piskvorky
Copy link
Owner

piskvorky commented Jul 29, 2017

Thanks for the detailed report! That helps a lot.

The in_qsize 0, out_qsize 0 really does indicate the workers are starved for input, in the "only ~3 words per document" case of text8_split.

In other words, even the almost trivial loops here and here seem to become the bottleneck with super short documents.

The fact that the 1-worker text8_split is faster than non-split text8 is probably due to the fact it's doing much less training -- you set window=5, but with documents of only ~3 words, the longer contexts never materialize.

I really don't know what we could do about this -- we're hitting the limits of Python itself here. Perhaps the most general solution would be to change the main API of gensim from "user supplies a stream of input data" to "user supplies multiple streams of data". It's fully backward compatible (one stream is just a special case of many streams), but would allow us to parallelize more easily without as much batching, data shuffling etc. Basically advise users to split their large data into multiple parts, Spark/Hadoop-style. Applies to all algos (LDA, LSI, word2vec...). @menshikh-iv @gojomo thoughts?

@gojomo
Copy link
Collaborator

gojomo commented Jul 31, 2017

Yes, the reason the 1-thread split run is faster is almost certainly due to the fact that with skip-gram and window=5, having many short sentences mean lots less training is happening due to windows truncated at sentence ends.

IO may still be a factor, depending on your volume type and the behavior of LineSentence. Running the test from a list-of-list-of-tokens in memory will better isolate that factor from the remaining Python multithreading, and our one-batching-master-thread, issues.

Also, maximum throughput for the small-sentences case might be reached with a worker-count between 1 and 20 - the contention of a larger number of threads for the single Python GIL may be a factor in starving the master thread.

@piskvorky Yes I think a shared abstraction for helping models open multiple non-contending (and ideally non-overlapping) streams into the corpus would be a good way to increase throughput. (The word2vec.c and fasttext.cc implementations just let every thread open their own handle into a different starting-offset of the file, and continue cycling through until the desired total number of examples is read. Because of thread overtaking issues there's no guarantee some parts of the file aren't read more times than others... but it probably doesn't matter in practice that their training samples aren't exactly N passes over each example.)

@piskvorky
Copy link
Owner

piskvorky commented Aug 1, 2017

@gojomo Yes, working off RAM (lists) will help isolate the issue, but I don't think the number of threads nor the IO are a factor here. @Knighter is comparing two identical setups, on identical data (the same number of workers, same IO...). The only difference here is the sentence length. One setup is starved, one isn't.

Regarding multiple streams: gensim would be agnostic as to where the streams come from. Seeking to different places in a single file is one option; reading from multiple separate files (possibly separate filesystems) another. In practice, I suspect most people simply use a local FS, so that's our target use-case to optimize.

@gojomo
Copy link
Collaborator

gojomo commented Aug 1, 2017

@piskvorky If LineSentence is less efficient at reading the many-lined input, that might contribute for the 20-worker (unsplit) to 20-worker (split) starvation. The concatenation of small examples into larger batches may be relevant – but that was added because it greatly increased multithreaded parallelism in tests, by having long noGIL blocks, compared to small-examples-without-batching – at least in cases of 3-8 threads.

Perhaps either of these processes – LineSentence IO or batching – gets starved for cycles when more threads are all waiting for the GIL. (That is: the trivial loops mean many more time-slicing events/overhead and context-switching.) Is there a 'withGIL' to force a block of related statements to not be open to normal interpreter GIL-sharing?

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017
@shuxiaobo
Copy link

@gojomo I met the same question. Does the long sentence longer than 10000 will be cut to 10000, and the rest data be discard while training? I do not see any declare about this process in API document.

@piskvorky
Copy link
Owner

piskvorky commented May 14, 2019

Yes, there's still a hard limit on sentence length (= max effective number of words in a document).

Btw the performance issues around GIL were solved in #2127, closing this ticket. See the tutorial at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb

CC @mpenkov @gojomo – let's link to that tutorial straight from the API docs. I had trouble locating it myself, it's pretty well hidden. The API docs only mention "You may use corpus_file instead of sentences to get performance boost", which is unnecessarily vague IMO. We can easily provide more docs and insight than that.

@mpenkov
Copy link
Collaborator

mpenkov commented May 14, 2019

OK, I'll deal with that as part of the documentation refactoring

@mpenkov mpenkov self-assigned this May 14, 2019
@shuxiaobo
Copy link

Hello @mpenkov , I still have a question. Does the batch size will affect the max length of the sentence? If I set the batch size = 128, So the max length of sentence will be set to 10000 or 128?

@gojomo
Copy link
Collaborator

gojomo commented May 15, 2019

@shuxiaobo batch_size has no effect on the 10000-token limit for individual texts.

Witiko added a commit to MIR-MU/pine that referenced this issue Jan 24, 2021
@ciropom
Copy link

ciropom commented Mar 21, 2023

Yes, there's still a hard limit on sentence length (= max effective number of words in a document).

Btw the performance issues around GIL were solved in #2127, closing this ticket. See the tutorial at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb

CC @mpenkov @gojomo – let's link to that tutorial straight from the API docs. I had trouble locating it myself, it's pretty well hidden. The API docs only mention "You may use corpus_file instead of sentences to get performance boost", which is unnecessarily vague IMO. We can easily provide more docs and insight than that.

I think corpus-file implementation doesn't solve this issue.
In my use-case, I have billions of sentences (around 41 millions of scientific papers) and I can't simply serialize them to LineSentence.

This ability to scale linearly with the number of cores is more needed when you have more data, but that's exactly the situation in which you can't use LineSentence.

I was wondering, why not using multiprocessing.Process instead of threading.Thread ?
This will allow to scale linearly, since you are forking python binary itself, while with Thread you are relying on the same python instance for all your computation.
Maybe it is something related to Cython bindings?

@piskvorky
Copy link
Owner

I can't simply serialize them to LineSentence.

Why not?

but that's exactly the situation in which you can't use LineSentence.

Why not?

@ciropom
Copy link

ciropom commented Mar 21, 2023

Because, if I understand correctly, having my corpus in LineSentence format implies serializing the whole corpus to disk.

Something I cannot afford to do, because I don't have enough space (I'm actually streaming from a remote datasource)
since the corpus is very big (and growing).

@piskvorky
Copy link
Owner

piskvorky commented Mar 21, 2023

@ciropom
Copy link

ciropom commented Mar 21, 2023

Ah ok, nice.
I think that's not mentioned in the documentation, but makes a big difference.
Thank you, I'll try it.

@ciropom
Copy link

ciropom commented Mar 21, 2023

The LineSentence is streamed and this allows you to keep the RAM constant,
but the space on disk will grow as you process sentences, and this will cause you to go out of hard drive space.

@piskvorky
Copy link
Owner

piskvorky commented Mar 21, 2023

That is true. Where do you pull your documents from for processing, if they don't even fit on disk?

Gensim aims at RAM-independence (corpora larger than RAM), but not disk independence (corpora larger than hard disk).

@ciropom
Copy link

ciropom commented Mar 21, 2023

I undestand.
I pull my documents from a remote solr instance that contains ~ 41 million bio-medical scientific papers.

My pre-processing prunes a lot of words (english common words for instance) but produces also different versions of the same sentences, (sentence as-is, sentence with n-grams taken from NER annotators and NER annotations replaced with ontology IDs)

I can't estimate the size of this, for sure not less than 500 GB.
Maybe I can afford to have such disk space.
But another option would be to use the linux FIFO to circumvent this.

@piskvorky
Copy link
Owner

piskvorky commented Mar 21, 2023

I suspect that if you're pulling document "live" from Solr, and preprocessing them on the fly, then training is not your bottleneck.

I would be surprised if you needed more than one Gensim worker to saturate such pipeline. In other words, the speed of procuring the training data will be the bottleneck, not the training itself.

@ciropom
Copy link

ciropom commented Mar 21, 2023

The preprocessing pipeline involves multiple parallel loaders and multiple parallel pre-processors.
I can confirm that without training, in this scenario
#400 (comment)

I have all cores at 100% usage 100% of the time and my pre-processing is very quick.

The things changes when I train, and all my sentences are going to a queue that the gensim pulls through the iterator interface.
I'm experimenting to get the right number of loaders/pre-processors/gensim workers ..

But till now the best performance I got is 25k words/s which is very low.

@ciropom
Copy link

ciropom commented Mar 21, 2023

Another option would be training in batches of documents.
I can serialize a batch of 200.000 documents to file, and train epoch 1 with gensim.
In the mean-time I will serialize other 200.000 documents, and at the end of epoch 1,
train epoch 2 on the new batch and remove the old batch.

Will this result in the same model? or training where multiple epochs corresponds to multiple training set is an issue?

@piskvorky
Copy link
Owner

piskvorky commented Mar 21, 2023

Yeah 25k words/s sounds low… especially if you have a large corpus (billions of tokens) to chew through.

To be clear – is it the initial vocabulary scan (1st corpus pass) that is slow, or also the actual training = subsequent corpus training passes?

If your preprocessing consumes 100% CPU 100% of the time, that indicates to me that it is indeed the bottleneck. I don't see any reason why multiple workers would only get to 25k words/seconds otherwise. Did you time the corpus iteration? No training, no gensim, just go through the input corpus iterator: for doc in corpus: pass.

About epochs: This seems too involved and specific, and we're getting off topic, sorry.

@ciropom
Copy link

ciropom commented Mar 22, 2023

I used this code to benchmark, where sentences is my iterator passed to FastText.train()

    start=time.time()
    words = 0
    sent_n = 0
    for sentence in sentences:
        words+=len(sentence)
        sent_n+=1
        if sent_n % 10000 == 0:
            print("Speed: %.0f words/s" % (words/(time.time()-start),))

with 6 pre-processors, 4 loaders: ~103k words/s (peak 150k words/s)
with 8 pre-processors, 6 loaders: ~63k words/s

faster, but still slower than I expected.
Putting everything to a single queue that is read sequentially is detrimental to performances..

@piskvorky
Copy link
Owner

piskvorky commented Mar 22, 2023

Thanks, that's good progress.

Yes, 150k words/s is still slow. IIRC fasttext can do >1m words/s easily, on a single powerful machine.

Still doesn't explain the drop from 150k to 25k. There is some overhead for sure but I don't see why it should be that large, with enough workers and enough spare CPU (vs 100% CPU already used for the on-the-fly preprocessing…).

Putting everything to a single queue that is read sequentially is detrimental to performances..

Yes – that's why there's the corpus_file mode, which parallelizes better but works off of a file on disk. So 1) no CPU contention between loading & preprocessing vs training, and 2) more efficient parallelization without serializing into a single queue.

@ciropom
Copy link

ciropom commented Mar 22, 2023

I'll use corpus_file. If it doesn't fit disk, I will train in chunks. Thank you for your guidance.

@gojomo
Copy link
Collaborator

gojomo commented Mar 23, 2023

Optimizing specific usage scenarios could be better discussed on the project discussion list, than an old closed bug-tracking issue. But, some observations:

  • If your preprocessing outside of Gensim saturates multiple cores, it's likely a big contributor to the bottleneck during the corpus_iterator mode. To get the max throughput from that mode, the Gensim master thread will ideally be doing nothing but IO from a fast source, and dirt-simple (space-based) tokenization - & it won't be waiting on any other local or remote busy-processes, either. Streaming from a plain text file on a fast local volume is best for that. But then also, if you can serialize your corpus to a single plain-text file on a local fast volume, then it becomes somewhat easier to use the corpus_file mode to truly saturate all cores. (Of course, the corpus_file mode's reliance on a single file may still be a headache if your corpus is in the tens to hundreds of GB.)

  • with 4TB SSDs only $200-$400, text corpora that are too big to fit on a fast local volume may be overkill for Word2Vec/etc purposes, too. (IIRC, even the GoogleNews vectors, which I believe Google parallel-trained via a proprietary word2vec process a decade ago, were only only "100 billion" words of news articles - less than 1TB of data.) Thinning the training data to a representative subset may give you just-as-good word-vectors with far fewer time/storage headaches. Or similarly: with mega-sized corpora, using a very-aggressive (smaller) 'sample' parameter can aggressively-discard plentiful words for faster training & better results on less-frequent words.)

  • Trying to use your own batching is usually a bad idea: complicated, error-prone, deviates from optimal SGD over full diversity of training set.

  • I'm unsure what's meant by a description like "8 preprocessors, 6 loaders" regarding your own corpus iterator outside Gensim. But if that separate, not-at-all-involving-Gensim code benchmarks slower with more 'precprocessors' & 'loaders', there may be other thread-contention & IO/network/remote-system bottlenecks contributing to the problem.

  • Regarding the possibility of a Process-based parallelism - it could be considered, but adds other interprocess-communication overheads & cross-platform complexities/failure-modes. SInce generally other optimizations of the corpus_iterable or corpus_file approaches have been enough for most users, adding yet another independent mode (that probably could just barely match corpus_file throughput in the cases where that works) hasn't been attractive.

@ciropom
Copy link

ciropom commented Mar 24, 2023

I perfectly agree with you on all points.
I was thinking (wrongly) that dumping everything to disk wasn't feasible but turns out to be feasible.
The reason is that I was used to dump documents with a structured json format, that takes far more space than a simple linesentence format, especially given the fact that I remove stopwords in the pre-processing phase.

In general "separation of concern" is best practice in software development an should be always encouraged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

7 participants