-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP GSOC 2018]: Multistream API, Part 1 #2048
[WIP GSOC 2018]: Multistream API, Part 1 #2048
Conversation
I've benchmarked current word2vec, doc2vec and fastText implementations. Hardware specs: 16 x Intel Xeon 2.30GHz CPU, 60Gb RAM Word2Vec Results
Up to 4 workers everything is okay: But increasing number of workers to 8, 10, 12, 14 shows the problem with workers starvation: So, multistream API could help word2vec to solve this scalability issue.Doc2Vec Results
Unfortunately, I don't see workers starvation problem here, because avg queue size metric increases with number of workers. I think that for doc2vec the main problem is the CPU-bound code which is not optimized well. P.S. I tried to reduce CPU-bound computation for doc2vec and ran benchmark for FastText Results
The situation in this case is "better" (for me, because multistream API will be helpful here) than in doc2vec. We also see that avg queue size drops almost to zero at some point + no linear performance increase + CPUs are not fully utilized. |
@persiyanov great start! Note that multistream is primarily meant to help with the dictionary building phase (before any training epochs). The word2vec/doc2vec/fasttext training is already heavily optimized and parallelized, although multistream should help there too, especially with many cores. But it's the dictionary building that is completely single-threaded and slow. That doc2vec is behaves differently from word2vec is surprising. It's nearly the same algorithm, with the same optimizations (I believe even the same portions of code ). CC @gojomo. |
That's the first time I hear about "dictionary building phase" problem and that multistream is supposed to solve it:
I think that for large datasets and number of epochs the time consumed by vocabulary building is much less than the time consumed by training. So, optimizing the training phase is more important. |
Gensim users report that for their datasets, word2vec vocab building takes a lot of time. I don't remember the exact percentage, but IIRC I saw numbers like 20-40% of overall training time. How much was it in your tests above? Sorry about not explicitly pointing out vocab building as an important beneficiary of the multi-stream API. That was clearly an omission. On the other hand, with everything else in place, parallelizing the vocab phase seems almost trivial (build multiple vocabs for each stream separately, then merge them at the end, no communication needed). So I'm not terribly worried about it. |
In my experiments, the vocabulary stage took only ~2 minutes. Here are the vocabulary specs:
|
Thanks. That's between 10-50% time of a single epoch, right? That's in line with what I remember. The more epochs the lower this number will become of course (although with large corpora, sometimes there's only the one epoch). Cutting this number down through parallelization should be an easy win. |
For benchmarks, it'd help to:
I'm a bit surprised by the difference in throughput (and job-queue-lengths) between what should be very-similar w2v/d2v setups (W2V CBOW with a 10-word-window vs D2V DM with a 10-word-window vary only by the inclusion of the 1 extra doc-vec in the context/corrections), but would have to dig deep to understand why it's happening. |
Sent2Vec results.• Time spent on building Sent2Vec vocab is ~ 1 hour which is quite slow.
|
Doc2Vec cProfile
Word2Vec profile for comparison
Proportions I see that in doc2vec function |
That the |
0d9dd83
to
b9668ee
Compare
I've written the report for last two weeks in a blog post: https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html |
@persiyanov nice! Did you tweet this in English? I'd like to retweet your post, for people who are following the GSoC progress. |
@piskvorky I didn't have a twitter account until this day... https://twitter.com/dpersiyanov/status/1001157238441037829 |
The last optimization, finally, has resulted in linear performance (2x faster than Mikolov's word2vec). Here is the table:
P.S. Mikolov's word2vec benchmark is here |
Great work @persiyanov ! |
Those are great numbers! But, from a quick glance at the cython changes, it looks like the old ability to provide texts as lists-of-tokens may have been removed? |
Continued in #2127 |
This is a PR for my GSOC project