Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP GSOC 2018]: Multistream API, Part 1 #2048

Closed

Conversation

persiyanov
Copy link
Contributor

This is a PR for my GSOC project

@persiyanov
Copy link
Contributor Author

I've benchmarked current word2vec, doc2vec and fastText implementations.

Hardware specs: 16 x Intel Xeon 2.30GHz CPU, 60Gb RAM
Data: I used slice (~1.2Gb) of English Wikipedia from here trained each model instance for one epoch.

Word2Vec Results

----- MODEL "full-word2vec-window-10-workers-01-size-300" RESULTS -----
* Total time: 1182.58386683 sec.
* Avg queue size: 1.48076923077 elems.
* Processing speed: 153350.576721 words/sec
* Avg CPU loads: 0.06, 0.02, 0.05, 99.40, 1.68, 0.25, 2.48, 0.11, 0.14, 0.03, 0.02, 0.00, 0.00, 0.01, 0.00, 0.10

----- MODEL "full-word2vec-window-10-workers-04-size-300" RESULTS -----
* Total time: 313.03660202 sec.
* Avg queue size: 7.09708737864 elems.
* Processing speed: 579322.695268 words/sec
* Avg CPU loads: 0.13, 0.06, 1.03, 0.16, 0.91, 16.28, 0.01, 0.01, 94.11, 22.10, 72.40, 0.40, 0.00, 0.00, 93.96, 93.67

----- MODEL "full-word2vec-window-10-workers-08-size-300" RESULTS -----
* Total time: 277.351661921 sec.
* Avg queue size: 0.0255474452555 elems.
* Processing speed: 653863.581506 words/sec
* Avg CPU loads: 25.71, 29.21, 19.85, 24.07, 38.67, 42.76, 27.28, 37.72, 29.91, 29.09, 35.91, 35.46, 21.26, 13.55, 29.03, 17.07

----- MODEL "full-word2vec-window-10-workers-10-size-300" RESULTS -----
* Total time: 275.248829842 sec.
* Avg queue size: 0.0404411764706 elems.
* Processing speed: 658857.998068 words/sec
* Avg CPU loads: 25.60, 26.90, 27.83, 21.12, 22.92, 29.67, 35.28, 40.76, 34.92, 33.96, 32.69, 33.75, 34.16, 26.19, 18.26, 12.84

----- MODEL "full-word2vec-window-10-workers-12-size-300" RESULTS -----
* Total time: 285.958873987 sec.
* Avg queue size: 0.0247349823322 elems.
* Processing speed: 634182.830109 words/sec
* Avg CPU loads: 23.00, 24.52, 27.39, 26.22, 29.77, 29.84, 37.62, 39.08, 32.67, 32.38, 30.09, 29.49, 27.07, 24.23, 22.69, 13.93

----- MODEL "full-word2vec-window-10-workers-14-size-300" RESULTS -----
* Total time: 288.641264915 sec.
* Avg queue size: 0.0175438596491 elems.
* Processing speed: 628288.079506 words/sec
* Avg CPU loads: 22.97, 21.53, 27.90, 26.05, 31.30, 30.67, 35.71, 36.23, 34.16, 30.63, 30.53, 32.36, 27.86, 21.09, 22.16, 21.02

Up to 4 workers everything is okay:
• Approx. linear speedup for 1-to-4 workers transition
• ~4 CPUs are fully utilized

But increasing number of workers to 8, 10, 12, 14 shows the problem with workers starvation:
• Avg queue size is almost zero
• Processing speed doesn't increases linearly, it gets on a plateau.
• CPUs are not fully utilized (each CPU is utilized by ~20-40%)

So, multistream API could help word2vec to solve this scalability issue.

Doc2Vec Results

----- MODEL "full-doc2vec-window-10-workers-01-size-300" RESULTS -----
* Total time: 1383.67663002 sec.
* Avg queue size: 1.58158682635 elems.
* Processing speed: 133080.756013 words/sec
* Avg CPU loads: 0.01, 98.21, 0.41, 1.42, 2.76, 0.10, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01

----- MODEL "full-doc2vec-window-10-workers-04-size-300" RESULTS -----
* Total time: 649.453327179 sec.
* Avg queue size: 7.55924170616 elems.
* Processing speed: 283530.223488 words/sec
* Avg CPU loads: 6.85, 38.09, 10.82, 14.12, 9.71, 2.30, 12.36, 12.65, 9.65, 12.70, 23.39, 17.30, 34.89, 19.81, 10.53, 27.11

----- MODEL "full-doc2vec-window-10-workers-08-size-300" RESULTS -----
* Total time: 852.938983917 sec.
* Avg queue size: 15.4957369062 elems.
* Processing speed: 215888.833166 words/sec
* Avg CPU loads: 7.84, 9.28, 13.51, 8.76, 12.57, 14.47, 16.64, 17.81, 17.22, 21.99, 14.87, 20.35, 19.53, 19.46, 15.66, 16.24

----- MODEL "full-doc2vec-window-10-workers-10-size-300" RESULTS -----
* Total time: 880.833570957 sec.
* Avg queue size: 19.4769775679 elems.
* Processing speed: 209053.271891 words/sec
* Avg CPU loads: 13.89, 13.32, 12.32, 13.91, 14.64, 16.48, 14.22, 14.48, 17.50, 14.94, 16.86, 15.50, 15.46, 14.08, 16.15, 16.62

----- MODEL "full-doc2vec-window-10-workers-12-size-300" RESULTS -----
* Total time: 891.707653999 sec.
* Avg queue size: 23.4707259953 elems.
* Processing speed: 206503.589124 words/sec
* Avg CPU loads: 13.66, 14.25, 15.48, 16.64, 16.50, 15.96, 15.49, 15.55, 14.67, 15.54, 15.35, 14.36, 14.05, 13.92, 15.07, 13.53

----- MODEL "full-doc2vec-window-10-workers-14-size-300" RESULTS -----
* Total time: 897.130183935 sec.
* Avg queue size: 27.4576074332 elems.
* Processing speed: 205255.330048 words/sec
* Avg CPU loads: 15.05, 14.75, 14.23, 15.38, 14.75, 14.31, 14.97, 14.92, 15.33, 14.73, 15.51, 15.23, 14.54, 14.24, 15.15, 15.02

Unfortunately, I don't see workers starvation problem here, because avg queue size metric increases with number of workers. I think that for doc2vec the main problem is the CPU-bound code which is not optimized well.

P.S. I tried to reduce CPU-bound computation for doc2vec and ran benchmark for window size = 3, no changes, I saw the same picture as above.

FastText Results

----- MODEL "full-fasttext-window-10-workers-01-size-300" RESULTS -----
* Total time: 6285.61437321 sec.
* Avg queue size: 1.46123298033 elems.
* Processing speed: 28851.581919 words/sec
* Avg CPU loads: 16.27, 0.12, 0.31, 83.63, 0.15, 0.04, 0.02, 0.41, 0.06, 0.02, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00

----- MODEL "full-fasttext-window-10-workers-04-size-300" RESULTS -----
* Total time: 1778.50108886 sec.
* Avg queue size: 7.38855421687 elems.
* Processing speed: 101967.814997 words/sec
* Avg CPU loads: 0.34, 0.02, 0.02, 98.24, 81.53, 98.08, 0.03, 0.02, 0.59, 0.39, 86.10, 0.04, 12.10, 0.00, 1.64, 17.41

----- MODEL "full-fasttext-window-10-workers-08-size-300" RESULTS -----
* Total time: 1118.67161107 sec.
* Avg queue size: 14.3133208255 elems.
* Processing speed: 162112.411905 words/sec
* Avg CPU loads: 36.36, 2.76, 39.65, 93.53, 75.64, 93.34, 93.55, 56.12, 58.07, 93.65, 54.47, 0.80, 18.46, 1.95, 0.05, 37.99

----- MODEL "full-fasttext-window-10-workers-10-size-300" RESULTS -----
* Total time: 1139.1305759 sec.
* Avg queue size: 0.664527956004 elems.
* Processing speed: 159200.315431 words/sec
* Avg CPU loads: 54.72, 52.11, 53.93, 54.07, 53.97, 56.72, 60.23, 64.16, 32.59, 36.32, 34.62, 33.19, 34.32, 31.69, 26.28, 21.80

----- MODEL "full-fasttext-window-10-workers-12-size-300" RESULTS -----
* Total time: 1150.42088914 sec.
* Avg queue size: 0.0190389845875 elems.
* Processing speed: 157638.112027 words/sec
* Avg CPU loads: 51.05, 51.39, 53.22, 53.05, 55.50, 55.58, 58.20, 62.45, 33.91, 33.78, 33.06, 33.89, 30.84, 31.21, 29.19, 24.07

----- MODEL "full-fasttext-window-10-workers-14-size-300" RESULTS -----
* Total time: 1114.04786587 sec.
* Avg queue size: 0.0131208997188 elems.
* Processing speed: 162784.644678 words/sec
* Avg CPU loads: 50.24, 49.94, 54.69, 54.49, 54.75, 57.14, 57.28, 59.53, 39.44, 39.90, 34.53, 35.05, 33.85, 31.75, 32.52, 28.86

The situation in this case is "better" (for me, because multistream API will be helpful here) than in doc2vec. We also see that avg queue size drops almost to zero at some point + no linear performance increase + CPUs are not fully utilized.

@piskvorky
Copy link
Owner

@persiyanov great start! Note that multistream is primarily meant to help with the dictionary building phase (before any training epochs). The word2vec/doc2vec/fasttext training is already heavily optimized and parallelized, although multistream should help there too, especially with many cores. But it's the dictionary building that is completely single-threaded and slow.

That doc2vec is behaves differently from word2vec is surprising. It's nearly the same algorithm, with the same optimizations (I believe even the same portions of code ). CC @gojomo.

@persiyanov
Copy link
Contributor Author

persiyanov commented May 17, 2018

@piskvorky @menshikh-iv

That's the first time I hear about "dictionary building phase" problem and that multistream is supposed to solve it:

  1. In project ideas page there are no words about problems with vocabulary building.
  2. I read these issues and didn't see any discussions that vocabulary building is slow.
  3. I submitted my proposal which clearly doesn't aim to solve problems with vocabulary building.
  4. If we want to optimize vocabulary building stage, I should know about that before GSoC started.
  5. I'll try to do my best, but I can't promise that I'll complete vocabulary optimization.

I think that for large datasets and number of epochs the time consumed by vocabulary building is much less than the time consumed by training. So, optimizing the training phase is more important.

@piskvorky
Copy link
Owner

piskvorky commented May 17, 2018

Gensim users report that for their datasets, word2vec vocab building takes a lot of time. I don't remember the exact percentage, but IIRC I saw numbers like 20-40% of overall training time. How much was it in your tests above?

Sorry about not explicitly pointing out vocab building as an important beneficiary of the multi-stream API. That was clearly an omission.

On the other hand, with everything else in place, parallelizing the vocab phase seems almost trivial (build multiple vocabs for each stream separately, then merge them at the end, no communication needed). So I'm not terribly worried about it.

@persiyanov
Copy link
Contributor Author

In my experiments, the vocabulary stage took only ~2 minutes. Here are the vocabulary specs:

2018-05-16 12:28:17,058 : INFO : collected 2805453 word types and 3000000 unique tags from a corpus of 3000000 examples and 185102290 words

@piskvorky
Copy link
Owner

piskvorky commented May 17, 2018

Thanks. That's between 10-50% time of a single epoch, right? That's in line with what I remember.

The more epochs the lower this number will become of course (although with large corpora, sometimes there's only the one epoch). Cutting this number down through parallelization should be an easy win.

@gojomo
Copy link
Collaborator

gojomo commented May 17, 2018

For benchmarks, it'd help to:

  • print all training parameters with results – often the achievable parallelism is highly affected by parameters like window, negative, and size (for each, higher tends to get better parallelism, via longer spans inside noGIL sections, and thus less bottlenecking on the GIL areas)
  • include the default and commonly-used window size of 5
  • be sure none of the texts have more than 10k tokens (after which tokens are silently ignored, confounding true rates-of-progress in word-count)
  • clarify method of corpus iteration – are examples coming from RAM or disk? Is any complex tokenization still in the iterator? (Here the answers are "disk, just whitespace-breaking" but doing a full-RAM test might reveal other bottlenecks.)
  • perhaps, discover exactly the optimal number of threads for any given setup – as opposed to just "near 10" (w2v) or "near 4" (d2v)

I'm a bit surprised by the difference in throughput (and job-queue-lengths) between what should be very-similar w2v/d2v setups (W2V CBOW with a 10-word-window vs D2V DM with a 10-word-window vary only by the inclusion of the 1 extra doc-vec in the context/corrections), but would have to dig deep to understand why it's happening.

@persiyanov
Copy link
Contributor Author

Sent2Vec results.

• Time spent on building Sent2Vec vocab is ~ 1 hour which is quite slow.

----- MODEL "sent2vec-sent2vec-window-10-workers-01-size-300" RESULTS -----
* Total time: 6578.89023399 sec.
* Avg queue size: 1.51812478215 elems.
* Processing speed: 6979.72307894 words/sec
* Avg CPU loads: 0.44, 20.40, 0.40, 0.43, 0.17, 0.19, 0.11, 0.74, 0.04, 0.22, 48.55, 0.76, 8.57, 0.12, 20.67, 0.63
----- MODEL "sent2vec-sent2vec-window-10-workers-04-size-300" RESULTS -----
* Total time: 4444.96238685 sec.
* Avg queue size: 7.34023178808 elems.
* Processing speed: 11041.0770056 words/sec
* Avg CPU loads: 3.96, 13.30, 13.14, 13.58, 13.44, 13.03, 11.26, 18.29, 18.61, 16.50, 22.31, 17.12, 16.85, 14.98, 12.31, 13.65
----- MODEL "sent2vec-sent2vec-window-10-workers-08-size-300" RESULTS -----
* Total time: 4417.36249804 sec.
* Avg queue size: 15.1806142575 elems.
* Processing speed: 11716.2800252 words/sec
* Avg CPU loads: 10.88, 15.93, 20.09, 18.46, 22.08, 22.11, 23.61, 26.96, 25.21, 19.50, 17.17, 18.37, 13.90, 14.34, 11.87, 9.78
----- MODEL "sent2vec-sent2vec-window-10-workers-10-size-300" RESULTS -----
* Total time: 4316.13834405 sec.
* Avg queue size: 16.5152118283 elems.
* Processing speed: 13038.0299504 words/sec
* Avg CPU loads: 17.60, 17.77, 19.84, 21.82, 22.61, 23.67, 25.17, 25.75, 18.60, 18.73, 17.93, 15.83, 14.14, 12.82, 10.99, 9.63
----- MODEL "sent2vec-sent2vec-window-10-workers-12-size-300" RESULTS -----
* Total time: 4363.98958588 sec.
* Avg queue size: 20.5866252822 elems.
* Processing speed: 13054.8526478 words/sec
* Avg CPU loads: 20.25, 21.76, 21.37, 22.37, 22.23, 23.33, 22.87, 24.25, 18.18, 16.55, 17.24, 16.35, 15.88, 14.92, 15.05, 13.79
----- MODEL "sent2vec-sent2vec-window-10-workers-14-size-300" RESULTS -----
* Total time: 4447.54199004 sec.
* Avg queue size: 24.3785576126 elems.
* Processing speed: 13574.7372673 words/sec
* Avg CPU loads: 22.33, 22.20, 22.53, 23.16, 22.85, 23.79, 23.53, 23.19, 17.71, 17.97, 17.55, 17.69, 17.72, 17.13, 16.82, 17.55

@persiyanov
Copy link
Contributor Author

persiyanov commented May 21, 2018

Doc2Vec cProfile

name                                  ncall  tsub      ttot      tavg      
..hon2.7/threading.py:743 Thread.run  9      0.007869  2300.735  255.6372
..ny2vec.py:125 Doc2Vec._worker_loop  8      18.29677  2150.211  268.7763
..c2vec.py:441 Doc2Vec._do_train_job  18941  42.84655  2128.714  0.112387
..ec_inner.pyx:364 train_document_dm  300..  2001.621  2001.621  0.000667
..ls/doc2vec.py:277 Doc2Vec.__init__  1      0.000137  266.4669  266.4669
..doc2vec.py:702 Doc2Vec.build_vocab  1      0.022338  256.1062  256.1062
..vec.py:798 Doc2VecVocab.scan_vocab  1      84.28159  164.1206  164.1206
..py:963 TaggedLineDocument.__iter__  600..  34.31165  152.4012  0.000025
..y2vec.py:148 Doc2Vec._job_producer  1      12.15365  150.5162  150.5162
.. Doc2VecTrainables.prepare_weights  1      0.000014  68.05760  68.05760
...

Word2Vec profile for comparison

name                                  ncall  tsub      ttot      tavg      
..hon2.7/threading.py:743 Thread.run  9      0.005520  1288.751  143.1946
..y2vec.py:125 Word2Vec._worker_loop  8      15.08401  1171.058  146.3823
..2vec.py:536 Word2Vec._do_train_job  18963  0.354295  1145.082  0.060385
..vec_inner.pyx:404 train_batch_cbow  18963  1136.349  1136.349  0.059925
../word2vec.py:426 Word2Vec.__init__  1      0.000032  179.6019  179.6019
..e_any2vec.py:342 Word2Vec.__init__  1      0.000071  179.6018  179.6018
..ny2vec.py:504 Word2Vec.build_vocab  1      0.022748  171.3067  171.3067
..2vec.py:1071 LineSentence.__iter__  474..  41.34936  146.8659  0.000031
..c.py:1159 Word2VecVocab.scan_vocab  1      75.41324  135.3223  135.3223
..2vec.py:148 Word2Vec._job_producer  1      8.624359  117.6874  117.6874
...

Proportions Word2Vec._do_train_job / total word2vec time and Doc2Vec._do_train_job / total doc2vec time are the same, so, the problem is in Doc2Vec._do_train_job.

I see that in doc2vec function train_document_dm is called many times for each document in a job. On the contrary, in word2vec train_batch_cbow is called once for each sentence in a batch simultaneously. Could this be a source of problems with doc2vec timings? @gojomo @menshikh-iv

@gojomo
Copy link
Collaborator

gojomo commented May 21, 2018

That the train_batch_cbow() cythonized function trains the entire batch inside one noGIL block could be a big contributor to the speedup, yes.

@persiyanov persiyanov force-pushed the feature/gsoc-multistream-api-1 branch from 0d9dd83 to b9668ee Compare May 27, 2018 12:04
@persiyanov
Copy link
Contributor Author

I've written the report for last two weeks in a blog post: https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html

@piskvorky
Copy link
Owner

piskvorky commented May 28, 2018

@persiyanov nice! Did you tweet this in English? I'd like to retweet your post, for people who are following the GSoC progress.

@persiyanov
Copy link
Contributor Author

@piskvorky I didn't have a twitter account until this day... https://twitter.com/dpersiyanov/status/1001157238441037829

@persiyanov
Copy link
Contributor Author

The last optimization, finally, has resulted in linear performance (2x faster than Mikolov's word2vec). Here is the table:

# workers total time (sec) processing speed (words per sec) sum cpu load (%)
1 1023.93 168408.17 104.00
4 268.52 642134.95 398.91
8 134.32 1283689.56 807.40
10 125.04 1378849.23 1010.21
14 93.16 1850972.69 1407.70

P.S. Mikolov's word2vec benchmark is here
P.S.S. All experiments related to multistream training in one gist here

@jayantj
Copy link
Contributor

jayantj commented Jun 30, 2018

Great work @persiyanov !

@gojomo
Copy link
Collaborator

gojomo commented Jun 30, 2018

Those are great numbers! But, from a quick glance at the cython changes, it looks like the old ability to provide texts as lists-of-tokens may have been removed?

@piskvorky
Copy link
Owner

piskvorky commented Jun 30, 2018

@gojomo yes; see our recent #opensource Slack discussion here and here for alternatives and ideas. (The Slack chat format is more convenient but I guess we should really discuss things in the open here on Github, not internally)

@menshikh-iv
Copy link
Contributor

Continued in #2127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants