[WIP GSOC 2018]: Multistream API, Part 1 #2048

persiyanov · 2018-05-14T16:02:25Z

This is a PR for my GSOC project

persiyanov · 2018-05-16T12:16:39Z

I've benchmarked current word2vec, doc2vec and fastText implementations.

Hardware specs: 16 x Intel Xeon 2.30GHz CPU, 60Gb RAM
Data: I used slice (~1.2Gb) of English Wikipedia from here trained each model instance for one epoch.

Word2Vec Results

----- MODEL "full-word2vec-window-10-workers-01-size-300" RESULTS -----
* Total time: 1182.58386683 sec.
* Avg queue size: 1.48076923077 elems.
* Processing speed: 153350.576721 words/sec
* Avg CPU loads: 0.06, 0.02, 0.05, 99.40, 1.68, 0.25, 2.48, 0.11, 0.14, 0.03, 0.02, 0.00, 0.00, 0.01, 0.00, 0.10

----- MODEL "full-word2vec-window-10-workers-04-size-300" RESULTS -----
* Total time: 313.03660202 sec.
* Avg queue size: 7.09708737864 elems.
* Processing speed: 579322.695268 words/sec
* Avg CPU loads: 0.13, 0.06, 1.03, 0.16, 0.91, 16.28, 0.01, 0.01, 94.11, 22.10, 72.40, 0.40, 0.00, 0.00, 93.96, 93.67

----- MODEL "full-word2vec-window-10-workers-08-size-300" RESULTS -----
* Total time: 277.351661921 sec.
* Avg queue size: 0.0255474452555 elems.
* Processing speed: 653863.581506 words/sec
* Avg CPU loads: 25.71, 29.21, 19.85, 24.07, 38.67, 42.76, 27.28, 37.72, 29.91, 29.09, 35.91, 35.46, 21.26, 13.55, 29.03, 17.07

----- MODEL "full-word2vec-window-10-workers-10-size-300" RESULTS -----
* Total time: 275.248829842 sec.
* Avg queue size: 0.0404411764706 elems.
* Processing speed: 658857.998068 words/sec
* Avg CPU loads: 25.60, 26.90, 27.83, 21.12, 22.92, 29.67, 35.28, 40.76, 34.92, 33.96, 32.69, 33.75, 34.16, 26.19, 18.26, 12.84

----- MODEL "full-word2vec-window-10-workers-12-size-300" RESULTS -----
* Total time: 285.958873987 sec.
* Avg queue size: 0.0247349823322 elems.
* Processing speed: 634182.830109 words/sec
* Avg CPU loads: 23.00, 24.52, 27.39, 26.22, 29.77, 29.84, 37.62, 39.08, 32.67, 32.38, 30.09, 29.49, 27.07, 24.23, 22.69, 13.93

----- MODEL "full-word2vec-window-10-workers-14-size-300" RESULTS -----
* Total time: 288.641264915 sec.
* Avg queue size: 0.0175438596491 elems.
* Processing speed: 628288.079506 words/sec
* Avg CPU loads: 22.97, 21.53, 27.90, 26.05, 31.30, 30.67, 35.71, 36.23, 34.16, 30.63, 30.53, 32.36, 27.86, 21.09, 22.16, 21.02

Up to 4 workers everything is okay:
• Approx. linear speedup for 1-to-4 workers transition
• ~4 CPUs are fully utilized

But increasing number of workers to 8, 10, 12, 14 shows the problem with workers starvation:
• Avg queue size is almost zero
• Processing speed doesn't increases linearly, it gets on a plateau.
• CPUs are not fully utilized (each CPU is utilized by ~20-40%)

So, multistream API could help word2vec to solve this scalability issue.

Doc2Vec Results

----- MODEL "full-doc2vec-window-10-workers-01-size-300" RESULTS -----
* Total time: 1383.67663002 sec.
* Avg queue size: 1.58158682635 elems.
* Processing speed: 133080.756013 words/sec
* Avg CPU loads: 0.01, 98.21, 0.41, 1.42, 2.76, 0.10, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01

----- MODEL "full-doc2vec-window-10-workers-04-size-300" RESULTS -----
* Total time: 649.453327179 sec.
* Avg queue size: 7.55924170616 elems.
* Processing speed: 283530.223488 words/sec
* Avg CPU loads: 6.85, 38.09, 10.82, 14.12, 9.71, 2.30, 12.36, 12.65, 9.65, 12.70, 23.39, 17.30, 34.89, 19.81, 10.53, 27.11

----- MODEL "full-doc2vec-window-10-workers-08-size-300" RESULTS -----
* Total time: 852.938983917 sec.
* Avg queue size: 15.4957369062 elems.
* Processing speed: 215888.833166 words/sec
* Avg CPU loads: 7.84, 9.28, 13.51, 8.76, 12.57, 14.47, 16.64, 17.81, 17.22, 21.99, 14.87, 20.35, 19.53, 19.46, 15.66, 16.24

----- MODEL "full-doc2vec-window-10-workers-10-size-300" RESULTS -----
* Total time: 880.833570957 sec.
* Avg queue size: 19.4769775679 elems.
* Processing speed: 209053.271891 words/sec
* Avg CPU loads: 13.89, 13.32, 12.32, 13.91, 14.64, 16.48, 14.22, 14.48, 17.50, 14.94, 16.86, 15.50, 15.46, 14.08, 16.15, 16.62

----- MODEL "full-doc2vec-window-10-workers-12-size-300" RESULTS -----
* Total time: 891.707653999 sec.
* Avg queue size: 23.4707259953 elems.
* Processing speed: 206503.589124 words/sec
* Avg CPU loads: 13.66, 14.25, 15.48, 16.64, 16.50, 15.96, 15.49, 15.55, 14.67, 15.54, 15.35, 14.36, 14.05, 13.92, 15.07, 13.53

----- MODEL "full-doc2vec-window-10-workers-14-size-300" RESULTS -----
* Total time: 897.130183935 sec.
* Avg queue size: 27.4576074332 elems.
* Processing speed: 205255.330048 words/sec
* Avg CPU loads: 15.05, 14.75, 14.23, 15.38, 14.75, 14.31, 14.97, 14.92, 15.33, 14.73, 15.51, 15.23, 14.54, 14.24, 15.15, 15.02

Unfortunately, I don't see workers starvation problem here, because avg queue size metric increases with number of workers. I think that for doc2vec the main problem is the CPU-bound code which is not optimized well.

P.S. I tried to reduce CPU-bound computation for doc2vec and ran benchmark for window size = 3, no changes, I saw the same picture as above.

FastText Results

----- MODEL "full-fasttext-window-10-workers-01-size-300" RESULTS -----
* Total time: 6285.61437321 sec.
* Avg queue size: 1.46123298033 elems.
* Processing speed: 28851.581919 words/sec
* Avg CPU loads: 16.27, 0.12, 0.31, 83.63, 0.15, 0.04, 0.02, 0.41, 0.06, 0.02, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00

----- MODEL "full-fasttext-window-10-workers-04-size-300" RESULTS -----
* Total time: 1778.50108886 sec.
* Avg queue size: 7.38855421687 elems.
* Processing speed: 101967.814997 words/sec
* Avg CPU loads: 0.34, 0.02, 0.02, 98.24, 81.53, 98.08, 0.03, 0.02, 0.59, 0.39, 86.10, 0.04, 12.10, 0.00, 1.64, 17.41

----- MODEL "full-fasttext-window-10-workers-08-size-300" RESULTS -----
* Total time: 1118.67161107 sec.
* Avg queue size: 14.3133208255 elems.
* Processing speed: 162112.411905 words/sec
* Avg CPU loads: 36.36, 2.76, 39.65, 93.53, 75.64, 93.34, 93.55, 56.12, 58.07, 93.65, 54.47, 0.80, 18.46, 1.95, 0.05, 37.99

----- MODEL "full-fasttext-window-10-workers-10-size-300" RESULTS -----
* Total time: 1139.1305759 sec.
* Avg queue size: 0.664527956004 elems.
* Processing speed: 159200.315431 words/sec
* Avg CPU loads: 54.72, 52.11, 53.93, 54.07, 53.97, 56.72, 60.23, 64.16, 32.59, 36.32, 34.62, 33.19, 34.32, 31.69, 26.28, 21.80

----- MODEL "full-fasttext-window-10-workers-12-size-300" RESULTS -----
* Total time: 1150.42088914 sec.
* Avg queue size: 0.0190389845875 elems.
* Processing speed: 157638.112027 words/sec
* Avg CPU loads: 51.05, 51.39, 53.22, 53.05, 55.50, 55.58, 58.20, 62.45, 33.91, 33.78, 33.06, 33.89, 30.84, 31.21, 29.19, 24.07

----- MODEL "full-fasttext-window-10-workers-14-size-300" RESULTS -----
* Total time: 1114.04786587 sec.
* Avg queue size: 0.0131208997188 elems.
* Processing speed: 162784.644678 words/sec
* Avg CPU loads: 50.24, 49.94, 54.69, 54.49, 54.75, 57.14, 57.28, 59.53, 39.44, 39.90, 34.53, 35.05, 33.85, 31.75, 32.52, 28.86

The situation in this case is "better" (for me, because multistream API will be helpful here) than in doc2vec. We also see that avg queue size drops almost to zero at some point + no linear performance increase + CPUs are not fully utilized.

piskvorky · 2018-05-17T07:07:46Z

@persiyanov great start! Note that multistream is primarily meant to help with the dictionary building phase (before any training epochs). The word2vec/doc2vec/fasttext training is already heavily optimized and parallelized, although multistream should help there too, especially with many cores. But it's the dictionary building that is completely single-threaded and slow.

That doc2vec is behaves differently from word2vec is surprising. It's nearly the same algorithm, with the same optimizations (I believe even the same portions of code ). CC @gojomo.

persiyanov · 2018-05-17T09:30:20Z

@piskvorky @menshikh-iv

That's the first time I hear about "dictionary building phase" problem and that multistream is supposed to solve it:

In project ideas page there are no words about problems with vocabulary building.
I read these issues and didn't see any discussions that vocabulary building is slow.
I submitted my proposal which clearly doesn't aim to solve problems with vocabulary building.
If we want to optimize vocabulary building stage, I should know about that before GSoC started.
I'll try to do my best, but I can't promise that I'll complete vocabulary optimization.

I think that for large datasets and number of epochs the time consumed by vocabulary building is much less than the time consumed by training. So, optimizing the training phase is more important.

piskvorky · 2018-05-17T12:52:42Z

Gensim users report that for their datasets, word2vec vocab building takes a lot of time. I don't remember the exact percentage, but IIRC I saw numbers like 20-40% of overall training time. How much was it in your tests above?

Sorry about not explicitly pointing out vocab building as an important beneficiary of the multi-stream API. That was clearly an omission.

On the other hand, with everything else in place, parallelizing the vocab phase seems almost trivial (build multiple vocabs for each stream separately, then merge them at the end, no communication needed). So I'm not terribly worried about it.

persiyanov · 2018-05-17T13:50:30Z

In my experiments, the vocabulary stage took only ~2 minutes. Here are the vocabulary specs:

2018-05-16 12:28:17,058 : INFO : collected 2805453 word types and 3000000 unique tags from a corpus of 3000000 examples and 185102290 words

piskvorky · 2018-05-17T14:47:06Z

Thanks. That's between 10-50% time of a single epoch, right? That's in line with what I remember.

The more epochs the lower this number will become of course (although with large corpora, sometimes there's only the one epoch). Cutting this number down through parallelization should be an easy win.

gojomo · 2018-05-17T18:44:36Z

For benchmarks, it'd help to:

print all training parameters with results – often the achievable parallelism is highly affected by parameters like window, negative, and size (for each, higher tends to get better parallelism, via longer spans inside noGIL sections, and thus less bottlenecking on the GIL areas)
include the default and commonly-used window size of 5
be sure none of the texts have more than 10k tokens (after which tokens are silently ignored, confounding true rates-of-progress in word-count)
clarify method of corpus iteration – are examples coming from RAM or disk? Is any complex tokenization still in the iterator? (Here the answers are "disk, just whitespace-breaking" but doing a full-RAM test might reveal other bottlenecks.)
perhaps, discover exactly the optimal number of threads for any given setup – as opposed to just "near 10" (w2v) or "near 4" (d2v)

I'm a bit surprised by the difference in throughput (and job-queue-lengths) between what should be very-similar w2v/d2v setups (W2V CBOW with a 10-word-window vs D2V DM with a 10-word-window vary only by the inclusion of the 1 extra doc-vec in the context/corrections), but would have to dig deep to understand why it's happening.

persiyanov · 2018-05-21T10:09:50Z

Sent2Vec results.

• Time spent on building Sent2Vec vocab is ~ 1 hour which is quite slow.

----- MODEL "sent2vec-sent2vec-window-10-workers-01-size-300" RESULTS -----
* Total time: 6578.89023399 sec.
* Avg queue size: 1.51812478215 elems.
* Processing speed: 6979.72307894 words/sec
* Avg CPU loads: 0.44, 20.40, 0.40, 0.43, 0.17, 0.19, 0.11, 0.74, 0.04, 0.22, 48.55, 0.76, 8.57, 0.12, 20.67, 0.63
----- MODEL "sent2vec-sent2vec-window-10-workers-04-size-300" RESULTS -----
* Total time: 4444.96238685 sec.
* Avg queue size: 7.34023178808 elems.
* Processing speed: 11041.0770056 words/sec
* Avg CPU loads: 3.96, 13.30, 13.14, 13.58, 13.44, 13.03, 11.26, 18.29, 18.61, 16.50, 22.31, 17.12, 16.85, 14.98, 12.31, 13.65
----- MODEL "sent2vec-sent2vec-window-10-workers-08-size-300" RESULTS -----
* Total time: 4417.36249804 sec.
* Avg queue size: 15.1806142575 elems.
* Processing speed: 11716.2800252 words/sec
* Avg CPU loads: 10.88, 15.93, 20.09, 18.46, 22.08, 22.11, 23.61, 26.96, 25.21, 19.50, 17.17, 18.37, 13.90, 14.34, 11.87, 9.78
----- MODEL "sent2vec-sent2vec-window-10-workers-10-size-300" RESULTS -----
* Total time: 4316.13834405 sec.
* Avg queue size: 16.5152118283 elems.
* Processing speed: 13038.0299504 words/sec
* Avg CPU loads: 17.60, 17.77, 19.84, 21.82, 22.61, 23.67, 25.17, 25.75, 18.60, 18.73, 17.93, 15.83, 14.14, 12.82, 10.99, 9.63
----- MODEL "sent2vec-sent2vec-window-10-workers-12-size-300" RESULTS -----
* Total time: 4363.98958588 sec.
* Avg queue size: 20.5866252822 elems.
* Processing speed: 13054.8526478 words/sec
* Avg CPU loads: 20.25, 21.76, 21.37, 22.37, 22.23, 23.33, 22.87, 24.25, 18.18, 16.55, 17.24, 16.35, 15.88, 14.92, 15.05, 13.79
----- MODEL "sent2vec-sent2vec-window-10-workers-14-size-300" RESULTS -----
* Total time: 4447.54199004 sec.
* Avg queue size: 24.3785576126 elems.
* Processing speed: 13574.7372673 words/sec
* Avg CPU loads: 22.33, 22.20, 22.53, 23.16, 22.85, 23.79, 23.53, 23.19, 17.71, 17.97, 17.55, 17.69, 17.72, 17.13, 16.82, 17.55

persiyanov · 2018-05-21T10:34:14Z

Doc2Vec cProfile

name                                  ncall  tsub      ttot      tavg      
..hon2.7/threading.py:743 Thread.run  9      0.007869  2300.735  255.6372
..ny2vec.py:125 Doc2Vec._worker_loop  8      18.29677  2150.211  268.7763
..c2vec.py:441 Doc2Vec._do_train_job  18941  42.84655  2128.714  0.112387
..ec_inner.pyx:364 train_document_dm  300..  2001.621  2001.621  0.000667
..ls/doc2vec.py:277 Doc2Vec.__init__  1      0.000137  266.4669  266.4669
..doc2vec.py:702 Doc2Vec.build_vocab  1      0.022338  256.1062  256.1062
..vec.py:798 Doc2VecVocab.scan_vocab  1      84.28159  164.1206  164.1206
..py:963 TaggedLineDocument.__iter__  600..  34.31165  152.4012  0.000025
..y2vec.py:148 Doc2Vec._job_producer  1      12.15365  150.5162  150.5162
.. Doc2VecTrainables.prepare_weights  1      0.000014  68.05760  68.05760
...

Word2Vec profile for comparison

name                                  ncall  tsub      ttot      tavg      
..hon2.7/threading.py:743 Thread.run  9      0.005520  1288.751  143.1946
..y2vec.py:125 Word2Vec._worker_loop  8      15.08401  1171.058  146.3823
..2vec.py:536 Word2Vec._do_train_job  18963  0.354295  1145.082  0.060385
..vec_inner.pyx:404 train_batch_cbow  18963  1136.349  1136.349  0.059925
../word2vec.py:426 Word2Vec.__init__  1      0.000032  179.6019  179.6019
..e_any2vec.py:342 Word2Vec.__init__  1      0.000071  179.6018  179.6018
..ny2vec.py:504 Word2Vec.build_vocab  1      0.022748  171.3067  171.3067
..2vec.py:1071 LineSentence.__iter__  474..  41.34936  146.8659  0.000031
..c.py:1159 Word2VecVocab.scan_vocab  1      75.41324  135.3223  135.3223
..2vec.py:148 Word2Vec._job_producer  1      8.624359  117.6874  117.6874
...

Proportions Word2Vec._do_train_job / total word2vec time and Doc2Vec._do_train_job / total doc2vec time are the same, so, the problem is in Doc2Vec._do_train_job.

I see that in doc2vec function train_document_dm is called many times for each document in a job. On the contrary, in word2vec train_batch_cbow is called once for each sentence in a batch simultaneously. Could this be a source of problems with doc2vec timings? @gojomo @menshikh-iv

gojomo · 2018-05-21T18:45:49Z

That the train_batch_cbow() cythonized function trains the entire batch inside one noGIL block could be a big contributor to the speedup, yes.

persiyanov · 2018-05-28T15:21:00Z

I've written the report for last two weeks in a blog post: https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html

piskvorky · 2018-05-28T16:15:10Z

@persiyanov nice! Did you tweet this in English? I'd like to retweet your post, for people who are following the GSoC progress.

persiyanov · 2018-05-28T17:45:38Z

@piskvorky I didn't have a twitter account until this day... https://twitter.com/dpersiyanov/status/1001157238441037829

persiyanov · 2018-06-29T23:02:01Z

The last optimization, finally, has resulted in linear performance (2x faster than Mikolov's word2vec). Here is the table:

# workers	total time (sec)	processing speed (words per sec)	sum cpu load (%)
1	1023.93	168408.17	104.00
4	268.52	642134.95	398.91
8	134.32	1283689.56	807.40
10	125.04	1378849.23	1010.21
14	93.16	1850972.69	1407.70

P.S. Mikolov's word2vec benchmark is here
P.S.S. All experiments related to multistream training in one gist here

jayantj · 2018-06-30T07:33:50Z

Great work @persiyanov !

gojomo · 2018-06-30T17:30:46Z

Those are great numbers! But, from a quick glance at the cython changes, it looks like the old ability to provide texts as lists-of-tokens may have been removed?

piskvorky · 2018-06-30T17:49:35Z

@gojomo yes; see our recent #opensource Slack discussion here and here for alternatives and ideas. (The Slack chat format is more convenient but I guess we should really discuss things in the open here on Github, not internally)

menshikh-iv · 2018-07-14T15:37:25Z

Continued in #2127

persiyanov added 7 commits May 14, 2018 15:27

Add wikipedia parsing script

2724812

Track performance metrics in base_any2vec.py

f893487

reset performance metrics in beginning of epoch

f03d9e6

add tracking CPU load + benchmarking script

55517fd

Some bug fixes

8ae3248

prettify logging results in benchmark script

29d2dba

More prettifying in benchmark script

5e47dfa

add SUM cpu load

389293f

remove sent2vec from script

b1765e7

persiyanov added 3 commits May 21, 2018 16:49

First approach to multistream, only for word2vec right now

4d50cff

adapted benchmarking script to multistream

48f498c

fix

a2a6e4f

piskvorky mentioned this pull request May 21, 2018

Phrases multiprocessing #1141

Closed

fix bench script

b9668ee

persiyanov force-pushed the feature/gsoc-multistream-api-1 branch from 0d9dd83 to b9668ee Compare May 27, 2018 12:04

persiyanov added 2 commits May 28, 2018 13:11

Measure vocabulary building time

2765207

fix

d110f26

persiyanov added 16 commits June 21, 2018 17:47

fastlinesentence in c++

37b55f3

almost working version; works on large files, but one bug is to be fixed

97f834d

remove batch iterator from pyx

944e3dc

working code

0081f01

remove build_vocab changes

fe66246

approaching to fully nogil cython _worker_loop

491a087

wrapper fix

15e07ae

one more fix

5cad26b

more fixes

495c4dc

upd

8b29df8

try to cythonize batch preparation

2119c3a

it compiles

3506ec9

prepare batch inside nogil section in a while loop

62f71ee

compiles

8924af5

some bugfixes

53fedfa

add cpu_distribution script

c679bc6

persiyanov added 7 commits July 4, 2018 16:02

accept CythonLineSentence into _worker_loop, not filename

921ff38

make CythonLineSentence iterable

9e4ed0e

fix

f9ea23b

python iterators without gil

cb8bb71

fix

6162b50

fixes

c14fca1

last changes

440c6df

persiyanov mentioned this pull request Jul 11, 2018

File-based fast training for Any2Vec models #2127

Merged

menshikh-iv closed this Jul 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP GSOC 2018]: Multistream API, Part 1 #2048

[WIP GSOC 2018]: Multistream API, Part 1 #2048

persiyanov commented May 14, 2018

persiyanov commented May 16, 2018

piskvorky commented May 17, 2018

persiyanov commented May 17, 2018 •

edited

Loading

piskvorky commented May 17, 2018 •

edited

Loading

persiyanov commented May 17, 2018

piskvorky commented May 17, 2018 •

edited

Loading

gojomo commented May 17, 2018

persiyanov commented May 21, 2018

persiyanov commented May 21, 2018 •

edited

Loading

gojomo commented May 21, 2018

persiyanov commented May 28, 2018

piskvorky commented May 28, 2018 •

edited

Loading

persiyanov commented May 28, 2018

persiyanov commented Jun 29, 2018

jayantj commented Jun 30, 2018

gojomo commented Jun 30, 2018

piskvorky commented Jun 30, 2018 •

edited

Loading

menshikh-iv commented Jul 14, 2018

[WIP GSOC 2018]: Multistream API, Part 1 #2048

[WIP GSOC 2018]: Multistream API, Part 1 #2048

Conversation

persiyanov commented May 14, 2018

persiyanov commented May 16, 2018

Word2Vec Results

So, multistream API could help word2vec to solve this scalability issue.

Doc2Vec Results

FastText Results

piskvorky commented May 17, 2018

persiyanov commented May 17, 2018 • edited Loading

piskvorky commented May 17, 2018 • edited Loading

persiyanov commented May 17, 2018

piskvorky commented May 17, 2018 • edited Loading

gojomo commented May 17, 2018

persiyanov commented May 21, 2018

Sent2Vec results.

persiyanov commented May 21, 2018 • edited Loading

Doc2Vec cProfile

Word2Vec profile for comparison

gojomo commented May 21, 2018

persiyanov commented May 28, 2018

piskvorky commented May 28, 2018 • edited Loading

persiyanov commented May 28, 2018

persiyanov commented Jun 29, 2018

jayantj commented Jun 30, 2018

gojomo commented Jun 30, 2018

piskvorky commented Jun 30, 2018 • edited Loading

menshikh-iv commented Jul 14, 2018

persiyanov commented May 17, 2018 •

edited

Loading

piskvorky commented May 17, 2018 •

edited

Loading

piskvorky commented May 17, 2018 •

edited

Loading

persiyanov commented May 21, 2018 •

edited

Loading

piskvorky commented May 28, 2018 •

edited

Loading

piskvorky commented Jun 30, 2018 •

edited

Loading