FP16 speed benchmark #204

hanxiao · 2019-01-19T06:26:37Z

Since 1.7.5 of bert-as-service a new option -fp16 is added to the CLI, which loads a pretrained/fine-tuned BERT model (originally in FP32) and converts it to a FP16 frozen graph. As a result the model size and the memory footprint reduces 40% comparing to the FP32 counterpart. On the other hand, the speedup highly depends on the device, GPU driver, CUDA, etc. I currently don't have such environment to measure the speedup with -fp16.

If someone has a half-precision enabled device, it would be really awesome if you can test the performance and report it here or in bert-as-service issues. Thanks in advance! 🙇

The text was updated successfully, but these errors were encountered:

davidlenz · 2019-01-21T12:07:16Z

Would love to help. I am running a Titan V on Win 10 with the following specs

I guess i just run benchmark.py and add the -fp16 flag to line 16, i.e.

common = vars(get_args_parser().parse_args(['-model_dir', MODEL_DIR, '-port', str(PORT), '-port_out', str(PORT_OUT)])),

is that correct?

Also, should the benchmark be run with the chinese language model as in benchmark.py or can i stick with the multi language model i am currently running?

hanxiao · 2019-01-21T12:14:37Z

Thanks a lot for your initiative! Yes, correct. Just add -fp16 to
https://github.com/hanxiao/bert-as-service/blob/6d57dddf2528751cfd8e8ea470c22df4e8651362/benchmark.py#L16
and it should work.

No, you don't need Chinese BERT at all. You can use whatever BERT you like, as the speed is measured on the sample-level not on the token-level, changing BERT language should not affect the speed.

davidlenz · 2019-01-21T15:52:58Z

Thanks for the clarifications! The benchmark is currently running, however i had to do some additional steps, which i will shortly outline in the following, to determine whether i introduced inconsistencies.

Why not just run benchmark.py as is?

Since ~~bert-server-start~~ bert-serving-start does not work for me as described in #99, i used the fix suggested by #99 (comment) , i.e. starting the server via python start-bert-as-service.py with the necessary flags.

So why is this a problem for benchmark.py?

The server is started from within benchmark.py, which did not work for me. I guess this is related to the previous point.

https://github.com/hanxiao/bert-as-service/blob/9a5b015c1b9d925d88769d93eb17704d5ddb8691/benchmark.py#L70
https://github.com/hanxiao/bert-as-service/blob/9a5b015c1b9d925d88769d93eb17704d5ddb8691/benchmark.py#L71
https://github.com/hanxiao/bert-as-service/blob/9a5b015c1b9d925d88769d93eb17704d5ddb8691/benchmark.py#L94

Cool. How did you work around that?

I commented out the server starting and closing procedures in the experimental loop in benchmark.py, see above, and made the server starting a one time procedure using
python start-bert-as-service.py -model_dir E:\bert\pretrained_model\multi_cased_L-12_H-768_A-12 -port 7779 -port_out 7780 -fp16

With the running server i then proceeded to run benchmark.py, which is going well thus far.

Any trouble caused by this?

Based on my (probably limited) understanding, the changes i made should not affect the benchmark results, right? If this is the case i'll update this comment with the results once they are available. In the case this introduced inconsistencies, please let me know what i missed and (if possible) how to circumvent it.
Yes, since some of the benchmark arguments need to be passed to the server starting procedure to be effective!

Thanks! Also, for clarification, i used the version of benchmark.py from commit 9a5b015 but hard coded the model_dir flag to where my model is located.

Edit: fixed formatting.
Edit2: fixed spelling and explained in more detail which version of benchmark.py was used.
Edit3: updated the shown lines of code from benchmark.py to reflect the correct commit (9a5b015)

Update

Based on the comparison of the obtained results with the results posted in the readme i am now pretty certain that the approach i took did skew the results (now that i think about it again it makes a lot of sense, since parameters like pooling_layer have to be passed during start of the server). I guess i'll have to fix bert-serving-start first before i can deliver reliable results. I'll post the obtained results anyway for completeness.

Speed wrt. `client_batch_size`

`client_batch_size`	seqs/s
1	58
4	129
8	163
16	195
64	213
256	216
512	225
1024	229
2048	230
4096	232

Speed wrt. `max_batch_size`

`max_batch_size`	seqs/s
32	229
64	231
128	230
256	231
512	233

Speed wrt. `max_seq_len`

`max_seq_len`	seqs/s
20	231
40	231
80	225
160	232
320	230

Speed wrt. `num_client`

`num_client`	seqs/s
2	120
4	59
8	29
16	14
32	7

Speed wrt. `pooling_layer`

`pooling_layer`	seqs/s
[-1]	233
[-2]	233
[-3]	233
[-4]	233
[-5]	234
[-6]	233
[-7]	233
[-8]	233
[-9]	233
[-10]	233
[-11]	233
[-12]	233

hanxiao · 2019-01-23T01:33:39Z

Thanks a lot for the detailed investigation. I did improve benchmark.py in 1.7.7, now it's a part of the CLI. After pip install -U, you can now use bert-serving-benchmark to benchmark. It reuses the argparser defined for bert-serving-start, thus accepts what bert-serving-start accepts. Details can be found here:
https://github.com/hanxiao/bert-as-service/blob/master/server/bert_serving/server/cli/__init__.py
and here:
https://github.com/hanxiao/bert-as-service/blob/master/server/bert_serving/server/helper.py#L176

btw, the broken CLI link on Windows (#194) is also fixed.

hanxiao · 2019-01-23T01:49:36Z

Besides that benchmark script, python example/example1.py PORT PORT_OUT can also give you a quick overview on the speed. On my Tesla V100 (fp16 supported), here are the results:

FP32

encoding 717 strs in 0.41s, speed: 1729/s
encoding 7887 strs in 3.39s, speed: 2328/s
encoding 15057 strs in 6.39s, speed: 2357/s
encoding 22227 strs in 9.59s, speed: 2318/s
encoding 29397 strs in 12.92s, speed: 2275/s
encoding 36567 strs in 16.66s, speed: 2194/s
encoding 43737 strs in 20.97s, speed: 2086/s
encoding 50907 strs in 24.30s, speed: 2094/s
encoding 58077 strs in 27.79s, speed: 2089/s
encoding 65247 strs in 31.29s, speed: 2085/s
encoding 72417 strs in 34.79s, speed: 2081/s
encoding 79587 strs in 38.11s, speed: 2088/s
encoding 86757 strs in 41.65s, speed: 2082/s
encoding 93927 strs in 44.96s, speed: 2089/s
encoding 101097 strs in 48.21s, speed: 2096/s
encoding 108267 strs in 51.68s, speed: 2094/s
encoding 115437 strs in 55.16s, speed: 2092/s
encoding 122607 strs in 58.38s, speed: 2100/s
encoding 129777 strs in 62.14s, speed: 2088/s
encoding 136947 strs in 65.44s, speed: 2092/s

FP16 (`bert-serving-start -fp16 -model_dir ...`)

~/alg/bert-as-service# python example/example1.py 5570 5571
encoding 717 strs in 0.28s, speed: 2547/s
encoding 7887 strs in 2.88s, speed: 2739/s
encoding 15057 strs in 5.28s, speed: 2851/s
encoding 22227 strs in 7.75s, speed: 2868/s
encoding 29397 strs in 10.32s, speed: 2848/s
encoding 36567 strs in 13.03s, speed: 2805/s
encoding 43737 strs in 15.54s, speed: 2814/s
encoding 50907 strs in 17.75s, speed: 2867/s
encoding 58077 strs in 20.22s, speed: 2871/s
encoding 65247 strs in 23.05s, speed: 2831/s
encoding 72417 strs in 25.92s, speed: 2794/s
encoding 79587 strs in 28.31s, speed: 2811/s
encoding 86757 strs in 30.86s, speed: 2811/s
encoding 93927 strs in 33.30s, speed: 2820/s
encoding 101097 strs in 35.54s, speed: 2844/s
encoding 108267 strs in 37.38s, speed: 2896/s
encoding 115437 strs in 40.92s, speed: 2820/s
encoding 122607 strs in 43.48s, speed: 2819/s
encoding 129777 strs in 46.00s, speed: 2821/s
encoding 136947 strs in 47.51s, speed: 2882/s

So, ~1.4x speedup using -fp16.

hanxiao · 2019-01-23T09:08:18Z

got some new result

davidlenz · 2019-01-24T17:18:19Z

After updating to the latest release the CLI works very well, thank you!

Benchmark Results with Titan V and `-fp16`

(see #204 (comment) for driver and cuda details)

bert-serving-benchmark -model_dir "path_to_bert\multi_cased_L-12_H-768_A-12" -client_vocab_file "path_to_bert\multi_cased_L-12_H-768_A-12\vocab.txt" -fp16

The benchmark procedure took around 17 hours.

`client_batch_size`	samples/s
1	60
16	192
256	209
4096	231

`max_batch_size`	samples/s
8	194
32	226
128	232
512	231

`max_seq_len`	samples/s
32	250
64	216
128	178
256	124

`num_client`	samples/s
1	231
4	61
16	16
64	3

`pooling_layer`	samples/s
[-1]	209
[-2]	227
[-3]	250
[-4]	277
[-5]	313
[-6]	357
[-7]	417
[-8]	492
[-9]	600
[-10]	789
[-11]	1116
[-12]	1454

I'll update with fp32 results once available.

davidlenz · 2019-01-29T09:45:17Z

Benchmark Results with Titan V and fp32

After reviewing the results i think i maybe need to rerun the -fp16 benchmark skript, as it seems something was going on while it has been running that slowed down the process. Any thoughts?
Edit: The benchmark procedure took around 3 hours this time.

bert-serving-benchmark -model_dir "multi_cased_L-12_H-768_A-12" -client_vocab_file "multi_cased_L-12_H-768_A-12\vocab.txt"

`client_batch_size`	samples/s
1	75
16	599
256	956
4096	1523

`max_batch_size`	samples/s
8	667
32	1220
128	1402
512	1594

`max_seq_len`	samples/s
32	1223
64	671
128	342
256	164

`num_client`	samples/s
1	1538
4	469
16	124
64	30

`pooling_layer`	samples/s
[-1]	1422
[-2]	1533
[-3]	1620
[-4]	1679
[-5]	1694
[-6]	1724
[-7]	1741
[-8]	1771
[-9]	1797
[-10]	1822
[-11]	1830
[-12]	1848

hanxiao · 2019-01-29T09:58:34Z

Hi @davidlenz thanks for your time and effort on benchmarking this. Your result suggests that your GPU/driver may not support FP16 instruction. Before running into another exhaustive benchmark, could you do a quick test by using python example/example1.py PORT PORT_OUT? Namely,

Start a server: bert-serving-start ...
Test with python example/example1.py PORT PORT_OUT
Start a FP16 server bert-serving-start -fp16 ...
Test again with python example/example1.py PORT PORT_OUT

then copy-paste the client output here? Thanks a lot!

davidlenz · 2019-01-29T12:16:36Z

Looks like you made the right guess. Thus, i should probably update the drivers to enable FP 16 instructions?
Here are the results

`fp32`

encoding        512 sentences   0.34s   1492 samples/s   13859 tokens/s
encoding       1024 sentences   0.62s   1661 samples/s   15429 tokens/s
encoding       2048 sentences   1.20s   1702 samples/s   15812 tokens/s
encoding       4096 sentences   2.41s   1701 samples/s   15806 tokens/s
encoding       8192 sentences   4.81s   1703 samples/s   15819 tokens/s
encoding      16384 sentences   9.50s   1723 samples/s   16008 tokens/s
encoding      32768 sentences   19.07s  1717 samples/s   15955 tokens/s
encoding      65536 sentences   37.87s  1730 samples/s   16073 tokens/s
encoding     131072 sentences   76.13s  1721 samples/s   15989 tokens/s
encoding     262144 sentences   150.24s 1744 samples/s   16205 tokens/s

`fp16`

encoding        512 sentences   2.29s    223 samples/s    2080 tokens/s
encoding       1024 sentences   4.49s    228 samples/s    2118 tokens/s
encoding       2048 sentences   8.87s    230 samples/s    2143 tokens/s
encoding       4096 sentences   17.70s   231 samples/s    2149 tokens/s
encoding       8192 sentences   35.42s   231 samples/s    2147 tokens/s
encoding      16384 sentences   70.81s   231 samples/s    2148 tokens/s
encoding      32768 sentences   141.61s  231 samples/s    2148 tokens/s
encoding      65536 sentences   283.16s  231 samples/s    2149 tokens/s
encoding     131072 sentences   564.28s  232 samples/s    2157 tokens/s
encoding     262144 sentences   1134.71s         231 samples/s    2145 tokens/s

hanxiao added help wanted Extra attention is needed discussion Discuss some feature or code labels Jan 19, 2019

hanxiao mentioned this issue Jan 22, 2019

refactor benchmark add fp16 result #212

Merged

hanxiao mentioned this issue Jan 23, 2019

BERT with FP16 and XLA inference speed google-research/bert#391

Open

hanxiao added the working on it Currently working this issue label Jan 25, 2019

hanxiao removed help wanted Extra attention is needed working on it Currently working this issue labels Feb 15, 2019

hanxiao closed this as completed Feb 15, 2019

hanxiao mentioned this issue Feb 22, 2019

Profiling BERT speed of predictions google-research/bert#282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP16 speed benchmark #204

FP16 speed benchmark #204

hanxiao commented Jan 19, 2019

davidlenz commented Jan 21, 2019

hanxiao commented Jan 21, 2019

davidlenz commented Jan 21, 2019 •

edited

hanxiao commented Jan 23, 2019

hanxiao commented Jan 23, 2019 •

edited

hanxiao commented Jan 23, 2019

davidlenz commented Jan 24, 2019

davidlenz commented Jan 29, 2019 •

edited

hanxiao commented Jan 29, 2019

davidlenz commented Jan 29, 2019

FP16 speed benchmark #204

FP16 speed benchmark #204

Comments

hanxiao commented Jan 19, 2019

davidlenz commented Jan 21, 2019

hanxiao commented Jan 21, 2019

davidlenz commented Jan 21, 2019 • edited

Why not just run benchmark.py as is?

So why is this a problem for benchmark.py?

Cool. How did you work around that?

Any trouble caused by this?

Update

Speed wrt. client_batch_size

Speed wrt. max_batch_size

Speed wrt. max_seq_len

Speed wrt. num_client

Speed wrt. pooling_layer

hanxiao commented Jan 23, 2019

hanxiao commented Jan 23, 2019 • edited

FP32

FP16 (bert-serving-start -fp16 -model_dir ...)

hanxiao commented Jan 23, 2019

davidlenz commented Jan 24, 2019

Benchmark Results with Titan V and -fp16

davidlenz commented Jan 29, 2019 • edited

Benchmark Results with Titan V and fp32

hanxiao commented Jan 29, 2019

davidlenz commented Jan 29, 2019

fp32

fp16

davidlenz commented Jan 21, 2019 •

edited

Speed wrt. `client_batch_size`

Speed wrt. `max_batch_size`

Speed wrt. `max_seq_len`

Speed wrt. `num_client`

Speed wrt. `pooling_layer`

hanxiao commented Jan 23, 2019 •

edited

FP16 (`bert-serving-start -fp16 -model_dir ...`)

Benchmark Results with Titan V and `-fp16`

davidlenz commented Jan 29, 2019 •

edited

`fp32`

`fp16`