Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP16 speed benchmark #204

Closed
hanxiao opened this issue Jan 19, 2019 · 10 comments
Closed

FP16 speed benchmark #204

hanxiao opened this issue Jan 19, 2019 · 10 comments
Labels
discussion Discuss some feature or code

Comments

@hanxiao
Copy link
Member

hanxiao commented Jan 19, 2019

google-research/bert#378

Since 1.7.5 of bert-as-service a new option -fp16 is added to the CLI, which loads a pretrained/fine-tuned BERT model (originally in FP32) and converts it to a FP16 frozen graph. As a result the model size and the memory footprint reduces 40% comparing to the FP32 counterpart. On the other hand, the speedup highly depends on the device, GPU driver, CUDA, etc. I currently don't have such environment to measure the speedup with -fp16.

If someone has a half-precision enabled device, it would be really awesome if you can test the performance and report it here or in bert-as-service issues. Thanks in advance! 🙇

@hanxiao hanxiao added help wanted Extra attention is needed discussion Discuss some feature or code labels Jan 19, 2019
@davidlenz
Copy link

Would love to help. I am running a Titan V on Win 10 with the following specs

grafik

I guess i just run benchmark.py and add the -fp16 flag to line 16, i.e.

common = vars(get_args_parser().parse_args(['-model_dir', MODEL_DIR, '-port', str(PORT), '-port_out', str(PORT_OUT)])),

is that correct?

Also, should the benchmark be run with the chinese language model as in benchmark.py or can i stick with the multi language model i am currently running?

@hanxiao
Copy link
Member Author

hanxiao commented Jan 21, 2019

Thanks a lot for your initiative! Yes, correct. Just add -fp16 to
https://github.com/hanxiao/bert-as-service/blob/6d57dddf2528751cfd8e8ea470c22df4e8651362/benchmark.py#L16
and it should work.

No, you don't need Chinese BERT at all. You can use whatever BERT you like, as the speed is measured on the sample-level not on the token-level, changing BERT language should not affect the speed.

@davidlenz
Copy link

davidlenz commented Jan 21, 2019

Thanks for the clarifications! The benchmark is currently running, however i had to do some additional steps, which i will shortly outline in the following, to determine whether i introduced inconsistencies.

Why not just run benchmark.py as is?

Since bert-server-start bert-serving-start does not work for me as described in #99, i used the fix suggested by #99 (comment) , i.e. starting the server via python start-bert-as-service.py with the necessary flags.

So why is this a problem for benchmark.py?

The server is started from within benchmark.py, which did not work for me. I guess this is related to the previous point.

https://github.com/hanxiao/bert-as-service/blob/9a5b015c1b9d925d88769d93eb17704d5ddb8691/benchmark.py#L70
https://github.com/hanxiao/bert-as-service/blob/9a5b015c1b9d925d88769d93eb17704d5ddb8691/benchmark.py#L71
https://github.com/hanxiao/bert-as-service/blob/9a5b015c1b9d925d88769d93eb17704d5ddb8691/benchmark.py#L94

Cool. How did you work around that?

I commented out the server starting and closing procedures in the experimental loop in benchmark.py, see above, and made the server starting a one time procedure using
python start-bert-as-service.py -model_dir E:\bert\pretrained_model\multi_cased_L-12_H-768_A-12 -port 7779 -port_out 7780 -fp16

With the running server i then proceeded to run benchmark.py, which is going well thus far.

Any trouble caused by this?

Based on my (probably limited) understanding, the changes i made should not affect the benchmark results, right? If this is the case i'll update this comment with the results once they are available. In the case this introduced inconsistencies, please let me know what i missed and (if possible) how to circumvent it.
Yes, since some of the benchmark arguments need to be passed to the server starting procedure to be effective!

Thanks! Also, for clarification, i used the version of benchmark.py from commit 9a5b015 but hard coded the model_dir flag to where my model is located.

Edit: fixed formatting.
Edit2: fixed spelling and explained in more detail which version of benchmark.py was used.
Edit3: updated the shown lines of code from benchmark.py to reflect the correct commit (9a5b015)

Update

Based on the comparison of the obtained results with the results posted in the readme i am now pretty certain that the approach i took did skew the results (now that i think about it again it makes a lot of sense, since parameters like pooling_layer have to be passed during start of the server). I guess i'll have to fix bert-serving-start first before i can deliver reliable results. I'll post the obtained results anyway for completeness.

Speed wrt. client_batch_size

client_batch_size seqs/s
1 58
4 129
8 163
16 195
64 213
256 216
512 225
1024 229
2048 230
4096 232

Speed wrt. max_batch_size

max_batch_size seqs/s
32 229
64 231
128 230
256 231
512 233

Speed wrt. max_seq_len

max_seq_len seqs/s
20 231
40 231
80 225
160 232
320 230

Speed wrt. num_client

num_client seqs/s
2 120
4 59
8 29
16 14
32 7

Speed wrt. pooling_layer

pooling_layer seqs/s
[-1] 233
[-2] 233
[-3] 233
[-4] 233
[-5] 234
[-6] 233
[-7] 233
[-8] 233
[-9] 233
[-10] 233
[-11] 233
[-12] 233

@hanxiao
Copy link
Member Author

hanxiao commented Jan 23, 2019

Thanks a lot for the detailed investigation. I did improve benchmark.py in 1.7.7, now it's a part of the CLI. After pip install -U, you can now use bert-serving-benchmark to benchmark. It reuses the argparser defined for bert-serving-start, thus accepts what bert-serving-start accepts. Details can be found here:
https://github.com/hanxiao/bert-as-service/blob/master/server/bert_serving/server/cli/__init__.py
and here:
https://github.com/hanxiao/bert-as-service/blob/master/server/bert_serving/server/helper.py#L176

btw, the broken CLI link on Windows (#194) is also fixed.

@hanxiao
Copy link
Member Author

hanxiao commented Jan 23, 2019

Besides that benchmark script, python example/example1.py PORT PORT_OUT can also give you a quick overview on the speed. On my Tesla V100 (fp16 supported), here are the results:

FP32

encoding 717 strs in 0.41s, speed: 1729/s
encoding 7887 strs in 3.39s, speed: 2328/s
encoding 15057 strs in 6.39s, speed: 2357/s
encoding 22227 strs in 9.59s, speed: 2318/s
encoding 29397 strs in 12.92s, speed: 2275/s
encoding 36567 strs in 16.66s, speed: 2194/s
encoding 43737 strs in 20.97s, speed: 2086/s
encoding 50907 strs in 24.30s, speed: 2094/s
encoding 58077 strs in 27.79s, speed: 2089/s
encoding 65247 strs in 31.29s, speed: 2085/s
encoding 72417 strs in 34.79s, speed: 2081/s
encoding 79587 strs in 38.11s, speed: 2088/s
encoding 86757 strs in 41.65s, speed: 2082/s
encoding 93927 strs in 44.96s, speed: 2089/s
encoding 101097 strs in 48.21s, speed: 2096/s
encoding 108267 strs in 51.68s, speed: 2094/s
encoding 115437 strs in 55.16s, speed: 2092/s
encoding 122607 strs in 58.38s, speed: 2100/s
encoding 129777 strs in 62.14s, speed: 2088/s
encoding 136947 strs in 65.44s, speed: 2092/s

FP16 (bert-serving-start -fp16 -model_dir ...)

~/alg/bert-as-service# python example/example1.py 5570 5571
encoding 717 strs in 0.28s, speed: 2547/s
encoding 7887 strs in 2.88s, speed: 2739/s
encoding 15057 strs in 5.28s, speed: 2851/s
encoding 22227 strs in 7.75s, speed: 2868/s
encoding 29397 strs in 10.32s, speed: 2848/s
encoding 36567 strs in 13.03s, speed: 2805/s
encoding 43737 strs in 15.54s, speed: 2814/s
encoding 50907 strs in 17.75s, speed: 2867/s
encoding 58077 strs in 20.22s, speed: 2871/s
encoding 65247 strs in 23.05s, speed: 2831/s
encoding 72417 strs in 25.92s, speed: 2794/s
encoding 79587 strs in 28.31s, speed: 2811/s
encoding 86757 strs in 30.86s, speed: 2811/s
encoding 93927 strs in 33.30s, speed: 2820/s
encoding 101097 strs in 35.54s, speed: 2844/s
encoding 108267 strs in 37.38s, speed: 2896/s
encoding 115437 strs in 40.92s, speed: 2820/s
encoding 122607 strs in 43.48s, speed: 2819/s
encoding 129777 strs in 46.00s, speed: 2821/s
encoding 136947 strs in 47.51s, speed: 2882/s

So, ~1.4x speedup using -fp16.

@hanxiao
Copy link
Member Author

hanxiao commented Jan 23, 2019

got some new result
image

@davidlenz
Copy link

After updating to the latest release the CLI works very well, thank you!

Benchmark Results with Titan V and -fp16

(see #204 (comment) for driver and cuda details)

bert-serving-benchmark -model_dir "path_to_bert\multi_cased_L-12_H-768_A-12" -client_vocab_file "path_to_bert\multi_cased_L-12_H-768_A-12\vocab.txt" -fp16

The benchmark procedure took around 17 hours.

client_batch_size samples/s
1 60
16 192
256 209
4096 231
max_batch_size samples/s
8 194
32 226
128 232
512 231
max_seq_len samples/s
32 250
64 216
128 178
256 124
num_client samples/s
1 231
4 61
16 16
64 3
pooling_layer samples/s
[-1] 209
[-2] 227
[-3] 250
[-4] 277
[-5] 313
[-6] 357
[-7] 417
[-8] 492
[-9] 600
[-10] 789
[-11] 1116
[-12] 1454

I'll update with fp32 results once available.

@hanxiao hanxiao added the working on it Currently working this issue label Jan 25, 2019
@davidlenz
Copy link

davidlenz commented Jan 29, 2019

Benchmark Results with Titan V and fp32

After reviewing the results i think i maybe need to rerun the -fp16 benchmark skript, as it seems something was going on while it has been running that slowed down the process. Any thoughts?
Edit: The benchmark procedure took around 3 hours this time.

bert-serving-benchmark -model_dir "multi_cased_L-12_H-768_A-12" -client_vocab_file "multi_cased_L-12_H-768_A-12\vocab.txt"

client_batch_size samples/s
1 75
16 599
256 956
4096 1523
max_batch_size samples/s
8 667
32 1220
128 1402
512 1594
max_seq_len samples/s
32 1223
64 671
128 342
256 164
num_client samples/s
1 1538
4 469
16 124
64 30
pooling_layer samples/s
[-1] 1422
[-2] 1533
[-3] 1620
[-4] 1679
[-5] 1694
[-6] 1724
[-7] 1741
[-8] 1771
[-9] 1797
[-10] 1822
[-11] 1830
[-12] 1848

@hanxiao
Copy link
Member Author

hanxiao commented Jan 29, 2019

Hi @davidlenz thanks for your time and effort on benchmarking this. Your result suggests that your GPU/driver may not support FP16 instruction. Before running into another exhaustive benchmark, could you do a quick test by using python example/example1.py PORT PORT_OUT? Namely,

  1. Start a server: bert-serving-start ...
  2. Test with python example/example1.py PORT PORT_OUT
  3. Start a FP16 server bert-serving-start -fp16 ...
  4. Test again with python example/example1.py PORT PORT_OUT

then copy-paste the client output here? Thanks a lot!

@davidlenz
Copy link

Looks like you made the right guess. Thus, i should probably update the drivers to enable FP 16 instructions?
Here are the results

fp32

encoding        512 sentences   0.34s   1492 samples/s   13859 tokens/s
encoding       1024 sentences   0.62s   1661 samples/s   15429 tokens/s
encoding       2048 sentences   1.20s   1702 samples/s   15812 tokens/s
encoding       4096 sentences   2.41s   1701 samples/s   15806 tokens/s
encoding       8192 sentences   4.81s   1703 samples/s   15819 tokens/s
encoding      16384 sentences   9.50s   1723 samples/s   16008 tokens/s
encoding      32768 sentences   19.07s  1717 samples/s   15955 tokens/s
encoding      65536 sentences   37.87s  1730 samples/s   16073 tokens/s
encoding     131072 sentences   76.13s  1721 samples/s   15989 tokens/s
encoding     262144 sentences   150.24s 1744 samples/s   16205 tokens/s

fp16

encoding        512 sentences   2.29s    223 samples/s    2080 tokens/s
encoding       1024 sentences   4.49s    228 samples/s    2118 tokens/s
encoding       2048 sentences   8.87s    230 samples/s    2143 tokens/s
encoding       4096 sentences   17.70s   231 samples/s    2149 tokens/s
encoding       8192 sentences   35.42s   231 samples/s    2147 tokens/s
encoding      16384 sentences   70.81s   231 samples/s    2148 tokens/s
encoding      32768 sentences   141.61s  231 samples/s    2148 tokens/s
encoding      65536 sentences   283.16s  231 samples/s    2149 tokens/s
encoding     131072 sentences   564.28s  232 samples/s    2157 tokens/s
encoding     262144 sentences   1134.71s         231 samples/s    2145 tokens/s

@hanxiao hanxiao removed help wanted Extra attention is needed working on it Currently working this issue labels Feb 15, 2019
@hanxiao hanxiao closed this as completed Feb 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discuss some feature or code
Projects
None yet
Development

No branches or pull requests

2 participants