Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize inference performance of ERNIE INT8 on CPU #275

Open
bingyanghuang opened this issue Nov 20, 2019 · 26 comments
Open

Optimize inference performance of ERNIE INT8 on CPU #275

bingyanghuang opened this issue Nov 20, 2019 · 26 comments

Comments

@bingyanghuang
Copy link
Contributor

bingyanghuang commented Nov 20, 2019

Now, paddle ERNIE fp32 inference on CPU performance is ass below:
single thread: 251.464 m
20 threads:29.8818 ms
Our goal is to prove that with INT8 real kernel, ERNIE can get the performance gain.
@Sand3r- @wojtuss Please update your benchmark progress here.
@wzzju @luotao1 Please track the status here.

@wojtuss
Copy link

wojtuss commented Nov 21, 2019

We are currently working with the following two models: ernie_quant.zip and fp32_model.tar.gz.
Results gathered with the optimized fp32_model model will be labeled FP32.
Results gathered with the fake quantized FP32 ernie_quant model will be labeled QAT FP32.
Results gathered with an optimized INT8 model obtained from ernie_quant model will be labeled QAT INT8.

Initial performance results using the former (for FP32, 4 inputs, test_ds dataset) and the latest (for QAT FP32/INT8, 2 inputs, 1.8w.bs1 dataset) unit test obtained from Baidu.
(20 threads, bs 1, CLX 6248)
FP32:
Run 5010 samples, average latency: 55.7539 ms per sample.
Run 5009 samples, average latency [exclude 1 warmup steps]: 55.7415 ms per sample.
QAT FP32:
Run 2490 samples, average latency: 92.2313ms per sample.
QAT INT8 (with mul):
Run 2490 samples, average latency: 47.1024ms per sample.

@wojtuss
Copy link

wojtuss commented Nov 22, 2019

I have just noticed the above results were mixed. They are updated now.

@wojtuss
Copy link

wojtuss commented Nov 22, 2019

FP32 results with @GaoWei8's fix (PaddlePaddle/Paddle#20972):
Run 5010 samples, average latency: 32.0582 ms per sample.

@bingyanghuang
Copy link
Contributor Author

Attached the baseline profile results:

  • FP32 with Gaowei's padding patch
Event Calls Total Min. Max. Ave. Ratio.
fc 370666 127408 0.01446 20.14810 0.34373 0.75962
elementwise_add 190342 10271.7 0.03167 4.93884 0.05396 0.06124
transpose2 240432 7704.78 0.02098 3.71560 0.03205 0.04594
matmul 125225 7266.52 0.02186 3.96825 0.05803 0.04332
scale 70126 4823.14 0.00407 4.92560 0.06878 0.02876
layer_norm 125225 4184.69 0.02592 4.72700 0.03342 0.02495
softmax 60108 2565.96 0.03276 2.04443 0.04269 0.01530
reshape2 240432 1383.47 0.00393 2.92184 0.00575 0.00825
lookup_table 15027 1331.17 0.06888 1.23500 0.08859 0.00794
stack 5009 462.643 0.06707 0.30953 0.09236 0.00276
tanh 5009 181.429 0.02849 0.12309 0.03622 0.00108
slice 5009 68.8743 0.01152 0.02409 0.01375 0.00041
fetch 5009 40.985 0.00489 0.01910 0.00818 0.00024
feed 20036 33.6655 0.00066 0.01238 0.00168 0.00020
  • Real INT8 ERNIE profile
Event Calls Total Min. Max. Ave. Ratio.
elementwise_add 290080 48707.3 0.00604 7.29221 0.16791 0.39068
mul 191660 41987.3 0.02023 4.70104 0.21907 0.33678
reshape2 124320 9530.43 0.02449 0.37624 0.07666 0.07644
quantize 129500 5801.52 0.00967 2.96898 0.04480 0.04653
transpose2 124320 4886.13 0.01829 3.68880 0.03930 0.03919
matmul 64750 3845.23 0.02139 15.42340 0.05939 0.03084
scale 36260 2754.43 0.00546 0.14959 0.07596 0.02209
relu 31080 2548.26 0.05381 2.56348 0.08199 0.02044
layer_norm 64750 2127.01 0.02338 1.18454 0.03285 0.01706
softmax 31080 1192.7 0.02731 2.38116 0.03838 0.00957
lookup_table 7770 707.691 0.04808 0.23893 0.09108 0.00568
stack 2590 260.936 0.06343 0.27896 0.10075 0.00209
slice 7770 74.5761 0.00523 0.02923 0.00960 0.00060
tanh 2590 57.4421 0.01791 0.08974 0.02218 0.00046
unsqueeze2 5180 45.0293 0.00550 0.02471 0.00869 0.00036
fill_constant 10360 37.9978 0.00214 0.02271 0.00367 0.00030
fetch 2590 23.3609 0.00753 0.01779 0.00902 0.00019
range 2590 23.1647 0.00674 0.01581 0.00894 0.00019
expand 2590 18.4988 0.00649 0.01439 0.00714 0.00015
cast 5180 16.361 0.00223 0.01195 0.00316 0.00013
equal 2590 12.2261 0.00411 0.01321 0.00472 0.00010
feed 5180 11.6075 0.00127 0.00955 0.00224 0.00009
shape 2590 5.52088 0.00176 0.01129 0.00213 0.00004

@bingyanghuang
Copy link
Contributor Author

bingyanghuang commented Nov 26, 2019

  1. FP32 ERNIE model and INT8 ERNIE model have slightly different on inputs numbers

  2. INT8 optimized latency is slower than FP32 optimized latency because

    • elementwise_add takes a lot of time after quantizing the mul, fc int8 op should be introduced (support for 3 dims fc input) elementwise_add bias/residual ?

    • Too much reorder time, INT8 pipeline for feed-forward part can be further optimizing the model

    • INT8 execution on GEMM with MKL is without padding (UT MKLDNN INT8 GEMM with the same dimension with FP32 GEMM )

@bingyanghuang
Copy link
Contributor Author

Schedule plan of ERNIE INT8 optimization:

ERNIE INT8 Task Status Plan to Finish Risk Developer
Accuracy check in UT WIP 12.06 Low Asia
MKL-DNN inner product performance benchmark UT benchmark  done 12.06 Low Wojtuss
3D support in FC INT8 kernel WIP 12.13 High Michal
Fusion: FC INT8 Not started 12.13 Dependency -
Fusion: FC+scale WIP 12.13 Dependency Danqing
Reshape2 + Transpose2  INT8 kernel verification WIP 12.13 Dependency Asia & Wojtuss
Overall performance & accuracy tuning Not started 12.20 Dependency -

@bingyanghuang
Copy link
Contributor Author

bingyanghuang commented Dec 5, 2019

MKL-DNN 0.20 from paddle      
aef88b7c2      
20 threads      
Dimension 128x768x768 128x3072x768 128x768x3072
u8s8s32s32 0.0545178 0.170797 0.129593
s8s8s32s32 0.102495 0.326905 0.320389
f32 0.135169 0.419238 0.271828
f32 -> s8 increase 24% 22% -18%
       
1 thread      
Dimension 128x768x768 128x3072x768 128x768x3072
u8s8s32s32 0.21483 0.884561 0.904923
s8s8s32s32 0.638012 2.62521 2.57349
f32 0.848266 3.6114 3.40055
f32 -> s8 increase 25% 27% 24%

@wozna
Copy link
Contributor

wozna commented Dec 5, 2019

@luotao1
I'm working on accuracy test on ernie2 right now .
I've run test on:
ernie_origin_2_inputs.tar.gz fp32 model with 2 inputs
data file - 1.8w.bs1
label file - label.xnli.dev that looks like 2 0 1 2 0 1 0 1 2 1 0 2 ...

I got outputs like below:
--- iteration 204 --- 0.0957182 0.0770019 0.243675 --- iteration 205 --- 0.0957183 0.0770018 0.243675 --- iteration 206 --- 0.0957183 0.0770018 0.243675 --- iteration 207 --- 0.0957182 0.0770018 0.243675 --- iteration 208 --- 0.0957182 0.0770017 0.243675

My question is: am I using the correct file or is there any process for calculating accuracy?

@bingyanghuang
Copy link
Contributor Author

bingyanghuang commented Dec 11, 2019

MKL-DNN master (v 1.1)      
1ee831fa6a2f802de1d399fe1de4e6cc629ad855      
20 threads      
problem descriptor 128x768x768 128x3072x768 128x768x3072
u8s8s32 0.0446907 0.162134 0.132759
s8s8s32 0.0455766 0.15978 0.13624
f32 0.081141 0.283924 0.467588
f32 -> s8 increase 44% 44% 71%
       
1 thread      
problem descriptor 128x768x768 128x3072x768 128x768x3072
u8s8s32 0.216017 0.925034 0.901553
s8s8s32 0.221654 0.947028 0.942317
f32 0.742698 2.96747 6.87472
f32 -> s8 increase 70% 68% 86%

@wojtuss
Copy link

wojtuss commented Dec 15, 2019

Below are our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

on CLX 6248. INT8 tests were run with the memory_optimize_pass disabled, due to PaddlePaddle/Paddle#21492.

With the current develop branch (1fd1f06) and the origin FP32 model, the latency was:
FP32, 20 threads: 29.10 ms,
FP32, 1 thread: 180.95 ms.

With additional commits from PRs

the latency for INT8 (with FC quantized only) was:
INT8, 20 threads: 23.73 ms (18.45% faster than FP32),
INT8, 1 thread: 84.32 ms (53.40% faster than FP32).
After enabling INT8 kernels of reshape2 and transpose2 (support already present in develop branch), the results were:
INT8, 20 threads: 20.47 ms (29.68% faster than FP32),
INT8, 1 thread: 80.64 ms (55.43% faster than FP32).

With additional optimizations possible to implement:

  • avoiding reorder nwc->ncw before layer_norm (here imitated by removing the reorder),
  • fusing scale ops into fc or into dequantize (here imitated by removing the scale ops),

the latency for INT8 (with FC quantized only) was:
INT8, 20 threads: 20.68 ms (28.93% faster than FP32),
INT8, 1 thread:: 77.84 ms (56.99% faster than FP32).
With INT8 kernels also for reshape2 and transpose2, the latency was:
INT8, 20 threads: 17.39 ms (40.24% faster than FP32),
INT8, 1 thread: 75.66 ms (58.19% faster than FP32).

Other optimizations, like quantization of more operators and more fuses, are also investigated.

@wozna
Copy link
Contributor

wozna commented Dec 16, 2019

I repeat my question.
Im checking accuracy on trained ernie models with 2 inputs. But outputs still are float type.
It looks like that:
--- iteration 0 --- -2.22873 -1.84079 3.08339 ...
And label file - label.xnli.dev consists of integers: 2 0 1 2 0 1 0 1 2 1 0 2 ...

Should output labels have a floating point type?

@Meiyim
Copy link

Meiyim commented Dec 16, 2019

The output float number is the logits of a 3-label classification task. and the label file consists of class_id of the true label.
so you can evaluate the accuracy with python code below:

import sys
import numpy as np

all, acc = 0, 0
for i in sys.stdin:
        a, b, c, d = i.strip().split('\t')
        pred = np.array(map(float, (a, b, c))).argmax()
        if pred == int(d):
                acc += 1
        all += 1
print(float(acc)/all)

run with:

paste score_3_columns label.xnli.dev | python eval.py

@wozna
Copy link
Contributor

wozna commented Dec 16, 2019

@Meiyim you are right. Thank you for the explanation

@wojtuss
Copy link

wojtuss commented Dec 16, 2019

I have updated the results in #275 (comment) above. Results with additionally quantized reshape2 and transpose2 are added.

@luotao1
Copy link
Collaborator

luotao1 commented Dec 18, 2019

FP32, 1 thread: 180.95 ms.

@bingyanghuang @wojtuss Could you tell us why mkldnn update to 1.1 could have such great improvement (250ms -> 180ms)? We really want to know what happens in mkldnn 1.1?

@wojtuss
Copy link

wojtuss commented Dec 19, 2019

@luotao1 ,
Today we have found that since our last benchmarks there were some changes in the FC INT8 3D support PR's code (PaddlePaddle/Paddle#21746) which influenced the performance. There is a difference in performance between our previous results, and today's, after merging the PRs into develop branch, both on CLX 6248 and on SKX 6148.
Here are results showing the difference on SKX 6148:

  • previous FP32
    1 thread: 183.92 ms
    20 threads: 31.09 ms
  • current FP32
    1 thread: 214.39 ms
    20 threads: 53.69 ms

We are investigating that.

@wojtuss
Copy link

wojtuss commented Dec 20, 2019

@luotao1 ,
The PR with FC INT8 3D is already fixed and updated.

@wojtuss
Copy link

wojtuss commented Dec 20, 2019

The latest FP32 results for the current clean develop branch (25e765a4fe) on SKX 6148:

  • 4-dimensional input (fp32_model, test_ds):
    1 thread: 189.39 ms,
    20 threads: 30.20 ms.
  • 2-dimensional input (origin, 1.8w.bs1):
    1 thread: 192.31 ms,
    20 threads: 33:90 ms.

After merging the PRs PaddlePaddle/Paddle#21746 and PaddlePaddle/Paddle#21754, the FP32 results are basically the same:

  • 2-dimensional -input (origin, 1.8w.bs1):
    1 thread: 194.45 ms,
    20 threads: 33.93 ms.

@luotao1
Copy link
Collaborator

luotao1 commented Dec 21, 2019

@GaoWei8 Please have a double-check on fp32 performance.

@GaoWei8
Copy link
Collaborator

GaoWei8 commented Dec 23, 2019

The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148:

4-dimensional input (fp32_model, test_ds):
1 thread: 249.714 ms
20 threads: 29.741 ms

@bingyanghuang
Copy link
Contributor Author

@GaoWei8 Could you please paste your profile log of FP32 baseline in this issue? Let's try to align our baseline.

@GaoWei8
Copy link
Collaborator

GaoWei8 commented Jan 2, 2020

@bingyanghuang
The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148
4-dimensional input (fp32_model, test_ds):
1 thread profile:

I0102 03:32:40.693799 12448 inference.cc:357] Run 5010 samples, average latency: 249.052 ms per sample.
I0102 03:32:40.693889 12448 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 249.05 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      1.10423e+06 0.016733    6.69891     2.97844     0.885518
thread0::elementwise_add    190380      43708.8     0.115389    0.955863    0.229587    0.0350516
thread0::matmul             125250      33400       0.025127    2.02142     0.266667    0.0267846
thread0::transpose2         240480      22726.5     0.069161    0.33388     0.0945046   0.0182251
thread0::layer_norm         125250      19296       0.138214    0.232502    0.15406     0.0154741
thread0::softmax            60120       14590.1     0.236914    0.48581     0.242683    0.0117003
thread0::scale              70140       4239.09     0.006753    0.116512    0.0604376   0.00339947
thread0::reshape2           240480      2274.57     0.0074      0.055957    0.00945847  0.00182406
thread0::lookup_table       15030       1274.28     0.076111    0.160232    0.0847821   0.00102188
thread0::stack              5010        646.453     0.121365    0.180305    0.129033    0.000518413
thread0::load               202         270.43      0.008871    169.244     1.33876     0.000216867
thread0::tanh               5010        146.669     0.026564    0.197547    0.0292752   0.000117618
thread0::slice              5010        85.3612     0.015276    0.045218    0.0170382   6.84541e-05
thread0::feed               20040       56.402      0.001333    0.011693    0.00281447  4.52307e-05
thread0::fetch              5010        43.0507     0.00684     0.043593    0.00859295  3.45238e-05

20 threads profile:

I0102 03:11:21.370021 12402 inference.cc:357] Run 5010 samples, average latency: 29.5513 ms per sample.
I0102 03:11:21.370100 12402 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 29.5477 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      108360      0.016532    18.4293     0.292281    0.73505
thread0::elementwise_add    190380      9398.93     0.027077    7.27942     0.0493693   0.0637567
thread0::matmul             125250      7594.48     0.021428    9.68409     0.0606346   0.0515164
thread0::transpose2         240480      7444.04     0.02215     7.14557     0.0309549   0.0504959
thread0::layer_norm         125250      4383.38     0.029262    6.79018     0.0349971   0.0297342
thread0::scale              70140       3930.57     0.00581     7.21213     0.0560389   0.0266626
thread0::softmax            60120       2166.03     0.029904    7.0274      0.0360284   0.014693
thread0::reshape2           240480      1896.58     0.006059    7.18181     0.00788662  0.0128652
thread0::lookup_table       15030       1283.8      0.070953    7.12537     0.0854156   0.0087085
thread0::stack              5010        376.056     0.063878    0.828854    0.075061    0.00255093
thread0::load               202         272.917     0.009124    171.089     1.35107     0.0018513
thread0::tanh               5010        136.909     0.024189    0.195245    0.0273272   0.00092871
thread0::slice              5010        81.6364     0.014465    0.043562    0.0162947   0.000553772
thread0::feed               20040       54.9063     0.00108     0.024591    0.00273984  0.000372451
thread0::fetch              5010        38.3997     0.00608     0.039843    0.00766461  0.00026048

@bingyanghuang
Copy link
Contributor Author

bingyanghuang commented Jan 3, 2020

The latest FP32 results for the clean develop branch (c7b03d308c, Jan 2nd) on CLX 6248
4-dimensional input (fp32_model, test_ds):
1 thread profile:

I0103 04:47:21.063290 111775 inference.cc:354] Run 5010 samples, average latency: 186.189 ms per sample.
I0103 04:47:21.063355 111775 inference.cc:359] Run 5009 samples, average latency [exclude 1 warmup steps]: 186.18 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      789200      0.012302    8.10634     2.12871     0.846926
thread0::elementwise_add    190380      67363.2     0.124292    11.9827     0.353835    0.0722904
thread0::matmul             125250      25282.5     0.021207    9.8986      0.201856    0.0271318
thread0::transpose2         240480      18951.5     0.061624    0.722954    0.078807    0.0203377
thread0::layer_norm         125250      14751.8     0.102778    0.470418    0.117778    0.0158308
thread0::softmax            60120       8870.11     0.138533    0.872575    0.14754     0.00951892
thread0::scale              70140       3878.18     0.004727    0.113928    0.055292    0.00416185
thread0::reshape2           240480      1665.39     0.005301    0.057911    0.00692528  0.00178721
thread0::lookup_table       15030       1000.76     0.054471    0.126544    0.0665843   0.00107396
thread0::stack              5010        470.501     0.083346    0.181878    0.0939125   0.000504916
thread0::load               202         151.379     0.007597    19.1899     0.7494      0.000162451
thread0::tanh               5010        116.446     0.020674    0.164315    0.0232427   0.000124964
thread0::slice              5010        63.1702     0.011075    0.037661    0.0126088   6.77908e-05
thread0::feed               20040       44.1006     0.000886    0.006938    0.00220063  4.73264e-05
thread0::fetch              5010        31.8417     0.00492     0.040041    0.00635563  3.41708e-05

@wojtuss
Copy link

wojtuss commented Jan 9, 2020

Here come our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

on CLX 6248.

With the current develop branch (ad0dfb1) and the origin FP32 model, the latency was:
FP32, 20 threads: 29.73 ms,
FP32, 1 thread: 196.86 ms.

INT8 results with fc, reshape2 and transpose2 quantized:
INT8, 20 threads: 20.26 ms (31.8 % faster),
INT8, 1 thread: 78.40 ms (60.2 % faster).

@wojtuss
Copy link

wojtuss commented Jan 17, 2020

We have finally resolved the accuracy issue for Ernie INT8 run.

Background:
FC operators in the Ernie model have 2-dimensional weights and receive 3-dimensional inputs. For the MKL-DNN inner product to be used it requires transforming the weights and inputs into proper 3-dimensional tensors. As long as the input is in NCHW format (for FP32), the transformation is possible and Ernie FP32 achieves good accuracy. However, when the input is in NHWC format (for INT8), transforming the input and the weights into proper tensors requires modifications of strides, which are not supported by the MKL-DNN inner product.

Solution:
The NHWC format of INT8 input data is beneficial for conv2d INT8 kernel (because of the nature of the conv2d's operations), but for the FC the NCHW format remains optimal also for INT8. The solution is to add support for an NCHW INT8 output of the quantize operator and use it before FC INT8. This fixes the accuracy problem. Having applied the fix we got 0.795582 accuracy for Ernie INT8 (on an SKX, on a CLX it will probably be even better; for the original QAT model, we got 0.79759 on a CLX). A PR with a fix and benchmark results will be prepared next week.

@wojtuss
Copy link

wojtuss commented Jan 30, 2020

With the fix for FC INT8 (PR PaddlePaddle/Paddle#22404, branch Sand3r-:mgallus/3d-fc-acc) we got the following results for Ernie on a CLX 6248 machine:

  accuracy latency (ms) 1 thread latency (ms) 20 threads
FP32 0.802008 190.1 31.2
INT8 0.799598 70.6 17.3
QAT FP32 0.797590    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants