Create a quantized EmbedLayerNorm for ORT. by nkreeger · Pull Request #8124 · microsoft/onnxruntime

nkreeger · 2021-06-22T21:30:09Z

Reduces memory overhead from a quantized graph containing a EmbedLayerNormalization fused Op. All weights and initializers are converted to uint8_t. Runtime simply dequantizes on the fly during batched/threaded execution. This reduces memory consumption on transformer models with large word embeddings since the entire embedding is not expanded to float32 during every invoke. Additionally, the uint8_t operations are much faster (up to ~3x on my machine with a large word embedding and hidden size of 768).

NOTES:

I considered outputting as uint8 but this works well for now - potential for a future pass.
I tried consolidating logic between qembed_layer_norm.h/.cc and embed_layer_norm.h/.cc but was running into too many issues with complicated template declarations (see all the commits on kreeger/qembed_layer_norm. Ideally, the guts of QEmbedLayerNorm are eventually fully quantized and the logic diverges more.

mrry

Comments on the op and kernel implementation only. Note that most of these are nits and/or optional.

The only important thing to fix is the lack of (runtime) shape validation on various tensor inputs, which could cause reads off the end of a buffer. (I guess they aren't writes off the end of a buffer, but still....)

onnxruntime/contrib_ops/cpu/bert/qembed_layer_norm.cc

onnxruntime/core/graph/contrib_ops/contrib_defs.cc

onnxruntime/contrib_ops/cpu/bert/qembed_layer_norm.h

onnxruntime/contrib_ops/cpu/bert/qembed_layer_norm.cc

yufenglee · 2021-06-23T04:02:59Z

onnxruntime/contrib_ops/cpu/bert/qembed_layer_norm.cc

+                                             word_embedding_zero_point) +
+                         Dequantize<uint8_t>(input_position_embedding[i],
+                                             position_embedding_scale,
+                                             position_embedding_zero_point);


This is slow. To improve the performance, you can use table query, an approach we use for other operators.

As word_embedding_scale and word_embedding_zero_point are constant, you can create a hash table with key is input_word_embedding[i] and value is (input_word_embedding[i] - word_embedding_zero_point ) * word_embedding_scale

Even for the case word_embedding_scale and word_embedding_zero_point are non-constant, you can calculate the hash table ahead dynamically

Further more, you can make the value as (input_word_embedding[i] - word_embedding_zero_point ) * word_embedding_scale + (input_segment_embedding[i] - segment_embedding_zero_point) * segment_embedding_scale

https://github.com/microsoft/onnxruntime/blob/53d1d55ea88f9c8a7da74dd67608a75236db9e7b/onnxruntime/contrib_ops/cpu/qlinear_lookup_table.h

This is great! I didn't know these things existed. Mind if I do this as a fast-follow? This PR is getting rather large.

I was thinking that a follow up PR that used these techniques could use as a canonical PR with documentation to help future contributors write these optimizations for quantized kernels.

Sounds good.

yufenglee · 2021-06-23T04:14:15Z

onnxruntime/contrib_ops/cpu/bert/qembed_layer_norm.cc

+            T cur_beta = Dequantize<uint8_t>(beta_data[i],
+                                             layer_norm_bias_scale,
+                                             layer_norm_bias_zero_point);
+            output[i] = output[i] / e * cur_gamma + cur_beta;


Is quantization of gamma_data and beta_data paid off? They are quite small.

The models we see have a large layer (768). I'd like to stick with them for now and use Prepack() to help with this in the future.

nkreeger

Some updates here - mostly around inference shape function and some other bits.

Still working through how to handle dual uint8_t and int8_t hybrid approaches.

onnxruntime/core/graph/contrib_ops/contrib_defs.cc

onnxruntime/contrib_ops/cpu/bert/qembed_layer_norm.cc

nkreeger

@mrry @yufenglee PTAL - updated and should address all comments for now.

…ntime into kreeger/QEmbedLayerNorm

Create a quantized EmbedLayerNorm for ORT.

37c50c7

nkreeger requested a review from a team as a code owner June 22, 2021 21:30

nkreeger requested review from mrry and yufenglee June 22, 2021 21:30

mrry suggested changes Jun 22, 2021

View reviewed changes

review comments

c188f36