Skip to content

Add RemovePadding and RestorePadding for BERT model#13701

Merged
tianleiwu merged 2 commits into
mainfrom
tlwu/bert_pad
Nov 22, 2022
Merged

Add RemovePadding and RestorePadding for BERT model#13701
tianleiwu merged 2 commits into
mainfrom
tlwu/bert_pad

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description

Add two operators RemovePadding and RestorePadding based on ideal of effective transformer (https://github.com/bytedance/effective_transformer) to improve large batch size inference for BERT model.

Motivation and Context

"output tensor with shape (total_tokens, hidden_size)",
"T")
.Output(1,
"token_offset",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_offset

why do you need token_offset? It is redundant to cumulated_seq_len


const auto& dims = input->Shape().GetDims();
if (dims.size() != 3) {
return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, "Input 'input' is expected to have 3 dimensions, got ",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'input'

nit: 0

// total_token_count: 1 + 2 + 4 = 7
// max_token_count: 4
// cumulated_token_count: 0, 1, 1+2, 1+2+4
__global__ void getTokenOffset(int* token_count_buffer,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_ void getTokenOffset(int* token_count_buffer,

It can be implemented with cub::BlockScan. The kernel can be launched with Grid: 1, Block: batch.
For kernel:

  1. it uses cub::BlockScan to compute cumulated_token_count firstly.
  2. then each thread fills its token_offset

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For offset idx > token_size, we don't need to fill it actually because the restoring won't use it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_ void getTokenOffset(int* token_count_buffer,

It can be implemented with cub::BlockScan. The kernel can be launched with Grid: 1, Block: batch. For kernel:

  1. it uses cub::BlockScan to compute cumulated_token_count firstly.
  2. then each thread fills its token_offset

Good suggestion. There is a TODO in comment that is related to this:
// TODO(tianleiwu): Use cub::DevicePartition::Flagged like BuildGlobalIndex in longformer_global_impl.cu
// to build token_offset when sequence length is large.
I could do it in another pull request later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For offset idx > token_size, we don't need to fill it actually because the restoring won't use it

The purpose is to fill zeros for those padded tokens (to make result determined). Otherwise, we will need the fill the whole output with zeros first, then use another kernel to restore non-padding tokens.

Another purpose is to make the shape as (batch_size, sequence_length). Otherwise, we will need pass these two values to restore padding operator.

@tianleiwu tianleiwu merged commit 8b0e0f4 into main Nov 22, 2022
@tianleiwu tianleiwu deleted the tlwu/bert_pad branch November 22, 2022 18:00
simon-moo pushed a commit to simon-moo/onnxruntime that referenced this pull request Dec 26, 2022
Add two operators RemovePadding and RestorePadding based on ideal of
effective transformer (https://github.com/bytedance/effective_transformer) to improve large
batch size inference for BERT model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants