Feed forward chunking #6024

Pradhy729 · 2020-07-24T20:00:09Z

Official PR for #5928

Update from source

Pradhy729 · 2020-07-24T22:51:01Z

@patrickvonplaten - here's an initial implementation I have. My first step is to get the model to work with chunked feed forward - and it works! I still need to run the benchmark test to find out the benefits in terms of memory.

However, I see a problem. The new architecture causes some of the nn.Module weights and bias parameter-names to change - which would be a problem with loading existing pretrained weights from checkpoints.
For example:
bert.encoder.layer.0.intermediate.dense.weight --> becomes bert.encoder.layer.0.feed_forward.dense.dense.weight

See the failing tests for more details. Any thoughts/ideas on how to get around this?

patrickvonplaten · 2020-08-05T15:21:21Z

src/transformers/configuration_utils.py

@@ -159,6 +159,7 @@ def __init__(self, **kwargs):
        self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", 0)
        self.bad_words_ids = kwargs.pop("bad_words_ids", None)
        self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
+        self.chunk_size_feed_forward = kwargs.pop("chunk_size_feed_forward", 1)


This should be disabled by default -> so would be nice to set it to 0

And to make sure we don't break backward compatibility...

in modeling_bert.py, refer to #6972. thanks.

patrickvonplaten · 2020-08-05T15:58:58Z

src/transformers/modeling_bert.py

@@ -408,8 +432,7 @@ def forward(
            attention_output = cross_attention_outputs[0]
            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights

-        intermediate_output = self.intermediate(attention_output)


To solve the problem, my suggestions would be to wrap these two calls in a function forward_chunk which is part of this class (def forward_chunk(self, ....)) and call apply_chunking_to_forward(self.chunk_size_feed_forward, self.seq_len_dim, self.forward_chunk, attention_output,) here.

Don't think I quite follow what you mean here. Which two calls do you want to wrap?
Did you mean to have a forward_chunk function in the BertLayer class?

Ok I fixed it based on your input - looks ok to me now.

Great, that's exactly what I meant :-)

Update from source

fix the shuffle agrument usage and the default (huggingface#6307)

This is an initial implementation to test applying feed forward chunking for BERT. Will need additional modifications based on output and benchmark results.

codecov · 2020-08-07T04:31:29Z

Codecov Report

Merging #6024 into master will decrease coverage by 0.31%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6024      +/-   ##
==========================================
- Coverage   79.44%   79.12%   -0.32%     
==========================================
  Files         148      148              
  Lines       27193    27198       +5     
==========================================
- Hits        21604    21521      -83     
- Misses       5589     5677      +88

Impacted Files	Coverage Δ
src/transformers/configuration_reformer.py	`100.00% <ø> (ø)`
src/transformers/configuration_utils.py	`96.57% <100.00%> (+0.02%)`	⬆️
src/transformers/modeling_bert.py	`88.49% <100.00%> (+0.09%)`	⬆️
src/transformers/modeling_tf_bert.py	`69.31% <0.00%> (-26.18%)`	⬇️
src/transformers/tokenization_roberta.py	`76.71% <0.00%> (-21.92%)`	⬇️
src/transformers/tokenization_utils_base.py	`86.43% <0.00%> (-7.42%)`	⬇️
src/transformers/tokenization_transfo_xl.py	`38.73% <0.00%> (-3.76%)`	⬇️
src/transformers/tokenization_utils_fast.py	`92.14% <0.00%> (-2.15%)`	⬇️
src/transformers/tokenization_openai.py	`82.57% <0.00%> (-1.52%)`	⬇️
src/transformers/tokenization_bert.py	`91.07% <0.00%> (-0.45%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 175cd45...406d621. Read the comment docs.

tests/test_modeling_reformer.py

patrickvonplaten · 2020-08-08T17:59:13Z

Hey @Pradhy729, thanks a lot for continuing the PR. I made a couple of changes: fix the docs and added tests for all models, whereas only Reformer and Bert tests are on for now.

Would be great if @LysandreJik @sgugger @thomwolf @sshleifer can review.

This PR shows how feed_forward_chunking can be employed for all models. Feed forward chunking is explained here: https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers in combination with some benchmarking. It can give good memory improvements for certain model architectures.
For Bert a test is added showing that the model gives equal results. This function can easily be added to other models, the same way it was done for BERT. There is no real drawback in implementing this IMO.

To-Do after review is positive:

Add feed forward chunking to more models. @Pradhy729, feel free to add it to as many models as you want. The rest can also be added in a new PR or we open a "first good issue" for it.
Add feed forward chunking for language modeling loss. Chunking of feed forward layers in the attention block is often not really helpful to save memory - only if the model has very few attention heads. On the other hand, a real bottleneck is often the last word embedding layer. When training the loss does not have to be calculated in one huge batch (over time dim), but can be chunked the same way it is done here for Feed forward layers. This is not even really implemented in Reformer yet and would definitely require a new PR.

Pradhy729 · 2020-08-08T20:22:57Z

Great! Thanks @patrickvonplaten
Will wait for reviewers and start working on the others.

sgugger

This is very useful work, thanks for tackling this!

LysandreJik

Very nice, love how it only requires a few lines of code now that apply_chunking_to_forward is created.

LysandreJik · 2020-08-11T07:09:47Z

src/transformers/modeling_bert.py

+        layer_output = apply_chunking_to_forward(
+            self.chunk_size_feed_forward, self.seq_len_dim, self.feed_forward_chunk, attention_output
+        )


Very much a nitpick here, for future PRs probably, but this looks a lot like the gradient checkpointing method from PyTorch. This method takes the callable (the forward) method as first positional argument and I think it makes sense to have it this way.

I can do this globally in the new PR where I add the chunking for other models. Let me know if you have concerns with that.

This reverts commit 261b765.

Pradhy729 added 5 commits July 14, 2020 21:48

Merge pull request #1 from huggingface/master

62788a7

Update from source

Merge pull request #2 from huggingface/master

b413437

Update from source

Merge pull request #3 from huggingface/master

fcaa3aa

Update from source

Merge pull request #4 from huggingface/master

623352f

Update from source

Merge pull request #5 from huggingface/master

5a26a2d

Update from source

LysandreJik requested a review from patrickvonplaten July 30, 2020 11:05

patrickvonplaten reviewed Aug 5, 2020

View reviewed changes

Merge pull request #6 from huggingface/master

c96bc33

Update from source

Pradhy729 force-pushed the feed_forward_chunking branch from ab11e31 to 5621e1f Compare August 7, 2020 04:02

Pradhy729 and others added 5 commits August 6, 2020 21:24

Merge pull request #7 from huggingface/master

b85adeb

fix the shuffle agrument usage and the default (huggingface#6307)

Chunked feed forward for Bert

c0fdd09

This is an initial implementation to test applying feed forward chunking for BERT. Will need additional modifications based on output and benchmark results.

Black and cleanup

d2d531c

Feed forward chunking in BertLayer class.

f0e3826

Isort

5655ddf

Pradhy729 force-pushed the feed_forward_chunking branch from 44aea51 to 5655ddf Compare August 7, 2020 04:25

patrickvonplaten added 2 commits August 8, 2020 19:35

add chunking for all models

9f07b10

fix docs

37a0963

patrickvonplaten reviewed Aug 8, 2020

View reviewed changes

tests/test_modeling_reformer.py Outdated Show resolved Hide resolved

Fix typo

406d621

patrickvonplaten changed the title ~~[WIP] Feed forward chunking~~ Feed forward chunking Aug 8, 2020

This comment was marked as spam.

Sign in to view

patrickvonplaten requested review from LysandreJik, sgugger and thomwolf August 8, 2020 17:59

Pradhy729 mentioned this pull request Aug 9, 2020

Feed forward chunking others #6365

Merged

sgugger approved these changes Aug 10, 2020

View reviewed changes

LysandreJik approved these changes Aug 11, 2020

View reviewed changes

LysandreJik merged commit b25cec1 into huggingface:master Aug 11, 2020

Pradhy729 mentioned this pull request Aug 11, 2020

Update from source Pradhy729/transformers#9

Merged

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Feed forward chunking (huggingface#6024)"

aa7e47b

This reverts commit 261b765.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feed forward chunking #6024

Feed forward chunking #6024

Pradhy729 commented Jul 24, 2020

Pradhy729 commented Jul 24, 2020

patrickvonplaten Aug 5, 2020

patrickvonplaten Aug 5, 2020

franciszzj Sep 6, 2020

patrickvonplaten Aug 5, 2020

Pradhy729 Aug 5, 2020

Pradhy729 Aug 7, 2020

patrickvonplaten Aug 8, 2020

codecov bot commented Aug 7, 2020 •

edited

Loading

This comment was marked as spam.

patrickvonplaten commented Aug 8, 2020

Pradhy729 commented Aug 8, 2020

sgugger left a comment

LysandreJik left a comment

LysandreJik Aug 11, 2020

Pradhy729 Aug 11, 2020

Feed forward chunking #6024

Feed forward chunking #6024

Conversation

Pradhy729 commented Jul 24, 2020

Pradhy729 commented Jul 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 7, 2020 • edited Loading

Codecov Report

This comment was marked as spam.

patrickvonplaten commented Aug 8, 2020

Pradhy729 commented Aug 8, 2020

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 7, 2020 •

edited

Loading