Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarization Fine Tuning #4406

Closed
kevinlu1248 opened this issue May 17, 2020 · 79 comments
Closed

Summarization Fine Tuning #4406

kevinlu1248 opened this issue May 17, 2020 · 79 comments
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix

Comments

@kevinlu1248
Copy link

kevinlu1248 commented May 17, 2020

❓ Questions & Help

Details

I tried using T5 and Bart but the abstraction summarization on scientific texts does not seem to give the results I want since I think they are both trained on news corpora. I have scraped all of the free PMC articles and I am thinking about fine-tuning a seq2seq model between the articles and their abstracts to make an abstractive summarizer for scientific texts. This Medium article (https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8) provides a bit of an introduction to how to approach this but does not quite go into detail so I am wondering how to approach this.

I'm not really asking for help being stuck but I just don't really know how to approach this problem.

A link to original question on Stack Overflow:
https://stackoverflow.com/questions/61826443/train-custom-seq2seq-transformers-model

@patil-suraj
Copy link
Contributor

patil-suraj commented May 18, 2020

First thing you can try is fine-tune T5/BART for summarization on your corpus and see how it performs.

@kevinlu1248
Copy link
Author

@patil-suraj where can I find a guide to this? I'm a bit confused by the documentation.

@patil-suraj
Copy link
Contributor

Here's the official example which fine-tunes BART on CNN/DM, you can just replace the cnn/dm dataset with your own summerization dataset.

@kevinlu1248
Copy link
Author

@patil-suraj Thanks for the example. I'm wondering if there is any simpler way to get started since I'm planning on training it in a Kaggle notebook due to GPU constraints, because otherwise I may need to copy paste entire folder into a Kaggle notebook.

@sshleifer sshleifer changed the title Fine Tuning Summarization Fine Tuning May 20, 2020
@sshleifer sshleifer added the Discussion Discussion on a topic (keep it focused or open a new issue though) label May 20, 2020
@patil-suraj
Copy link
Contributor

@kevinlu1248
This colab shows how to fine-tune T5 with lightening. This is just the self-contained version of official example. You should be able to use the same Trainer, just replace the model with BART and use you own dataset.

@kevinlu1248
Copy link
Author

@patil-suraj Thanks, I'll look into it.

@sam-writer
Copy link
Contributor

Here's the official example which fine-tunes BART on CNN/DM, you can just replace the cnn/dm dataset with your own summerization dataset.

Hi @patil-suraj, I am following that example and have my data in that format, and I can see the process using GPU/CPU, but I can't get tensorboard working. Do you have any hints? I am happy to contribute to documentation once I get it working.

@patil-suraj
Copy link
Contributor

patil-suraj commented May 23, 2020

@sam-qordoba lightning handles logging itself and by default the tensorboard logs are saved in lightning_logs directory. So you should be able see the logs by passing lightning_logs as the logdir to tensorboard command.

@sam-writer
Copy link
Contributor

Thanks @patil-suraj

@sam-writer
Copy link
Contributor

sam-writer commented May 30, 2020

Hey @patil-suraj, I had OOM issues on Colab, so moved to a VM with 56GB RAM, and the behaviour is the same as on Colab: memory usage grows, until it uses up everything available (I even added 32GB of swap, so, it's a really impressive amount of memory usage), until I get locked out of the machine... and the only time it writes to lightning_logs is right when it starts.

jupyter@pytorch-20200529-155153:~/lightning_logs$ tree
.
└── version_0
    ├── events.out.tfevents.1590794134.pytorch-20200529-155753.8733.0
    └── hparams.yaml

1 directory, 2 files

nvidia-smi looks like this:

jupyter@pytorch-20200529-155753:~$ nvidia-smi 
Sat May 30 00:07:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0    35W /  70W |   2579MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8733      C   /opt/conda/bin/python                       2569MiB |
+-----------------------------------------------------------------------------+

The cell trainer.fit(model) outputs the model definition, but no progress bar on anything,


    | Name                                                                  | Type                       | Params
-----------------------------------------------------------------------------------------------------------------
0   | model                                                                 | T5ForConditionalGeneration | 222 M 
1   | model.shared                                                          | Embedding                  | 24 M  
2   | model.encoder                                                         | T5Stack                    | 109 M 
...
514 | model.decoder.block.11.layer.2.dropout                                | Dropout                    | 0     
515 | model.decoder.final_layer_norm                                        | T5LayerNorm                | 768   
516 | model.decoder.dropout                                                 | Dropout                    | 0     
517 | model.lm_head                                                         | Linear                     | 24 M  
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

Sorry to keep bothering you, but do you have any hints? It's hard to know what's going on because it doesn't seem to log

@patil-suraj
Copy link
Contributor

It shouldn't take that much memory, did you try reducing the batch size ?

Also seems that you are using fp16 here. I haven't tried it with fp16 yet.

tagging @sshleifer

@sam-writer
Copy link
Contributor

Ok, I tried fp16 as a "maybe this will use less memory" experiment, I will try without. I tried batch size of 4, could go lower I guess. Should I just double the learning rate each time I halve the batch size, or are other changes needed?

@alexgaskell10
Copy link

Could somebody who has fine-tuned BART give me an estimate of how long it takes / how many epochs until convergence? Also any tricks to speed it up (weight freezing etc)?

1 epoch takes c. 150 hrs for my dataset so wondering how many I need...

@sshleifer
Copy link
Contributor

sshleifer commented Jun 5, 2020

Sounds like you have a huge dataset?
It's tough to know exactly how many you will need, but for xsum and cnn most of the model's I've need have required 4-6 to converged.
The [original authors] https://github.com/pytorch/fairseq/blob/master/examples/bart/README.summarization.md#4-fine-tuning-on-cnn-dm-summarization-task say 15-20K Steps.

I have had to go down to batch size=1 or 2 on some occasions.
You can use --gradient_accumulation_steps to keep the "effective" batch size (how many examples your model processes per backward pass) consistent.

@sam-qordoba is your Dataset/DataLoader putting all the examples in memory before training? That could be an issue on a large dataset.

@sshleifer
Copy link
Contributor

sshleifer commented Jun 5, 2020

You can also freeze the BartForConditionalGeneration.model.encoder using the function below to reduce memory cost.

def freeze_part(model: nn.Module):
    for par in model.parameters():
        par.requires_grad = False

You can also use val_check_interval in lightning to check validation statistics more frequently, but unfortunately your checkpoints will still be saved at the end of every epoch.

@alexgaskell10
Copy link

@sshleifer thanks for coming back with this- all very helpful.

Yes- essentially I am just trying out using BART to for longer docs (arXiv/PubMed) as a baseline to compare more sophisticated methods against. This means training set has 300k samples and only 1 sample fits on the GPU at once (12Gb- using 1,024 input length).

Lots for me to play around with and see what works well. Thanks for your help.

@patil-suraj
Copy link
Contributor

patil-suraj commented Jun 5, 2020

Yes- essentially I am just trying out using BART to for longer docs (arXiv/PubMed) as a baseline to compare more sophisticated methods against

@alexgaskell10 If you are interested in using BART for long documents then keep an eye here.
https://github.com/patil-suraj/longbart

I'm trying to convert BART to it's long version using longformer's sliding-window attention.

I've been able to replace BART encoder's SelfAttention with LongformerSelfAttention with 4096 max length. Now I'm working on adding gradient checkpointing to allow it to train on smaller GPU's. Hope to finish it soon.

gradient checkpointing and fp16 with '02' opt level should allow to use larger batch size

@alexgaskell10
Copy link

@patil-suraj thanks for this- adapting BART for LongformerSelfAttention was actually something I was going to start looking into over the next couple of weeks. Thanks for sharing- I'll be sure to give it a go soon.

@virattt
Copy link

virattt commented Jun 12, 2020

Hey @patil-suraj, any updates on your latest progress on LongBART? Thinking about diving into a similar block of work: expanding BART via Longformer

@patil-suraj
Copy link
Contributor

Hi @virattt , I've been able to replace bart encoder's self attention with sliding window attention. Also added gradient checkpoiting in the encoder.

Gradient checkpoiting in decoder is not working so going to remove it for now. Will update the repo this weekend and will put some instructions in the readme.

@virattt
Copy link

virattt commented Jun 12, 2020

Sounds great, thanks @patil-suraj

@sshleifer
Copy link
Contributor

Would love to hear LongBart experimental results whenever they are available!

@alexgaskell10
Copy link

@sshleifer I have been playing around with LongBart recently and have some preliminary experimental results. This is using @patil-suraj 's longbart repo fine-tuned on the PubMed dataset using the hf summarization finetune.py script.

The best result so far is ROUGE-1 = 36.8 (for comparison, fine-tuning vanilla BART on PubMed and truncating articles at 1024 tokens I got 42.3 ROUGE-1). I have only run a few configs so far and will be running many more so I expect this to improve. Next steps:

  • Have been only using a 12Gb GPU so far so have frozen the embeddings and encoder otherwise too large. I have a much larger cluster I can move to so will start running trials on this soon which will give more freedom to try different configs
  • I am only fine-tuning at the moment. Might explore doing some pre-training although this may be too expensive.

Let me know if there is anything you would like to see and I'll try to schedule it in.

@patil-suraj
Copy link
Contributor

Hi @alexgaskell10 , did you use the code as it is ? I think we'll need to train the embeddings for few epochs then we can freeze it.
However without freezing the embeddings I ran into OOM halfway through the epoch even with bart-base with '02' fp16 on 16GB V100.

@sshleifer do you have any ideas why this might be happening ? It went well till 60% of first epoch then OOM. Batch size was 1 and max_seq_len 4096 ?

@alexgaskell10 can you share more details, how many epochs, batch size, fp16 or not ?

@alexgaskell10
Copy link

alexgaskell10 commented Jul 1, 2020

Yes, I used the code as is (minor changes to integrate with hf finetune.py script). I agree that the embeddings and encoder should not be frozen from the beginning but I couldn't fit it on my 12Gb GPU. Once I get setup on the cluster I'll try this.

More details on all my runs so far can be found in my wandb project. To answer your question, max a couple epochs so far, batch size between 4 and 16 depending on what fits, not fp16 so far (haven't set up yet but will do soon).

@patil-suraj
Copy link
Contributor

Thanks @alexgaskell10 , I think you'll be able to use bart-base with fp16 and max 2048 seq len without frezzing embdddings on 12GB GPU

@HHousen
Copy link
Contributor

HHousen commented Jul 14, 2020

I ran the benchmark scripts for each version: python examples/benchmarking/run_benchmark.py --models allenai/longformer-base-4096 --training.

Latest master branch:

2020-07-14 18:21:34.487221: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 725/725 [00:00<00:00, 583kB/s]
1 / 1

====================       INFERENCE - SPEED - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.031     
 allenai/longformer-base-4096        8               32            1.015     
 allenai/longformer-base-4096        8              128            1.037     
 allenai/longformer-base-4096        8              512            1.028     
--------------------------------------------------------------------------------

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2117     
 allenai/longformer-base-4096        8               32             2117     
 allenai/longformer-base-4096        8              128             2117     
 allenai/longformer-base-4096        8              512             2117     
--------------------------------------------------------------------------------

====================        TRAIN - SPEED - RESULTS         ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2.0      
 allenai/longformer-base-4096        8               32            1.999     
 allenai/longformer-base-4096        8              128            2.103     
 allenai/longformer-base-4096        8              512            2.366     
--------------------------------------------------------------------------------

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             10001     
 allenai/longformer-base-4096        8               32            10155     
 allenai/longformer-base-4096        8              128            10207     
 allenai/longformer-base-4096        8              512            12559     
--------------------------------------------------------------------------------

====================        ENVIRONMENT INFORMATION         ====================
- transformers_version: 3.0.2
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 18:30:48.341403
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0
- use_tpu: False

Version 2.11.0 (git checkout tags/v2.11.0):

2020-07-14 18:31:04.379166: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
1 / 1
======= INFERENCE - SPEED - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 0.356s
		allenai/longformer-base-4096/8/32: 0.359s
		allenai/longformer-base-4096/8/128: 0.364s
		allenai/longformer-base-4096/8/512: 0.367s
======= INFERENCE - MEMORY - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 8178 MB
		allenai/longformer-base-4096/8/32: 8170 MB
		allenai/longformer-base-4096/8/128: 8162 MB
		allenai/longformer-base-4096/8/512: 8162 MB
======= TRAIN - SPEED - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 0.357s
		allenai/longformer-base-4096/8/32: 0.359s
		allenai/longformer-base-4096/8/128: 0.363s
		allenai/longformer-base-4096/8/512: 0.366s
======= TRAIN - MEMORY - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 9320 MB
		allenai/longformer-base-4096/8/32: 9416 MB
		allenai/longformer-base-4096/8/128: 9514 MB
		allenai/longformer-base-4096/8/512: 11866 MB

======== ENVIRONMENT - INFORMATION ========
- transformers_version: 2.11.0
- framework: PyTorch
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 18:34:31.155709
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0

@HHousen
Copy link
Contributor

HHousen commented Jul 14, 2020

I also tested the differences before and after d697b6c ([Longformer] Major Refactor (#5219)).

Training time changes:

Before --> After
1.323  --> 1.995
1.353  --> 2.016
1.416  --> 2.094
1.686  --> 2.378

Before d697b6c (at commit e0d58dd):

====================       INFERENCE - SPEED - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             0.332     
 allenai/longformer-base-4096        8               32            0.342     
 allenai/longformer-base-4096        8              128             0.35     
 allenai/longformer-base-4096        8              512            0.357     
--------------------------------------------------------------------------------

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2117     
 allenai/longformer-base-4096        8               32             2117     
 allenai/longformer-base-4096        8              128             2117     
 allenai/longformer-base-4096        8              512             2117     
--------------------------------------------------------------------------------

====================        TRAIN - SPEED - RESULTS         ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.323     
 allenai/longformer-base-4096        8               32            1.353     
 allenai/longformer-base-4096        8              128            1.416     
 allenai/longformer-base-4096        8              512            1.686     
--------------------------------------------------------------------------------

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              9617     
 allenai/longformer-base-4096        8               32             9771     
 allenai/longformer-base-4096        8              128             9823     
 allenai/longformer-base-4096        8              512            12175     
--------------------------------------------------------------------------------

====================        ENVIRONMENT INFORMATION         ====================
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
- transformers_version: 3.0.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 18:59:34.297304
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0
- use_tpu: False

After d697b6c (at commit d697b6c):

====================       INFERENCE - SPEED - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.028     
 allenai/longformer-base-4096        8               32             1.01     
 allenai/longformer-base-4096        8              128            1.013     
 allenai/longformer-base-4096        8              512            1.061     
--------------------------------------------------------------------------------

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2117     
 allenai/longformer-base-4096        8               32             2117     
 allenai/longformer-base-4096        8              128             2117     
 allenai/longformer-base-4096        8              512             2117     
--------------------------------------------------------------------------------

====================        TRAIN - SPEED - RESULTS         ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.995     
 allenai/longformer-base-4096        8               32            2.016     
 allenai/longformer-base-4096        8              128            2.094     
 allenai/longformer-base-4096        8              512            2.378     
--------------------------------------------------------------------------------

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              9617     
 allenai/longformer-base-4096        8               32             9771     
 allenai/longformer-base-4096        8              128             9823     
 allenai/longformer-base-4096        8              512            12175     
--------------------------------------------------------------------------------

====================        ENVIRONMENT INFORMATION         ====================
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
- transformers_version: 3.0.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 19:10:37.139177
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0
- use_tpu: False

@HHousen
Copy link
Contributor

HHousen commented Jul 14, 2020

@patrickvonplaten @alexgaskell10 @ibeltagy The training time increased from tags/v2.11.0 (0.361s) to right before d697b6c (at commit e0d58dd) (1.445s) by 1.084s.

The training time increased from right before d697b6c (at commit e0d58dd) (1.445s) to directly after d697b6c (at commit d697b6c) (2.121s) by 0.676s.

I ran the benchmarks twice and got similar results both times.

@ibeltagy
Copy link
Contributor

nice finding. Thanks, @HHousen.

@patrickvonplaten, we can check the refactoring more carefully to find the reason for the second slowdown. Any thoughts on what could be the reason for the first one? It is a span of 270 commits!!

@patrickvonplaten
Copy link
Contributor

Thanks a lot for running the benchmark @HHousen !

Very interesting indeed! I will take a look tomorrow.
The benchmarking tools were changed quite significantly from 2.11 to 3.0.1 => so I will run both Longformer versions (2.11 and master) with the same benchmarking tools tomorrow to make sure that the performance degradation is really due to changes in Longformer.

@HHousen
Copy link
Contributor

HHousen commented Jul 15, 2020

@patrickvonplaten You're correct about the first training time increase. I tracked down the time change to commit fa0be6d. At 18a0150 (right before fa0be6d) the training time is about 0.35s. But at fa0be6d it's about 1.4s. So the first time increase can be safely ignored because it was caused by a change in the benchmark scripts.

The second time increase, caused by d697b6c seems to be the main issue.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Jul 15, 2020

@HHousen @ibeltagy,

I just ran the same benchmarking scripts on different versions and I can confirm that there is quite a drastic slow-down at master.
Here is the branch: https://github.com/huggingface/transformers/tree/benchmark_for_2_11 in case it's useful for you.

My results for master:

====================       INFERENCE - SPEED - RESULT       ====================                         
--------------------------------------------------------------------------------                         
          Model Name             Batch Size     Seq Length     Time in s                                                                                                                                           
--------------------------------------------------------------------------------                         
 allenai/longformer-base-4096        8               8             0.644                                 
 allenai/longformer-base-4096        8               32             0.64                                 
 allenai/longformer-base-4096        8              128             0.64                                 
 allenai/longformer-base-4096        8              512            0.637        
--------------------------------------------------------------------------------                         
Saving results to csv.                                                                                   
                                                                                                         
====================      INFERENCE - MEMORY - RESULT       ====================                         
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB                               
--------------------------------------------------------------------------------                         
 allenai/longformer-base-4096        8               8              2023                                 
 allenai/longformer-base-4096        8               32             2023                                 
 allenai/longformer-base-4096        8              128             2023                                 
 allenai/longformer-base-4096        8              512             2023                                 
--------------------------------------------------------------------------------                         
Saving results to csv.                                                                                   
                                                                                                         
====================        ENVIRONMENT INFORMATION         ====================                                                                                                                                   
- transformers_version: 3.0.2                                                                            
- framework: PyTorch                                                                                     
- use_torchscript: False                                                                                 
- framework_version: 1.5.0                                                                               
- python_version: 3.7.7                                                                                  
- system: Linux                                                                                                                                                                                                    
- cpu: x86_64                                                                                                                                                                                                      
- architecture: 64bit                                                                                    
- date: 2020-07-15                                  
- time: 18:30:00.426834                             
- fp16: False                                       
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32089                                 
- use_gpu: True                                     
- num_gpus: 1                                       
- gpu: TITAN RTX                                    
- gpu_ram_mb: 24217                                 
- gpu_power_watts: 280.0                            
- gpu_performance_state: 0
- use_tpu: False

results for 2.11.0:

====================       INFERENCE - SPEED - RESULT       ====================                                                                                                                                   
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             0.144       
 allenai/longformer-base-4096        8               32            0.144     
 allenai/longformer-base-4096        8              128            0.144     
 allenai/longformer-base-4096        8              512            0.145     
--------------------------------------------------------------------------------                                                                                                                                   
Saving results to csv.                                                                                   
                                                    
====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2023                                 
 allenai/longformer-base-4096        8               32             2023          
 allenai/longformer-base-4096        8              128             2023                                 
 allenai/longformer-base-4096        8              512             2023                                 
--------------------------------------------------------------------------------
Saving results to csv.                                                                                   
                                                                                                         
====================        ENVIRONMENT INFORMATION         ====================   
- transformers_version: 2.11.0                                                                           
- framework: PyTorch                                
- use_torchscript: False                                                                                 
- framework_version: 1.5.0                                                                               
- python_version: 3.7.7                                                                                  
- system: Linux                                                                                          
- cpu: x86_64                                                                                            
- architecture: 64bit                                                                                    
- date: 2020-07-15                                                                                       
- time: 18:39:00.315564                                                                                  
- fp16: False                                                                                            
- use_multiprocessing: True                                                                              
- only_pretrain_model: False                                                                             
- cpu_ram_mb: 32089                                                                                      
- use_gpu: True                                                                                          
- num_gpus: 1                                                                                            
- gpu: TITAN RTX                                                                                         
- gpu_ram_mb: 24217                                                                                      
- gpu_power_watts: 280.0                                                                                 
- gpu_performance_state: 0                                                                               
- use_tpu: False   

It was probably caused by me, when I did the major longformer refactoring... => will investigate more tomorrow!
Thanks a lot for pointing this out @HHousen - this is super useful.

I guess we should have tests that automatically check if the PR causes a significant slow down. (also @sshleifer , @thomwolf, @mfuntowicz )

@patrickvonplaten
Copy link
Contributor

Ok fixed it. @ibeltagy @HHousen - it would be great if you can try again on your end with the current version of master to make sure the inference speed is back to normal.

@HHousen
Copy link
Contributor

HHousen commented Jul 16, 2020

@patrickvonplaten I ran the benchmark on master and the speeds do look to be normal again.

The training speeds are 1.328s, 1.378s, 1.457s, and 1.776s for sequences of length 8, 32, 128, 512 respectively, which is similar to the speeds before the major refactor at d697b6c.

Inference speeds are 0.326s, 0.343s, 0.348s, and 0.367s, which are appear to be back to normal.

@HHousen
Copy link
Contributor

HHousen commented Jul 16, 2020

@ibeltagy Should you merge ibeltagy/transformers@longformer_encoder_decoder into huggingface/transformers@master yet to add gradient checkpointing to BART? Or are you waiting for the final LongformerEncoderDecoder implementation to be completed?

@ibeltagy
Copy link
Contributor

ibeltagy commented Jul 16, 2020

@HHousen, I had to disable certain features of the model here to implement gradient checkpointing, so merging it will require more work.
@LysandreJik started working on gradient checkpointing in this PR #5415 and he might have better ideas.

@WangHexie
Copy link

@kevinlu1248
This colab shows how to fine-tune T5 with lightening. This is just the self-contained version of official example. You should be able to use the same Trainer, just replace the model with BART and use you own dataset.

I modified this example to adapt to the BART. And only use Positive </s> as target , but after training for a epoch, the model output all 0, tensor([[2, 0, 0, 2]]), decoded as '' .
What could be reason for the model failed on such simple task?
Using training_rate = 2*10^-5

@patil-suraj
Copy link
Contributor

@WangHexie , not sure. One suggestion, with BART you won't need to manually add at the end as the BART tokenizer automatically add the eos token at the end of the text

@WangHexie
Copy link

@patil-suraj Thanks to your prompt. These models' behaviour is quite different, the problem is solved by shifting decoder input to the right manually.

@ibeltagy
Copy link
Contributor

ibeltagy commented Aug 6, 2020

@alexgaskell10, @HHousen, the query.reshape()solution is wrong. The code runs but it is not doing the right thing. It should be query.transpose(0, 1). I just pushed a fix. This bug will affect all your results if you are using a batch size > 1

@alexgaskell10
Copy link

alexgaskell10 commented Aug 7, 2020

@ibeltagy @HHousen Thanks for the update, it still is not working well for me with bsz > 1. I think you also need to change attn_output = attn_output.contiguous().view(tgt_len, bsz, embed_dim) to attn_output = attn_output.transpose(0,1) in longformer/longformer_encoder_decoder.py, line 75.

@ibeltagy
Copy link
Contributor

ibeltagy commented Aug 7, 2020

@alexgaskell10, you are right. Just pushed a fix for that one as well.

@alexgaskell10
Copy link

@ibeltagy it still isn't working correctly for me (even at bsz=1). On some runs and at random points training the training becomes corrupted as per the image below. Taking a look into this now but not really sure where to start as it only happens sometimes and at random points during training so I haven't got much to work with. Any ideas?

Screenshot 2020-08-12 at 15 45 19

@ibeltagy
Copy link
Contributor

ibeltagy commented Aug 12, 2020

  • what is the effective batch size? bsz x gradient accumulation x number of gpus? make sure it is not very small, try at least 8 if not 32.

  • how does the learning curve rate curve look like? can you draw it next to the loss curve? are you using warmup and decay? try lowering the learning rate?

@alexgaskell10
Copy link

alexgaskell10 commented Aug 12, 2020

Thanks for the suggestions- a couple of good thoughts. I have only been using small bsz so far (< 4) so I think that is somewhere to start alongside playing with the LR. Thanks!

  • I am not using warmup and decay. I don't think warmup is the issue as it rarely begins soon into training. Will try with decay though
  • What are you referring to as the learning curve in this instance? The validation loss?

@ibeltagy
Copy link
Contributor

ibeltagy commented Aug 12, 2020

oh, sorry, I meant plotting learning rate curve vs. steps.

@JessicaLopezEspejel
Copy link

JessicaLopezEspejel commented Aug 18, 2020

Gradient checkpointing in the decoder is not working so going to remove it for now. Will update the repo this weekend and will put some instructions in the readme.

Hello @patil-suraj
Do you have any advance in this work? I checked the repository but the README is still empty.

Can you help me please @alexgaskell10?

Thank you so much.

@morganmcg1
Copy link
Contributor

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Just adding that I get the same error for AlbertForMaskedLM with albert-large-v2, batch size of 8, using version 3.1.0 (pytorch), and training with Trainer

It doesn't appear immediately, but a little way into the warm-up phase of the training.

@memray
Copy link

memray commented Sep 30, 2020

I had the same problem with FunnelTransformer. But it seems resolved after I set WANDB_WATCH=false or disable --fp16. You can try if it works for you.

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Just adding that I get the same error for AlbertForMaskedLM with albert-large-v2, batch size of 8, using version 3.1.0 (pytorch), and training with Trainer

It doesn't appear immediately, but a little way into the warm-up phase of the training.

@Amirosimani
Copy link

@patil-suraj what is the best way to save the fine tune model in order to reuse it again with T5ForConditionalGeneration.from_pretrained()?

@stale
Copy link

stale bot commented Dec 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 12, 2020
@stale stale bot closed this as completed Dec 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix
Projects
None yet
Development

No branches or pull requests