Summarization Fine Tuning #4406

kevinlu1248 · 2020-05-17T01:50:39Z

❓ Questions & Help

Details

I tried using T5 and Bart but the abstraction summarization on scientific texts does not seem to give the results I want since I think they are both trained on news corpora. I have scraped all of the free PMC articles and I am thinking about fine-tuning a seq2seq model between the articles and their abstracts to make an abstractive summarizer for scientific texts. This Medium article (https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8) provides a bit of an introduction to how to approach this but does not quite go into detail so I am wondering how to approach this.

I'm not really asking for help being stuck but I just don't really know how to approach this problem.

A link to original question on Stack Overflow:
https://stackoverflow.com/questions/61826443/train-custom-seq2seq-transformers-model

patil-suraj · 2020-05-18T04:36:02Z

First thing you can try is fine-tune T5/BART for summarization on your corpus and see how it performs.

kevinlu1248 · 2020-05-18T04:49:00Z

@patil-suraj where can I find a guide to this? I'm a bit confused by the documentation.

patil-suraj · 2020-05-18T12:55:36Z

Here's the official example which fine-tunes BART on CNN/DM, you can just replace the cnn/dm dataset with your own summerization dataset.

kevinlu1248 · 2020-05-18T18:57:53Z

@patil-suraj Thanks for the example. I'm wondering if there is any simpler way to get started since I'm planning on training it in a Kaggle notebook due to GPU constraints, because otherwise I may need to copy paste entire folder into a Kaggle notebook.

patil-suraj · 2020-05-21T11:00:24Z

@kevinlu1248
This colab shows how to fine-tune T5 with lightening. This is just the self-contained version of official example. You should be able to use the same Trainer, just replace the model with BART and use you own dataset.

kevinlu1248 · 2020-05-21T17:44:50Z

@patil-suraj Thanks, I'll look into it.

sam-writer · 2020-05-23T17:13:37Z

Here's the official example which fine-tunes BART on CNN/DM, you can just replace the cnn/dm dataset with your own summerization dataset.

Hi @patil-suraj, I am following that example and have my data in that format, and I can see the process using GPU/CPU, but I can't get tensorboard working. Do you have any hints? I am happy to contribute to documentation once I get it working.

patil-suraj · 2020-05-23T17:51:20Z

@sam-qordoba lightning handles logging itself and by default the tensorboard logs are saved in lightning_logs directory. So you should be able see the logs by passing lightning_logs as the logdir to tensorboard command.

sam-writer · 2020-05-23T18:41:56Z

Thanks @patil-suraj

sam-writer · 2020-05-30T00:14:36Z

Hey @patil-suraj, I had OOM issues on Colab, so moved to a VM with 56GB RAM, and the behaviour is the same as on Colab: memory usage grows, until it uses up everything available (I even added 32GB of swap, so, it's a really impressive amount of memory usage), until I get locked out of the machine... and the only time it writes to lightning_logs is right when it starts.

jupyter@pytorch-20200529-155153:~/lightning_logs$ tree
.
└── version_0
    ├── events.out.tfevents.1590794134.pytorch-20200529-155753.8733.0
    └── hparams.yaml

1 directory, 2 files

nvidia-smi looks like this:

jupyter@pytorch-20200529-155753:~$ nvidia-smi 
Sat May 30 00:07:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0    35W /  70W |   2579MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8733      C   /opt/conda/bin/python                       2569MiB |
+-----------------------------------------------------------------------------+

The cell trainer.fit(model) outputs the model definition, but no progress bar on anything,


    | Name                                                                  | Type                       | Params
-----------------------------------------------------------------------------------------------------------------
0   | model                                                                 | T5ForConditionalGeneration | 222 M 
1   | model.shared                                                          | Embedding                  | 24 M  
2   | model.encoder                                                         | T5Stack                    | 109 M 
...
514 | model.decoder.block.11.layer.2.dropout                                | Dropout                    | 0     
515 | model.decoder.final_layer_norm                                        | T5LayerNorm                | 768   
516 | model.decoder.dropout                                                 | Dropout                    | 0     
517 | model.lm_head                                                         | Linear                     | 24 M  
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

Sorry to keep bothering you, but do you have any hints? It's hard to know what's going on because it doesn't seem to log

patil-suraj · 2020-06-01T15:01:21Z

It shouldn't take that much memory, did you try reducing the batch size ?

Also seems that you are using fp16 here. I haven't tried it with fp16 yet.

tagging @sshleifer

sam-writer · 2020-06-01T17:18:57Z

Ok, I tried fp16 as a "maybe this will use less memory" experiment, I will try without. I tried batch size of 4, could go lower I guess. Should I just double the learning rate each time I halve the batch size, or are other changes needed?

alexgaskell10 · 2020-06-05T09:12:53Z

Could somebody who has fine-tuned BART give me an estimate of how long it takes / how many epochs until convergence? Also any tricks to speed it up (weight freezing etc)?

1 epoch takes c. 150 hrs for my dataset so wondering how many I need...

sshleifer · 2020-06-05T15:26:44Z

Sounds like you have a huge dataset?
It's tough to know exactly how many you will need, but for xsum and cnn most of the model's I've need have required 4-6 to converged.
The [original authors] https://github.com/pytorch/fairseq/blob/master/examples/bart/README.summarization.md#4-fine-tuning-on-cnn-dm-summarization-task say 15-20K Steps.

I have had to go down to batch size=1 or 2 on some occasions.
You can use --gradient_accumulation_steps to keep the "effective" batch size (how many examples your model processes per backward pass) consistent.

@sam-qordoba is your Dataset/DataLoader putting all the examples in memory before training? That could be an issue on a large dataset.

sshleifer · 2020-06-05T15:28:40Z

You can also freeze the BartForConditionalGeneration.model.encoder using the function below to reduce memory cost.

def freeze_part(model: nn.Module):
    for par in model.parameters():
        par.requires_grad = False

You can also use val_check_interval in lightning to check validation statistics more frequently, but unfortunately your checkpoints will still be saved at the end of every epoch.

alexgaskell10 · 2020-06-05T15:32:33Z

@sshleifer thanks for coming back with this- all very helpful.

Yes- essentially I am just trying out using BART to for longer docs (arXiv/PubMed) as a baseline to compare more sophisticated methods against. This means training set has 300k samples and only 1 sample fits on the GPU at once (12Gb- using 1,024 input length).

Lots for me to play around with and see what works well. Thanks for your help.

patil-suraj · 2020-06-05T15:43:31Z

Yes- essentially I am just trying out using BART to for longer docs (arXiv/PubMed) as a baseline to compare more sophisticated methods against

@alexgaskell10 If you are interested in using BART for long documents then keep an eye here.
https://github.com/patil-suraj/longbart

I'm trying to convert BART to it's long version using longformer's sliding-window attention.

I've been able to replace BART encoder's SelfAttention with LongformerSelfAttention with 4096 max length. Now I'm working on adding gradient checkpointing to allow it to train on smaller GPU's. Hope to finish it soon.

gradient checkpointing and fp16 with '02' opt level should allow to use larger batch size

alexgaskell10 · 2020-06-05T15:48:58Z

@patil-suraj thanks for this- adapting BART for LongformerSelfAttention was actually something I was going to start looking into over the next couple of weeks. Thanks for sharing- I'll be sure to give it a go soon.

virattt · 2020-06-12T17:01:48Z

Hey @patil-suraj, any updates on your latest progress on LongBART? Thinking about diving into a similar block of work: expanding BART via Longformer

patil-suraj · 2020-06-12T17:12:13Z

Hi @virattt , I've been able to replace bart encoder's self attention with sliding window attention. Also added gradient checkpoiting in the encoder.

Gradient checkpoiting in decoder is not working so going to remove it for now. Will update the repo this weekend and will put some instructions in the readme.

virattt · 2020-06-12T17:51:14Z

Sounds great, thanks @patil-suraj

sshleifer · 2020-06-12T19:24:19Z

Would love to hear LongBart experimental results whenever they are available!

alexgaskell10 · 2020-07-01T09:54:25Z

@sshleifer I have been playing around with LongBart recently and have some preliminary experimental results. This is using @patil-suraj 's longbart repo fine-tuned on the PubMed dataset using the hf summarization finetune.py script.

The best result so far is ROUGE-1 = 36.8 (for comparison, fine-tuning vanilla BART on PubMed and truncating articles at 1024 tokens I got 42.3 ROUGE-1). I have only run a few configs so far and will be running many more so I expect this to improve. Next steps:

Have been only using a 12Gb GPU so far so have frozen the embeddings and encoder otherwise too large. I have a much larger cluster I can move to so will start running trials on this soon which will give more freedom to try different configs
I am only fine-tuning at the moment. Might explore doing some pre-training although this may be too expensive.

Let me know if there is anything you would like to see and I'll try to schedule it in.

patil-suraj · 2020-07-01T10:39:43Z

Hi @alexgaskell10 , did you use the code as it is ? I think we'll need to train the embeddings for few epochs then we can freeze it.
However without freezing the embeddings I ran into OOM halfway through the epoch even with bart-base with '02' fp16 on 16GB V100.

@sshleifer do you have any ideas why this might be happening ? It went well till 60% of first epoch then OOM. Batch size was 1 and max_seq_len 4096 ?

@alexgaskell10 can you share more details, how many epochs, batch size, fp16 or not ?

alexgaskell10 · 2020-07-01T10:49:52Z

Yes, I used the code as is (minor changes to integrate with hf finetune.py script). I agree that the embeddings and encoder should not be frozen from the beginning but I couldn't fit it on my 12Gb GPU. Once I get setup on the cluster I'll try this.

More details on all my runs so far can be found in my wandb project. To answer your question, max a couple epochs so far, batch size between 4 and 16 depending on what fits, not fp16 so far (haven't set up yet but will do soon).

patil-suraj · 2020-07-01T11:01:45Z

Thanks @alexgaskell10 , I think you'll be able to use bart-base with fp16 and max 2048 seq len without frezzing embdddings on 12GB GPU

HHousen · 2020-07-14T18:42:50Z

I ran the benchmark scripts for each version: python examples/benchmarking/run_benchmark.py --models allenai/longformer-base-4096 --training.

Latest master branch:

2020-07-14 18:21:34.487221: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 725/725 [00:00<00:00, 583kB/s]
1 / 1

====================       INFERENCE - SPEED - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.031     
 allenai/longformer-base-4096        8               32            1.015     
 allenai/longformer-base-4096        8              128            1.037     
 allenai/longformer-base-4096        8              512            1.028     
--------------------------------------------------------------------------------

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2117     
 allenai/longformer-base-4096        8               32             2117     
 allenai/longformer-base-4096        8              128             2117     
 allenai/longformer-base-4096        8              512             2117     
--------------------------------------------------------------------------------

====================        TRAIN - SPEED - RESULTS         ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2.0      
 allenai/longformer-base-4096        8               32            1.999     
 allenai/longformer-base-4096        8              128            2.103     
 allenai/longformer-base-4096        8              512            2.366     
--------------------------------------------------------------------------------

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             10001     
 allenai/longformer-base-4096        8               32            10155     
 allenai/longformer-base-4096        8              128            10207     
 allenai/longformer-base-4096        8              512            12559     
--------------------------------------------------------------------------------

====================        ENVIRONMENT INFORMATION         ====================
- transformers_version: 3.0.2
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 18:30:48.341403
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0
- use_tpu: False

Version 2.11.0 (git checkout tags/v2.11.0):

2020-07-14 18:31:04.379166: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
1 / 1
======= INFERENCE - SPEED - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 0.356s
		allenai/longformer-base-4096/8/32: 0.359s
		allenai/longformer-base-4096/8/128: 0.364s
		allenai/longformer-base-4096/8/512: 0.367s
======= INFERENCE - MEMORY - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 8178 MB
		allenai/longformer-base-4096/8/32: 8170 MB
		allenai/longformer-base-4096/8/128: 8162 MB
		allenai/longformer-base-4096/8/512: 8162 MB
======= TRAIN - SPEED - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 0.357s
		allenai/longformer-base-4096/8/32: 0.359s
		allenai/longformer-base-4096/8/128: 0.363s
		allenai/longformer-base-4096/8/512: 0.366s
======= TRAIN - MEMORY - RESULT =======
	======= MODEL CHECKPOINT: allenai/longformer-base-4096 =======
		allenai/longformer-base-4096/8/8: 9320 MB
		allenai/longformer-base-4096/8/32: 9416 MB
		allenai/longformer-base-4096/8/128: 9514 MB
		allenai/longformer-base-4096/8/512: 11866 MB

======== ENVIRONMENT - INFORMATION ========
- transformers_version: 2.11.0
- framework: PyTorch
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 18:34:31.155709
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0

HHousen · 2020-07-14T19:13:43Z

I also tested the differences before and after d697b6c ([Longformer] Major Refactor (#5219)).

Training time changes:

Before --> After
1.323  --> 1.995
1.353  --> 2.016
1.416  --> 2.094
1.686  --> 2.378

Before d697b6c (at commit e0d58dd):

====================       INFERENCE - SPEED - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             0.332     
 allenai/longformer-base-4096        8               32            0.342     
 allenai/longformer-base-4096        8              128             0.35     
 allenai/longformer-base-4096        8              512            0.357     
--------------------------------------------------------------------------------

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2117     
 allenai/longformer-base-4096        8               32             2117     
 allenai/longformer-base-4096        8              128             2117     
 allenai/longformer-base-4096        8              512             2117     
--------------------------------------------------------------------------------

====================        TRAIN - SPEED - RESULTS         ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.323     
 allenai/longformer-base-4096        8               32            1.353     
 allenai/longformer-base-4096        8              128            1.416     
 allenai/longformer-base-4096        8              512            1.686     
--------------------------------------------------------------------------------

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              9617     
 allenai/longformer-base-4096        8               32             9771     
 allenai/longformer-base-4096        8              128             9823     
 allenai/longformer-base-4096        8              512            12175     
--------------------------------------------------------------------------------

====================        ENVIRONMENT INFORMATION         ====================
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
- transformers_version: 3.0.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 18:59:34.297304
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0
- use_tpu: False

After d697b6c (at commit d697b6c):

====================       INFERENCE - SPEED - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.028     
 allenai/longformer-base-4096        8               32             1.01     
 allenai/longformer-base-4096        8              128            1.013     
 allenai/longformer-base-4096        8              512            1.061     
--------------------------------------------------------------------------------

====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2117     
 allenai/longformer-base-4096        8               32             2117     
 allenai/longformer-base-4096        8              128             2117     
 allenai/longformer-base-4096        8              512             2117     
--------------------------------------------------------------------------------

====================        TRAIN - SPEED - RESULTS         ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             1.995     
 allenai/longformer-base-4096        8               32            2.016     
 allenai/longformer-base-4096        8              128            2.094     
 allenai/longformer-base-4096        8              512            2.378     
--------------------------------------------------------------------------------

====================        TRAIN - MEMORY - RESULTS        ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              9617     
 allenai/longformer-base-4096        8               32             9771     
 allenai/longformer-base-4096        8              128             9823     
 allenai/longformer-base-4096        8              512            12175     
--------------------------------------------------------------------------------

====================        ENVIRONMENT INFORMATION         ====================
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
- transformers_version: 3.0.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.5.1+cu101
- python_version: 3.6.9
- system: Linux
- cpu: x86_64
- architecture: 64bit
- date: 2020-07-14
- time: 19:10:37.139177
- fp16: False
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 13021
- use_gpu: True
- num_gpus: 1
- gpu: Tesla T4
- gpu_ram_mb: 15079
- gpu_power_watts: 70.0
- gpu_performance_state: 0
- use_tpu: False

HHousen · 2020-07-14T20:03:37Z

@patrickvonplaten @alexgaskell10 @ibeltagy The training time increased from tags/v2.11.0 (0.361s) to right before d697b6c (at commit e0d58dd) (1.445s) by 1.084s.

The training time increased from right before d697b6c (at commit e0d58dd) (1.445s) to directly after d697b6c (at commit d697b6c) (2.121s) by 0.676s.

I ran the benchmarks twice and got similar results both times.

ibeltagy · 2020-07-14T20:41:03Z

nice finding. Thanks, @HHousen.

@patrickvonplaten, we can check the refactoring more carefully to find the reason for the second slowdown. Any thoughts on what could be the reason for the first one? It is a span of 270 commits!!

patrickvonplaten · 2020-07-14T22:19:08Z

Thanks a lot for running the benchmark @HHousen !

Very interesting indeed! I will take a look tomorrow.
The benchmarking tools were changed quite significantly from 2.11 to 3.0.1 => so I will run both Longformer versions (2.11 and master) with the same benchmarking tools tomorrow to make sure that the performance degradation is really due to changes in Longformer.

HHousen · 2020-07-15T00:48:38Z

@patrickvonplaten You're correct about the first training time increase. I tracked down the time change to commit fa0be6d. At 18a0150 (right before fa0be6d) the training time is about 0.35s. But at fa0be6d it's about 1.4s. So the first time increase can be safely ignored because it was caused by a change in the benchmark scripts.

The second time increase, caused by d697b6c seems to be the main issue.

patrickvonplaten · 2020-07-15T18:45:31Z

@HHousen @ibeltagy,

I just ran the same benchmarking scripts on different versions and I can confirm that there is quite a drastic slow-down at master.
Here is the branch: https://github.com/huggingface/transformers/tree/benchmark_for_2_11 in case it's useful for you.

My results for master:

====================       INFERENCE - SPEED - RESULT       ====================                         
--------------------------------------------------------------------------------                         
          Model Name             Batch Size     Seq Length     Time in s                                                                                                                                           
--------------------------------------------------------------------------------                         
 allenai/longformer-base-4096        8               8             0.644                                 
 allenai/longformer-base-4096        8               32             0.64                                 
 allenai/longformer-base-4096        8              128             0.64                                 
 allenai/longformer-base-4096        8              512            0.637        
--------------------------------------------------------------------------------                         
Saving results to csv.                                                                                   
                                                                                                         
====================      INFERENCE - MEMORY - RESULT       ====================                         
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB                               
--------------------------------------------------------------------------------                         
 allenai/longformer-base-4096        8               8              2023                                 
 allenai/longformer-base-4096        8               32             2023                                 
 allenai/longformer-base-4096        8              128             2023                                 
 allenai/longformer-base-4096        8              512             2023                                 
--------------------------------------------------------------------------------                         
Saving results to csv.                                                                                   
                                                                                                         
====================        ENVIRONMENT INFORMATION         ====================                                                                                                                                   
- transformers_version: 3.0.2                                                                            
- framework: PyTorch                                                                                     
- use_torchscript: False                                                                                 
- framework_version: 1.5.0                                                                               
- python_version: 3.7.7                                                                                  
- system: Linux                                                                                                                                                                                                    
- cpu: x86_64                                                                                                                                                                                                      
- architecture: 64bit                                                                                    
- date: 2020-07-15                                  
- time: 18:30:00.426834                             
- fp16: False                                       
- use_multiprocessing: True
- only_pretrain_model: False
- cpu_ram_mb: 32089                                 
- use_gpu: True                                     
- num_gpus: 1                                       
- gpu: TITAN RTX                                    
- gpu_ram_mb: 24217                                 
- gpu_power_watts: 280.0                            
- gpu_performance_state: 0
- use_tpu: False

results for 2.11.0:

====================       INFERENCE - SPEED - RESULT       ====================                                                                                                                                   
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length     Time in s   
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8             0.144       
 allenai/longformer-base-4096        8               32            0.144     
 allenai/longformer-base-4096        8              128            0.144     
 allenai/longformer-base-4096        8              512            0.145     
--------------------------------------------------------------------------------                                                                                                                                   
Saving results to csv.                                                                                   
                                                    
====================      INFERENCE - MEMORY - RESULT       ====================
--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
 allenai/longformer-base-4096        8               8              2023                                 
 allenai/longformer-base-4096        8               32             2023          
 allenai/longformer-base-4096        8              128             2023                                 
 allenai/longformer-base-4096        8              512             2023                                 
--------------------------------------------------------------------------------
Saving results to csv.                                                                                   
                                                                                                         
====================        ENVIRONMENT INFORMATION         ====================   
- transformers_version: 2.11.0                                                                           
- framework: PyTorch                                
- use_torchscript: False                                                                                 
- framework_version: 1.5.0                                                                               
- python_version: 3.7.7                                                                                  
- system: Linux                                                                                          
- cpu: x86_64                                                                                            
- architecture: 64bit                                                                                    
- date: 2020-07-15                                                                                       
- time: 18:39:00.315564                                                                                  
- fp16: False                                                                                            
- use_multiprocessing: True                                                                              
- only_pretrain_model: False                                                                             
- cpu_ram_mb: 32089                                                                                      
- use_gpu: True                                                                                          
- num_gpus: 1                                                                                            
- gpu: TITAN RTX                                                                                         
- gpu_ram_mb: 24217                                                                                      
- gpu_power_watts: 280.0                                                                                 
- gpu_performance_state: 0                                                                               
- use_tpu: False

It was probably caused by me, when I did the major longformer refactoring... => will investigate more tomorrow!
Thanks a lot for pointing this out @HHousen - this is super useful.

I guess we should have tests that automatically check if the PR causes a significant slow down. (also @sshleifer , @thomwolf, @mfuntowicz )

patrickvonplaten · 2020-07-16T14:25:41Z

Ok fixed it. @ibeltagy @HHousen - it would be great if you can try again on your end with the current version of master to make sure the inference speed is back to normal.

HHousen · 2020-07-16T16:38:00Z

@patrickvonplaten I ran the benchmark on master and the speeds do look to be normal again.

The training speeds are 1.328s, 1.378s, 1.457s, and 1.776s for sequences of length 8, 32, 128, 512 respectively, which is similar to the speeds before the major refactor at d697b6c.

Inference speeds are 0.326s, 0.343s, 0.348s, and 0.367s, which are appear to be back to normal.

HHousen · 2020-07-16T16:40:27Z

@ibeltagy Should you merge ibeltagy/transformers@longformer_encoder_decoder into huggingface/transformers@master yet to add gradient checkpointing to BART? Or are you waiting for the final LongformerEncoderDecoder implementation to be completed?

ibeltagy · 2020-07-16T17:15:02Z

@HHousen, I had to disable certain features of the model here to implement gradient checkpointing, so merging it will require more work.
@LysandreJik started working on gradient checkpointing in this PR #5415 and he might have better ideas.

WangHexie · 2020-08-02T13:49:15Z

@kevinlu1248
This colab shows how to fine-tune T5 with lightening. This is just the self-contained version of official example. You should be able to use the same Trainer, just replace the model with BART and use you own dataset.

I modified this example to adapt to the BART. And only use Positive </s> as target , but after training for a epoch, the model output all 0, tensor([[2, 0, 0, 2]]), decoded as '' .
What could be reason for the model failed on such simple task?
Using training_rate = 2*10^-5

patil-suraj · 2020-08-02T14:27:41Z

@WangHexie , not sure. One suggestion, with BART you won't need to manually add at the end as the BART tokenizer automatically add the eos token at the end of the text

WangHexie · 2020-08-02T17:00:12Z

@patil-suraj Thanks to your prompt. These models' behaviour is quite different, the problem is solved by shifting decoder input to the right manually.

ibeltagy · 2020-08-06T19:08:38Z

@alexgaskell10, @HHousen, the query.reshape()solution is wrong. The code runs but it is not doing the right thing. It should be query.transpose(0, 1). I just pushed a fix. This bug will affect all your results if you are using a batch size > 1

alexgaskell10 · 2020-08-07T17:36:59Z

@ibeltagy @HHousen Thanks for the update, it still is not working well for me with bsz > 1. I think you also need to change attn_output = attn_output.contiguous().view(tgt_len, bsz, embed_dim) to attn_output = attn_output.transpose(0,1) in longformer/longformer_encoder_decoder.py, line 75.

ibeltagy · 2020-08-07T18:17:11Z

@alexgaskell10, you are right. Just pushed a fix for that one as well.

alexgaskell10 · 2020-08-12T14:50:27Z

@ibeltagy it still isn't working correctly for me (even at bsz=1). On some runs and at random points training the training becomes corrupted as per the image below. Taking a look into this now but not really sure where to start as it only happens sometimes and at random points during training so I haven't got much to work with. Any ideas?

ibeltagy · 2020-08-12T15:13:46Z

what is the effective batch size? bsz x gradient accumulation x number of gpus? make sure it is not very small, try at least 8 if not 32.
how does the learning ~~curve~~ rate curve look like? can you draw it next to the loss curve? are you using warmup and decay? try lowering the learning rate?

alexgaskell10 · 2020-08-12T16:40:31Z

Thanks for the suggestions- a couple of good thoughts. I have only been using small bsz so far (< 4) so I think that is somewhere to start alongside playing with the LR. Thanks!

I am not using warmup and decay. I don't think warmup is the issue as it rarely begins soon into training. Will try with decay though
What are you referring to as the learning curve in this instance? The validation loss?

ibeltagy · 2020-08-12T16:54:45Z

oh, sorry, I meant plotting learning rate curve vs. steps.

JessicaLopezEspejel · 2020-08-18T15:00:31Z

Gradient checkpointing in the decoder is not working so going to remove it for now. Will update the repo this weekend and will put some instructions in the readme.

Hello @patil-suraj
Do you have any advance in this work? I checked the repository but the README is still empty.

Can you help me please @alexgaskell10?

Thank you so much.

morganmcg1 · 2020-09-24T07:32:46Z

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Just adding that I get the same error for AlbertForMaskedLM with albert-large-v2, batch size of 8, using version 3.1.0 (pytorch), and training with Trainer

It doesn't appear immediately, but a little way into the warm-up phase of the training.

memray · 2020-09-30T19:21:35Z

I had the same problem with FunnelTransformer. But it seems resolved after I set WANDB_WATCH=false or disable --fp16. You can try if it works for you.

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Just adding that I get the same error for AlbertForMaskedLM with albert-large-v2, batch size of 8, using version 3.1.0 (pytorch), and training with Trainer

It doesn't appear immediately, but a little way into the warm-up phase of the training.

Amirosimani · 2020-10-12T20:46:00Z

@patil-suraj what is the best way to save the fine tune model in order to reuse it again with T5ForConditionalGeneration.from_pretrained()?

stale · 2020-12-12T20:44:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sshleifer changed the title ~~Fine Tuning~~ Summarization Fine Tuning May 20, 2020

sshleifer added the Discussion Discussion on a topic (keep it focused or open a new issue though) label May 20, 2020

patil-suraj mentioned this issue Jun 5, 2020

🚀 [Feature Request] Add self-contained browsable examples/notebooks in the docs #4787

Closed

HHousen mentioned this issue Jun 21, 2020

Add support for encoder_hidden_states and encoder_attention_mask in modeling_longformer #5170

Closed

patrickvonplaten mentioned this issue Jul 16, 2020

[Longformer] fix longformer slow-down #5811

Merged

hannes89 mentioned this issue Aug 3, 2020

[WIP] LongformerEncoderDecoder allenai/longformer#28

Open

stale bot added the wontfix label Dec 12, 2020

stale bot closed this as completed Dec 20, 2020

Summarization Fine Tuning #4406

Summarization Fine Tuning #4406

Comments

kevinlu1248 commented May 17, 2020 • edited

❓ Questions & Help

Details

patil-suraj commented May 18, 2020 • edited by sshleifer

kevinlu1248 commented May 18, 2020

patil-suraj commented May 18, 2020

kevinlu1248 commented May 18, 2020

patil-suraj commented May 21, 2020

kevinlu1248 commented May 21, 2020

sam-writer commented May 23, 2020

patil-suraj commented May 23, 2020 • edited

sam-writer commented May 23, 2020

sam-writer commented May 30, 2020 • edited

patil-suraj commented Jun 1, 2020

sam-writer commented Jun 1, 2020

alexgaskell10 commented Jun 5, 2020

sshleifer commented Jun 5, 2020 • edited

sshleifer commented Jun 5, 2020 • edited

alexgaskell10 commented Jun 5, 2020

patil-suraj commented Jun 5, 2020 • edited

alexgaskell10 commented Jun 5, 2020

virattt commented Jun 12, 2020 • edited

patil-suraj commented Jun 12, 2020

virattt commented Jun 12, 2020

sshleifer commented Jun 12, 2020

alexgaskell10 commented Jul 1, 2020

patil-suraj commented Jul 1, 2020

alexgaskell10 commented Jul 1, 2020 • edited

patil-suraj commented Jul 1, 2020

HHousen commented Jul 14, 2020

HHousen commented Jul 14, 2020

HHousen commented Jul 14, 2020

ibeltagy commented Jul 14, 2020

patrickvonplaten commented Jul 14, 2020

HHousen commented Jul 15, 2020

patrickvonplaten commented Jul 15, 2020 • edited

patrickvonplaten commented Jul 16, 2020

HHousen commented Jul 16, 2020

HHousen commented Jul 16, 2020

ibeltagy commented Jul 16, 2020 • edited

WangHexie commented Aug 2, 2020

patil-suraj commented Aug 2, 2020

WangHexie commented Aug 2, 2020

ibeltagy commented Aug 6, 2020

alexgaskell10 commented Aug 7, 2020 • edited

ibeltagy commented Aug 7, 2020

alexgaskell10 commented Aug 12, 2020

ibeltagy commented Aug 12, 2020 • edited

alexgaskell10 commented Aug 12, 2020 • edited

ibeltagy commented Aug 12, 2020 • edited

JessicaLopezEspejel commented Aug 18, 2020 • edited

morganmcg1 commented Sep 24, 2020

memray commented Sep 30, 2020

Amirosimani commented Oct 12, 2020

stale bot commented Dec 12, 2020

kevinlu1248 commented May 17, 2020 •

edited

patil-suraj commented May 18, 2020 •

edited by sshleifer

patil-suraj commented May 23, 2020 •

edited

sam-writer commented May 30, 2020 •

edited

sshleifer commented Jun 5, 2020 •

edited

sshleifer commented Jun 5, 2020 •

edited

patil-suraj commented Jun 5, 2020 •

edited

virattt commented Jun 12, 2020 •

edited

alexgaskell10 commented Jul 1, 2020 •

edited

patrickvonplaten commented Jul 15, 2020 •

edited

ibeltagy commented Jul 16, 2020 •

edited

alexgaskell10 commented Aug 7, 2020 •

edited

ibeltagy commented Aug 12, 2020 •

edited

alexgaskell10 commented Aug 12, 2020 •

edited

ibeltagy commented Aug 12, 2020 •

edited

JessicaLopezEspejel commented Aug 18, 2020 •

edited