Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed and T5-11B for multitask training #14531

Closed
tuhinjubcse opened this issue Nov 26, 2021 · 45 comments
Closed

Deepspeed and T5-11B for multitask training #14531

tuhinjubcse opened this issue Nov 26, 2021 · 45 comments

Comments

@tuhinjubcse
Copy link

tuhinjubcse commented Nov 26, 2021

Carrying on my conversation here @stas00
#9996 (comment)

Used the run_translation.py and now my loss is 0.0 :( . This is probably doomed to fail

{'loss': 7.2639, 'learning_rate': 0.001, 'epoch': 0.02}                                                                                                                                                     
  3%|████                                                                                                                                                            | 612/24128 [42:13<26:09:12,  4.00s/it]{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.04}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.06}                                                                                                                                                        
  8%|█████████████                                                                                                                                                | 1999/24128 [2:15:09<24:43:54,  4.02s/it][2021-11-25 22:01:13,181] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1995, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-25 22:01:13,181] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=7.902960485741644
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.08}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.1}    

Script

export BS=8;
PYTHONPATH=../../../src
USE_TF=0

deepspeed --num_gpus=4 ./run_translation.py \
        --model_name_or_path t5-11b \
        --output_dir /local/nlp/temp/poetryT5-11B_new \
        --evaluation_strategy=epoch \
        --do_train \
        --train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
        --save_strategy=epoch \
        --label_smoothing 0.1 \
        --learning_rate 1e-3 \
        --adafactor \
        --overwrite_output_dir \
        --max_source_length 64 \
        --max_target_length 64 \
        --num_train_epochs 1 \
        --per_device_train_batch_size $BS \
        --per_device_eval_batch_size $BS \
        --source_lang en \
        --target_lang en \
        --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json  \
        --fp16
~                 

Data format

{"translation": {"en1": "Write a poetic sentence about 'people'", "en2": "In this age what people used to call."}}
{"translation": {"en1": "Write a poetic sentence about 'tale'", "en2": "Where evening is empty, an unfinished tale."}}
{"translation": {"en1": "Write a poetic sentence that ends in a word which rhymes with 'planes'", "en2": "Now the blood freezes in the veins."}}
{"translation": {"en1": "Write a poetic sentence about 'Weighs his spread' and ending in 'behold'", "en2": "Weighs his spread wings, at leasure to behold."}}
{"translation": {"en1": "Write a poetic sentence about 'lips'", "en2": "Her dry lips were tightly closed up."}}
def preprocess_function(examples):
        inputs = [ex["en1"] for ex in examples["translation"]]
        targets = [ex["en2"] for ex in examples["translation"]]
        inputs = [prefix + inp for inp in inputs]
        model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)

        # Setup the tokenizer for targets
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)

        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
        # padding in the loss.
        if padding == "max_length" and data_args.ignore_pad_token_for_loss:
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
@stas00
Copy link
Contributor

stas00 commented Nov 26, 2021

I have a feeling that the issue is not in using deepspeed but somewhere else in your setup.

Let's remove deepspeed for a moment from the equation and try your setup with a single gpu setup with t5-large or even t5-small - make it work first so that it produces what you expect albeit with a lower quality.

Once this is working you can then progress to a higher model size and eventually you'd just plug deepspeed to work with t5-11b.

It'll also make your debug process much easier since it takes forever to even load t5-11b.

Always start small and simple, then progress to bigger and slightly more complex, and then big and complex.

@tuhinjubcse
Copy link
Author

tuhinjubcse commented Nov 26, 2021

@stas00 thanks I tried with t5-small with and without deepspeed and the loss was non zero it was in the range of 3.6 and was slowly decreasing. I removed the label smoothing before training

t5-small with deepspeed / t5-small without deepspeed


{'loss': 3.6752, 'learning_rate': 0.001, 'epoch': 0.02}                                                                                                                                                     
{'loss': 3.4976, 'learning_rate': 0.001, 'epoch': 0.04}                                                                                                                                                     
{'loss': 3.4253, 'learning_rate': 0.001, 'epoch': 0.06}                                                                                                                                                     
  8%|█████████████▎                                                                                                                                                  | 1999/24128 [08:14<1:25:00,  4.34it/s]
[2021-11-26 10:02:46,946] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=5, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 10:02:46,964] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=133.5718740278581
{'loss': 3.3788, 'learning_rate': 0.001, 'epoch': 0.08}                                                                                                                                                     
{'loss': 3.3362, 'learning_rate': 0.001, 'epoch': 0.1}                                                                                                                                                      
{'loss': 3.3234, 'learning_rate': 0.001, 'epoch': 0.12}                                                                                                                                                     
{'loss': 3.303, 'learning_rate': 0.001, 'epoch': 0.15}                                                                                                                                                      
 17%|██████████████████████████▌                                                                                                                                     | 3999/24128 [16:20<1:17:30,  4.33it/s]
[2021-11-26 10:10:53,519] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=8, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 10:10:53,566] [INFO] [timer.py:181:stop] 0/4000, SamplesPerSec=134.3619251306713
{'loss': 3.2785, 'learning_rate': 0.001, 'epoch': 0.17}                                                                                                                                                     
{'loss': 3.2497, 'learning_rate': 0.001, 'epoch': 0.19}                                                                                                                                                     
{'loss': 3.238, 'learning_rate': 0.001, 'epoch': 0.21}                                                                                                                                                      
{'loss': 3.225, 'learning_rate': 0.001, 'epoch': 0.23}                                                                                                                                                      
 25%|████████████████████████████████████████▎                                                                                                                         | 5999/24128 [24:09<59:07,  5.11it/s]
[2021-11-26 10:18:42,146] [INFO] [logging.py:69:log_dist] [Rank 0] step=6000, skipped=12, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 10:18:42,209] [INFO] [timer.py:181:stop] 0/6000, SamplesPerSec=136.2225860449825
{'loss': 3.2199, 'learning_rate': 0.001, 'epoch': 0.25}                                                                                                                                                     
{'loss': 3.2117, 'learning_rate': 0.001, 'epoch': 0.27}                                                                                                                                                     
{'loss': 3.1959, 'learning_rate': 0.001, 'epoch': 0.29}                                                                                                                                                     
{'loss': 3.179, 'learning_rate': 0.001, 'epoch': 0.31}                                                                                                                                                      
 33%|█████████████████████████████████████████████████████                                                                                                           | 7999/24128 [32:08<1:02:08,  4.33it/s]
[2021-11-26 10:26:40,925] [INFO] [logging.py:69:log_dist] [Rank 0] step=8000, skipped=14, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 10:26:40,956] [INFO] [timer.py:181:stop] 0/8000, SamplesPerSec=136.46790814424403
{'loss': 3.1771, 'learning_rate': 0.001, 'epoch': 0.33}

I started doing T5-11B with deepspeed

{'loss': 6.2645, 'learning_rate': 0.001, 'epoch': 0.02}                                                                                                                                                     
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.04}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.06}                                                                                                                                                        
  8%|█████████████                                                                                                                                                | 1999/24128 [1:52:09<20:27:23,  3.33s/it][2021-11-26 03:07:16,494] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1995, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 03:07:16,494] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=9.526738918021234
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.08}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.1}                                                                                                                                                         
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.12}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.15}                                                                                                                                                        
 17%|██████████████████████████                                                                                                                                   | 3999/24128 [3:43:02<18:36:58,  3.33s/it][2021-11-26 04:58:10,385] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=3995, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 04:58:10,386] [INFO] [timer.py:181:stop] 0/4000, SamplesPerSec=9.581176077667344
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.17}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.19}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.21}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.23}                                                                                                                                                        
 25%|███████████████████████████████████████                                                                                                                      | 5999/24128 [5:33:57<16:42:03,  3.32s/it][2021-11-26 06:49:04,614] [INFO] [logging.py:69:log_dist] [Rank 0] step=6000, skipped=5995, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 06:49:04,614] [INFO] [timer.py:181:stop] 0/6000, SamplesPerSec=9.599231332866195
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.25}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.27}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.29}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.31}                                                                                                                                                        
 33%|████████████████████████████████████████████████████                                                                                                         | 7999/24128 [7:24:52<14:51:53,  3.32s/it][2021-11-26 08:40:00,444] [INFO] [logging.py:69:log_dist] [Rank 0] step=8000, skipped=7995, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-26 08:40:00,445] [INFO] [timer.py:181:stop] 0/8000, SamplesPerSec=9.607671816549383
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.33}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.35}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.37}

@tuhinjubcse
Copy link
Author

tuhinjubcse commented Nov 26, 2021

t5-large with deepspeed back to zero lr

Time to load utils op: 0.0009222030639648438 seconds
[INFO|trainer.py:1196] 2021-11-26 13:25:43,185 >> ***** Running training *****
[INFO|trainer.py:1197] 2021-11-26 13:25:43,185 >>   Num examples = 772073
[INFO|trainer.py:1198] 2021-11-26 13:25:43,185 >>   Num Epochs = 1
[INFO|trainer.py:1199] 2021-11-26 13:25:43,185 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1200] 2021-11-26 13:25:43,186 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1201] 2021-11-26 13:25:43,186 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1202] 2021-11-26 13:25:43,186 >>   Total optimization steps = 24128
  2%|███▎                                                                                                                                                             | 500/24128 [03:08<2:47:02,  2.36it/s][WARNING|trainer_pt_utils.py:803] 2021-11-26 13:28:52,075 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 13:28:52,075 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 13:28:52,076 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 13:28:52,076 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0729, 'learning_rate': 0, 'epoch': 0.02}                                                                                                                                                         
  4%|██████▋                                                                                                                                                         | 1000/24128 [06:12<2:20:51,  2.74it/s][WARNING|trainer_pt_utils.py:803] 2021-11-26 13:31:56,044 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 13:31:56,044 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 13:31:56,045 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 13:31:56,045 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0765, 'learning_rate': 0, 'epoch': 0.04}                                          

@stas00
Copy link
Contributor

stas00 commented Nov 26, 2021

So first good to see that you have the non-DS setup working.

I don't understand this in your log 2 comments up. Is it with or without DS?

t5-small with deepspeed / t5-small without deepspeed

Re: last comment:

As the warning says, the optimizer hasn't started running, so it doesn't have an LR yet, and just returns 0.

So we need to figure out why the optimizer isn't running.

For example you can edit the ds config file to remove the optimizer section and it will use Transformers' AdamW instead of the DS's one.

Meanwhile could you help me to reproduce the issue on my side? Could you perhaps make a tarball that I could run with the data and your customizations? So that I could run the same setup as you do

@stas00
Copy link
Contributor

stas00 commented Nov 26, 2021

I also run a sanity check with this and verified that in general things work correctly:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_train --label_smoothing 0.1 --learning_rate 1e-3 --logging_first_step --logging_steps 2 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: " --val_max_target_length 128  --max_train_samples 500 --deepspeed tests/deepspeed/ds_config_zero2.json
[...]

{'loss': 2.9952, 'learning_rate': 0.0, 'epoch': 0.06}
{'loss': 2.7144, 'learning_rate': 0.001, 'epoch': 0.12}
{'loss': 2.809, 'learning_rate': 0.001, 'epoch': 0.25}
{'loss': 2.4788, 'learning_rate': 0.001, 'epoch': 0.38}
{'loss': 2.2926, 'learning_rate': 0.001, 'epoch': 0.5}

So something is different about your setup.

@tuhinjubcse
Copy link
Author

You are trying with t5-small in your sanity check . t5-small works for me too with deepseed as well
as without deepspeed. It gives me loss zero for t5-11b with same code. I also am using adafactor instead of adam since I am trying to reproduce the same hyperparameters as T0pp

@stas00
Copy link
Contributor

stas00 commented Nov 26, 2021

Additionally, I have noticed you're using --adafactor, which until recently didn't have a way to access LR as it's an internal state. Some months back I added a hack to have it extract the LR, but it's not great.

So it's very likely this could be related as well. e.g. try to use the default ds_config w/ optimizer and remove --adafactor and see if things are different?

@stas00
Copy link
Contributor

stas00 commented Nov 26, 2021

You are trying with t5-small in your sanity check . t5-small works for me too with deepseed as well as without deepspeed. It gives me loss zero for t5-11b with same code. I also am using adafactor instead of adam since I am trying to reproduce the same hyperparameters as T0pp

Understood.

As I explained earlier, for some reason the optimizer isn't stepping in your t5-11b example.

So we need to figure out why that is.

You can also try the larger ones first - t5-base, t5-large

@tuhinjubcse
Copy link
Author

I removed adafactor. This is for t5-large

my config

 "fp16": {
       "enabled": true, 
       "loss_scale": 0, 
       "loss_scale_window": 1000, 
       "initial_scale_power": 16, 
       "hysteresis": 2, 
       "min_loss_scale": 1
   }, 
   "optimizer": {
       "type": "AdamW", 
       "params": {
           "lr": 0.001, 
           "betas": [0.9, 0.999], 
           "eps": 1e-06, 
           "weight_decay": 0.0
       }
   }, 
   "scheduler": {
       "type": "WarmupLR", 
       "params": {
           "warmup_min_lr": 0, 
           "warmup_max_lr": 0.001, 
           "warmup_num_steps": 0
       }
   }, 
   "zero_optimization": {
       "stage": 2, 
       "offload_optimizer": {
           "device": "cpu", 
           "pin_memory": true
       }, 
       "allgather_partitions": true, 
       "allgather_bucket_size": 2.000000e+08, 
       "overlap_comm": true, 
       "reduce_scatter": true, 
       "reduce_bucket_size": 2.000000e+08, 
       "contiguous_gradients": true
   }, 
   "train_batch_size": 32, 
   "train_micro_batch_size_per_gpu": 8, 
   "gradient_clipping": 1.0, 
   "steps_per_print": 2.000000e+03, 
   "wall_clock_breakdown": false
}

export BS=8;
PYTHONPATH=../../../src
USE_TF=0

deepspeed --num_gpus=4 ./run_translation.py \
        --model_name_or_path t5-large \
        --output_dir /local/nlp/temp/poetryT5-11B_new \
        --evaluation_strategy=epoch \
        --do_train \
        --train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
        --validation_file /home/tuhin.chakr/gpt3/poetrynew/val.json \
        --save_strategy=epoch \
        --learning_rate 1e-3 \
        --adam_eps 1e-06 \
        --overwrite_output_dir \
        --max_source_length 64 \
        --max_target_length 64 \
        --num_train_epochs 1 \
        --per_device_train_batch_size $BS \
        --per_device_eval_batch_size $BS \
        --source_lang en_XX \
        --target_lang en_XX \
        --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json  \
        --fp16
~                                                  
Time to load utils op: 0.002509593963623047 seconds
[INFO|trainer.py:1196] 2021-11-26 22:54:54,098 >> ***** Running training *****
[INFO|trainer.py:1197] 2021-11-26 22:54:54,098 >>   Num examples = 772073
[INFO|trainer.py:1198] 2021-11-26 22:54:54,098 >>   Num Epochs = 1
[INFO|trainer.py:1199] 2021-11-26 22:54:54,098 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1200] 2021-11-26 22:54:54,098 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1201] 2021-11-26 22:54:54,098 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1202] 2021-11-26 22:54:54,098 >>   Total optimization steps = 24128
  2%|███▎                                                                                                                                                             | 500/24128 [03:10<2:36:00,  2.52it/s][WARNING|trainer_pt_utils.py:803] 2021-11-26 22:58:04,534 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 22:58:04,534 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 22:58:04,534 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 22:58:04,534 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0729, 'learning_rate': 0, 'epoch': 0.02}                                                                                                                                                         
  4%|██████▋                                                                                                                                                         | 1000/24128 [06:19<2:26:00,  2.64it/s][WARNING|trainer_pt_utils.py:803] 2021-11-26 23:01:13,601 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 23:01:13,601 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 23:01:13,601 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 23:01:13,601 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0765, 'learning_rate': 0, 'epoch': 0.04}                                                                                                                                                         
  6%|█████████▉                                                                                                                                                      | 1500/24128 [09:29<2:22:56,  2.64it/s][WARNING|trainer_pt_utils.py:803] 2021-11-26 23:04:23,358 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 23:04:23,358 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 23:04:23,358 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 23:04:23,358 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0185, 'learning_rate': 0, 'epoch': 0.06}  

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

In general try to use auto values in the ds config file, so that you don't have to sync them manually. HF Trainer will set them up correctly for you on the fly.

But I don't see any fault with your config.

And you haven't tried t5-base, t5-large, t5-3b to see if they work and it's an issue specifically with t5-11b.

Can you please send me a sample of your data you train with - if it's not for a public eye, let me know. It'd be easier to experiment directly rather than ask you to do this and that all the time.

And I suppose you have a custom code - best to send me a tarball of the whole thing (custom script+data), so that I don't have to spend time sorting it out. Thanks.

p.s. I don't actually have access to A100 at the moment, but I hope to sort it out on a smaller gpu.

@tuhinjubcse
Copy link
Author

I can't share it publicly on this thread but I emailed you the zip file containing code and data
I emailed at your email id mentioned here https://stasosphere.com/

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

missing your custom run_translation.py

@tuhinjubcse
Copy link
Author

tuhinjubcse commented Nov 27, 2021

I made changes already in the code run_translation.py

Check for this function and you will know

def preprocess_function(examples):
        inputs = [ex["en1"] for ex in examples["translation"]]
        targets = [ex["en2"] for ex in examples["translation"]]

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

Could you please try after applying this patch to deepspeed:

# patch.txt
diff --git a/deepspeed/runtime/zero/stage2.py b/deepspeed/runtime/zero/stage2.py
index b995e4d..8df4997 100755
--- a/deepspeed/runtime/zero/stage2.py
+++ b/deepspeed/runtime/zero/stage2.py
@@ -1622,6 +1622,14 @@ class FP16_DeepSpeedZeroOptimizer(object):
         prev_scale = self.loss_scale
         self._update_scale(self.overflow)
         if self.overflow:
+
+            if dist.get_rank() == 0:
+                logger.info(
+                    "[deepscale] OVERFLOW! Rank {} Skipping step. Attempted loss scale: {}, "
+                    "reducing to {}".format(dist.get_rank(),
+                                            prev_scale,
+                                            self.loss_scale))
+
             see_memory_usage('After overflow before clearing gradients')
             self.zero_grad()
             if self.cpu_offload:
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
git apply patch.txt
pip install -e .

patch.txt

This should now tell if you OVERFLOW happens and that's why it skips the step

PR: microsoft/DeepSpeed#1593

@tuhinjubcse
Copy link
Author

Does this solve the issue ? I think for t5-large I was getting 0 LR however for T5-11b loss was zero. I am just trying to understand here

@tuhinjubcse
Copy link
Author

[2021-11-27 00:36:54,800] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                 | 98/24128 [00:40<2:40:10,  2.50it/s][2021-11-27 00:36:55,194] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                 | 99/24128 [00:40<2:39:27,  2.51it/s][2021-11-27 00:36:55,588] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 100/24128 [00:40<2:39:03,  2.52it/s][2021-11-27 00:36:55,985] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 101/24128 [00:41<2:39:00,  2.52it/s][2021-11-27 00:36:56,390] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 102/24128 [00:41<2:39:53,  2.50it/s][2021-11-27 00:36:56,789] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 103/24128 [00:41<2:39:50,  2.51it/s][2021-11-27 00:36:57,210] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 104/24128 [00:42<2:42:25,  2.47it/s][2021-11-27 00:36:57,613] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 105/24128 [00:42<2:42:12,  2.47it/s][2021-11-27 00:36:58,024] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 106/24128 [00:43<2:42:48,  2.46it/s][2021-11-27 00:36:58,424] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 107/24128 [00:43<2:42:01,  2.47it/s][2021-11-27 00:36:58,826] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 108/24128 [00:44<2:41:43,  2.48it/s][2021-11-27 00:36:59,219] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 109/24128 [00:44<2:40:21,  2.50it/s][2021-11-27 00:36:59,621] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 110/24128 [00:44<2:40:35,  2.49it/s][2021-11-27 00:37:00,014] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 111/24128 [00:45<2:39:33,  2.51it/s][2021-11-27 00:37:00,407] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▋                                                                                                                                                                | 112/24128 [00:45<2:38:53,  2.52it/s][2021-11-27 00:37:00,805] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 113/24128 [00:46<2:38:58,  2.52it/s][2021-11-27 00:37:01,200] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 114/24128 [00:46<2:38:45,  2.52it/s][2021-11-27 00:37:01,596] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 115/24128 [00:46<2:38:37,  2.52it/s][2021-11-27 00:37:01,998] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 116/24128 [00:47<2:39:18,  2.51it/s][2021-11-27 00:37:02,421] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 117/24128 [00:47<2:42:16,  2.47it/s][2021-11-27 00:37:02,814] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 118/24128 [00:48<2:40:49,  2.49it/s][2021-11-27 00:37:03,210] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 119/24128 [00:48<2:40:06,  2.50it/s][2021-11-27 00:37:03,616] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  0%|▊                                                                                                                                                                | 120/24128 [00:48<2:40:49,  2.49it/s][2021-11-27 00:37:04,008] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 121/24128 [00:49<2:39:36,  2.51it/s][2021-11-27 00:37:04,404] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 122/24128 [00:49<2:39:15,  2.51it/s][2021-11-27 00:37:04,797] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 123/24128 [00:50<2:38:34,  2.52it/s][2021-11-27 00:37:05,193] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 124/24128 [00:50<2:38:33,  2.52it/s][2021-11-27 00:37:05,591] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 125/24128 [00:50<2:38:42,  2.52it/s][2021-11-27 00:37:05,989] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 126/24128 [00:51<2:38:51,  2.52it/s][2021-11-27 00:37:06,386] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 127/24128 [00:51<2:38:52,  2.52it/s][2021-11-27 00:37:06,782] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1%|▊                                                                                                                                                                | 128/24128 [00:51<2:38:41,  2.52it/s][2021-11-27 00:37:07,177] [INFO] [stage2.py:1628:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1

Yes so I guess OVERFLOW is happening

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

No, it's not solving the issue - I just added a diagnostic logging. It was already in zero3.py - so I just ported it to zero2.py - I will submit a PR to Deepspeed.

So why does it start with loss scale: 1, e.g. when I run with t5-small I get:

(Also added --logging_steps 2 to the cmd args so you don't have to wait for long to see the logs)

  0%|                                                                                                                              | 0/96510 [00:00<?, ?it/s][2021-11-26 21:18:19,660] [INFO] [stage2.py:1627:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536
  0%|                                                                                                                   | 1/96510 [00:00<10:57:06,  2.45it/s][2021-11-26 21:18:19,753] [INFO] [stage2.py:1627:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768.0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 21:18:19,754 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 4.9414, 'learning_rate': 0, 'epoch': 0.0}
  0%|                                                                                                                   | 2/96510 [00:00<10:57:06,  2.45it/s][2021-11-26 21:18:19,848] [INFO] [stage2.py:1627:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
  0%|                                                                                                                    | 3/96510 [00:00<4:41:53,  5.71it/s][2021-11-26 21:18:19,940] [INFO] [stage2.py:1627:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
[WARNING|trainer_pt_utils.py:803] 2021-11-26 21:18:19,941 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 5.4297, 'learning_rate': 0, 'epoch': 0.0}
  0%|  

In the ds config file:

        "initial_scale_power": 16,

which is 2**16, hence you can see that its first step on my t5-small setup is:

Attempted loss scale: 65536, reducing to 65536

well, it's actually a minor bug, but ignore it, as the next one does the right thing:

[2021-11-26 21:18:19,753] [INFO] [stage2.py:1627:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768.0

but in your case it appears that "initial_scale_power": 0, which is 2**0, but you pasted your config and it's 16.
need to figure out how it jumped to:

Attempted loss scale: 1, reducing to 1

instead of starting with 2**16.

so it gets an overflow and it's already at loss scale 1, so it can't go anywhere from here.

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

I can reproduce your issue with t5-large, so this is good as now I should be able to sort it out or at least communicate the problem to the Deepspeed team.

OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1

I hope to have time tomorrow to debug this.

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

zero3 does the right thing, starting with 65536, but it too goes down to 1. it just skips one degree down per step in a different fashion.

if you want to experiment before I get a chance, the next step is for you to try t5-large w/o deepspeed as you don't need it.

And it fails too:

export BS=8; PYTHONPATH=src USE_TF=0 deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py         --model_name_or_path t5-large         --output_dir output_dir         --evaluation_strategy=epoch         --do_train         --train_file ../poetrynew/train.json         --validation_file ../poetrynew/val.json         --save_strategy=epoch         --learning_rate 1e-3         --adam_eps 1e-06         --overwrite_output_dir         --max_source_length 64         --max_target_length 64         --num_train_epochs 1         --per_device_train_batch_size $BS         --per_device_eval_batch_size $BS         --source_lang en_XX         --target_lang en_XX            --fp16 --logging_steps 2
[...]
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                          
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                          
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.0} 

So your issue is not with deepspeed, but either your code or transformers.

(I just left the deepspeed launcher, but it's not running deepspeed)

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

OK, your issue is --fp16. t5 and most other models trained in bf16 have huge issues with fp16. (Search Issues if you're curious)

bf16 has a much larger dynamic range than fp16, and models trained in the former often overflow on the first step in fp16. e.g. mt5 overflows even on a small model on the first step.

Removing --fp16 (and disabling it in deepspeed if you use the latter) fixes the problem.

But you want speed of course, so here is what you can do next:

  1. a workaround for overflow to continue using --fp16: [T5/MT5] resolve inf/nan under amp (mixed precision)  #10956 - works for some people
  2. a WIP --bf16 PR (since you're on A100) Support for Training with BF16 #13207
  3. finetune in fp32 - much slower on pre-Amphere cards, but pytorch allows you to enable TF32 on Amphere - so you should have speed somewhat closer to fp16 while using the normal fp32 mode.

for 3. make sure you use torch>=1.10 and enable:

torch.backends.cuda.matmul.allow_tf32 = True

https://pytorch.org/docs/master/notes/numerical_accuracy.html#tensorfloat-32-tf32-on-nvidia-ampere-devices

I recommend you try 3 first, then 2, and then 1.

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

And DS has recently added bf16 for
https://www.deepspeed.ai/docs/config-json/#bfloat16-options

"bfloat16": {
   "enabled": true
 }

so that's option 4 to try with deepspeed - just replace the float16 section with the above one and don't use --fp16.

I think it only works with z2.

@tuhinjubcse
Copy link
Author

Stas you are amazing and I appreciate all the help and fast turnaround. I am just trying to understand if I use OPTION 3 (fp32) won't it give me OOM eventually? I just wanted to let you know my entire research questions tests on the ability to finetune T5-11B so unless that works t5-large/small/3B doesn't really help me

Just to be sure and consistent I have 4 A100 GPUs, so if you can tell me what would be the best way for me to use T5-11B. I am trying to reproduce (https://arxiv.org/abs/2110.08207) and honestly its been a bit difficult for me to get to train T5-11B .:(

@tuhinjubcse
Copy link
Author

I got t5-large to work with fp32 but ofcourse got OOM with batch size 1 fp32 T5-11B zero2. Appreciate any help here

@tuhinjubcse
Copy link
Author

Option 4 gave me this

File "./run_translation.py", line 622, in <module>
    main()
  File "./run_translation.py", line 539, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1317, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1857, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1889, in compute_loss
    outputs = model(**inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 1599, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1574, in forward
    encoder_outputs = self.encoder(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1004, in forward
    layer_outputs = layer_module(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 639, in forward
    self_attention_outputs = self.layer[0](
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 546, in forward
    attention_output = self.SelfAttention(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 472, in forward
    query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: expected scalar type Float but found BFloat16

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

The first step is to make things work w/o overflow, the second step is dealing with memory.

As bf16 is all new it will take some time to fully sort it out. You can try solution (1) as well - it might just work.

So your fp32 OOM was w/ or w/o deepspeed?

fp32 takes about the same amount of memory as fp16 mixed precision, because the latter still allocates 4 bytes for master weights per param. So the latter saves some memory in some places, but uses more memory in others. fp16 amp is really about up to 5x speed up, not saving memory.

Here are the next things to try:

Experiment A. Try deepspeed with both fp16 and bf16 disabled and stage2 (your current setup) and add on top of run_translation.py add:

import torch
torch.backends.cuda.matmul.allow_tf32 = True

how does that fair?

Experiment B. Same as A, but use stage 3 in the config file, and ensure your cpu offload is enabled - the default config file from the docs will do.

I of course assume you're also using torch==1.10 and some fairly recent cuda - at least cuda=11.3


re: bf16-support in deepspeed I haven't tried it myself yet as it was literally just added. I will give it a try.

@stas00
Copy link
Contributor

stas00 commented Nov 27, 2021

Additionally, I know you're trying to use Adafactor, but if nothing else works right away and you're in a hurry one other things to consider is using https://github.com/facebookresearch/bitsandbytes 8-bit AdamW optimizer. It will save you 6 out of 8 bytes per param. This is a huge memory saving, hence the suggestion.

Here is the saving breakdown:

  • fp32: from 16 (8+4+4) to 10 (2+4+4) bytes per param
  • fp16 or bf16 mixed precision: from 18 (8+4+4+2) to 12 (2+4+4+2) bytes per param

We are testing it (BNB) out right now at BigScience and so far it tracks the normal AdamW performance quality-wise.

The main issue with BNB is that it needs a Embed norm, which transformers models don't have at the moment. So we need to discuss this.

@tuhinjubcse
Copy link
Author

tuhinjubcse commented Nov 28, 2021

Turns out with zero3 and fp32 it works. I was training it and it went OOM after 25% training so I reduced it to batch size 12 from 16. If it still fails will fall back to 8. The time its taking is definitely more but atleast working

Time to load utils op: 0.0010249614715576172 seconds
[INFO|trainer.py:1196] 2021-11-27 22:54:13,786 >> ***** Running training *****
[INFO|trainer.py:1197] 2021-11-27 22:54:13,786 >>   Num examples = 772073
[INFO|trainer.py:1198] 2021-11-27 22:54:13,786 >>   Num Epochs = 1
[INFO|trainer.py:1199] 2021-11-27 22:54:13,786 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1200] 2021-11-27 22:54:13,786 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1201] 2021-11-27 22:54:13,786 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1202] 2021-11-27 22:54:13,786 >>   Total optimization steps = 12064
  2%|██▍                                                                                                                                                             | 182/12064 [23:29<24:13:05,  7.34s/it]


{'loss': 3.2285, 'learning_rate': 0.001, 'epoch': 0.04}                                                                                                                                                     
{'loss': 3.0005, 'learning_rate': 0.001, 'epoch': 0.08}                                                                                                                                                     
{'loss': 2.8807, 'learning_rate': 0.001, 'epoch': 0.12}                                                                                                                                                     
 17%|██████████████████████████                                                                                                                                   | 1999/12064 [4:02:17<20:06:29,  7.19s/it][2021-11-28 02:56:38,748] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
[2021-11-28 02:56:38,749] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=8.819555807358741
{'loss': 2.7952, 'learning_rate': 0.001, 'epoch': 0.17}                                                                                                                                                     
{'loss': 2.7062, 'learning_rate': 0.001, 'epoch': 0.21}                                                                                                                                                     
{'loss': 2.6237, 'learning_rate': 0.001, 'epoch': 0.25}                                                                                                                                                     
 25%|███████████████████████████████████████▏                                                                                                                     | 3010/12064 [6:04:11<18:12:50,  7.24s/it]Traceback (most recent call last):
  File "./run_translation.py", line 621, in <module>
    main()
  File "./run_translation.py", line 538, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1317, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1857, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1889, in compute_loss
    outputs = model(**inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 1599, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1611, in forward
    decoder_outputs = self.decoder(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1004, in forward
    layer_outputs = layer_module(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 665, in forward
    cross_attention_outputs = self.layer[1](
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 580, in forward
    attention_output = self.EncDecAttention(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 518, in forward
    attn_output = self.o(attn_output)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl
    result = hook(self, input)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1476, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1588, in pre_sub_module_forward_function
    self.param_coordinator.fetch_sub_module(sub_module)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 448, in fetch_sub_module
    self._all_gather(partitioned_params, async_op=False)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 525, in _all_gather
    handles = partitioned_params[0].all_gather(
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 595, in all_gather
    return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 704, in _all_gather
    ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 936, in _allgather_params_coalesced
    flat_tensor = torch.empty(tensor_size,
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 2; 39.59 GiB total capacity; 35.71 GiB already allocated; 56.94 MiB free; 36.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@stas00
Copy link
Contributor

stas00 commented Nov 29, 2021

That's a progress.

Additionally have you made sure that you have torch.backends.cuda.matmul.allow_tf32 = True and you're using torch==1.10 and some fairly recent cuda - at least cuda=11.3? You haven't confirmed that.

@tuhinjubcse
Copy link
Author

Yes

Done and yes confirming
torch.backends.cuda.matmul.allow_tf32 = True
torch==1.10
cuda=11.3

@stas00
Copy link
Contributor

stas00 commented Nov 29, 2021

watch this PR microsoft/DeepSpeed#1453
as soon as it's merged you can have the speed back with bf16/Zero3 under Deepspeed with Amphere.

I guess you can already try it if you are in need.

@tuhinjubcse
Copy link
Author

I can confirm I could train and evaluate using fp32 and zero3. It does take me 28 hours even after using 4 GPUs
is there any way to make this faster? I am not entirely sure I understand your last comment but what should i change at my end to enable the PR ?

@stas00
Copy link
Contributor

stas00 commented Dec 1, 2021

Thank you for the confirmation, @tuhinjubcse, that it works just not very fast.

To be faster you want bf16-support, which is a work in progress.

The plan is as following:

  1. complete and merge: Support for Training with BF16 #13207 (mostly done, just tweaking docs)
  2. complete and merge: Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) microsoft/DeepSpeed#1453 (promised to be done soon - I have no control there)
  3. meanwhile I will start working on integrating 1 and 2 here: [Deepspeed] add support for bf16 mode #14569 - but I'm blocked by 2.

Once 3 is done or least I have it working you should be able to use bf16 w/ Deepspeed/HF integration.

I will let you know once this happens.

@tuhinjubcse
Copy link
Author

Many many thanks

@tuhinjubcse
Copy link
Author

One thing I have been noticing is my performance once using run_translation which indirectly uses trainer is significantly lower. In my earlier code where I did not use a trainer, my perplexity loss was so much better than what I am getting now. Are there any trainer specific hyperparameters which I am missing

are there any hyperparameter that I might be missing? This was my training code prior to deep speed. You can see the train function

https://github.com/tuhinjubcse/tuhinjubcse.github.io/blob/master/fine_tune_lm.py

@stas00
Copy link
Contributor

stas00 commented Dec 3, 2021

But you're not using --fp16 now, which sometimes makes a huge difference, so it's not the code base that is different. And your original finetune script was using trainer, the script was rewritten but it's the same trainer.

That is not to say there is surely no regression here. We have been talking about adding speed regression tests, but so far we haven't gone beyond talking about it.

Once deepspeed releases bloat16-support I think you should be back at a fast speed.

I will start working on the integration now, against the Deepspeed PR, now that we have completed --bf16 in transformers.

So perhaps you will have something to experiment with shortly. I will keep you posted.

@tuhinjubcse
Copy link
Author

No before finetune_trainer I was using something else as you can see in the link above
As an experiment I was trying model.parallelize using T5 3B just to see what happens without deep speed and honestly it's surprising that the evaluation loss is lower for T5-3B with model parallelize compared to T5-11B with deep speed

I would expect since T5-11B is a bigger model it should give better performance anyway

I will put a comparative result of T5-3B using model.parallelize and deep speed. I am wondering if there is performance degradation with deepspeed

@stas00
Copy link
Contributor

stas00 commented Dec 4, 2021

Thank you for clarifying that you were talking about the pre-finetune_trainer. I assumed that it was finetune_trainer based on the name of your script, but I haven't read it as it's too long.

OK, so for you to understand what Deepspeed ZeRO does conceptually - it shards the tensors over multiple gpus and then at the point of calculation (forward/backward) it restores the tensors to their normal unsharded state, so the model doesn't see anything different - it has no idea ZeRO even exists. i.e. Deepspeed ZeRO itself doesn't change anything and can't make any difference to the math, and thus you should be getting an identical numerical output w/ or w/o ZeRO.

Now, it's possible that you are using a different optimizer or lr scheduler, - when you're using Deepspeed since it lets you use your own or provides its own - and they aren't identical most of the time. And you could be mismatching on whatever other hparams are involved. So when comparing such things you need to make sure you compare oranges to orange.

Besides Deepspeed you have transformers which also changes over time and it could also have regressions.

Now, bigger models typically give better performance but usually they take much longer to get to the same range of loss.

Now that you understand these things, let's focus on how you could get better results faster, since we want the same thing.

@stas00
Copy link
Contributor

stas00 commented Dec 5, 2021

Deepspeed/bf16/zero2 should work with #14569

Please let me know if you run into any problems if you choose to try that branch. Please follow up directly in that PR's comments if you run into issues.

To use bf16 you need 2 things:

  1. you just need to add --bf16 to your command line
  2. use a new config file that enables bf16. The PR includes a sample config file tests/deepspeed/ds_config_zero2_bf16.json.

On the deepspeed side I'm not sure if you just need deepspeed@master, or you actually need this branch: microsoft/DeepSpeed#1453 - I was testing with the latter.

zero3 doesn't seem to be ready on the deepspeed size. But it's all ready on the transformers side.

p.s. remember zero2 doesn't shard params, so it will be more memory demanding.

p.p.s. I think I need to do some tweaks to t5 models as well to save more memory for bf16 - I will have a look in the next few days.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jan 7, 2022
@alexcoca
Copy link

alexcoca commented Mar 3, 2023

och

For future reference, I could help with LR logging from adafactor as I have successfully monitored it and used it for research. However, I think I was using a setting where the LR is not automatically inferred by the optimizer, which is what Google folkes actually do when optimising this. However, I only trained T5-base variants so far... Soon to try XXL :)

@alexcoca
Copy link

alexcoca commented Mar 3, 2023

Stas you are amazing and I appreciate all the help and fast turnaround. I am just trying to understand if I use OPTION 3 (fp32) won't it give me OOM eventually? I just wanted to let you know my entire research questions tests on the ability to finetune T5-11B so unless that works t5-large/small/3B doesn't really help me

Just to be sure and consistent I have 4 A100 GPUs, so if you can tell me what would be the best way for me to use T5-11B. I am trying to reproduce (https://arxiv.org/abs/2110.08207) and honestly its been a bit difficult for me to get to train T5-11B .:(

Are these 40GB or 80 GB A100s?

@alexgshaw
Copy link

alexgshaw commented Jul 6, 2023

I'm running into this exact same issue except with bf16 and llama 13b+ combo.

Turning off bf16 fixes it, but I then can't fit 65b onto my GPUs. Any idea why bf16 is causing problems?

@LeopoldACC
Copy link

I'm running into this exact same issue except with bf16 and llama 13b+ combo.

Turning off bf16 fixes it, but I then can't fit 65b onto my GPUs. Any idea why bf16 is causing problems?

I also meet the same error when setup is ds_stage2/bf16 and baichuan13b model. I want to ask about Turning off bf16 fixes it is meaning that using fp32? or using fp16?

@alexgshaw
Copy link

I ran into issues with fp16 as well, so I used fp32.

@bilguunchinzorigEPFL
Copy link

For those who do not have enough gpu memory to train full precision model, I fixed this issue by decreased "initial_scale_power" in fp16 option in deepspeed config from 16 to 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants