max_grad_norm 在sft下面不起作用 #3996

HuangOwen · 2024-05-30T16:18:18Z

Reminder

I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=0 python src/train.py
--stage sft
--do_train True
--model_name_or_path /xxx/Llama-2-7b-hf
--finetuning_type lora
--template default
--dataset alpaca_gpt4_en
--cutoff_len 1024
--learning_rate 0.0001
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 10
--save_steps 10000
--warmup_ratio 0.01
--val_size 0.1
--per_device_eval_batch_size 16
--evaluation_strategy steps
--eval_steps 5000
--optim adamw_torch
--report_to wandb
--output_dir saves/llama2-7b-lora-baseline-qv-r8-rotate-noreplace/
--fp16 True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target q_proj,v_proj
--plot_loss True
--load_best_model_at_end

Expected behavior

max_grad_norm 应该用于gradient clipping，测试下来好像没有起到作用，加上max_grad_norm之后仍然会有超过max_grad_norm的gradient norm value

System Info

Others

No response

hiyouga · 2024-06-03T11:33:49Z

这里记录的是 clip 之前的 norm

hiyouga added the solved This problem has been already solved label Jun 3, 2024

hiyouga closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_grad_norm 在sft下面不起作用 #3996

max_grad_norm 在sft下面不起作用 #3996

HuangOwen commented May 30, 2024

hiyouga commented Jun 3, 2024

max_grad_norm 在sft下面不起作用 #3996

max_grad_norm 在sft下面不起作用 #3996

Comments

HuangOwen commented May 30, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Jun 3, 2024