Hi,
I am trying to run the GRPO training according to the README file. I am using the totally same commands as the readme.
(1) CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
(2) CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \ accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 7 \ src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml
But I am not sure why after the first update, all the responses from the model become "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!".
Is there anything I missed or anyone encountered the same problem as mine?
Thanks!