The responses are always "!!!!!!!!!!!!!!!!!!!!!!!!!" during grpo training.

Hi,

I am trying to run the GRPO training according to the README file. I am using the totally same commands as the readme. 
(1) `CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
(2) `CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \
    accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 7 \
    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml`

But I am not sure why after the first update, all the responses from the model become "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!".

Is there anything I missed or anyone encountered the same problem as mine?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The responses are always "!!!!!!!!!!!!!!!!!!!!!!!!!" during grpo training. #555

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The responses are always "!!!!!!!!!!!!!!!!!!!!!!!!!" during grpo training. #555

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions