generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
❓ questionSeeking clarification or more informationSeeking clarification or more information🏋 PPORelated to PPORelated to PPO🙋 help from community wantedOpen invitation for community members to contributeOpen invitation for community members to contribute
Description
System Info
transformers
version: 4.44.0- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.11.9
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.32.1
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: True
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': True, 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'EAGER'} - PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script? Yes
- GPU type: NVIDIA A100-SXM4-80GB
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder - My own task or dataset (give details below)
Reproduction
In TRL's PPOv2Trainer TLDR example, run the default command:
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/ppo/ppo_tldr.py \
--output_dir models/minimal/ppo_tldr \
--learning_rate 3e-6 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--total_episodes 1000000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--local_rollout_forward_batch_size 16 \
--non_eos_penalty \
--stop_token eos
Expected behavior
Entropy for a discrete distribution (such as that of a language model) must be non-negative. However, when I run the official example, the entropy can be negative:
I don't think I'm making a mistake because this negative entropy also appears in the official documentation. Specifically, look early in training, at maybe 20k episodes:
The documentation describes objective/entropy
as "The mean entropy of the policy, indicating the randomness of the actions chosen by the policy." If this is incorrect, and some other quantity is computed instead, then perhaps the documentation needs to be updated?
Metadata
Metadata
Assignees
Labels
❓ questionSeeking clarification or more informationSeeking clarification or more information🏋 PPORelated to PPORelated to PPO🙋 help from community wantedOpen invitation for community members to contributeOpen invitation for community members to contribute