New example: EMPO2#524
Conversation
fcc121e to
5c38bb1
Compare
5c38bb1 to
1929399
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an EMPO² training example/integration for the env recipes, including new memory/embedding helper servers and trainer/daemon changes to support tips + off-policy updates.
Changes:
- Update
train_env_agent.pydefaults/workflow to run EMPO² (envscienceworld2, new algorithm config selection, start/stop helper processes). - Add EMPO² helper servers (
server_bert.py,server_mem.py) + new EMPO² agent implementation. - Extend env-VERL trainer/daemon to support tips-driven training modes, off-policy log-prob computation, and GRPO rollout-level advantage aggregation.
Reviewed changes
Copilot reviewed 17 out of 19 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| contrib/recipes/envs/train_env_agent.py | Switch defaults to EMPO² recipe; adds process cleanup and EMPO² server startup/reset. |
| contrib/recipes/envs/prompt_builder.py | Aligns get_instruction_prompt() callsite with updated env API. |
| contrib/recipes/envs/empo2_server/server_mem.py | Adds memory store service for EMPO². |
| contrib/recipes/envs/empo2_server/server_bert.py | Adds embedding/key service for EMPO² via SentenceTransformer. |
| contrib/recipes/envs/config_verl/scienceworld/grpo_qwen_7b_instruct.yaml | Adds GRPO config variant (7B). |
| contrib/recipes/envs/config_verl/scienceworld/grpo_qwen_1.5b_instruct.yaml | Adds GRPO config variant (1.5B). |
| contrib/recipes/envs/config_verl/scienceworld/empo2_qwen_7b_instruct.yaml | Adds EMPO² config (tips/low-prob masking hooks). |
| contrib/recipes/envs/config_verl/alfworld/grpo_qwen_1.5b_instruct.yaml | Adds ALFWorld GRPO config variant. |
| contrib/recipes/envs/config_env/scienceworld2.yaml | Adds new recipe env config file used by EMPO² example. |
| contrib/recipes/envs/clean.sh | Adds cleanup script for Ray/AgentLightning/VLLM processes. |
| contrib/recipes/envs/add_instruction.py | Adds “tip” instruction and helpers to inject tips into chat prompts. |
| contrib/recipes/envs/README.md | Documents new algorithm names and EMPO² run instructions. |
| contrib/agentlightning/contrib/algorithm/env_verl/trainer.py | Adds tips/off-policy flow and GRPO rollout-level advantage path; adds low-prob masking hook. |
| contrib/agentlightning/contrib/algorithm/env_verl/daemon.py | Adds tip removal for off-policy log-prob computation; adds max_train_length + logging. |
| contrib/agentlightning/contrib/algorithm/env_verl/core_empo2.py | Adds tip-tag stripping + low-prob masking utility functions. |
| contrib/agentlightning/contrib/agent/env_agent.py | Records original vs executed action based on config. |
| contrib/agentlightning/contrib/agent/empo2_agent.py | Introduces EMPO² agent that queries memory servers and generates/stores tips. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i in range(bsz): | ||
| rid = rollout_index[i] | ||
| rollout_score[rid] += float(token_level_rewards[i].sum().item()) | ||
| rollout_to_task[rid] = task_index[i] |
There was a problem hiding this comment.
Rollout scores are computed via token_level_rewards[i].sum() without applying response_mask. If token_level_rewards contains non-zero values in padded positions (or if shapes vary and padding occurs), this will skew rollout-level normalization. Use a masked sum (e.g., multiply by response_mask[i] or sum only where mask==1) to ensure only valid response tokens contribute.
| subprocess.Popen(f"nohup python empo2_server/server_bert.py > logs/bert_{args.task_num}.log 2>&1 &", shell=True) | ||
| subprocess.Popen(f"nohup python empo2_server/server_mem.py > logs/mem_{args.task_num}.log 2>&1 &", shell=True) | ||
|
|
||
| NUM_MEMORY = 5 | ||
| time.sleep(1) | ||
| reset_memory(NUM_MEMORY) |
There was a problem hiding this comment.
Starting long-running servers via Popen(..., shell=True) while also using nohup and & makes process management brittle (no PID tracking, unclear lifetime, harder to terminate reliably) and time.sleep(1) is a weak readiness check. Prefer starting the server processes directly (no nohup/&), capturing their PIDs, and replacing the fixed sleep with a simple health check loop on ports 8000/8001 before calling reset_memory().
| if __name__ == "__main__": | ||
| uvicorn.run(app, host="0.0.0.0", port=8000, workers=num_works) |
There was a problem hiding this comment.
Binding the embedding service to 0.0.0.0 exposes it on all interfaces with no authentication. If this is intended only for local training, bind to 127.0.0.1 (or add an explicit --host/--port config and default to localhost) to reduce accidental exposure.
| if __name__ == "__main__": | ||
| uvicorn.run(app, host="0.0.0.0", port=8001, workers=num_works) |
There was a problem hiding this comment.
Same concern as the BERT/key server: binding to 0.0.0.0 exposes an unauthenticated memory store API externally. Default to 127.0.0.1 or require an explicit opt-in to listen on all interfaces.
No description provided.