Skip to content

New example: EMPO2#524

Merged
ultmaster merged 13 commits intomicrosoft:mainfrom
beanie00:feature/empo2-final
Apr 29, 2026
Merged

New example: EMPO2#524
ultmaster merged 13 commits intomicrosoft:mainfrom
beanie00:feature/empo2-final

Conversation

@beanie00
Copy link
Copy Markdown
Contributor

No description provided.

@beanie00 beanie00 force-pushed the feature/empo2-final branch 2 times, most recently from fcc121e to 5c38bb1 Compare April 28, 2026 20:48
Copilot AI review requested due to automatic review settings April 29, 2026 03:15
@beanie00 beanie00 force-pushed the feature/empo2-final branch from 5c38bb1 to 1929399 Compare April 29, 2026 03:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an EMPO² training example/integration for the env recipes, including new memory/embedding helper servers and trainer/daemon changes to support tips + off-policy updates.

Changes:

  • Update train_env_agent.py defaults/workflow to run EMPO² (env scienceworld2, new algorithm config selection, start/stop helper processes).
  • Add EMPO² helper servers (server_bert.py, server_mem.py) + new EMPO² agent implementation.
  • Extend env-VERL trainer/daemon to support tips-driven training modes, off-policy log-prob computation, and GRPO rollout-level advantage aggregation.

Reviewed changes

Copilot reviewed 17 out of 19 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
contrib/recipes/envs/train_env_agent.py Switch defaults to EMPO² recipe; adds process cleanup and EMPO² server startup/reset.
contrib/recipes/envs/prompt_builder.py Aligns get_instruction_prompt() callsite with updated env API.
contrib/recipes/envs/empo2_server/server_mem.py Adds memory store service for EMPO².
contrib/recipes/envs/empo2_server/server_bert.py Adds embedding/key service for EMPO² via SentenceTransformer.
contrib/recipes/envs/config_verl/scienceworld/grpo_qwen_7b_instruct.yaml Adds GRPO config variant (7B).
contrib/recipes/envs/config_verl/scienceworld/grpo_qwen_1.5b_instruct.yaml Adds GRPO config variant (1.5B).
contrib/recipes/envs/config_verl/scienceworld/empo2_qwen_7b_instruct.yaml Adds EMPO² config (tips/low-prob masking hooks).
contrib/recipes/envs/config_verl/alfworld/grpo_qwen_1.5b_instruct.yaml Adds ALFWorld GRPO config variant.
contrib/recipes/envs/config_env/scienceworld2.yaml Adds new recipe env config file used by EMPO² example.
contrib/recipes/envs/clean.sh Adds cleanup script for Ray/AgentLightning/VLLM processes.
contrib/recipes/envs/add_instruction.py Adds “tip” instruction and helpers to inject tips into chat prompts.
contrib/recipes/envs/README.md Documents new algorithm names and EMPO² run instructions.
contrib/agentlightning/contrib/algorithm/env_verl/trainer.py Adds tips/off-policy flow and GRPO rollout-level advantage path; adds low-prob masking hook.
contrib/agentlightning/contrib/algorithm/env_verl/daemon.py Adds tip removal for off-policy log-prob computation; adds max_train_length + logging.
contrib/agentlightning/contrib/algorithm/env_verl/core_empo2.py Adds tip-tag stripping + low-prob masking utility functions.
contrib/agentlightning/contrib/agent/env_agent.py Records original vs executed action based on config.
contrib/agentlightning/contrib/agent/empo2_agent.py Introduces EMPO² agent that queries memory servers and generates/stores tips.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread contrib/agentlightning/contrib/agent/empo2_agent.py
Comment thread contrib/agentlightning/contrib/agent/empo2_agent.py Outdated
Comment thread contrib/agentlightning/contrib/algorithm/env_verl/daemon.py
Comment thread contrib/agentlightning/contrib/algorithm/env_verl/daemon.py Outdated
Comment on lines +74 to +77
for i in range(bsz):
rid = rollout_index[i]
rollout_score[rid] += float(token_level_rewards[i].sum().item())
rollout_to_task[rid] = task_index[i]
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollout scores are computed via token_level_rewards[i].sum() without applying response_mask. If token_level_rewards contains non-zero values in padded positions (or if shapes vary and padding occurs), this will skew rollout-level normalization. Use a masked sum (e.g., multiply by response_mask[i] or sum only where mask==1) to ensure only valid response tokens contribute.

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +108
subprocess.Popen(f"nohup python empo2_server/server_bert.py > logs/bert_{args.task_num}.log 2>&1 &", shell=True)
subprocess.Popen(f"nohup python empo2_server/server_mem.py > logs/mem_{args.task_num}.log 2>&1 &", shell=True)

NUM_MEMORY = 5
time.sleep(1)
reset_memory(NUM_MEMORY)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting long-running servers via Popen(..., shell=True) while also using nohup and & makes process management brittle (no PID tracking, unclear lifetime, harder to terminate reliably) and time.sleep(1) is a weak readiness check. Prefer starting the server processes directly (no nohup/&), capturing their PIDs, and replacing the fixed sleep with a simple health check loop on ports 8000/8001 before calling reset_memory().

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +34
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000, workers=num_works)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding the embedding service to 0.0.0.0 exposes it on all interfaces with no authentication. If this is intended only for local training, bind to 127.0.0.1 (or add an explicit --host/--port config and default to localhost) to reduce accidental exposure.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +74
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8001, workers=num_works)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as the BERT/key server: binding to 0.0.0.0 exposes an unauthenticated memory store API externally. Default to 127.0.0.1 or require an explicit opt-in to listen on all interfaces.

Copilot uses AI. Check for mistakes.
Comment thread contrib/recipes/envs/empo2_server/server_mem.py Outdated
Comment thread contrib/recipes/envs/add_instruction.py
@ultmaster ultmaster merged commit 0b40cb7 into microsoft:main Apr 29, 2026
24 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants