New example: EMPO2 by beanie00 · Pull Request #524 · microsoft/agent-lightning

beanie00 · 2026-04-28T20:03:34Z

No description provided.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an EMPO² training example/integration for the env recipes, including new memory/embedding helper servers and trainer/daemon changes to support tips + off-policy updates.

Changes:

Update train_env_agent.py defaults/workflow to run EMPO² (env scienceworld2, new algorithm config selection, start/stop helper processes).
Add EMPO² helper servers (server_bert.py, server_mem.py) + new EMPO² agent implementation.
Extend env-VERL trainer/daemon to support tips-driven training modes, off-policy log-prob computation, and GRPO rollout-level advantage aggregation.

Reviewed changes

Copilot reviewed 17 out of 19 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
contrib/recipes/envs/train_env_agent.py	Switch defaults to EMPO² recipe; adds process cleanup and EMPO² server startup/reset.
contrib/recipes/envs/prompt_builder.py	Aligns `get_instruction_prompt()` callsite with updated env API.
contrib/recipes/envs/empo2_server/server_mem.py	Adds memory store service for EMPO².
contrib/recipes/envs/empo2_server/server_bert.py	Adds embedding/key service for EMPO² via SentenceTransformer.
contrib/recipes/envs/config_verl/scienceworld/grpo_qwen_7b_instruct.yaml	Adds GRPO config variant (7B).
contrib/recipes/envs/config_verl/scienceworld/grpo_qwen_1.5b_instruct.yaml	Adds GRPO config variant (1.5B).
contrib/recipes/envs/config_verl/scienceworld/empo2_qwen_7b_instruct.yaml	Adds EMPO² config (tips/low-prob masking hooks).
contrib/recipes/envs/config_verl/alfworld/grpo_qwen_1.5b_instruct.yaml	Adds ALFWorld GRPO config variant.
contrib/recipes/envs/config_env/scienceworld2.yaml	Adds new recipe env config file used by EMPO² example.
contrib/recipes/envs/clean.sh	Adds cleanup script for Ray/AgentLightning/VLLM processes.
contrib/recipes/envs/add_instruction.py	Adds “tip” instruction and helpers to inject tips into chat prompts.
contrib/recipes/envs/README.md	Documents new algorithm names and EMPO² run instructions.
contrib/agentlightning/contrib/algorithm/env_verl/trainer.py	Adds tips/off-policy flow and GRPO rollout-level advantage path; adds low-prob masking hook.
contrib/agentlightning/contrib/algorithm/env_verl/daemon.py	Adds tip removal for off-policy log-prob computation; adds max_train_length + logging.
contrib/agentlightning/contrib/algorithm/env_verl/core_empo2.py	Adds tip-tag stripping + low-prob masking utility functions.
contrib/agentlightning/contrib/agent/env_agent.py	Records original vs executed action based on config.
contrib/agentlightning/contrib/agent/empo2_agent.py	Introduces EMPO² agent that queries memory servers and generates/stores tips.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T03:33:40Z

+    for i in range(bsz):
+        rid = rollout_index[i]
+        rollout_score[rid] += float(token_level_rewards[i].sum().item())
+        rollout_to_task[rid] = task_index[i]


Rollout scores are computed via token_level_rewards[i].sum() without applying response_mask. If token_level_rewards contains non-zero values in padded positions (or if shapes vary and padding occurs), this will skew rollout-level normalization. Use a masked sum (e.g., multiply by response_mask[i] or sum only where mask==1) to ensure only valid response tokens contribute.

Copilot · 2026-04-29T03:33:41Z

+        subprocess.Popen(f"nohup python empo2_server/server_bert.py > logs/bert_{args.task_num}.log 2>&1 &", shell=True)
+        subprocess.Popen(f"nohup python empo2_server/server_mem.py > logs/mem_{args.task_num}.log 2>&1 &", shell=True)
+
+        NUM_MEMORY = 5
+        time.sleep(1)
+        reset_memory(NUM_MEMORY)


Starting long-running servers via Popen(..., shell=True) while also using nohup and & makes process management brittle (no PID tracking, unclear lifetime, harder to terminate reliably) and time.sleep(1) is a weak readiness check. Prefer starting the server processes directly (no nohup/&), capturing their PIDs, and replacing the fixed sleep with a simple health check loop on ports 8000/8001 before calling reset_memory().

Copilot · 2026-04-29T03:33:41Z

+if __name__ == "__main__":
+    uvicorn.run(app, host="0.0.0.0", port=8000, workers=num_works)


Binding the embedding service to 0.0.0.0 exposes it on all interfaces with no authentication. If this is intended only for local training, bind to 127.0.0.1 (or add an explicit --host/--port config and default to localhost) to reduce accidental exposure.

Copilot · 2026-04-29T03:33:41Z

+if __name__ == "__main__":
+    uvicorn.run(app, host="0.0.0.0", port=8001, workers=num_works)


Same concern as the BERT/key server: binding to 0.0.0.0 exposes an unauthenticated memory store API externally. Default to 127.0.0.1 or require an explicit opt-in to listen on all interfaces.

…grammar

beanie00 added 10 commits April 28, 2026 20:28

add simulation envs, agent, and trainer

790c8c8

fix and clean simulation structure

9d95676

update simulation docs and readme

7e6be43

add empo2 core implementation

1f8609f

fix empo2 validation and entropy

ce1e6c4

update empo2 configs and training params

9cbfcbf

update empo2 trainer and off-policy logic

768dd62

remove verl and restructure

a9941da

fix lint, formatting, and add logging

0581c8e

improve GRPO and add record_original_action

56290fd

beanie00 force-pushed the feature/empo2-final branch 2 times, most recently from fcc121e to 5c38bb1 Compare April 28, 2026 20:48

revert agent.py and otlp.py to match main

1929399

Copilot AI review requested due to automatic review settings April 29, 2026 03:15

beanie00 force-pushed the feature/empo2-final branch from 5c38bb1 to 1929399 Compare April 29, 2026 03:15

Copilot started reviewing on behalf of beanie00 April 29, 2026 03:18 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

beanie00 added 2 commits April 29, 2026 04:17

apply Copilot review: remove dead code, fix type annotation, and fix …

414bd36

…grammar

add missing headers

b434bc1

ultmaster merged commit 0b40cb7 into microsoft:main Apr 29, 2026
24 of 31 checks passed

beanie00 mentioned this pull request May 2, 2026

Fix #511: Question about Code Availability for EMPO^2 Paper #520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New example: EMPO2#524

New example: EMPO2#524
ultmaster merged 13 commits intomicrosoft:mainfrom
beanie00:feature/empo2-final

beanie00 commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if __name__ == "__main__":
		uvicorn.run(app, host="0.0.0.0", port=8000, workers=num_works)

Conversation

beanie00 commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants