[diffusion]: video_creator -> diffusion_video [audio]: new agent audio_generator, support doubao_tts by tallate · Pull Request #835 · inclusionAI/AWorld

tallate · 2026-03-27T08:52:41Z

No description provided.

[cast]: optimize [skills]: optimizer, text2agent

[audio]: new agent audio_generator, support doubao_tts

gemini-code-assist

Code Review

This pull request introduces a new Audio Agent powered by Doubao TTS, renames the "video_creator" agent to "diffusion," and adds two new skills: "optimizer" for agent enhancement and "text2agent" for automated agent creation. Key technical changes include the implementation of the AudioAgent base class, the DoubaoTTSProvider, and a refactor of the SEARCH_REPLACE tool to use structured parameters instead of a JSON string. Review feedback identifies a logic error in the diffusion configuration migration block and a potential regression in token limit calculations where the fallback was changed to zero. Additionally, the reviewer noted significant code duplication between the audio and diffusion configuration logic and pointed out several copy-paste naming inconsistencies in the new audio agent implementation.

gemini-code-assist · 2026-03-27T08:54:55Z

aworld-cli/src/aworld_cli/console.py

+            # Migrate from legacy models.diffusion
+            current_config['models']['diffusion'] = current_config['models'].get('diffusion') or {}
+            current_config['models'].pop('diffusion', None)


There appears to be a logic error in the migration block for the diffusion configuration. The code currently checks if 'diffusion' is not in current_config['models'], and if so, it attempts to get 'diffusion' from the same dictionary (which will be None), and then immediately removes it. This has no effect.

If the goal is to simply ensure the diffusion dictionary exists, this block should be simplified to match the pattern used for the new audio configuration.

Suggested change

# Migrate from legacy models.diffusion

current_config['models']['diffusion'] = current_config['models'].get('diffusion') or {}

current_config['models'].pop('diffusion', None)

current_config['models']['diffusion'] = {}

gemini-code-assist · 2026-03-27T08:54:55Z

aworld-cli/src/aworld_cli/core/context.py

                agent_stats.get("context_window_tokens", 0)
                if agent_stats
-                else stats.get("total_tokens", 0)
+                else 0


The logic for calculating total tokens has been changed. Previously, if agent_stats were not available for a given agent_name, it would fall back to using the session's total_tokens. The new logic falls back to 0.

Setting total to 0 may prevent context compression from being triggered when it's needed, as the condition total > limit would likely evaluate to false. This could lead to exceeding the context window limit unexpectedly.

I recommend reverting to the previous logic to ensure a more robust fallback.

Suggested change

else 0

else stats.get("total_tokens", 0)

gemini-code-assist · 2026-03-27T08:54:55Z

aworld-cli/src/aworld_cli/console.py

+        # Audio (models.audio -> AUDIO_* for audio agent)
+        self.console.print("\n[bold]Audio configuration[/bold] [dim](optional, for audio agent)[/dim]")
+        self.console.print("  [dim]Leave empty to use Media LLM or default LLM config above[/dim]\n")
+        if 'audio' not in current_config['models']:
+            current_config['models']['audio'] = {}
+        audio_cfg = current_config['models']['audio']
+
+        current_audio_api_key = audio_cfg.get('api_key', '')
+        if current_audio_api_key:
+            masked = current_audio_api_key[:8] + "..." if len(current_audio_api_key) > 8 else "***"
+            self.console.print(f"  [dim]Current AUDIO_API_KEY: {masked}[/dim]")
+        audio_api_key = Prompt.ask("  AUDIO_API_KEY", default=current_audio_api_key, password=True)
+        if audio_api_key:
+            audio_cfg['api_key'] = audio_api_key
+        else:
+            audio_cfg.pop('api_key', None)
+
+        current_audio_model = audio_cfg.get('model', '')
+        self.console.print("  [dim]e.g. claude-3-5-sonnet-20241022 · Enter to inherit from Media/default[/dim]")
+        audio_model = Prompt.ask("  AUDIO_MODEL_NAME", default=current_audio_model)
+        if audio_model:
+            audio_cfg['model'] = audio_model
+        else:
+            audio_cfg.pop('model', None)
+
+        current_audio_base_url = audio_cfg.get('base_url', '')
+        audio_base_url = Prompt.ask("  AUDIO_BASE_URL", default=current_audio_base_url)
+        if audio_base_url:
+            audio_cfg['base_url'] = audio_base_url
+        else:
+            audio_cfg.pop('base_url', None)
+
+        current_audio_provider = audio_cfg.get('provider', 'openai')
+        audio_provider = Prompt.ask("  AUDIO_PROVIDER", default=current_audio_provider)
+        if audio_provider:
+            audio_cfg['provider'] = audio_provider
+        else:
+            audio_cfg.pop('provider', None)
+
+        current_audio_temp = audio_cfg.get('temperature', 0.1)
+        audio_temp = Prompt.ask("  AUDIO_TEMPERATURE", default=str(current_audio_temp))
+        if audio_temp:
+            try:
+                audio_cfg['temperature'] = float(audio_temp)
+            except ValueError:
+                audio_cfg.pop('temperature', None)
+        else:
+            audio_cfg.pop('temperature', None)
+
+        if not audio_cfg:
+            current_config['models'].pop('audio', None)


The new "Audio configuration" section is almost identical to the "Diffusion configuration" section (lines 179-231). This significant code duplication makes the code harder to maintain.

Consider refactoring the logic for prompting and setting configuration values into a reusable helper function. This would apply to both the diffusion and audio configuration blocks.

A similar refactoring could be applied to the table display logic on lines 300-324, which also contains duplicated code for displaying the diffusion and audio configuration tables.

gemini-code-assist · 2026-03-27T08:54:55Z

aworld-cli/src/aworld_cli/core/config.py

+def _apply_audio_models_config(models_config: Dict[str, Any]) -> None:
+    """
+    Apply models.audio config to AUDIO_* env vars for audio agent.
+    Priority: models.audio config > existing AUDIO_* env vars > LLM_*.
+    """
+    audio_cfg = models_config.get('audio')
+    audio_cfg = audio_cfg if isinstance(audio_cfg, dict) else {}
+    api_key = (audio_cfg.get('api_key') or '').strip()
+    model_name = (audio_cfg.get('model') or '').strip()
+    base_url = (audio_cfg.get('base_url') or '').strip()
+    provider = (audio_cfg.get('provider') or '').strip()
+    temperature = audio_cfg.get('temperature')
+
+    if not api_key:
+        api_key = (os.environ.get('AUDIO_API_KEY') or '').strip()
+    if not api_key:
+        api_key = (os.environ.get('LLM_API_KEY') or '').strip()
+    if not api_key:
+        for key in ('OPENAI_API_KEY', 'ANTHROPIC_API_KEY', 'GEMINI_API_KEY'):
+            v = (os.environ.get(key) or '').strip()
+            if v:
+                api_key = v
+                if not provider and 'OPENAI' in key:
+                    provider = 'openai'
+                elif not provider and 'ANTHROPIC' in key:
+                    provider = 'anthropic'
+                elif not provider and 'GEMINI' in key:
+                    provider = 'gemini'
+                break
+    if not model_name:
+        model_name = (os.environ.get('AUDIO_MODEL_NAME') or '').strip()
+    if not model_name:
+        model_name = (os.environ.get('LLM_MODEL_NAME') or '').strip()
+    if not base_url:
+        base_url = (os.environ.get('AUDIO_BASE_URL') or '').strip()
+    if not base_url:
+        base_url = (os.environ.get('LLM_BASE_URL') or '').strip()
+    if not base_url:
+        for key in ('OPENAI_BASE_URL', 'ANTHROPIC_BASE_URL', 'GEMINI_BASE_URL'):
+            v = (os.environ.get(key) or '').strip()
+            if v:
+                base_url = v
+                break
+    if not provider:
+        provider = (os.environ.get('AUDIO_PROVIDER') or '').strip()
+    if not provider:
+        provider = 'openai'
+    if temperature is None:
+        env_temp = (os.environ.get('AUDIO_TEMPERATURE') or '').strip()
+        if env_temp:
+            temperature = float(env_temp)
+
+    if api_key:
+        os.environ['AUDIO_API_KEY'] = api_key
+    if model_name:
+        os.environ['AUDIO_MODEL_NAME'] = model_name
+    if base_url:
+        os.environ['AUDIO_BASE_URL'] = base_url
+    os.environ['AUDIO_PROVIDER'] = provider
+    if temperature is not None:
+        os.environ['AUDIO_TEMPERATURE'] = str(float(temperature))


The new function _apply_audio_models_config is very similar to _apply_diffusion_models_config. There is a large amount of duplicated code for resolving configuration values from the config dictionary, environment variables, and various fallbacks.

To improve maintainability, I recommend creating a single, parameterized helper function that can handle applying model configurations for different types (like diffusion and audio). For example, a function like _apply_specific_model_config(config_key: str, env_prefix: str, models_config: dict) could encapsulate the common logic.

gemini-code-assist · 2026-03-27T08:54:55Z

aworld-cli/src/aworld_cli/inner_plugins/smllc/agents/audio/audio.py

+class PreMultiTaskVideoCreatorHook(PreLLMCallHook):
+    """Hook triggered before LLM execution. Used for monitoring, logging, etc. Should NOT modify input/output content."""
+
+    async def exec(self, message: Message, context: Context = None) -> Message:
+        if message.sender.startswith('audio'):
+            # Logging and monitoring only - do not modify content
+            pass
+        return message
+
+
+@HookFactory.register(name="post_audio_hook")
+class PostMultiTaskVideoCreatorHook(PostLLMCallHook):
+    """Hook triggered after LLM execution. Used for monitoring, logging, etc. Should NOT modify input/output content."""
+
+    async def exec(self, message: Message, context: Context = None) -> Message:
+        if message.sender.startswith('audio'):
+            # Logging and monitoring only - do not modify content
+            pass
+        return message
+
+
+class AudioCreatorAgent(AudioAgent):
+    """An agent specializing in creating, editing, and generating video content."""


There are several copy-paste errors from the video_creator agent in this new audio agent file.

The hook classes are named PreMultiTaskVideoCreatorHook and PostMultiTaskVideoCreatorHook. They should be renamed to reflect that they are for the audio agent (e.g., PreAudioCreatorHook).

The AudioCreatorAgent class docstring says it specializes in "video content". This should be updated to describe its audio-related purpose.

These inconsistencies make the code confusing and harder to maintain. Please update the names and docstrings to match the agent's actual function.

gemini-code-assist · 2026-03-27T08:54:55Z

aworld-skills/optimizer/SKILL.md

+The `analysis_query` for this action **MUST** be a regular expression. Natural language queries are not supported and will fail.
+
+*   ✅ **Correct (Regex)**: `user_query=".*MyClass.*|.*my_function.*"`
+*   ❌ **Incorrect (Natural Language)**: `user_query="Find the MyClass class and the my_function function"`, `user_query=".*mcp_config\\.py."`, `user_query=".*"`


The documentation for CAST_ANALYSIS provides examples of incorrect regex queries: user_query=".*mcp_config\\.py." and user_query=".*".

While using an overly broad regex like .* might be undesirable, it is still a valid regex pattern. This example could be confusing for the agent or a human reader.

Consider clarifying why these patterns are considered "incorrect" in this context (e.g., they are too broad and may lead to performance issues or irrelevant results) or providing better examples of incorrect usage (e.g., non-regex natural language queries).

gemini-code-assist · 2026-03-27T08:54:55Z

aworld-skills/text2agent/SKILL.md

+### **Step 7: MCP Server Dependency Check and Installation (MANDATORY)**
+**After successfully registering the agent, you MUST verify and prepare the operational environment for the newly created agent's tools (MCP servers).** The goal is to ensure all MCP servers can be launched without dependency errors. You will use your terminal tool to perform this check.
+
+7.1  **Identify Target Modules**: First, parse the newly created mcp_config.py to get a list of all MCP server module paths. Use the following command block exactly as written to extract the paths.
+
+
+        ```PYTHON_SCRIPT="
+            import sys, os
+            agents_path = os.path.expanduser('${AGENTS_PATH:-$HOME/.aworld/agents}')
+            agent_path = os.path.join(agents_path, '<agent_folder_name>')
+            if os.path.isdir(agent_path):
+                sys.path.insert(0, agent_path)
+            try:
+                from mcp_config import mcp_config
+                for server, config in mcp_config.get('mcpServers', {}).items():
+                    args = config.get('args', [])
+                    if '-m' in args:
+                        try:
+                            module_index = args.index('-m') + 1
+                            if module_index < len(args):
+                                print(args[module_index])
+                        except (ValueError, IndexError):
+                            pass
+            except (ImportError, ModuleNotFoundError):
+                # This handles cases where mcp_config.py doesn't exist or is empty.
+                # No output means no modules to check, which is a valid state.
+                pass
+            "
+            MODULE_PATHS=$(python -c "$PYTHON_SCRIPT")
+            echo "Modules to check: $MODULE_PATHS"
+(Reminder: You MUST replace <agent_folder_name> with the actual folder name from Step 2.)    ```
+
+7.2  **Iterate and Install Dependencies**: For each <module_path> identified in the $MODULE_PATHS list, you must perform the following check-and-install loop.
+*   **A. Attempt a Timed Launch:**: Execute the module using python -m but wrap it in a timeout command. This will attempt to start the server and kill it after 2 seconds. This is a "dry run" to trigger any ModuleNotFoundError.
+    timeout 2s python -m <module_path>
+*   **B. Analyze the Output**: Carefully inspect the stderr from the command's output. Your only concern is the specific error ModuleNotFoundError.
+    If stderr contains ModuleNotFoundError: No module named '<missing_package_name>': Proceed to C.
+    If the command completes (exits with code 0) or is killed by the timeout (exit code 124) WITHOUT a ModuleNotFoundError: The check for this module is considered SUCCESSFUL. You can move on to the next module in your list.
+    If any other error occurs: Ignore it for now. The goal of this step is solely to resolve Python package dependencies.
+*   **C. Install the Missing Package**: If a ModuleNotFoundError was detected, parse the <missing_package_name> from the error message and immediately install it using pip, with timeout 600.
+    pip install <missing_package_name>
+    7.3  **Repeat the Check**: After a successful installation, you MUST return to Step 7.1 and re-run the timeout 2s python -m <module_path> command for the SAME module. This is to verify the installation was successful and to check if the module has other, different dependencies that need to be installed. Continue this loop until the launch attempt for the current module no longer produces a ModuleNotFoundError.
+
+After this loop has been successfully completed for all modules in $MODULE_PATHS, the new agent's environment is confirmed to be ready.


The workflow for checking MCP server dependencies in Step 7 is quite complex and potentially fragile. It relies on a multi-line Python script embedded in a shell command, which can be difficult for an LLM to handle correctly, especially with placeholders like <agent_folder_name>.

The dependency check logic using timeout 2s python -m <module_path> is clever but might fail for reasons other than a ModuleNotFoundError (e.g., the module takes more than 2 seconds to initialize).

Consider simplifying this workflow or providing a more robust script or tool to handle dependency checking to improve the reliability of this skill.

[audio]: new agent audio_generator, support doubao_tts

…rld_audio

AWorldAgent added 4 commits March 25, 2026 23:06

[cli]: PPT Generation -> Code Generation

06207b2

[cast]: optimize [skills]: optimizer, text2agent

t checkout:audioMerge branch 'main' of https://github.com/inclusionAI…

33bf19e

…/AWorld

[diffusion]: video_creator -> diffusion_video

7771aab

[audio]: new agent audio_generator, support doubao_tts

[diffusion]: video_creator -> diffusion_video

f0beccb

[audio]: new agent audio_generator, support doubao_tts

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

[diffusion]: video_creator -> diffusion_video

bacbf02

[audio]: new agent audio_generator, support doubao_tts

tallate changed the title ~~Aworld audio~~ [diffusion]: video_creator -> diffusion_video [audio]: new agent audio_generator, support doubao_tts Mar 27, 2026

Merge branch 'main' of https://github.com/inclusionAI/AWorld into awo…

34f499e

…rld_audio

ZhuangCY approved these changes Mar 27, 2026

View reviewed changes

ZhuangCY merged commit 19a4032 into main Mar 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion]: video_creator -> diffusion_video [audio]: new agent audio_generator, support doubao_tts#835

[diffusion]: video_creator -> diffusion_video [audio]: new agent audio_generator, support doubao_tts#835
ZhuangCY merged 6 commits intomainfrom
aworld_audio

tallate commented Mar 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tallate commented Mar 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants