Skip to content

[diffusion]: video_creator -> diffusion_video [audio]: new agent audio_generator, support doubao_tts#835

Merged
ZhuangCY merged 6 commits intomainfrom
aworld_audio
Mar 27, 2026
Merged

[diffusion]: video_creator -> diffusion_video [audio]: new agent audio_generator, support doubao_tts#835
ZhuangCY merged 6 commits intomainfrom
aworld_audio

Conversation

@tallate
Copy link
Copy Markdown
Collaborator

@tallate tallate commented Mar 27, 2026

No description provided.

AWorldAgent added 4 commits March 25, 2026 23:06
[cast]: optimize
[skills]: optimizer, text2agent
[audio]: new agent audio_generator, support doubao_tts
[audio]: new agent audio_generator, support doubao_tts
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Audio Agent powered by Doubao TTS, renames the "video_creator" agent to "diffusion," and adds two new skills: "optimizer" for agent enhancement and "text2agent" for automated agent creation. Key technical changes include the implementation of the AudioAgent base class, the DoubaoTTSProvider, and a refactor of the SEARCH_REPLACE tool to use structured parameters instead of a JSON string. Review feedback identifies a logic error in the diffusion configuration migration block and a potential regression in token limit calculations where the fallback was changed to zero. Additionally, the reviewer noted significant code duplication between the audio and diffusion configuration logic and pointed out several copy-paste naming inconsistencies in the new audio agent implementation.

Comment on lines +183 to +185
# Migrate from legacy models.diffusion
current_config['models']['diffusion'] = current_config['models'].get('diffusion') or {}
current_config['models'].pop('diffusion', None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a logic error in the migration block for the diffusion configuration. The code currently checks if 'diffusion' is not in current_config['models'], and if so, it attempts to get 'diffusion' from the same dictionary (which will be None), and then immediately removes it. This has no effect.

If the goal is to simply ensure the diffusion dictionary exists, this block should be simplified to match the pattern used for the new audio configuration.

Suggested change
# Migrate from legacy models.diffusion
current_config['models']['diffusion'] = current_config['models'].get('diffusion') or {}
current_config['models'].pop('diffusion', None)
current_config['models']['diffusion'] = {}

agent_stats.get("context_window_tokens", 0)
if agent_stats
else stats.get("total_tokens", 0)
else 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for calculating total tokens has been changed. Previously, if agent_stats were not available for a given agent_name, it would fall back to using the session's total_tokens. The new logic falls back to 0.

Setting total to 0 may prevent context compression from being triggered when it's needed, as the condition total > limit would likely evaluate to false. This could lead to exceeding the context window limit unexpectedly.

I recommend reverting to the previous logic to ensure a more robust fallback.

Suggested change
else 0
else stats.get("total_tokens", 0)

Comment on lines +233 to +283
# Audio (models.audio -> AUDIO_* for audio agent)
self.console.print("\n[bold]Audio configuration[/bold] [dim](optional, for audio agent)[/dim]")
self.console.print(" [dim]Leave empty to use Media LLM or default LLM config above[/dim]\n")
if 'audio' not in current_config['models']:
current_config['models']['audio'] = {}
audio_cfg = current_config['models']['audio']

current_audio_api_key = audio_cfg.get('api_key', '')
if current_audio_api_key:
masked = current_audio_api_key[:8] + "..." if len(current_audio_api_key) > 8 else "***"
self.console.print(f" [dim]Current AUDIO_API_KEY: {masked}[/dim]")
audio_api_key = Prompt.ask(" AUDIO_API_KEY", default=current_audio_api_key, password=True)
if audio_api_key:
audio_cfg['api_key'] = audio_api_key
else:
audio_cfg.pop('api_key', None)

current_audio_model = audio_cfg.get('model', '')
self.console.print(" [dim]e.g. claude-3-5-sonnet-20241022 · Enter to inherit from Media/default[/dim]")
audio_model = Prompt.ask(" AUDIO_MODEL_NAME", default=current_audio_model)
if audio_model:
audio_cfg['model'] = audio_model
else:
audio_cfg.pop('model', None)

current_audio_base_url = audio_cfg.get('base_url', '')
audio_base_url = Prompt.ask(" AUDIO_BASE_URL", default=current_audio_base_url)
if audio_base_url:
audio_cfg['base_url'] = audio_base_url
else:
audio_cfg.pop('base_url', None)

current_audio_provider = audio_cfg.get('provider', 'openai')
audio_provider = Prompt.ask(" AUDIO_PROVIDER", default=current_audio_provider)
if audio_provider:
audio_cfg['provider'] = audio_provider
else:
audio_cfg.pop('provider', None)

current_audio_temp = audio_cfg.get('temperature', 0.1)
audio_temp = Prompt.ask(" AUDIO_TEMPERATURE", default=str(current_audio_temp))
if audio_temp:
try:
audio_cfg['temperature'] = float(audio_temp)
except ValueError:
audio_cfg.pop('temperature', None)
else:
audio_cfg.pop('temperature', None)

if not audio_cfg:
current_config['models'].pop('audio', None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new "Audio configuration" section is almost identical to the "Diffusion configuration" section (lines 179-231). This significant code duplication makes the code harder to maintain.

Consider refactoring the logic for prompting and setting configuration values into a reusable helper function. This would apply to both the diffusion and audio configuration blocks.

A similar refactoring could be applied to the table display logic on lines 300-324, which also contains duplicated code for displaying the diffusion and audio configuration tables.

Comment on lines +348 to +408
def _apply_audio_models_config(models_config: Dict[str, Any]) -> None:
"""
Apply models.audio config to AUDIO_* env vars for audio agent.
Priority: models.audio config > existing AUDIO_* env vars > LLM_*.
"""
audio_cfg = models_config.get('audio')
audio_cfg = audio_cfg if isinstance(audio_cfg, dict) else {}
api_key = (audio_cfg.get('api_key') or '').strip()
model_name = (audio_cfg.get('model') or '').strip()
base_url = (audio_cfg.get('base_url') or '').strip()
provider = (audio_cfg.get('provider') or '').strip()
temperature = audio_cfg.get('temperature')

if not api_key:
api_key = (os.environ.get('AUDIO_API_KEY') or '').strip()
if not api_key:
api_key = (os.environ.get('LLM_API_KEY') or '').strip()
if not api_key:
for key in ('OPENAI_API_KEY', 'ANTHROPIC_API_KEY', 'GEMINI_API_KEY'):
v = (os.environ.get(key) or '').strip()
if v:
api_key = v
if not provider and 'OPENAI' in key:
provider = 'openai'
elif not provider and 'ANTHROPIC' in key:
provider = 'anthropic'
elif not provider and 'GEMINI' in key:
provider = 'gemini'
break
if not model_name:
model_name = (os.environ.get('AUDIO_MODEL_NAME') or '').strip()
if not model_name:
model_name = (os.environ.get('LLM_MODEL_NAME') or '').strip()
if not base_url:
base_url = (os.environ.get('AUDIO_BASE_URL') or '').strip()
if not base_url:
base_url = (os.environ.get('LLM_BASE_URL') or '').strip()
if not base_url:
for key in ('OPENAI_BASE_URL', 'ANTHROPIC_BASE_URL', 'GEMINI_BASE_URL'):
v = (os.environ.get(key) or '').strip()
if v:
base_url = v
break
if not provider:
provider = (os.environ.get('AUDIO_PROVIDER') or '').strip()
if not provider:
provider = 'openai'
if temperature is None:
env_temp = (os.environ.get('AUDIO_TEMPERATURE') or '').strip()
if env_temp:
temperature = float(env_temp)

if api_key:
os.environ['AUDIO_API_KEY'] = api_key
if model_name:
os.environ['AUDIO_MODEL_NAME'] = model_name
if base_url:
os.environ['AUDIO_BASE_URL'] = base_url
os.environ['AUDIO_PROVIDER'] = provider
if temperature is not None:
os.environ['AUDIO_TEMPERATURE'] = str(float(temperature))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new function _apply_audio_models_config is very similar to _apply_diffusion_models_config. There is a large amount of duplicated code for resolving configuration values from the config dictionary, environment variables, and various fallbacks.

To improve maintainability, I recommend creating a single, parameterized helper function that can handle applying model configurations for different types (like diffusion and audio). For example, a function like _apply_specific_model_config(config_key: str, env_prefix: str, models_config: dict) could encapsulate the common logic.

Comment on lines +20 to +42
class PreMultiTaskVideoCreatorHook(PreLLMCallHook):
"""Hook triggered before LLM execution. Used for monitoring, logging, etc. Should NOT modify input/output content."""

async def exec(self, message: Message, context: Context = None) -> Message:
if message.sender.startswith('audio'):
# Logging and monitoring only - do not modify content
pass
return message


@HookFactory.register(name="post_audio_hook")
class PostMultiTaskVideoCreatorHook(PostLLMCallHook):
"""Hook triggered after LLM execution. Used for monitoring, logging, etc. Should NOT modify input/output content."""

async def exec(self, message: Message, context: Context = None) -> Message:
if message.sender.startswith('audio'):
# Logging and monitoring only - do not modify content
pass
return message


class AudioCreatorAgent(AudioAgent):
"""An agent specializing in creating, editing, and generating video content."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several copy-paste errors from the video_creator agent in this new audio agent file.

  • The hook classes are named PreMultiTaskVideoCreatorHook and PostMultiTaskVideoCreatorHook. They should be renamed to reflect that they are for the audio agent (e.g., PreAudioCreatorHook).
  • The AudioCreatorAgent class docstring says it specializes in "video content". This should be updated to describe its audio-related purpose.

These inconsistencies make the code confusing and harder to maintain. Please update the names and docstrings to match the agent's actual function.

The `analysis_query` for this action **MUST** be a regular expression. Natural language queries are not supported and will fail.

* ✅ **Correct (Regex)**: `user_query=".*MyClass.*|.*my_function.*"`
* ❌ **Incorrect (Natural Language)**: `user_query="Find the MyClass class and the my_function function"`, `user_query=".*mcp_config\\.py."`, `user_query=".*"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for CAST_ANALYSIS provides examples of incorrect regex queries: user_query=".*mcp_config\\.py." and user_query=".*".

While using an overly broad regex like .* might be undesirable, it is still a valid regex pattern. This example could be confusing for the agent or a human reader.

Consider clarifying why these patterns are considered "incorrect" in this context (e.g., they are too broad and may lead to performance issues or irrelevant results) or providing better examples of incorrect usage (e.g., non-regex natural language queries).

Comment on lines +124 to +167
### **Step 7: MCP Server Dependency Check and Installation (MANDATORY)**
**After successfully registering the agent, you MUST verify and prepare the operational environment for the newly created agent's tools (MCP servers).** The goal is to ensure all MCP servers can be launched without dependency errors. You will use your terminal tool to perform this check.

7.1 **Identify Target Modules**: First, parse the newly created mcp_config.py to get a list of all MCP server module paths. Use the following command block exactly as written to extract the paths.


```PYTHON_SCRIPT="
import sys, os
agents_path = os.path.expanduser('${AGENTS_PATH:-$HOME/.aworld/agents}')
agent_path = os.path.join(agents_path, '<agent_folder_name>')
if os.path.isdir(agent_path):
sys.path.insert(0, agent_path)
try:
from mcp_config import mcp_config
for server, config in mcp_config.get('mcpServers', {}).items():
args = config.get('args', [])
if '-m' in args:
try:
module_index = args.index('-m') + 1
if module_index < len(args):
print(args[module_index])
except (ValueError, IndexError):
pass
except (ImportError, ModuleNotFoundError):
# This handles cases where mcp_config.py doesn't exist or is empty.
# No output means no modules to check, which is a valid state.
pass
"
MODULE_PATHS=$(python -c "$PYTHON_SCRIPT")
echo "Modules to check: $MODULE_PATHS"
(Reminder: You MUST replace <agent_folder_name> with the actual folder name from Step 2.) ```

7.2 **Iterate and Install Dependencies**: For each <module_path> identified in the $MODULE_PATHS list, you must perform the following check-and-install loop.
* **A. Attempt a Timed Launch:**: Execute the module using python -m but wrap it in a timeout command. This will attempt to start the server and kill it after 2 seconds. This is a "dry run" to trigger any ModuleNotFoundError.
timeout 2s python -m <module_path>
* **B. Analyze the Output**: Carefully inspect the stderr from the command's output. Your only concern is the specific error ModuleNotFoundError.
If stderr contains ModuleNotFoundError: No module named '<missing_package_name>': Proceed to C.
If the command completes (exits with code 0) or is killed by the timeout (exit code 124) WITHOUT a ModuleNotFoundError: The check for this module is considered SUCCESSFUL. You can move on to the next module in your list.
If any other error occurs: Ignore it for now. The goal of this step is solely to resolve Python package dependencies.
* **C. Install the Missing Package**: If a ModuleNotFoundError was detected, parse the <missing_package_name> from the error message and immediately install it using pip, with timeout 600.
pip install <missing_package_name>
7.3 **Repeat the Check**: After a successful installation, you MUST return to Step 7.1 and re-run the timeout 2s python -m <module_path> command for the SAME module. This is to verify the installation was successful and to check if the module has other, different dependencies that need to be installed. Continue this loop until the launch attempt for the current module no longer produces a ModuleNotFoundError.

After this loop has been successfully completed for all modules in $MODULE_PATHS, the new agent's environment is confirmed to be ready.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The workflow for checking MCP server dependencies in Step 7 is quite complex and potentially fragile. It relies on a multi-line Python script embedded in a shell command, which can be difficult for an LLM to handle correctly, especially with placeholders like <agent_folder_name>.

The dependency check logic using timeout 2s python -m <module_path> is clever but might fail for reasons other than a ModuleNotFoundError (e.g., the module takes more than 2 seconds to initialize).

Consider simplifying this workflow or providing a more robust script or tool to handle dependency checking to improve the reliability of this skill.

[audio]: new agent audio_generator, support doubao_tts
@tallate tallate changed the title Aworld audio [diffusion]: video_creator -> diffusion_video [audio]: new agent audio_generator, support doubao_tts Mar 27, 2026
@ZhuangCY ZhuangCY merged commit 19a4032 into main Mar 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants