Skip to content

Code Execution Plugin

Nicolas Bonamy edited this page Nov 11, 2025 · 2 revisions

Code Execution Plugin

Note: This feature is currently implemented but not yet enabled in Witsy.

Background

Anthropic recently published an article on code execution with MCP, demonstrating how to enable Claude to orchestrate multiple tool calls through a single "execute code" function. This approach can dramatically reduce token usage and improve reliability when chaining multiple operations together.

Witsy's Specific Challenges

Implementing code execution in Witsy presents unique challenges compared to a typical MCP server setup:

  1. Dynamic MCP servers: MCP servers in Witsy are user-configured and can be added or removed at runtime. This means we cannot write static "documentation" or examples for real code - the available tools and their schemas are completely dynamic.

  2. Desktop application constraints: Witsy is a desktop application that needs to remain lightweight and portable. Implementing a sandboxed execution environment (as Anthropic does) would require heavy dependencies like Docker, which conflicts with Witsy's goal of being an accessible, easy-to-install desktop app.

Given these constraints, Witsy's implementation takes a different approach focused on workflow orchestration rather than arbitrary code execution.

💡 Results: Despite these limitations, the initial implementation shows dramatic improvements - reducing token usage from 50,000+ tokens to ~5,000 tokens (a 10x reduction) for simple multi-step tasks. See the Example below for details.

Implementation

Source Code: src/plugins/code_exec.ts

Architecture

The CodeExecutionPlugin is implemented as a MultiToolPlugin that exposes only two tools to the model. All other available tools (MCP servers, plugins, etc.) are not directly accessible to the model - instead, they can only be invoked through these two tools:

  1. code_exec_get_tools_info: Retrieves detailed information about specific tools, including their parameters and descriptions. The tool's description includes a complete list of all available tools, allowing the model to discover what tools exist before requesting details.
  2. code_exec_run_program: Executes a workflow described as a JSON sequence of tool calls.

The plugin provides streaming status updates during workflow execution, yielding status messages as each step begins and completes (e.g., "Executing step 1: get_workspaces", "Completed step 1"). This allows the UI to show real-time progress to the user.

System Instructions

Getting models to reliably use code execution requires detailed system instructions. The following instructions are provided to guide the model:

View complete system instructions

To accomplish tasks, you can use the run_program tool. This tool provides a list of other tools you can use. You can get details about these tools using get_tool_info.

When to Use run_program:

  • ALWAYS prefer run_program for workflows involving 2+ steps or when steps depend on each other
  • Use run_program proactively rather than making individual tool calls and then switching to it
  • Before writing a program: Use code_exec_get_tool_info to understand the exact tool names, required parameters, and argument structure for unfamiliar tools
  • Once you understand the required tools, write a complete program and call run_program directly

Do not write single-step programs to retrieve intermediate data: the goal is to streamline execution of complete workflows.

Workflow for Using run_program:

  1. Identify which tools you need for the task
  2. Use code_exec_get_tool_info with tool_name parameter to get details about each tool you're unfamiliar with
  3. Review the tool's parameters, required fields, and structure
  4. Write a complete multi-step program with proper tool names and arguments
  5. Execute with code_exec_run_program

Program Structure The program must contain a steps array:

{
  "steps": [...]
}

Step Definition Each step object must include:

  • id: A unique string identifier for this step (used for output references in later steps).
  • tool: The exact name of the tool to call (e.g., "search_internet", "asana_get_tasks___c9c8"). Use the complete tool name including any suffixes.
  • args: An object with key-value pairs representing the inputs for this tool as described in its specification.
  • Optionally, on_error: Controls workflow behavior if this step fails. Acceptable values are "continue" (ignore the error and proceed) or "abort" (stop execution). Default should be "abort" if not specified.

References Between Steps

  • When a step's argument value should be populated with the output of a previous step, use the reference syntax: {{step_id.result}}
  • You CAN access nested fields and array indices: {{step_id.result.field_name}} or {{step_id.result.0.field_name}}
    • Example: {{get_workspaces.result.0.gid}} to access the gid field of the first workspace
    • Example: {{get_user.result.gid}} to access the gid field directly
  • You can reference fields at any depth: {{step_id.result.data.items.0.id}}
  • When uncertain about structure, reference the whole result: {{step_id.result}}

General Style & Validity

  • The JSON must be valid (no comments, trailing commas, or extra fields).
  • Remember the root object contains only a steps array
  • Use clear and direct names for step IDs.
  • Use only tools and parameters exactly as specified.
  • For each tool, provide only the arguments that are required or documented.
  • Programs should be minimal, containing only as many steps as needed to accomplish the described objective.

Error Handling

  • If not specified, treat on_error as "abort".
  • When instructed, apply "on_error": "continue" to relevant steps to skip on failure.

Key Points:

  • Use code_exec_get_tool_info first to understand unfamiliar tools
  • Start with {"steps": [...]} (no program wrapper)
  • You can access nested fields in step results (e.g., .gid, .0.gid)
  • Use complete tool names including suffixes
  • Plan the full workflow before executing

These instructions emphasize:

  • Proactive usage: Encourage the model to use run_program for multi-step workflows rather than making individual tool calls
  • Discovery workflow: Get tool info first, then write the program
  • Clear syntax: Explicit examples of variable substitution patterns
  • Error guidance: How to handle failures and structure references

Workflow Model

Unlike Anthropic's implementation which allows arbitrary TypeScript code execution and intermediate data processing (filtering, mapping, etc.), Witsy's implementation focuses on workflow orchestration. A workflow is defined as:

{
  "steps": [
    { "id": "step1", "tool": "tool_name", "args": { ... } },
    { "id": "step2", "tool": "tool_name", "args": { "param": "{{step1.result}}" } },
    // ...
  ]
}

Each step executes a tool and stores its result. Subsequent steps can reference previous results using template variables with the syntax {{step_id.path.to.value}}.

Variable Substitution

The plugin supports sophisticated variable substitution:

  • Dot notation: {{step1.user.name}}
  • Array access: {{step1.items.0}} or {{step1.items[0]}}
  • Bracket notation with paths: {{step1.result[0].gid}}
  • Nested objects and arrays: Variables can be used in complex argument structures

Error Handling and Model Guidance

Since it's impossible to predict the structure of objects returned by arbitrary MCP calls, the plugin provides JSON schemas in error messages to guide the model toward self-correction.

Schema-Based Error Messages

When a variable resolution error occurs (e.g., accessing a non-existent property or an out-of-bounds array index), the plugin returns the complete JSON schema of the actual result structure. This schema uses Witsy's lightweight "simple JSON schema" format (see Agents JSON format):

{
  "status": "string",
  "count": "number",
  "items": ["string"]
}

For example, if the model tries to access {{get_data.nonexistent}}, the error message includes:

Failed to resolve variables in "{"value":"{{get_data.nonexistent}}"}" for step "use_data".
Expected schema for "tool1":
{
  "status": "string",
  "count": "number",
  "items": []
}

This approach provides the model with the complete, accurate structure of the data, allowing it to correct its variable references on the next attempt.

Schema Persistence and Proactive Delivery

The plugin implements schema learning to reduce trial-and-error:

  1. Schema Generation: Every time a tool executes successfully, the plugin automatically generates a simple JSON schema from the actual result
  2. Persistent Storage: Schemas are saved to disk (code_exec.json in the app's userData directory) and persist across sessions
  3. Proactive Delivery: When the model calls code_exec_get_tools_info, the response includes result_schema fields for any tools that have been executed before

Example get_tools_info response:

{
  "tools_info": [
    {
      "name": "asana_get_tasks",
      "description": "Retrieves tasks from Asana...",
      "parameters": { ... },
      "result_schema": "{\"data\":[{\"gid\":\"string\",\"name\":\"string\"}]}"
    }
  ]
}

This means after the first successful execution of a tool, the model can see the expected output structure before writing programs that reference it, significantly reducing variable resolution errors.

Result Unwrapping

To simplify variable references, the plugin automatically unwraps common result wrappers:

  • If a tool returns { result: { ... } }, the result property is auto-unwrapped
  • If a tool returns { data: { ... } }, the data property is auto-unwrapped
  • JSON strings are automatically parsed

This means {{step1.items}} works whether the tool returns { result: { items: [...] } } or { items: [...] } directly.

Example: Asana Task Retrieval

Scenario

Retrieve all Asana tasks assigned to the current user and store them in long-term memory.

Token Comparison

  • Without code execution: >50,000 tokens (multiple back-and-forth exchanges, each tool call requires full context)
  • With code execution: ~5,000 tokens (two tool calls total)

Actual Execution

First call - Get tool information:

{
  "tool": "code_exec_get_tools_info",
  "parameters": {
    "tools_names": [
      "asana_list_workspaces___c9c8",
      "asana_get_user___c9c8",
      "asana_get_tasks___c9c8",
      "long_term_memory"
    ]
  }
}

Second call - Execute workflow:

{
  "tool": "code_exec_run_program",
  "parameters": {
    "program": {
      "steps": [
        {
          "id": "get_workspaces",
          "tool": "asana_list_workspaces___c9c8",
          "args": {}
        },
        {
          "id": "get_user",
          "tool": "asana_get_user___c9c8",
          "args": { "user_id": "me" }
        },
        {
          "id": "get_tasks",
          "tool": "asana_get_tasks___c9c8",
          "args": {
            "assignee": "{{get_user.result.gid}}",
            "workspace": "{{get_workspaces.result.0.gid}}",
            "limit": 100
          }
        },
        {
          "id": "save_to_memory",
          "tool": "long_term_memory",
          "args": {
            "action": "store",
            "content": [
              "User's Asana tasks retrieved on November 8, 2025: {{get_tasks.result}}"
            ]
          }
        }
      ]
    }
  }
}

The entire operation completes in two tool calls with full variable substitution between steps.

Reflections and Limitations

Scope Differences

Witsy's implementation is intentionally narrower than Anthropic's MCP code execution:

  • No arbitrary code execution or data processing between steps
  • No filtering, mapping, or transformation of intermediate results
  • No conditional logic or branching within workflows
  • Focus is purely on orchestrating sequential tool calls with variable passing

This trade-off was made to:

  1. Avoid the complexity and security concerns of sandboxed code execution
  2. Keep the implementation lightweight and maintainable
  3. Align with Witsy's architecture as a desktop application

Instruction Tuning Required

Getting models to reliably use code execution still requires prompt engineering:

  • Models need to be encouraged to use get_tools_info first to understand tool schemas
  • Some models struggle with the JSON syntax for complex workflows
  • Variable substitution patterns need to be clearly explained

The feature works best with frontier models (Claude Sonnet 4.5, etc.) that have strong instruction-following capabilities.

Error Detection Fragility

The current error detection has some weaknesses:

if (result.error || (typeof result === 'string' && result.toLowerCase().startsWith('error'))) {

This approach:

  • Only catches errors that start with "error" (case-insensitive)
  • Misses errors formatted as "Failed to..." or "Cannot..."
  • Relies on string matching rather than structured error formats

A more robust approach would require:

  • Standardized error formats from MCP servers (which we don't control)
  • Or more sophisticated heuristics for error detection
  • Or explicit error signaling from the plugin execution layer

Auto-Unwrapping Complexity

The automatic unwrapping of result and data properties, while convenient, adds cognitive overhead:

  • It's not always clear whether {{step1.items}} refers to step1.items or step1.result.items
  • The logic for determining when to unwrap has edge cases
  • Makes the variable resolution code more complex (~130 lines with multiple branching paths)

This is a trade-off between convenience for simple cases and explicitness for complex ones.

Future Enhancements

Planned improvements include:

  • Hardening variable substitution: Improve the robustness and reliability of the variable resolution logic to handle more edge cases gracefully
  • Better result visibility: Provide the model with summary information about what was accomplished at each step, rather than just a boolean success/failure indicator
  • Error handling control: Implement the on_error parameter described in system instructions to allow workflows to continue on failure or abort as needed
  • Workflow constraints: Define and enforce practical limits on workflow size (max steps), execution time (timeouts), and resource usage

Completed Enhancements:

  • Schema learning (implemented): The plugin now automatically captures JSON schemas from tool results and provides them in get_tools_info responses, dramatically reducing variable resolution errors

Despite these limitations, the code execution plugin provides significant value for multi-step operations, dramatically reducing token usage while maintaining the reliability of tool chaining.

Clone this wiki locally