Skip to content

FEAT CodeAttack #1945

@u7k4rs6

Description

@u7k4rs6

Summary

CodeAttack (Ren et al., ACL 2024, arXiv:2403.07865, repo: https://github.com/renqibing/CodeAttack) reformulates a harmful natural-language query as a code-completion task. The query is encoded character-by-character into a data structure initialisation sequence (e.g., successive deque.appendleft() calls, list literals, or string concatenations), embedded inside a partial code template that declares a decode() stub and an output-filling skeleton, and the model is asked to complete the code. Because the harmful intent is expressed as a programming task rather than a natural-language request, safety training, which is applied predominantly to natural-language inputs - fails to trigger consistently. The authors demonstrate this against GPT-4, Claude, and Gemini with black-box API access only.

How this differs from CodeChameleon (already in PyRIT)

Both attacks use code, but the mechanism is opposite.

CodeChameleon CodeAttack
Encoding step Encrypts the prompt using a known algorithm (reverse, binary-tree, odd-even, length-sort) Encodes the query character-by-character into data-structure initialisation code
Decryption key Embeds a decrypt function in the output; model must decrypt and then answer No separate decryption key — the decode() stub is a function the model is asked to write
Core disguise Code as cipher: the encoding is the obfuscation Code as task: the completion framing is the obfuscation
Model instruction "Solve the encrypted problem using the decryption function" "Complete this code so that it initialises the output list with the relevant paragraphs"

CodeChameleon is already in pyrit/prompt_converter/codechameleon_converter.py. CodeAttack is not implemented anywhere in PyRIT.

Proposed design

Following the FlipConverter + FlipAttack two-class template.

CodeAttackConverter - pyrit/prompt_converter/code_attack_converter.py

Transforms an input prompt into a code template. The converter handles the encoding (wrapping each character/token of the prompt into the chosen data-structure operations) and the template rendering. Being a standalone converter, it composes freely with other converters via AttackConverterConfig.

Key parameters (grounded in the reference implementation's --prompt-type and template variants):

  • language (str): one of "python_stack", "python_list", "python_string", "cpp", "go" — selects which template family to use.
  • verbose (bool, default True): selects the _plus template variant, which adds a more detailed output-structure specification and produces longer, more complete responses.

Template YAML seed prompts live under pyrit/prompt_converter/seed_prompts/, one file per language/verbose combination, matching the existing codechameleon_converter.yaml convention.

CodeAttackAttack - pyrit/executor/attack/single_turn/code_attack.py

Subclasses PromptSendingAttack. In __init__, instantiates CodeAttackConverter and prepends it to _request_converters (same pattern as FlipAttack). In _setup_async, injects a system prompt that frames the session as a code-completion environment and stores it in context.prepended_conversation.

Parameters beyond those inherited from PromptSendingAttack:

  • language and verbose, forwarded to the converter.

No new scoring logic - callers supply a scorer via AttackScoringConfig as usual.

Scope

In scope:

  • CodeAttackConverter with language and verbose parameters and five YAML seed-prompt templates (python_stack, python_list, python_string, cpp, go)
  • CodeAttackAttack as a PromptSendingAttack subclass
  • Unit tests in tests/unit/prompt_converter/test_code_attack_converter.py and tests/unit/executor/attack/single_turn/test_code_attack.py, following the test_flip_converter.py / test_flip_attack.py patterns (MagicMock, AsyncMock, patch_central_database)
  • A jupytext notebook in doc/code/executor/attack/ demonstrating end-to-end usage

Out of scope:

  • Auto-selection of language based on target-model capability
  • Multimodal variants
  • The paper's --num-samples outer retry loop (orthogonal - covered by max_attempts_on_failure + a scorer in the existing framework)

I'd like to take this. Will hold off on writing any code until @romanlutz approves the design above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions