Summary
CodeAttack (Ren et al., ACL 2024, arXiv:2403.07865, repo: https://github.com/renqibing/CodeAttack) reformulates a harmful natural-language query as a code-completion task. The query is encoded character-by-character into a data structure initialisation sequence (e.g., successive deque.appendleft() calls, list literals, or string concatenations), embedded inside a partial code template that declares a decode() stub and an output-filling skeleton, and the model is asked to complete the code. Because the harmful intent is expressed as a programming task rather than a natural-language request, safety training, which is applied predominantly to natural-language inputs - fails to trigger consistently. The authors demonstrate this against GPT-4, Claude, and Gemini with black-box API access only.
How this differs from CodeChameleon (already in PyRIT)
Both attacks use code, but the mechanism is opposite.
|
CodeChameleon |
CodeAttack |
| Encoding step |
Encrypts the prompt using a known algorithm (reverse, binary-tree, odd-even, length-sort) |
Encodes the query character-by-character into data-structure initialisation code |
| Decryption key |
Embeds a decrypt function in the output; model must decrypt and then answer |
No separate decryption key — the decode() stub is a function the model is asked to write |
| Core disguise |
Code as cipher: the encoding is the obfuscation |
Code as task: the completion framing is the obfuscation |
| Model instruction |
"Solve the encrypted problem using the decryption function" |
"Complete this code so that it initialises the output list with the relevant paragraphs" |
CodeChameleon is already in pyrit/prompt_converter/codechameleon_converter.py. CodeAttack is not implemented anywhere in PyRIT.
Proposed design
Following the FlipConverter + FlipAttack two-class template.
CodeAttackConverter - pyrit/prompt_converter/code_attack_converter.py
Transforms an input prompt into a code template. The converter handles the encoding (wrapping each character/token of the prompt into the chosen data-structure operations) and the template rendering. Being a standalone converter, it composes freely with other converters via AttackConverterConfig.
Key parameters (grounded in the reference implementation's --prompt-type and template variants):
language (str): one of "python_stack", "python_list", "python_string", "cpp", "go" — selects which template family to use.
verbose (bool, default True): selects the _plus template variant, which adds a more detailed output-structure specification and produces longer, more complete responses.
Template YAML seed prompts live under pyrit/prompt_converter/seed_prompts/, one file per language/verbose combination, matching the existing codechameleon_converter.yaml convention.
CodeAttackAttack - pyrit/executor/attack/single_turn/code_attack.py
Subclasses PromptSendingAttack. In __init__, instantiates CodeAttackConverter and prepends it to _request_converters (same pattern as FlipAttack). In _setup_async, injects a system prompt that frames the session as a code-completion environment and stores it in context.prepended_conversation.
Parameters beyond those inherited from PromptSendingAttack:
language and verbose, forwarded to the converter.
No new scoring logic - callers supply a scorer via AttackScoringConfig as usual.
Scope
In scope:
CodeAttackConverter with language and verbose parameters and five YAML seed-prompt templates (python_stack, python_list, python_string, cpp, go)
CodeAttackAttack as a PromptSendingAttack subclass
- Unit tests in
tests/unit/prompt_converter/test_code_attack_converter.py and tests/unit/executor/attack/single_turn/test_code_attack.py, following the test_flip_converter.py / test_flip_attack.py patterns (MagicMock, AsyncMock, patch_central_database)
- A jupytext notebook in
doc/code/executor/attack/ demonstrating end-to-end usage
Out of scope:
- Auto-selection of
language based on target-model capability
- Multimodal variants
- The paper's
--num-samples outer retry loop (orthogonal - covered by max_attempts_on_failure + a scorer in the existing framework)
I'd like to take this. Will hold off on writing any code until @romanlutz approves the design above.
Summary
CodeAttack (Ren et al., ACL 2024, arXiv:2403.07865, repo: https://github.com/renqibing/CodeAttack) reformulates a harmful natural-language query as a code-completion task. The query is encoded character-by-character into a data structure initialisation sequence (e.g., successive
deque.appendleft()calls, list literals, or string concatenations), embedded inside a partial code template that declares adecode()stub and an output-filling skeleton, and the model is asked to complete the code. Because the harmful intent is expressed as a programming task rather than a natural-language request, safety training, which is applied predominantly to natural-language inputs - fails to trigger consistently. The authors demonstrate this against GPT-4, Claude, and Gemini with black-box API access only.How this differs from CodeChameleon (already in PyRIT)
Both attacks use code, but the mechanism is opposite.
decode()stub is a function the model is asked to writeCodeChameleon is already in
pyrit/prompt_converter/codechameleon_converter.py. CodeAttack is not implemented anywhere in PyRIT.Proposed design
Following the
FlipConverter+FlipAttacktwo-class template.CodeAttackConverter-pyrit/prompt_converter/code_attack_converter.pyTransforms an input prompt into a code template. The converter handles the encoding (wrapping each character/token of the prompt into the chosen data-structure operations) and the template rendering. Being a standalone converter, it composes freely with other converters via
AttackConverterConfig.Key parameters (grounded in the reference implementation's
--prompt-typeand template variants):language(str): one of"python_stack","python_list","python_string","cpp","go"— selects which template family to use.verbose(bool, defaultTrue): selects the_plustemplate variant, which adds a more detailed output-structure specification and produces longer, more complete responses.Template YAML seed prompts live under
pyrit/prompt_converter/seed_prompts/, one file per language/verbose combination, matching the existingcodechameleon_converter.yamlconvention.CodeAttackAttack-pyrit/executor/attack/single_turn/code_attack.pySubclasses
PromptSendingAttack. In__init__, instantiatesCodeAttackConverterand prepends it to_request_converters(same pattern asFlipAttack). In_setup_async, injects a system prompt that frames the session as a code-completion environment and stores it incontext.prepended_conversation.Parameters beyond those inherited from
PromptSendingAttack:languageandverbose, forwarded to the converter.No new scoring logic - callers supply a scorer via
AttackScoringConfigas usual.Scope
In scope:
CodeAttackConverterwithlanguageandverboseparameters and five YAML seed-prompt templates (python_stack, python_list, python_string, cpp, go)CodeAttackAttackas aPromptSendingAttacksubclasstests/unit/prompt_converter/test_code_attack_converter.pyandtests/unit/executor/attack/single_turn/test_code_attack.py, following thetest_flip_converter.py/test_flip_attack.pypatterns (MagicMock,AsyncMock,patch_central_database)doc/code/executor/attack/demonstrating end-to-end usageOut of scope:
languagebased on target-model capability--num-samplesouter retry loop (orthogonal - covered bymax_attempts_on_failure+ a scorer in the existing framework)I'd like to take this. Will hold off on writing any code until @romanlutz approves the design above.