FEAT CodeAttack

## Summary

CodeAttack (Ren et al., ACL 2024, [arXiv:2403.07865](https://arxiv.org/abs/2403.07865), repo: https://github.com/renqibing/CodeAttack) reformulates a harmful natural-language query as a code-completion task. The query is encoded character-by-character into a data structure initialisation sequence (e.g., successive `deque.appendleft()` calls, list literals, or string concatenations), embedded inside a partial code template that declares a `decode()` stub and an output-filling skeleton, and the model is asked to complete the code. Because the harmful intent is expressed as a programming task rather than a natural-language request, safety training, which is applied predominantly to natural-language inputs - fails to trigger consistently. The authors demonstrate this against GPT-4, Claude, and Gemini with black-box API access only.

## How this differs from CodeChameleon (already in PyRIT)

Both attacks use code, but the mechanism is opposite.

| | CodeChameleon | CodeAttack |
|---|---|---|
| Encoding step | Encrypts the prompt using a known algorithm (reverse, binary-tree, odd-even, length-sort) | Encodes the query character-by-character into data-structure initialisation code |
| Decryption key | Embeds a decrypt function in the output; model must decrypt and then answer | No separate decryption key — the `decode()` stub is a function the model is asked to write |
| Core disguise | Code as cipher: the encoding is the obfuscation | Code as task: the completion framing is the obfuscation |
| Model instruction | "Solve the encrypted problem using the decryption function" | "Complete this code so that it initialises the output list with the relevant paragraphs" |

CodeChameleon is already in `pyrit/prompt_converter/codechameleon_converter.py`. CodeAttack is not implemented anywhere in PyRIT.

## Proposed design

Following the `FlipConverter` + `FlipAttack` two-class template.

**`CodeAttackConverter`** - `pyrit/prompt_converter/code_attack_converter.py`

Transforms an input prompt into a code template. The converter handles the encoding (wrapping each character/token of the prompt into the chosen data-structure operations) and the template rendering. Being a standalone converter, it composes freely with other converters via `AttackConverterConfig`.

Key parameters (grounded in the reference implementation's `--prompt-type` and template variants):
- `language` (`str`): one of `"python_stack"`, `"python_list"`, `"python_string"`, `"cpp"`, `"go"` — selects which template family to use.
- `verbose` (`bool`, default `True`): selects the `_plus` template variant, which adds a more detailed output-structure specification and produces longer, more complete responses.

Template YAML seed prompts live under `pyrit/prompt_converter/seed_prompts/`, one file per language/verbose combination, matching the existing `codechameleon_converter.yaml` convention.

**`CodeAttackAttack`** - `pyrit/executor/attack/single_turn/code_attack.py`

Subclasses `PromptSendingAttack`. In `__init__`, instantiates `CodeAttackConverter` and prepends it to `_request_converters` (same pattern as `FlipAttack`). In `_setup_async`, injects a system prompt that frames the session as a code-completion environment and stores it in `context.prepended_conversation`.

Parameters beyond those inherited from `PromptSendingAttack`:
- `language` and `verbose`, forwarded to the converter.

No new scoring logic - callers supply a scorer via `AttackScoringConfig` as usual.

## Scope

**In scope:**
- `CodeAttackConverter` with `language` and `verbose` parameters and five YAML seed-prompt templates (python_stack, python_list, python_string, cpp, go)
- `CodeAttackAttack` as a `PromptSendingAttack` subclass
- Unit tests in `tests/unit/prompt_converter/test_code_attack_converter.py` and `tests/unit/executor/attack/single_turn/test_code_attack.py`, following the `test_flip_converter.py` / `test_flip_attack.py` patterns (`MagicMock`, `AsyncMock`, `patch_central_database`)
- A jupytext notebook in `doc/code/executor/attack/` demonstrating end-to-end usage

**Out of scope:**
- Auto-selection of `language` based on target-model capability
- Multimodal variants
- The paper's `--num-samples` outer retry loop (orthogonal - covered by `max_attempts_on_failure` + a scorer in the existing framework)

---

I'd like to take this. Will hold off on writing any code until @romanlutz approves the design above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT CodeAttack #1945

Summary

How this differs from CodeChameleon (already in PyRIT)

Proposed design

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	CodeChameleon	CodeAttack
Encoding step	Encrypts the prompt using a known algorithm (reverse, binary-tree, odd-even, length-sort)	Encodes the query character-by-character into data-structure initialisation code
Decryption key	Embeds a decrypt function in the output; model must decrypt and then answer	No separate decryption key — the `decode()` stub is a function the model is asked to write
Core disguise	Code as cipher: the encoding is the obfuscation	Code as task: the completion framing is the obfuscation
Model instruction	"Solve the encrypted problem using the decryption function"	"Complete this code so that it initialises the output list with the relevant paragraphs"

FEAT CodeAttack #1945

Description

Summary

How this differs from CodeChameleon (already in PyRIT)

Proposed design

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions