Skip to content

feat: Add BijectionConverter and BijectionAttack (#1903)#1942

Open
sajisanchu1913-source wants to merge 12 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack
Open

feat: Add BijectionConverter and BijectionAttack (#1903)#1942
sajisanchu1913-source wants to merge 12 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack

Conversation

@sajisanchu1913-source
Copy link
Copy Markdown
Contributor

Summary

Implements the Bijection Attack from arXiv:2410.01294 (Haize Labs) into PyRIT.

The attack works by teaching a target LLM a secret character mapping through
demonstration shots, then sending harmful prompts encoded in that mapping to
bypass safety filters. Responses are decoded using the inverse mapping.

Changes

New Files

  • pyrit/prompt_converter/bijection_converter.py — generates random letter-to-letter mapping, encodes prompts, decodes responses
  • pyrit/executor/attack/single_turn/bijection_attack.py — runs full bijection attack with teaching phase
  • tests/unit/prompt_converter/test_bijection_converter.py — 11 unit tests for converter
  • tests/unit/executor/test_bijection_attack.py — 5 unit tests for attack
  • doc/code/executor/attack/bijection_attack.ipynb — usage notebook

Modified Files

  • pyrit/prompt_converter/__init__.py — registered BijectionConverter
  • pyrit/executor/attack/single_turn/__init__.py — registered BijectionAttack

How It Works

  1. BijectionConverter generates a random secret mapping (e.g. a→q, b→x...)
  2. BijectionAttack sends teaching messages to target AI to teach the mapping
  3. Harmful prompt is encoded and sent as TASK is '⟪encoded prompt⟫'
  4. Response is decoded using inverse mapping
  5. Decoded response is scored by the judge

Pattern Followed

  • BijectionConverter follows FlipConverter pattern
  • BijectionAttack follows FlipAttack pattern

Reference

sajisanchu1913-source and others added 12 commits May 28, 2026 17:14
- _RemoteDatasetLoader._fetch_zip_from_url:
  - keyword-only args (source, inner_files, cache)
  - streams download (requests stream=True + iter_content) to avoid
    double-buffering large archives
  - md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when
    cache=True; named temp file otherwise (cleaned up after parse)
  - validates each inner_files extension against FILE_TYPE_HANDLERS;
    raises ValueError with a member preview if an inner file is missing
  - parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts,
    so the open ZipFile never escapes the worker thread
  - adds the missing import zipfile that broke the previous commit
- _MICDataset:
  - drops unused io / json / requests imports (helper handles them)
  - delegates download + parse to the helper; only owns the seed
    construction loop
  - guards non-string Q values (in addition to NaN moral values)
  - forwards cache from fetch_dataset_async to the helper
  - factors authors into AUTHORS class constant
- Tests:
  - test_moral_integrity_corpus_dataset.py: stops mocking requests.get
    directly; patches _fetch_zip_from_url to return parsed dicts so
    tests don't depend on the helper's internal shape
  - adds test_fetch_dataset_non_string_q and
    test_fetch_dataset_passes_cache_flag
  - hoists imports into the right groups so ruff I001 stops firing
  - removes trailing whitespace / extra newlines
- test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering
  happy path, on-disk caching (hits 1 network call across 2 fetches),
  cache=False does not persist, missing inner file raises ValueError,
  unsupported extension raises ValueError

Verified live against the real MIC.zip: 35,408 unique seeds across
all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset
unit tests pass; ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav
  to prevent concurrent call collisions
- Wrap Azure upload in try/finally to ensure temp file is always
  deleted even when upload fails
- Add regression test to verify cleanup on upload failure

Fixes microsoft#1894
- Add BijectionConverter that generates random letter-to-letter mapping
- Add BijectionAttack that teaches the mapping to target AI and encodes harmful prompts
- Add unit tests for both converter and attack
- Add notebook demonstrating usage
- Update __init__.py files to register new classes

Based on arXiv:2410.01294 (Haize Labs bijection-learning)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT Bijection

2 participants