Skip to content

REFACTOR: Shared image fetch helper for multimodal seed datasets#1776

Merged
romanlutz merged 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/refactor-image-fetch-helper
May 22, 2026
Merged

REFACTOR: Shared image fetch helper for multimodal seed datasets#1776
romanlutz merged 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/refactor-image-fetch-helper

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Summary

Follow-up to @jsong468's review comment on #1757: #1757 (comment) (out of scope for this PR).

Four multimodal seed-dataset loaders under pyrit/datasets/seed_datasets/remote/ duplicated the same download an image from a URL, cache it under seed-prompt-entries, return the local path logic. This PR extracts that logic into a single shared async helper.

What changed

Added pyrit/datasets/seed_datasets/remote/_image_cache.py with:

async def fetch_and_cache_image_async(
    *,
    filename: str,
    image_url: str | None = None,
    image_bytes: bytes | None = None,
    log_prefix: str = ""image-cache"",
    request_headers: Mapping[str, str] | None = None,
    request_timeout: float | None = None,
    follow_redirects: bool = False,
) -> str
  • Caller passes filename so each loader keeps its existing on-disk cache filename pattern (no user-visible cache invalidation).
  • image_bytes path supports loaders that have raw bytes in hand (e.g. PIL images from HuggingFace rows) without going through the network. This makes the helper a drop-in fit for the MSTS loader once FEAT: Add MSTS multimodal safety dataset loader #1757 lands.
  • request_headers/request_timeout/follow_redirects cover VLSU's custom UA + 2 s timeout + redirect needs without leaking a generic **httpx_kwargs through the public API.
  • Uses str(Path(results_path) / sub_dir / filename) for cross-OS-stable cache paths so the skip download if file exists fast-path fires correctly on Windows.
  • Raises on failure to match the existing try / except Exception -> log + skip convention used at every call site.
  • Preserves the existing RuntimeError message (Serializer memory is not properly configured) so existing pytest.raises(match=...) assertions keep passing.

Why a standalone module (rather than a base-class method or mixin)

  • Only 4 of ~30 remote dataset loaders touch images - adding the method to _RemoteDatasetLoader would bloat the base.
  • Standalone helper is easy to test in isolation.
  • No existing _helpers.py / _common.py in remote/, so the new module fits the prevailing one-helper-per-file pattern.

Migrated loaders

Loader Method Filename pattern (unchanged)
_HarmBenchMultimodalDataset _fetch_and_save_image_async harmbench_{behavior_id}.png
_VisualLeakBenchDataset _fetch_and_save_image_async visual_leak_bench_{example_id}.png
_VLSUMultimodalDataset _fetch_and_save_image_async ml_vlsu_{group_id}.png (custom headers/timeout/redirects preserved)
_ComicJailbreakDataset _fetch_template_async comic_jailbreak_{template_name}.png (template-name validation kept in the wrapper)

Each loader's existing wrapper method survives as a 1-3 line delegating call, so every existing patch.object(loader, "_fetch_and_save_image_async", ...) test mock keeps working untouched.

_VLGuardDataset is not migrated - it downloads a zip from HuggingFace Hub instead of an individual image URL, so the pattern doesn't apply.

Tests

  • New tests/unit/datasets/test_image_cache.py (8 tests) directly exercising the helper: cache hit skips network, URL fetch path writes response bytes, image_bytes path skips network entirely, ValueError on missing input, RuntimeError on misconfigured memory, exception propagation, custom headers/timeout/redirects forwarded to the HTTP client, and graceful fallback when the cache-existence check itself fails.
  • Existing loader tests adjusted in three small ways:
    • Patch target switched from <loader_module>.data_serializer_factory to _image_cache.data_serializer_factory (the loaders no longer import the factory directly).
    • expected_path assertions switched to str(Path(...)) for Windows/POSIX portability.
    • No test logic changes.

All 439 unit tests under tests/unit/datasets pass.

Verification

uv run ruff format pyrit tests doc            # 1127 files left unchanged
uv run ruff check pyrit tests doc              # All checks passed!
uv run -m ty check pyrit                       # No new diagnostics in changed files
uv run pytest tests/unit/datasets -v           # 439 passed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jsong468 jsong468 self-assigned this May 21, 2026
Copy link
Copy Markdown
Contributor

@jsong468 jsong468 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing!

@romanlutz romanlutz added this pull request to the merge queue May 22, 2026
Merged via the queue into microsoft:main with commit 79ae230 May 22, 2026
48 checks passed
@romanlutz romanlutz deleted the romanlutz/refactor-image-fetch-helper branch May 22, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants