FEAT Add JailbreakV_28k dataset from HF#1098
Conversation
romanlutz
left a comment
There was a problem hiding this comment.
Thanks for getting started on this!
The integration test for datasets is missing, but I suspect it will require a custom one as the dataset is meant to be multimodal (see other comment).
romanlutz
left a comment
There was a problem hiding this comment.
Great work! Two small adjustments and we're ready to merge.
|
@AdrGav941 a lot changed in datasets the last couple of weeks. We should have really tried to merge it before the changes but didn't quite get to it. Please let me know if you want to make the changes yourself or if we should make the change. |
|
@romanlutz im happy to make the changes, I'm on vacation until the 19th but can get it working again when i get back! |
No hurry 🙂 |
8c52bc6 to
f117a58
Compare
…kV_28K_dataset' into add__HF_jailbreakV_28K_dataset
romanlutz
left a comment
There was a problem hiding this comment.
Couple minor comments, otherwise this looks good to me. Just need to try it out once to make sure it works.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
86bed80 to
8ddc37a
Compare
- Use _validate_enums helper instead of bespoke validation - Preserve source casing in per-seed harm_categories - Use canonical empty-result error string - Add unit tests (28K: 92% coverage, RedTeam-2K: 100% coverage) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tebook
- Add @luo2024jailbreakv BibTeX entry and cite key, reference from both
loader docstrings per dataset instructions
- Fix RedTeam-2K _HarmCategory.CHILD_ABUSE_CONTENT -> CHILD_ABUSE so the
filter actually matches the upstream policy value ('Child Abuse' in the
RedTeam-2K config, vs 'Child Abuse Content' in the 28K config)
- Regenerate 1_loading_datasets.ipynb so the get_all_dataset_names_async
output cell picks up jailbreakv_28k and jailbreakv_redteam_2k
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 16GB image set lives behind a Google Form -> Google Drive link and cannot be auto-fetched in CI. Mirror the existing PromptIntel pattern in test_all_datasets.py and skip the parameterized fetch when ~/JailBreakV_28K.zip isn't present. The text-only RedTeam-2K sibling fetches from the public HF metadata and continues to run unconditionally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…K_dataset # Conflicts: # doc/bibliography.md # doc/references.bib # tests/end_to_end/test_all_datasets.py
…o 100%
- doc/references.bib: drop -28K from the BibTeX title so it renders cleanly
({JailBreakV} instead of {JailBreakV-28K}) and swap author order to match
the upstream README (Xiaoyu Guo before Chaowei Xiao).
- Mirror the corrected author order in both loaders' docstrings, SeedObjective,
and SeedPrompt authors lists.
- Add three unit tests to cover the previously-untested 9 lines in
jailbreakv_28k_dataset.py (zip-extract branch, empty image_path row,
_resolve_image_path exception fallback). Coverage now 100%.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>



Description
This PR adds support for the JailbreakV_28k dataset to PyRIT.
One notable departure from multimodal dataset fetching present here is that we need a local download of the images via a Google Drive download provided by the owners of the HF dataset. The share link to the zip file is in the function comments and this function does not work without this being downloaded locally due to the number of images missing in HF.
Unzipping if the extracted file is not present at the provided path is handled, as of right now we do not use HF at all for image download due to the large number of missing images so the zip directory is a mandatory parameter.
Addresses https://github.com/Azure/PyRIT/issues/1007
Changes Made:
Files Added/Modified:
Tests and Documentation