Skip to content

Fix multiline JSON extraction in exceptions helpers#1474

Merged
rlundeen2 merged 2 commits intomicrosoft:mainfrom
biefan:fix-multiline-json-extraction
Mar 16, 2026
Merged

Fix multiline JSON extraction in exceptions helpers#1474
rlundeen2 merged 2 commits intomicrosoft:mainfrom
biefan:fix-multiline-json-extraction

Conversation

@biefan
Copy link
Copy Markdown
Contributor

@biefan biefan commented Mar 16, 2026

Summary

This updates extract_json_from_string() to correctly extract multiline JSON payloads from larger strings.

Problem

extract_json_from_string() currently relies on a regex:

re.compile(r"\{.*\}|\[.*\]")

That has two problems for real model output:

  1. . does not match newlines by default, so multiline JSON objects and arrays are not matched as a whole.
  2. When the outer JSON spans multiple lines but contains a nested single-line object, the regex can skip the outer payload and return the nested fragment instead.

In practice, this means helper code can extract the wrong JSON fragment from valid responses that include explanatory text plus a pretty-printed JSON body.

Fix

Replace the regex-based extraction with json.JSONDecoder().raw_decode() scanning logic:

  • scan for candidate { / [ starting positions
  • attempt to decode a JSON object or array from each candidate position
  • return the first complete decodable payload

This preserves existing behavior for single-line JSON while fixing multiline extraction and avoiding nested-fragment false matches.

Tests

Added a regression test covering a multiline JSON object embedded in surrounding text.

Validation command:

uv run --extra dev pytest tests/unit/exceptions/test_exceptions_helpers.py -q

Validation result:

31 passed in 0.04s

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves JSON payload extraction from mixed model output strings by replacing a regex-based approach (which fails on multiline JSON and can select nested fragments) with a json.JSONDecoder().raw_decode() scan, and adds a regression test for the multiline/nested case.

Changes:

  • Replace regex JSON extraction with incremental decoding via json.JSONDecoder().raw_decode().
  • Ensure multiline JSON embedded in surrounding text is extracted correctly (and avoids nested-fragment false matches).
  • Add a regression test for multiline object extraction.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pyrit/exceptions/exceptions_helpers.py Reworks extract_json_from_string() to scan and decode JSON objects/arrays instead of using regex.
tests/unit/exceptions/test_exceptions_helpers.py Adds a regression test covering multiline JSON with a nested single-line object.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@rlundeen2
Copy link
Copy Markdown
Contributor

Looks great, thanks for the PR @biefan!

@rlundeen2 rlundeen2 merged commit 9de7af2 into microsoft:main Mar 16, 2026
34 of 35 checks passed
riyosha pushed a commit to riyosha/PyRIT that referenced this pull request Mar 24, 2026
Co-authored-by: rlundeen2 <137218279+rlundeen2@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants