Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Illegal Look-Around for OneOf in JSONSchema #897

Merged
merged 2 commits into from
May 17, 2024

Conversation

lapp0
Copy link
Collaborator

@lapp0 lapp0 commented May 17, 2024

Fixes #823

This comment details the issues error: #823 (comment)

The reproduction code provided results in a json schema with OneOf[pets]:

class Model(BaseModel):
    pet: Union[Cat, Dog] = Field(..., discriminator='pet_type')

Before this PR: OneOf uses negative lookaheads to assert that only one schema member is included. This is illegal in interegular, more details available here: #456

After OneOf uses or-joined non-capturing groups which don't have the same issues with interegular.

@lapp0 lapp0 changed the title Issue 823 Fix Issue #823, Prevent Illegal Look-Around for OneOf in JSONSchema May 17, 2024
@rlouf rlouf added structured generation Linked to structured generation JSON labels May 17, 2024
@rlouf rlouf changed the title Fix Issue #823, Prevent Illegal Look-Around for OneOf in JSONSchema Prevent Illegal Look-Around for OneOf in JSONSchema May 17, 2024
@rlouf rlouf merged commit 499d19d into outlines-dev:main May 17, 2024
5 checks passed
@rlouf
Copy link
Member

rlouf commented May 17, 2024

Looks good, thank you for the fix!

other_subregexes_str = "|".join([f"{s}" for s in other_subregexes])
negative_lookahead = f"(?!.*({other_subregexes_str}))"
xor_patterns.append(f"({subregex}){negative_lookahead}")
xor_patterns = [f"(?:{subregex})" for subregex in subregexes]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lapp0 what is the difference bw oneOf and anyOf in this script? Both concat the subregex with |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also think additional conditions like just checking for ^ and just checking for $ needs to be added to this conditon?

if pattern[0] == "^" and pattern[-1] == "$":
return rf'(^"{pattern[1:-1]}"$)'

Copy link

@ekagra-ranjan ekagra-ranjan May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how about adding array along with object to the types to ignore here?

regexes = [
to_regex(resolver, {"type": t}, whitespace_pattern)
for t in instance_type
if t != "object"
Array like object requires many more parameters like items which is not provided in the list of types which is the reason why object is ignored I believe since properties cannot be mentioned when type is given as a list

Copy link
Collaborator Author

@lapp0 lapp0 May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekagra-ranjan actually reviewing the test cases, I think anyOf is wrong. I'd appreciate a second set of eyes on this. Would you agree that

oneOf is correct (uses exclusive or), guarantees only one match

>>> print(re.fullmatch(r"((?:foo)|(?:bar))", "foo"))
<re.Match object; span=(0, 3), match='foo'>
>>> print(re.fullmatch(r"((?:foo)|(?:bar))", "foobar"))
None

anyOf is incorrect, allows only one match (I think this regression was introduced in https://github.com/outlines-dev/outlines/pull/552/files)

>>> print(re.fullmatch(r"((foo)|(bar))", "foo"))
<re.Match object; span=(0, 3), match='foo'>
>>> print(re.fullmatch(r"((foo)|(bar))", "foobar"))
None

allOf is somewhat correct, requires generation of all schemas, but enforces an order which it shouldn't:

    elif "allOf" in instance:
        subregexes = [
            to_regex(resolver, t, whitespace_pattern) for t in instance["allOf"]
        ]
        subregexes_str = [f"{subregex}" for subregex in subregexes]
        return rf"({''.join(subregexes_str)})"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your other two comments, I appreciate your insights into json schema -> regex. It's definitely a work in progress and it's great that you're looking deep into it's behavior and identifying problems. I think it's out of scope with this PR. Could you create a separate issue so others can provide input as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
JSON structured generation Linked to structured generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seems like passing a discriminated union crashes
3 participants