Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved json grammar #3785

Closed
wants to merge 4 commits into from

Conversation

hughescr
Copy link
Contributor

This PR includes #3782 #3783 #3784 but also completely removes all whitespace generation when applying format=json.

Whitespace is completely unnecessary in JSON - it is optionally acceptable, but can be ignored if we're generating and not parsing. Since every token costs time, money, and CO2, removing all whitespace is highly recommended.

It will also fix a buglet, where models could get "stuck" generating infinite whitespace, since there is no limit in the definition of ws for how long it can be.

@not-nullptr
Copy link

could we please have a contributor review this PR? json is currently unusable with the llama3 family of models, and i think this PR should fix it

@hughescr
Copy link
Contributor Author

hughescr commented May 5, 2024

I think #3782 and #3784 are the bufixes needed to fix llama3 (or any other model that likes generating bad JSON if unconstrained) -- one prevents output of malformed JSON strings (with unescaped characters in the middle of the strings); the other prevents endless whitespace at the end of the generated JSON. #3783 is more of a feature patch, allowing more flexibility in the JSON which is generated, so that you don't always require an object wrapper at the top level of the JSON - you can just have bare literal strings/numbers/arrays. That's sometimes good, but sometimes bad; I've actually seen llama3 generate worse JSON with #3783 than without it, cos it'll do something like:

"{\"key\":\"value\"}"

ie is starts off with a literal string, and then it realizes it wants to do an object, so you end up with a string that contains JSON...

This patch here I didn't include separately, but it's a good performance boost by removing interior whitespace in the JSON so the LLM just skips emitting all the whitespace characters that are un-needed.

My recommendation to whoever's deciding here would be to either accept this PR (ie all the changes), or else accept this one, then back out #3783 if you think allowing non-objects at the top level in JSON is too much.

@coder543
Copy link

coder543 commented May 5, 2024

Personally, I think it is better to allow interior whitespace, but limit it to up to one new line and up to a small number of spaces between fields. Papers like “let’s think dot by dot” show that every token of output can help the models perform better, so if the models “want” more “time” to think, a little flexibility for the “comfort” of the model isn’t a terrible thing.

On the other hand, whitespace after the emitted JSON has already become syntactically valid is a total waste of time.

Regarding excessive flexibility on the top level type causing issues, I agree with that as well. I wish we could just supply either a custom grammar or a JSON Schema or something, but since we can’t… I wish we could at least specify whether we want an object or an array as the top level type. I do not see any possible reason to allow a string or a null as the top level type, but I could see some argument for a number or a Boolean. In the absence of being allowed to specify a top level type at all, I’d either choose between only allowing an object or allowing an object or an array.

@coder543
Copy link

coder543 commented May 5, 2024

I also wish we could get a maintainer to look at these changes… it has been awhile. @jmorganca ?

@joliss
Copy link

joliss commented May 16, 2024

It might be worth benchmarking whether or not the response quality suffers if you disallow whitespace.

It's conceivable that models might benefit from whitespace, either because the indentation helps them keep track of where they are in the object hierarchy, or because of the "let's think dot by dot" result that @coder543 mentions [1]. If that's the case, then perhaps there should be an option to toggle whitespace in the output.

But it's also conceivable that it makes little to no difference, and then we might as well default to never generating whitespace, to save on tokens.

I'll be away from my development machine for a while, but if I have anything to report in the future, I'll leave a comment here.

[1] Update: Reading the "Let's Think Dot by Dot" paper, their method requires bespoke model finetuning and doesn't work on existing models. So I don't think this is a concern here.

@not-nullptr
Copy link

my reply isn't gonna be very constructive, but does it matter if quality is a tad worse? for any sort of real time production use case, json mode is unusable. waiting 10+ seconds for a response because the LLM keeps outputting whitespace after the answer is not a feature, it's a bug; it would freeze usually but there's a mechanism in ollama which cuts off generation if a character is repeated too many times (which also breaks the ollama JS library, since it throws an error if done is false on a non streamed request)

@joliss
Copy link

joliss commented May 16, 2024

I agree, and to be clear, my comment was only about the part where we remove all whitespace generation. Quoting the OP:

This PR includes #3782 #3783 #3784 but also completely removes all whitespace generation when applying format=json.

The changes in #3782 #3783 #3784 seem like unambiguous improvements, and I think they're all worth merging.

@not-nullptr
Copy link

removing all whitespace seems like an unambiguous improvement too. i don't see why generation quality would be hindered, and every token is more CO2 (like OP said)

@coder543
Copy link

@not-nullptr because of the Let’s Think Dot by Dot paper I linked? And in my previous reply, I was extremely clear that whitespace after the end of the valid JSON is a “total waste of time.” I’m not suggesting to keep that, at all. Why would anyone?

Regardless, I’d rather have these changes as-is than keep the current (broken) state of things, even if I believe forcibly removing all interior whitespace goes against the currently understood research on output quality.

@richardanaya
Copy link

I'd offer with the ability to override the grammar on API calls as in this #4525 PR, we wouldn't have to debate which grammar to use for JSON.

@hughescr
Copy link
Contributor Author

It might be worth benchmarking whether or not the response quality suffers if you disallow whitespace.

It's conceivable that models might benefit from whitespace, either because the indentation helps them keep track of where they are in the object hierarchy, or because of the "let's think dot by dot" result that @coder543 mentions [1]. If that's the case, then perhaps there should be an option to toggle whitespace in the output.

But it's also conceivable that it makes little to no difference, and then we might as well default to never generating whitespace, to save on tokens.

I'll be away from my development machine for a while, but if I have anything to report in the future, I'll leave a comment here.

[1] Update: Reading the "Let's Think Dot by Dot" paper, their method requires bespoke model finetuning and doesn't work on existing models. So I don't think this is a concern here.

I have actually noticed that Phi-3 generates garbage JSON unless it's allowed to first output a single space, then begin the {..., I've tried adjusting prompts to put the leading space in the reply and hope for continuation, but it just outputs another space and then then JSON. Constraining the output to not include that initial space for Phi-3 basically completely crippples it.

@hughescr
Copy link
Contributor Author

my reply isn't gonna be very constructive, but does it matter if quality is a tad worse? for any sort of real time production use case, json mode is unusable. waiting 10+ seconds for a response because the LLM keeps outputting whitespace after the answer is not a feature, it's a bug; it would freeze usually but there's a mechanism in ollama which cuts off generation if a character is repeated too many times (which also breaks the ollama JS library, since it throws an error if done is false on a non streamed request)

trailing whitespace in particular ought to always be truncatable. There's NO REASON ever to generate it. The rest has already been generated, so JSON quality can't drop based on removing trailing WS. As mentioned above though, some models definitely suffer when leading or interior whitespace is forced out, though how much will vary from model to model.

@coder543
Copy link

Also, I just saw the edit from @joliss, but my conclusion was that the paper’s findings were broadly applicable. I interpreted that they fine-tuned the final layer for a particular task to try to make the results more clear. But, I could always be wrong.

To quote the paper’s abstract summary:

In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

@hughescr
Copy link
Contributor Author

Phi3 is also super sensitive to interior whitespace I guess. These runs were with a JSON grammar that allows ws before the root object, but removes all other whitespace (like this #3785 but allowing an initial space that Phi3 requires). As can be seen, this completely breaks the LLM which just stop generating very early. When we don't force that tight JSON grammar, the model does just fine on its own (with a single trailing '\n')

[craig@Megamind:~]$ ollama run phi3:medium-128k-instruct-q8_0 'Please return a JSON object representing an address book entry for some invented person' --format=json                                                                              [√][!21742]
 {"name": {"firstName":


[craig@Megamind:~]$ ollama run phi3:medium-128k-instruct-q8_0 'Please return a JSON object representing an address book entry for some invented person'                                                                                            [√][!21743]
 {
  "name": "Jane Doe",
  "email": "jane.doe@example.com",
  "phone_number": "+1-555-0123456",
  "address": {
    "street": "123 Main St.",
    "city": "Metropolis",
    "state": "NY",
    "postal_code": "10001"
  },
  "notes": "Close friend and colleague. Enjoys hiking and photography."
}

...so probably this patch is not a good one in general, but #3782 and #3784 definitely are unambiguous improvements. #3783 I've had occasional issues - some LLMs seem to like generating stringified JSON (ie JSON inside a string) with #3783 so if what you want is objects, you might often not get them with that patch depending on the LLM.

I'm closing this PR since it doesn't seem ideal after using it in a variety of models for a while now. Leaving the others, and I'll add some notes to #3783

@hughescr hughescr closed this May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants