-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are we able to structure JSON output into a single line with just one whitespace? #908
Comments
Please pass |
I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training. |
My bike-shedding: typically with newlines there will be indentation involving 8, 12, or 16 spaces. We should set the default whitespace pattern to be |
Fair. We can give it a try and see if we still get complaints from users. |
Where do you see that training data for JSON is under the scheme of 4 whitespaces and one line break? The reason I am asking about single liner (flat) JSON output is to save tokens. My intuition is that a well-formatted JSON is meant for humans and not LLM, LLM can deal with flat JSON structure in both input and output. A single line JSON which is very long and potentially nested can be very difficult for human to read but the equivalent is most probably not true for LLM. |
Fixes #839 #908 #690 #450 ## Problem A major problem, especially with smaller language models, is the repetition problem. For example, let's say a model is generating json and must provide 12 space tokens for indentation in json output. Often a language model will assign a high probability to a 13th space token, and do the same for a 14th space, and then enter an infinite space generation loop. This is a problem with NLG that has been known for half a decade, but only has mitigations (mirostat, repetition penalty, using hundreds of billions of weights, etc), no absolute solutions (except for **structured generation**) ## Solution For structured json generation, we set a sane default whitespace pattern of `r"[ ]?"`. This removes all newlines and indentation. It disallows any syntactic whitespace beyond a single space separator. Users can still set the argument `whitespace_pattern=` if they want different behavior
This was addressed by #916, closing for now. |
Presentation of the new feature
Output JSON without wasting tokens on whitespaces and linebreaks.
Example output:
{"name": : "Tim" , "age" : 25 , "interest" : "llm" }
Where does it fit in Outlines?
Structured Generation
Are you willing to open a PR?
Yes.
The text was updated successfully, but these errors were encountered: