Are we able to structure JSON output into a single line with just one whitespace? #908

timothylimyl · 2024-05-20T09:02:19Z

Presentation of the new feature

Output JSON without wasting tokens on whitespaces and linebreaks.

Example output: {"name": : "Tim" , "age" : 25 , "interest" : "llm" }

Where does it fit in Outlines?

Structured Generation

Are you willing to open a PR?

Yes.

The text was updated successfully, but these errors were encountered:

lapp0 · 2024-05-20T09:59:52Z

Please pass whitespace_token=r'[ ]?' to outlines.generate.json()

rlouf · 2024-05-20T19:10:54Z

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

lapp0 · 2024-05-20T20:07:11Z

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

My bike-shedding: typically with newlines there will be indentation involving 8, 12, or 16 spaces. We should set the default whitespace pattern to be r'[ ]?' and to make the json output a single line.

rlouf · 2024-05-22T08:37:11Z

Fair. We can give it a try and see if we still get complaints from users.

timothylimyl · 2024-05-23T03:48:20Z

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

Where do you see that training data for JSON is under the scheme of 4 whitespaces and one line break?

The reason I am asking about single liner (flat) JSON output is to save tokens. My intuition is that a well-formatted JSON is meant for humans and not LLM, LLM can deal with flat JSON structure in both input and output. A single line JSON which is very long and potentially nested can be very difficult for human to read but the equivalent is most probably not true for LLM.

Fixes #839 #908 #690 #450 ## Problem A major problem, especially with smaller language models, is the repetition problem. For example, let's say a model is generating json and must provide 12 space tokens for indentation in json output. Often a language model will assign a high probability to a 13th space token, and do the same for a 14th space, and then enter an infinite space generation loop. This is a problem with NLG that has been known for half a decade, but only has mitigations (mirostat, repetition penalty, using hundreds of billions of weights, etc), no absolute solutions (except for **structured generation**) ## Solution For structured json generation, we set a sane default whitespace pattern of `r"[ ]?"`. This removes all newlines and indentation. It disallows any syntactic whitespace beyond a single space separator. Users can still set the argument `whitespace_pattern=` if they want different behavior

rlouf · 2024-05-24T11:48:50Z

This was addressed by #916, closing for now.

timothylimyl · 2024-05-29T02:13:04Z

@rlouf @lapp0 Thanks for making the change.

lapp0 mentioned this issue May 23, 2024

Use less problematic whitespace token #916

Merged

rlouf closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are we able to structure JSON output into a single line with just one whitespace? #908

Are we able to structure JSON output into a single line with just one whitespace? #908

timothylimyl commented May 20, 2024

lapp0 commented May 20, 2024 •

edited

Loading

rlouf commented May 20, 2024

lapp0 commented May 20, 2024

rlouf commented May 22, 2024

timothylimyl commented May 23, 2024

rlouf commented May 24, 2024

timothylimyl commented May 29, 2024

Are we able to structure JSON output into a single line with just one whitespace? #908

Are we able to structure JSON output into a single line with just one whitespace? #908

Comments

timothylimyl commented May 20, 2024

Presentation of the new feature

Where does it fit in Outlines?

Are you willing to open a PR?

lapp0 commented May 20, 2024 • edited Loading

rlouf commented May 20, 2024

lapp0 commented May 20, 2024

rlouf commented May 22, 2024

timothylimyl commented May 23, 2024

rlouf commented May 24, 2024

timothylimyl commented May 29, 2024

lapp0 commented May 20, 2024 •

edited

Loading