Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are we able to structure JSON output into a single line with just one whitespace? #908

Closed
timothylimyl opened this issue May 20, 2024 · 7 comments

Comments

@timothylimyl
Copy link

Presentation of the new feature

Output JSON without wasting tokens on whitespaces and linebreaks.

Example output: {"name": : "Tim" , "age" : 25 , "interest" : "llm" }

Where does it fit in Outlines?

Structured Generation

Are you willing to open a PR?

Yes.

@lapp0
Copy link
Collaborator

lapp0 commented May 20, 2024

Please pass whitespace_token=r'[ ]?' to outlines.generate.json()

@rlouf
Copy link
Member

rlouf commented May 20, 2024

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

@lapp0
Copy link
Collaborator

lapp0 commented May 20, 2024

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

My bike-shedding: typically with newlines there will be indentation involving 8, 12, or 16 spaces. We should set the default whitespace pattern to be r'[ ]?' and to make the json output a single line.

@rlouf
Copy link
Member

rlouf commented May 22, 2024

Fair. We can give it a try and see if we still get complaints from users.

@timothylimyl
Copy link
Author

I've been wondering if we should restrict the default pattern a little more to accept a maximum of 4 white spaces and one line break? That seems like a reasonable default that should cover most of what the model has seen during training.

Where do you see that training data for JSON is under the scheme of 4 whitespaces and one line break?

The reason I am asking about single liner (flat) JSON output is to save tokens. My intuition is that a well-formatted JSON is meant for humans and not LLM, LLM can deal with flat JSON structure in both input and output. A single line JSON which is very long and potentially nested can be very difficult for human to read but the equivalent is most probably not true for LLM.

rlouf pushed a commit that referenced this issue May 24, 2024
Fixes #839 #908 #690 #450

## Problem

A major problem, especially with smaller language models, is the
repetition problem.

For example, let's say a model is generating json and must provide 12
space tokens for indentation in json output. Often a language model will
assign a high probability to a 13th space token, and do the same for a
14th space, and then enter an infinite space generation loop.

This is a problem with NLG that has been known for half a decade, but
only has mitigations (mirostat, repetition penalty, using hundreds of
billions of weights, etc), no absolute solutions (except for
**structured generation**)

## Solution

For structured json generation, we set a sane default whitespace pattern
of `r"[ ]?"`. This removes all newlines and indentation. It disallows
any syntactic whitespace beyond a single space separator.

Users can still set the argument `whitespace_pattern=` if they want
different behavior
@rlouf
Copy link
Member

rlouf commented May 24, 2024

This was addressed by #916, closing for now.

@rlouf rlouf closed this as completed May 24, 2024
@timothylimyl
Copy link
Author

@rlouf @lapp0 Thanks for making the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants