-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlines guided generation #1539
Conversation
a0c8b9a
to
d4de402
Compare
updates:
JSON schemas are supported and can be used like: curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
"inputs": "info: david holtz like trees and has two cats. ",
"parameters": {
"max_new_tokens": 100,
"grammar": {
"$id": "https://example.com/person.schema.json",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Person",
"type": "object",
"properties": {
"firstName": {
"type": "string",
"description": "The person'\''s first name."
},
"lastName": {
"type": "string",
"description": "The person'\''s last name."
},
"hobby": {
"description": "The person'\''s hobby.",
"type": "string"
},
"numCats": {
"description": "The number of cats the person has.",
"type": "integer",
"minimum": 0
}
},
"required": ["firstName", "lastName", "hobby", "numCats"]
}
}
}' | jq . response {
"generated_text": "{\"firstName\": \"david\", \"hobby\": \"trees\", \"lastName\": \"holtz\", \"numCats\": 2}"
} regex strings are still supported as well curl --location 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
"inputs": "name: david. email: ",
"parameters": {
"max_new_tokens": 20,
"grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
}
}' {
"generated_text":"david@example.com.phone_number_1.1234567890.phone_"
} notes: building the from the grammar FSM is very computationally expensive and is required for the first generation. Wait times can be ~10 seconds with complex grammars; this performance impact (along with other thing) need to be taken into account before adding this feature |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
370e47f
to
cadb0a9
Compare
prefill_tokens_indices[ | ||
out_start_index : out_end_index - 1 | ||
] = batch.input_ids[start_index + 1 : start_index + out_length] | ||
prefill_tokens_indices[out_start_index : out_end_index - 1] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay we're going to need to force lint everything with some standard instead of using each other's editor's default :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I totally agree we should standardize our formatters. Currently I'm using Black out of the box. Is there an existing config I should use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Black is fine, we just need to enforce it repo wide in the CI. (We need to pin a revision though, black is not great at backward compatibility)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR overall.
I've seen weird issues with the grammar leading out of regex returns (I'm guessing it has to do with the llama hack).
self.fsm_grammar_states[i] = self.grammar_processor.advance( | ||
next_ids[i].item(), self.fsm_grammar_states[i], self.grammars[i] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of HeteregenousProcessor is to processor all the tokens at once, without making any CPU calls.
This .item()
defeats the purpose.
I don't have any good ideas aside from moving this bit to the CPU loop in causal_lm
/ flash_causal_lm
.
Which is probably error prone.
@OlivierDehaene Do you have better suggestions ?
Seems to works great. We can also add a simple way to let server owners disable grammar as a simple way to keep complexity low and latency highly predictable. |
|
||
fsm = self.compile_fsm(grammars[i], self.tokenizer) | ||
allowed_tokens = fsm.allowed_token_ids(fsm_grammar_states[i]) | ||
mask = torch.full((logits.shape[-1],), -math.inf, device=self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WE could generate that at the start of the loop and reuse it over and over, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this resolved ? Did you resolve and forgot to push maybe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Narsil I think I may have misunderstood the original comment, I moved the fsm creation into the HeterogeneousGrammarLogitProcessor.__init__
and access self.fsms[i]
in the HeterogeneousGrammarLogitProcessor.__call__
.
This should reduce the number of times the fsm is generated to the number of times HeterogeneousNextTokenChooser
is initialized instead of on each call.
Is there a different location I should move the fsm compilation too? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask
is being generated over and over.
Reusing the mask seems better here (this kind of stuff makes a difference unfortunately).
Just create the mask once, and fill_ it to clean its values.
To be fair, resetting the values of the mask is costly anyway, maybe this is too early optimizations, but allocation in a loop is really and easy one to fix.
For reference, I wont 2ms/token on the mamba + cuda graphs by moving the n_layers tensors into a single tensor so that the copies would be a single kernel launch (that's how bad launching any single op is ).
cadb0a9
to
fc689e0
Compare
except json.JSONDecodeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a try catch to support schemas that are already a regex? If so can you add a comment?
It could be possible to move this compile to the router with PyO3 if this takes too much time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is and just added a comment in the latest commit. I agree we should move the compilation out of the server and into the router.
I like the idea of using PyO3 and seems relatively straight forward (glanced at the docs). I'll start a new PR moving that logic as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While doing that, maybe we should ask more explicitness from the users, and force the grammar type to be specified.
That would avoid a try except abuse, and make error message probably more readable. Wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ 'grammar": {"type": "regex", "content": "..."}}
{ 'grammar": {"type": "json", "content": "..."}}
Also higher level API might be even better to expose to users: https://www.anyscale.com/blog/anyscale-endpoints-json-mode-and-function-calling-features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay to leave things for future PRs, but nothing that we tag in a revision can be removed later. So all the surface must be correct for us to do a release.
can be run with the **update: text-generation-launcher \
--model-id HuggingFaceH4/zephyr-7b-beta \
--grammar-support RequestsWith grammar support: **updated snippet curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
"inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
"parameters": {
"max_new_tokens": 200,
"repetition_penalty": 1.3,
"grammar": {
"type": "json",
"value": {
"properties": {
"location": {
"type": "string"
},
"activity": {
"type": "string"
},
"animals_seen": {
"type": "integer",
"minimum": 1,
"maximum": 5
},
"animals": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["location", "activity", "animals_seen", "animals"]
}
}
}
}' | jq . {
"generated_text": "{\n\"activity\": \"biking\",\n\"animals\": [\"puppy\",\"cat\",\"raccoon\"]\n , \"animals_seen\": 3,\n \"location\":\"park\"}"
} Without grammar support: If support is not toggled on then sending a {
"error": "Input validation error: grammar is not supported",
"error_type": "validation"
} |
I would have done it the other way personally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
Have you verified if it works well with speculation?
0c9b22f
to
63b0917
Compare
63b0917
to
f0cdd9c
Compare
Now it should 🙂 |
great points! I've updated the PR to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think we can still improve further, let's do this in other PRs.
Did something change in the final implementation? I'm geetting this error with thix exact request: |
@Narsil I got same error with demo schema:
Besides that, I also found some additional errors at the top of fastapi docs ui like this: Resolver error at paths./generate.post.requestBody.content.application/json.schema.properties.parameters.properties.grammar.allOf.0.$ref |
Hi @PawelFaron and @paulcx thank you both for the feedback and for testing the new feature! In order to resolve the issues above please note that
**notes: the grammar can be type {
"inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
"parameters": {
"max_new_tokens": 200,
"repetition_penalty": 1.3,
"grammar": {
"type": "json",
"value": {
"properties": {
"location": {
"type": "string"
},
"activity": {
"type": "string"
},
"animals_seen": {
"type": "integer",
"minimum": 1,
"maximum": 5
},
"animals": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["location", "activity", "animals_seen", "animals"]
}
}
}
} I hope this is helpful, please let me know if there are any other issues! |
Thanks @drbh. This demo provided above works. btw, would you please have a look the errors? |
Hi I've tested this feature on orca-13b and llama-2-13b, both of which just generates UPDATE
|
@Jason-CKY #1578 will fix this issue. |
errors exist in #1578 as showed |
This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's. todo: - [X] add simple outlines guidance to `NextTokenChooser` - [X] update protos for grammar - [X] update generation params API - [X] constrain simple grammar - [ ] support parsing more complex grammar into fsm - [ ] support all outline support grammar types - [ ] explore optimizations to avoid recompiling grammars guided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6, "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+" } }' | jq ``` response ```json { "generated_text": "david@example.com" } ``` unguided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6 } }' | jq ``` response ```json { "generated_text": " email = 'david" } ```
This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's.
todo:
NextTokenChooser
guided request
response
unguided request
response