Outlines guided generation #1539

drbh · 2024-02-08T00:47:44Z

This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's.

todo:

add simple outlines guidance to NextTokenChooser
update protos for grammar
update generation params API
constrain simple grammar
support parsing more complex grammar into fsm
support all outline support grammar types
explore optimizations to avoid recompiling grammars

guided request

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}' | jq

response

{
  "generated_text": "david@example.com"
}

unguided request

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6
    }
}' | jq

response

{
  "generated_text": "    email = 'david"
}

drbh · 2024-02-08T21:21:59Z

updates:

support parsing more complex grammar into fsm
support all (serializable) outline support grammar types
explore optimizations to avoid recompiling grammars (rely on lru_cache)
add grammar to all token choosers (should work with all/most model architectures)

JSON schemas are supported and can be used like:

 curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "info: david holtz like trees and has two cats. ",
    "parameters": {
        "max_new_tokens": 100,
        "grammar": {
            "$id": "https://example.com/person.schema.json",
            "$schema": "https://json-schema.org/draft/2020-12/schema",
            "title": "Person",
            "type": "object",
            "properties": {
                "firstName": {
                    "type": "string",
                    "description": "The person'\''s first name."
                },
                "lastName": {
                    "type": "string",
                    "description": "The person'\''s last name."
                },
                "hobby": {
                    "description": "The person'\''s hobby.",
                    "type": "string"
                },
                "numCats": {
                    "description": "The number of cats the person has.",
                    "type": "integer",
                    "minimum": 0
                }
            },
            "required": ["firstName", "lastName", "hobby", "numCats"]
        }
    }
}' | jq .

response

{
  "generated_text": "{\"firstName\": \"david\", \"hobby\": \"trees\", \"lastName\": \"holtz\", \"numCats\": 2}"
}

regex strings are still supported as well

curl --location 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "name: david. email:  ",
    "parameters": {
        "max_new_tokens": 20,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}'

{
  "generated_text":"david@example.com.phone_number_1.1234567890.phone_"
}

notes:

building the from the grammar FSM is very computationally expensive and is required for the first generation. Wait times can be ~10 seconds with complex grammars; this performance impact (along with other thing) need to be taken into account before adding this feature

HuggingFaceDocBuilderDev · 2024-02-10T02:21:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

integration-tests/conftest.py

Narsil · 2024-02-12T13:46:04Z

server/text_generation_server/models/flash_causal_lm.py

-                        prefill_tokens_indices[
-                            out_start_index : out_end_index - 1
-                        ] = batch.input_ids[start_index + 1 : start_index + out_length]
+                        prefill_tokens_indices[out_start_index : out_end_index - 1] = (


okay we're going to need to force lint everything with some standard instead of using each other's editor's default :)

ahh I totally agree we should standardize our formatters. Currently I'm using Black out of the box. Is there an existing config I should use?

Black is fine, we just need to enforce it repo wide in the CI. (We need to pin a revision though, black is not great at backward compatibility)

server/text_generation_server/utils/logits_process.py

proto/generate.proto

Narsil

Nice PR overall.

I've seen weird issues with the grammar leading out of regex returns (I'm guessing it has to do with the llama hack).

server/text_generation_server/utils/tokens.py

Narsil · 2024-02-12T14:02:59Z

server/text_generation_server/utils/tokens.py

+                    self.fsm_grammar_states[i] = self.grammar_processor.advance(
+                        next_ids[i].item(), self.fsm_grammar_states[i], self.grammars[i]
+                    )


The whole point of HeteregenousProcessor is to processor all the tokens at once, without making any CPU calls.

This .item() defeats the purpose.

I don't have any good ideas aside from moving this bit to the CPU loop in causal_lm/ flash_causal_lm.
Which is probably error prone.

@OlivierDehaene Do you have better suggestions ?

server/text_generation_server/utils/tokens.py

Narsil · 2024-02-12T14:14:20Z

explore optimizations to avoid recompiling grammars (rely on lru_cache)

Seems to works great.
Let's roll as-is for performance regarding the compilation we'll investigate later how to optimize this. The ideal situation would be to use the tokenization workers to do the compilation and send the compiled objects directly to the python backend (so we don't overload the python with CPU cycles which would slow down the GPUS)

We can also add a simple way to let server owners disable grammar as a simple way to keep complexity low and latency highly predictable.

server/text_generation_server/utils/logits_process.py

server/text_generation_server/utils/tokens.py

server/text_generation_server/utils/logits_process.py

Narsil · 2024-02-12T14:40:29Z

server/text_generation_server/utils/logits_process.py

+
+            fsm = self.compile_fsm(grammars[i], self.tokenizer)
+            allowed_tokens = fsm.allowed_token_ids(fsm_grammar_states[i])
+            mask = torch.full((logits.shape[-1],), -math.inf, device=self.device)


WE could generate that at the start of the loop and reuse it over and over, no?

Why is this resolved ? Did you resolve and forgot to push maybe ?

@Narsil I think I may have misunderstood the original comment, I moved the fsm creation into the HeterogeneousGrammarLogitProcessor.__init__ and access self.fsms[i] in the HeterogeneousGrammarLogitProcessor.__call__.

This should reduce the number of times the fsm is generated to the number of times HeterogeneousNextTokenChooser is initialized instead of on each call.

Is there a different location I should move the fsm compilation too? Thank you!

mask is being generated over and over.
Reusing the mask seems better here (this kind of stuff makes a difference unfortunately).

Just create the mask once, and fill_ it to clean its values.
To be fair, resetting the values of the mask is costly anyway, maybe this is too early optimizations, but allocation in a loop is really and easy one to fix.

For reference, I wont 2ms/token on the mamba + cuda graphs by moving the n_layers tensors into a single tensor so that the copies would be a single kernel launch (that's how bad launching any single op is ).

server/text_generation_server/utils/tokens.py

server/text_generation_server/models/flash_causal_lm.py

OlivierDehaene · 2024-02-13T11:02:15Z

server/text_generation_server/utils/logits_process.py

+        except json.JSONDecodeError:
+            pass


Is this a try catch to support schemas that are already a regex? If so can you add a comment?
It could be possible to move this compile to the router with PyO3 if this takes too much time.

yes it is and just added a comment in the latest commit. I agree we should move the compilation out of the server and into the router.

I like the idea of using PyO3 and seems relatively straight forward (glanced at the docs). I'll start a new PR moving that logic as a follow up

While doing that, maybe we should ask more explicitness from the users, and force the grammar type to be specified.

That would avoid a try except abuse, and make error message probably more readable. Wdyt ?

{ 'grammar": {"type": "regex", "content": "..."}} { 'grammar": {"type": "json", "content": "..."}}

Also higher level API might be even better to expose to users: https://www.anyscale.com/blog/anyscale-endpoints-json-mode-and-function-calling-features

It's okay to leave things for future PRs, but nothing that we tag in a revision can be removed later. So all the surface must be correct for us to do a release.

drbh · 2024-02-13T17:49:47Z

can be run with the --grammar-support cli flag or GRAMMAR_SUPPORT=true env variable

**update: --grammar-support is deprecated in favor of --disable-grammar-support. Gramma support is available by default

text-generation-launcher \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --grammar-support

Requests

With grammar support:

**updated snippet

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}' | jq .

{
  "generated_text": "{\n\"activity\": \"biking\",\n\"animals\": [\"puppy\",\"cat\",\"raccoon\"]\n  , \"animals_seen\": 3,\n   \"location\":\"park\"}"
}

Without grammar support:

If support is not toggled on then sending a grammar in the parameters will result in

{
  "error": "Input validation error: grammar is not supported",
  "error_type": "validation"
}

Narsil · 2024-02-14T09:48:19Z

--grammar-support

I would have done it the other way personally, --disable-grammar-support, just so that the default is to support it.
Most users will be happy to have the features, the defence is only intended for power users, no ? (We could even remove the flag when we find a clever solution to slow down only the grammar requests down and not all users at the same time).

OlivierDehaene

Nice :)
Have you verified if it works well with speculation?

integration-tests/conftest.py

router/src/lib.rs

server/text_generation_server/models/flash_causal_lm.py

server/text_generation_server/utils/logits_process.py

…types

drbh · 2024-02-14T17:40:27Z

Nice :) Have you verified if it works well with speculation?

Now it should 🙂

drbh · 2024-02-14T18:21:24Z

--grammar-support

I would have done it the other way personally, --disable-grammar-support, just so that the default is to support it. Most users will be happy to have the features, the defence is only intended for power users, no ? (We could even remove the flag when we find a clever solution to slow down only the grammar requests down and not all users at the same time).

great points! I've updated the PR to use --disable-grammar-support instead in the latest commits, thank you!

Narsil

LGTM.

I think we can still improve further, let's do this in other PRs.

PawelFaron · 2024-02-18T10:47:53Z

can be run with the --grammar-support cli flag or GRAMMAR_SUPPORT=true env variable

text-generation-launcher \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --grammar-support

Requests

With grammar support:

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "properties": {
                "location": {
                    "type": "string"
                },
                "activity": {
                    "type": "string"
                },
                "animals_seen": {
                    "type": "integer",
                    "minimum": 1,
                    "maximum": 5
                },
                "animals": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    }
                }
            },
            "required": ["location", "activity", "animals_seen", "animals"]
        }
    }
}' | jq .

{
  "generated_text": "{\n\"activity\": \"biking\",\n\"animals\": [\"puppy\",\"cat\",\"raccoon\"]\n  , \"animals_seen\": 3,\n   \"location\":\"park\"}"
}

Without grammar support:

If support is not toggled on then sending a grammar in the parameters will result in

{
  "error": "Input validation error: grammar is not supported",
  "error_type": "validation"
}

Did something change in the final implementation? I'm geetting this error with thix exact request:
Failed to deserialize the JSON body into the target type: parameters.grammar: missing field 'type' at line 27 column 4(base)

paulcx · 2024-02-19T08:05:23Z

can be run with the --grammar-support cli flag or GRAMMAR_SUPPORT=true env variable

text-generation-launcher \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --grammar-support

Requests

With grammar support:

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "properties": {
                "location": {
                    "type": "string"
                },
                "activity": {
                    "type": "string"
                },
                "animals_seen": {
                    "type": "integer",
                    "minimum": 1,
                    "maximum": 5
                },
                "animals": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    }
                }
            },
            "required": ["location", "activity", "animals_seen", "animals"]
        }
    }
}' | jq .

{
  "generated_text": "{\n\"activity\": \"biking\",\n\"animals\": [\"puppy\",\"cat\",\"raccoon\"]\n  , \"animals_seen\": 3,\n   \"location\":\"park\"}"
}

Without grammar support:
If support is not toggled on then sending a grammar in the parameters will result in

{
  "error": "Input validation error: grammar is not supported",
  "error_type": "validation"
}

Did something change in the final implementation? I'm geetting this error with thix exact request: Failed to deserialize the JSON body into the target type: parameters.grammar: missing field 'type' at line 27 column 4(base)

@Narsil I got same error with demo schema:

{
    "properties": {
        "location": {
            "type": "string"
        },
        "activity": {
            "type": "string"
        },
        "animals_seen": {
            "type": "integer",
            "minimum": 1,
            "maximum": 5
        },
        "animals": {
            "type": "array",
            "items": {
                "type": "string"
            }
        }
    },
    "required": [
        "location",
        "activity",
        "animals_seen",
        "animals"
    ]
}

Besides that, I also found some additional errors at the top of fastapi docs ui like this:

Resolver error at paths./generate.post.requestBody.content.application/json.schema.properties.parameters.properties.grammar.allOf.0.$ref
Could not resolve reference: Could not resolve pointer: /components/schemas/GrammarType does not exist in document
Resolver error at paths./generate.post.requestBody.content.application/json.schema.properties.parameters.properties.grammar.$ref
Could not resolve reference: Could not resolve pointer: /components/schemas/GrammarType does not exist in document

drbh · 2024-02-19T14:17:42Z

Hi @PawelFaron and @paulcx thank you both for the feedback and for testing the new feature!

In order to resolve the issues above please note that

the flag --grammar-support has been deprecated in favor of --disable-grammar-support. Gramma support is available by default and does not require a launcher flag to use
the request format changed slightly and is now nested inside a new object. The grammar includes a type and value at the top level; please see the updated example payload below.

**notes: the grammar can be type json or regex. The grammar needs to compile on the first request which can take a couple of seconds but subsequent requests should be faster.

{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}

I hope this is helpful, please let me know if there are any other issues!

paulcx · 2024-02-20T00:06:27Z

Thanks @drbh. This demo provided above works. btw, would you please have a look the errors?

Jason-CKY · 2024-02-20T02:36:26Z

Hi I've tested this feature on orca-13b and llama-2-13b, both of which just generates <unk> tokens until the max_new_tokens is reached when trying to enforce grammar.

UPDATE
I managed to get the grammar working by removing certain generation parameters in the request body.
Adding any of the below request parameters will result in <UNK> token generation alongside grammar:

top_k
top_p
typical_p

OlivierDehaene · 2024-02-20T14:18:27Z

@Jason-CKY #1578 will fix this issue.

paulcx · 2024-02-21T11:27:47Z

@Jason-CKY #1578 will fix this issue.

errors exist in #1578 as showed

This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's. todo: - [X] add simple outlines guidance to `NextTokenChooser` - [X] update protos for grammar - [X] update generation params API - [X] constrain simple grammar - [ ] support parsing more complex grammar into fsm - [ ] support all outline support grammar types - [ ] explore optimizations to avoid recompiling grammars guided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6, "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+" } }' | jq ``` response ```json { "generated_text": "david@example.com" } ``` unguided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6 } }' | jq ``` response ```json { "generated_text": " email = 'david" } ```

drbh force-pushed the outlines-guided-generation branch 2 times, most recently from a0c8b9a to d4de402 Compare February 8, 2024 17:59

drbh force-pushed the outlines-guided-generation branch from 370e47f to cadb0a9 Compare February 10, 2024 03:22