Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: accept legacy request format and response #1527

Merged
merged 17 commits into from
Feb 29, 2024

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Feb 2, 2024

This WIP PR (will) add support for legacy OpenAI v1/completions API.

This should allow TGI to be a drop in replacement for OpenAI when using tools that rely on the completions api

Should fix: #1468

@spew
Copy link

spew commented Feb 2, 2024

To be most useful for code completions this API would need to support FIM models, which I believe means adding the suffix parameter to CompletionRequest

https://platform.openai.com/docs/api-reference/fine-tuning/create#fine-tuning-create-suffix

Here's a way I could see this working:

  1. Add a suffix parameter to CompletionRequest
  2. When a CompletionRequest comes in, see if the suffix field is present.
  3. If the suffix field is not present, continue on as normal (as in this PR).
  4. If the suffix field is present, check the model's tokenizer_config.json and look for a field named completionTemplate.
  5. If the completionTemplate field is present, it should contain a string that tells you how to format the prompt (the prefix) and the suffix for "fill in the middle". Use it to convert to a single string that can be sent to generate.
  6. If the completionTemplate field is not present, return a 400 error for bad request.

@drbh drbh marked this pull request as ready for review February 2, 2024 23:15
@drbh drbh marked this pull request as draft February 2, 2024 23:17
@drbh
Copy link
Collaborator Author

drbh commented Feb 6, 2024

Hey @spew thanks for the details! I've reviewed the docs a bit more and it seems like the suffix may simply be appended to the end of the generated text rather than follow a model specific template. Would you be able to share an existing use case that requires a template rather then appending?

@drbh
Copy link
Collaborator Author

drbh commented Feb 6, 2024

for reference:

example non streaming request

curl -s localhost:3000/v1/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "prompt": "What color is the sky?",
  "max_tokens": 10,
  "seed": 0,
  "stream": false,
  "decoder_input_details": false,
  "suffix": "   hello world"
}' \
    -H 'Content-Type: application/json' | jq .

response

{
  "id": "",
  "object": "text_completion",
  "created": 1707233606,
  "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
  "system_fingerprint": "1.4.0-native",
  "choices": [
    {
      "index": 0,
      "text": " Blue.\" Well, let's break that down logically   hello world",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 10,
    "total_tokens": 16
  }
}

example streaming request

 curl -s localhost:3000/v1/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "prompt": "What color is the sky?",
  "max_tokens": 10,
  "seed": 0,
  "stream": true,
  "decoder_input_details": false,
  "suffix": "   hello world"
}' \
    -H 'Content-Type: application/json'

responses

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" Blue","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":".\"","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" Well","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":",","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" let","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":"'s","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233722,"choices":[{"index":0,"text":" break","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233723,"choices":[{"index":0,"text":" that","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233723,"choices":[{"index":0,"text":" down","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

data:{"id":"","object":"text_completion","created":1707233723,"choices":[{"index":0,"text":" logically","logprobs":null,"finish_reason":""}],"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","system_fingerprint":"1.4.0-native"}

@spew
Copy link

spew commented Feb 6, 2024

hi @drbh, The use case would be code completion extensions for IDEs, like Github Copilot. They do "fill in the middle" completion which improves results. So when a developer stops typing in their editor, they send in the code before the cursor as the prefix and the code after the cursor as the suffix. I'm 99% sure that on the OpenAI side they are not simply joining the strings but are instead intelligently feeding them into the model with the appropriate tokens marking the prefix and suffix. For example, given the following completion request:

{
  "prompt": "#!/usr/bin/env python3\n# print 1\nprint(\"1\")\n\n# print 2\n",
  "suffix": "# print 3\nprint(\"3\")"
}

If we were to feed this into a CodeLlama model, which has special tokens of <PRE> , <SUF>, and <MID>, then the final input string to generate should look something like this:

{
   "inputs": "<PRE> #!/usr/bin/env python3\n# print 1\nprint(\"1\")\n\n# print 2\n <SUF># print 3\nprint(\"3\") <MID>"
} 

@drbh drbh force-pushed the support-legacy-completions-api branch from ca73ca3 to d95bbd8 Compare February 19, 2024 17:23
@drbh
Copy link
Collaborator Author

drbh commented Feb 20, 2024

updates**

this PR now reads the completion_template field from tokenizer_config.json and uses the template to format the prompt and suffix variables.

example tokenizer_config.json

    ....
    "spaces_between_special_tokens": false,
    "tokenizer_class": "LlamaTokenizer",
    "unk_token": "<unk>",
    "use_default_system_prompt": false,
    "completion_template": "<PRE> {{ prompt }} <SUF> {{ suffix }} <MID>"
  }

example request

curl -s localhost:3000/v1/completions \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
  "model": "tgi",
  "prompt": "What color is the sky?",
  "max_tokens": 10,
  "seed": 0,
  "stream": false,
  "decoder_input_details": false,
  "suffix": "before night falls"
}' | jq .

response

{
  "id": "",
  "object": "text_completion",
  "created": 1708443875,
  "model": "HuggingFaceH4/zephyr-7b-beta",
  "system_fingerprint": "1.4.1-native",
  "choices": [
    {
      "index": 0,
      "text": " red, orange and pink like waves of fire,",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 10,
    "total_tokens": 31
  }
}

**note this PR is still a work in progress and its unclear if supporting completion_template is the best path forward. Note, it would also be possible for the caller to construct this template themselves and avoid the custom template field - however this diverges from openai's api and needs to be considered.

@drbh drbh force-pushed the support-legacy-completions-api branch from 64bd96e to 6a0f737 Compare February 20, 2024 18:15
@drbh drbh marked this pull request as ready for review February 21, 2024 16:31
server/text_generation_server/models/__init__.py Outdated Show resolved Hide resolved
router/src/infer.rs Outdated Show resolved Hide resolved
@@ -88,6 +90,10 @@ impl Infer {
.chat_template
.map(|t| ChatTemplate::new(t, tokenizer_config.bos_token, tokenizer_config.eos_token));

let completion_template = tokenizer_config
.completion_template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a standard amongst models :

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/main/tokenizer_config.json

afaik there's no need for a template...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is if we want the API to support Fill in the Middle: #1527 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except nothing in HF ecosystem supports this. Therefore it doesn't exist.
And if we added such a thing we'd commit to it, which is not the appropriate place or project for it (we usually start to implement things like this in transformers first, since it will have to check for much wider usability).

Copy link

@spew spew Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary use case across HF would probably be this API.

Another potential larger use case is I can see how there also could be a utility method which enables "easy" token encoding that would enable developers to send in a prefix & suffix and get back a list of tokens which has been properly created with FIM tokens (as a bonus it could also add the BOS and EOS tokens as well).

Alternatively, if we want to support FIM models with this API (we sorta do if we want it to be most useful), we could only store the per model chat template in this project, not in tokenizer_config.json and worry about moving it out of the project later. So this would mean having a configuration file in this project which has a mapping from model -> chat template. I think there is some precedent for this since this project has a lot of per-model loading code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @spew thanks for the insights and feedback, we've given it some thought as a team and decided we're going to avoid supporting the completions template.

The tokenizer_config.json is defined/maintained outside of TGI, and adding TGI specific/custom values there does not seem like the best path forward. Additionally the template would have only defined how two strings the prompt and suffix were concatenated and this can be done by the caller with simple preprocessing. While we aim to align with existing tools/apis, it does not make sense to add this complexity/overhead for a legacy api.

Really appreciate you bringing this up, though! Always good to see different angles and possibilities, thank you!

Copy link

@spew spew Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood! It does mean this API will be less useful for FIM models which means it will not be a drop-in replacement for some (possibly many?) tools built on the legacy OpenAI completions API. This is cumbersome when you don't control the source of those tools but desire to move off of OpenAI to your own hosting.

IMO this defeats most of the value prop for the API -- but I don't have insight into all the internal reasoning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do the prefix/suffix before sending the prompts. It's much more generic anyway.

but I don't have insight into all the internal reasoning.

You're more than welcome to fork this repo, and maintain this yourself. We just don't have the bandwidth for that.
If you want to contribute such a thing in the HF ecosystem, try implementing it into transformers first.

As said, this is the place where the normalization is created and maintained across our ecosytem for the configs.

Copy link

@spew spew Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. It does seem like transformer level support for FIM models would make things a lot easier and less error prone so maybe there's an opportunity to do something there.

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@drbh drbh merged commit 3dd7da2 into main Feb 29, 2024
8 checks passed
@drbh drbh deleted the support-legacy-completions-api branch February 29, 2024 15:44
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
This WIP PR (will) add support for legacy OpenAI `v1/completions` API.

This should allow TGI to be a drop in replacement for OpenAI when using
tools that rely on the completions api

Should fix:
huggingface#1468
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support the legacy openai 'completions' API interface
3 participants