Fix chat completion and docs #358

GirinMan · 2024-03-26T04:57:09Z

What does this PR do?

This PR introduces three improvements to the LoRAX project:

Tokenization API: Inspired by HF TGI, this addition allows users to get tokenization results via a new /tokenize route in the API.
Chat Completion Stream API Fix: The existing implementation had a discrepancy with the official OpenAI API, where the final chunk in a stream erroneously contained null "role" and "content", leading to potential unexpected errors in some OpenAI API clients. This fix ensures that, while the last chunk (is_stop is True) still recognizes "role" and "content" as null, they are not serialized, allowing the final chunk's delta to correctly return as an empty object.

How the last chunk looks like (Before)

{
  "id":"null",
  "object":"chat.completion.chunk",
  "created":0,
  "model":"null",
  "choices":[{
    "index":0,
    "delta":{
      "role":null,
      "content":null
    },
    "finish_reason":"stop"
  }]
}

How the last chunk looks like (After, Equivalent with OpenAI API's behavior)

{
  "id":"null",
  "object":"chat.completion.chunk",
  "created":0,
  "model":"null",
  "choices":[{
    "index":0,
    "delta":{},
    "finish_reason":"stop"
  }]
}

Swagger Docs Improvement: Enhancements were made to the OpenAPI (Swagger) Docs for better usability. The changes include making the OpenAI compatible API visible, tagging the LoRAX standalone APIs, OpenAI compatible APIs, and the newly added Tokenizer for easier differentiation, and adding missing schemas to eliminate errors related to schema unavailability. Before/after screenshots are provided for a clearer comparison.

Before this PR
After

These enhancements not only improve the project's functionality but also its usability and compatibility with existing OpenAI API clients.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Was this discussed/approved via a Github issue or the discord / slack channel? Please add a link
to it if that's the case.
- This PR will fix Fix Chat Completion Stream API and Swagger Docs #356 and also fix Addition of Tokenization API #357
Did you write any new necessary tests?

Who can review?

@tgaddair
I have conducted several tests to ensure the proper functioning of these improvements and applied linting to adhere to coding style and conventions. Despite these precautions, there could still be issues within the code. Please feel free to review and suggest any changes you think might be necessary.

cc. @soeque1

tgaddair

This is a great PR! Thanks @GirinMan for putting this together. I just had one question about the behavior of tokenizer.encode when add_special_tokens is false, which was a change made in this PR. I just wanted to make sure I understand the motivation and ramifications of this change. Thanks!

tgaddair · 2024-03-26T05:15:23Z

router/src/validation.rs

    // Get the number of tokens in the input
    let mut encoding = tokenizer
-        .encode(inputs.clone(), true)
+        .encode(inputs.clone(), false)


The documentation for add_special_tokens is pretty unclear, but is there a difference in the generated input lengths when this param is set to false. The one thing I would want to double-check here is that we don't under-count the number of tokens in the input during validation, otherwise we could exceed the max positional embeddings during inference and cause segfaults.

That's right.
Due to a simple test during development, the default value of add_special_tokens was changed and I forgot to revert it. I'll fix it back to the original.

soeque1 · 2024-03-26T12:15:12Z

/tokenize

The original code was implemented with reference to TGI's tokenize commit (huggingface/text-generation-inference#1471)

The problem is that in the case of TGI's tokenize, it uses tokenizers::Encoding from Infer, and there are utf-8 processing issues in get_offsets and get_ids.

It is likely due to the issues below:

huggingface/tokenizers#1201 (comment)
https://discuss.huggingface.co/t/token-offsets-in-rust-vs-python/37949

In fact, even in Huggingface's tokenizers, the python binding uses encode_char_offsets.

https://github.com/huggingface/tokenizers/blob/main/bindings/python/src/tokenizer.rs#L973-L978

Therefore, I changed it to hold the tokenizer as shown below and use encode_char_offsets.

https://github.com/predibase/lorax/pull/358/files#diff-58724fd9e4e3f359bc071bec55c6b3f8149372efa3dc33981ada5bf60afee878R1288-R1303

tgaddair

Thanks for the quick fix @GirinMan. The PR looks good to me.

I see you marked it as draft, were there other changes you wanted to make before landing?

GirinMan · 2024-03-27T02:23:01Z

Actually we have found some issues during handling non-ascii unicode characters such as Korean or Japanese text.
Some characters are splitted into several tokens by BPE tokenizer, and the returned text for each tokens are not equal to the token corresponding to the ID.

For example, a single character ⑴ can be usually seen in Japanese corpus, and it is splitted by NFKC normalizer like (, 1, ).
You can see this example and how we fixed it below:

Behavior of tokenize API before fixed

curl -X 'POST' \
  'http://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "⑴　抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと"
}'

[
  { "id": 325, "text": "⑴", "start": 0, "stop": 1 },
  { "id": 28740, "text": "⑴", "start": 0, "stop": 1 },
  { "id": 28731, "text": "⑴", "start": 0, "stop": 1 },
  { "id": 28705, "text": "　", "start": 1, "stop": 2 },
  { "id": 233, "text": "抵", "start": 2, "stop": 3 },
  { "id": 141, "text": "抵", "start": 2, "stop": 3 },
  { "id": 184, "text": "抵", "start": 2, "stop": 3 },
  { "id": 29162, "text": "当", "start": 3, "stop": 4 },
  { "id": 233, "text": "権", "start": 4, "stop": 5 },
  { "id": 171, "text": "権", "start": 4, "stop": 5 },
  { "id": 172, "text": "権", "start": 4, "stop": 5 },
  { "id": 29041, "text": "、", "start": 5, "stop": 6 },
...

Issue cause and workaround

The text used in the old SimpleToken was sliced by the index on the original text for each token returned by tokenizer.encode_char_offsets.
When each token has NFKC normalization applied and the text area is different from the original (e.g., ⑴ -> (1)), the same character is repeated because the index of the original string is referenced.
Instead of slicing text of a SimpleToken from the original text to idx, we use tokenizer.id_to_token(id) to get its token directly from vocab
- Before
  let text: String = input.chars().skip(start).take(stop - start).collect();
- After the change
  let text: String = tokenizer.id_to_token(id).unwrap();

API response changes after change

curl -X 'POST' \
  'http://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "⑴　抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと"
}'

[
  { "id": 325, "text": "▁(", "start": 0, "stop": 1 },
  { "id": 28740, "text": "1", "start": 0, "stop": 1 },
  { "id": 28731, "text": ")", "start": 0, "stop": 1 },
  { "id": 28705, "text": "▁", "start": 1, "stop": 2 },
  { "id": 233, "text": "<0xE6>", "start": 2, "stop": 3 },
  { "id": 141, "text": "<0x8A>", "start": 2, "stop": 3 },
  { "id": 184, "text": "<0xB5>", "start": 2, "stop": 3 },
  { "id": 29162, "text": "当", "start": 3, "stop": 4 },
  { "id": 233, "text": "<0xE6>", "start": 4, "stop": 5 },
  { "id": 171, "text": "<0xA8>", "start": 4, "stop": 5 },
  { "id": 172, "text": "<0xA9>", "start": 4, "stop": 5 },
  { "id": 29041, "text": "、", "start": 5, "stop": 6 },
...

GirinMan · 2024-03-27T02:31:27Z

And we can now truncate text to desired token length based on Tokenize result.

After the change, text correctly returns the value of the token itself, but it is sometimes broken into bytes, which is cumbersome to combine with a simple concat.
This can be solved by using the start, stop indexes spit out by the tokenize API without a separate detokenize process.
For example, if you want to slice the original text to within 100 tokens, you can slice it before the start index of the 101st token

# Truncate the original text to 10 tokens or less
text = "⑴　抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと"
tokens = [{"id":325,"text":"▁(","start":0,"stop":1},{"id":28740,"text":"1","start":0,"stop":1},{"id":28731,"text":")","start":0,"stop":1},{"id":28705,"text":"▁","start":1,"stop":2},{"id":233,"text":"<0xE6>","start":2,"stop":3},{"id":141,"text":"<0x8A>","start":2,"stop":3},{"id":184,"text":"<0xB5>","start":2,"stop":3},{"id":29162,"text":"当","start":3,"stop":4},{"id":233,"text":"<0xE6>","start":4,"stop":5},{"id":171,"text":"<0xA8>","start":4,"stop":5},{"id":172,"text":"<0xA9>","start":4,"stop":5},{"id":29041,"text":"、","start":5,"stop":6},{"id":235,"text":"<0xE8>","start":6,"stop":7},{"id":182,"text":"<0xB3>","start":6,"stop":7},{"id":173,"text":"<0xAA>","start":6,"stop":7},{"id":233,"text":"<0xE6>","start":7,"stop":8},{"id":171,"text":"<0xA8>","start":7,"stop":8},{"id":172,"text":"<0xA9>","start":7,"stop":8},{"id":29041,"text":"、","start":8,"stop":9},{"id":29596,"text":"先","start":9,"stop":10},{"id":29012,"text":"取","start":10,"stop":11},{"id":29631,"text":"特","start":11,"stop":12},{"id":233,"text":"<0xE6>","start":12,"stop":13},{"id":171,"text":"<0xA8>","start":12,"stop":13},{"id":172,"text":"<0xA9>","start":12,"stop":13},{"id":29965,"text":"及","start":13,"stop":14},{"id":31050,"text":"び","start":14,"stop":15},{"id":235,"text":"<0xE8>","start":15,"stop":16},{"id":182,"text":"<0xB3>","start":15,"stop":16},{"id":134,"text":"<0x83>","start":15,"stop":16},{"id":31626,"text":"借","start":16,"stop":17},{"id":233,"text":"<0xE6>","start":17,"stop":18},{"id":171,"text":"<0xA8>","start":17,"stop":18},{"id":172,"text":"<0xA9>","start":17,"stop":18},{"id":29041,"text":"、","start":18,"stop":19},{"id":29241,"text":"ま","start":19,"stop":20},{"id":29227,"text":"た","start":20,"stop":21},{"id":29277,"text":"は","start":21,"stop":22},{"id":29163,"text":"所","start":22,"stop":23},{"id":28998,"text":"有","start":23,"stop":24},{"id":233,"text":"<0xE6>","start":24,"stop":25},{"id":171,"text":"<0xA8>","start":24,"stop":25},{"id":172,"text":"<0xA9>","start":24,"stop":25},{"id":30210,"text":"留","start":25,"stop":26},{"id":29321,"text":"保","start":26,"stop":27},{"id":29041,"text":"、","start":27,"stop":28},{"id":29414,"text":"等","start":28,"stop":29},{"id":28993,"text":"の","start":29,"stop":30},{"id":231,"text":"<0xE4>","start":30,"stop":31},{"id":188,"text":"<0xB9>","start":30,"stop":31},{"id":156,"text":"<0x99>","start":30,"stop":31},{"id":28993,"text":"の","start":31,"stop":32},{"id":29474,"text":"完","start":32,"stop":33},{"id":29374,"text":"全","start":33,"stop":34},{"id":29270,"text":"な","start":34,"stop":35},{"id":29116,"text":"る","start":35,"stop":36},{"id":29163,"text":"所","start":36,"stop":37},{"id":28998,"text":"有","start":37,"stop":38},{"id":233,"text":"<0xE6>","start":38,"stop":39},{"id":171,"text":"<0xA8>","start":38,"stop":39},{"id":172,"text":"<0xA9>","start":38,"stop":39},{"id":28993,"text":"の","start":39,"stop":40},{"id":29037,"text":"行","start":40,"stop":41},{"id":29154,"text":"使","start":41,"stop":42},{"id":29078,"text":"を","start":42,"stop":43},{"id":232,"text":"<0xE5>","start":43,"stop":44},{"id":169,"text":"<0xA6>","start":43,"stop":44},{"id":171,"text":"<0xA8>","start":43,"stop":44},{"id":31967,"text":"げ","start":44,"stop":45},{"id":29116,"text":"る","start":45,"stop":46},{"id":29339,"text":"事","start":46,"stop":47},{"id":29418,"text":"情","start":47,"stop":48},{"id":29309,"text":"が","start":48,"stop":49},{"id":29270,"text":"な","start":49,"stop":50},{"id":29132,"text":"い","start":50,"stop":51},{"id":29543,"text":"こ","start":51,"stop":52},{"id":29316,"text":"と","start":52,"stop":53}]

# truncate to start of 10+1 tokens in case of byte-level splitting
# if 10th and 11th tokens are combined into a single character, discard 10th token
sliced_tokens = tokens[:10+1] 
end_idx = sliced_tokens[-1]["start"]
sliced_text = text[:end_idx]

print(text)
# ⑴　抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと

print(sliced_text)
# ⑴　抵当

Truncated text tokenize result

curl -X 'POST' \
  'http://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "⑴　抵当"
}'

The 9th, 10th, and 11th tokens, <0xE6> <0xA8> <0xA9>, make up one character "権", so leaving only 8 tokens worth of text

[
  { "id": 325, "text": "▁(", "start": 0, "stop": 1 },
  { "id": 28740, "text": "1", "start": 0, "stop": 1 },
  { "id": 28731, "text": ")", "start": 0, "stop": 1 },
  { "id": 28705, "text": "▁", "start": 1, "stop": 2 },
  { "id": 233, "text": "<0xE6>", "start": 2, "stop": 3 },
  { "id": 141, "text": "<0x8A>", "start": 2, "stop": 3 },
  { "id": 184, "text": "<0xB5>", "start": 2, "stop": 3 },
  { "id": 29162, "text": "当", "start": 3, "stop": 4 }
]

GirinMan · 2024-03-27T02:52:14Z

As you mentioned as Fix chat completion and docs #358 (comment), When we receive the result of our tokenization, it's natural that the tokenizer add the special tokens.
However, some people may want to see the result with add_special_tokens set to false, and we've added this as a parameter to the request.
The default value of add_special_tokens is true(when if it was set to null), and of course you can set it to false and get tokenized result without adding special tokens.

The changed request may look like:

curl -X 'POST' \
  'https://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "add_special_tokens": true,
  "inputs": "My name is Olivier and I"
}'

GirinMan · 2024-03-27T02:55:47Z

@tgaddair We have found some issues related to this PR and made some commits for it.

tgaddair

Thanks for the very thorough explanation of the issue and the fix! This LGTM. I'll go ahead and land assuming tests finish successfully.

GirinMan added 3 commits March 26, 2024 02:26

feat: add tokenize API

4330d5f

feat: fix behavior of chat completion stream response

e681be5

fix: swagger docs

bee251a

tgaddair reviewed Mar 26, 2024

View reviewed changes

fix: default add_special_tokens to True

0124533

GirinMan marked this pull request as draft March 26, 2024 13:00

tgaddair approved these changes Mar 26, 2024

View reviewed changes

feat: map id to token in vocab

c4cbd59

feat: define add_special_tokens when tokenized request

e5f76a6

GirinMan marked this pull request as ready for review March 27, 2024 02:55

GirinMan requested a review from tgaddair March 27, 2024 04:31

tgaddair approved these changes Mar 27, 2024

View reviewed changes

tgaddair merged commit 0b9117f into predibase:main Mar 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix chat completion and docs #358

Fix chat completion and docs #358

GirinMan commented Mar 26, 2024 •

edited

Loading

tgaddair left a comment

tgaddair Mar 26, 2024

GirinMan Mar 26, 2024

soeque1 commented Mar 26, 2024

tgaddair left a comment

GirinMan commented Mar 27, 2024 •

edited

Loading

GirinMan commented Mar 27, 2024

GirinMan commented Mar 27, 2024 •

edited

Loading

GirinMan commented Mar 27, 2024

tgaddair left a comment

Fix chat completion and docs #358

Fix chat completion and docs #358

Conversation

GirinMan commented Mar 26, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Mar 26, 2024

Choose a reason for hiding this comment

GirinMan Mar 26, 2024

Choose a reason for hiding this comment

soeque1 commented Mar 26, 2024

tgaddair left a comment

Choose a reason for hiding this comment

GirinMan commented Mar 27, 2024 • edited Loading

Behavior of tokenize API before fixed

Issue cause and workaround

API response changes after change

GirinMan commented Mar 27, 2024

GirinMan commented Mar 27, 2024 • edited Loading

GirinMan commented Mar 27, 2024

tgaddair left a comment

Choose a reason for hiding this comment

GirinMan commented Mar 26, 2024 •

edited

Loading

GirinMan commented Mar 27, 2024 •

edited

Loading

GirinMan commented Mar 27, 2024 •

edited

Loading