Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix chat completion and docs #358

Merged
merged 6 commits into from
Mar 27, 2024

Conversation

GirinMan
Copy link
Contributor

@GirinMan GirinMan commented Mar 26, 2024

What does this PR do?

This PR introduces three improvements to the LoRAX project:

  1. Tokenization API: Inspired by HF TGI, this addition allows users to get tokenization results via a new /tokenize route in the API.

  2. Chat Completion Stream API Fix: The existing implementation had a discrepancy with the official OpenAI API, where the final chunk in a stream erroneously contained null "role" and "content", leading to potential unexpected errors in some OpenAI API clients. This fix ensures that, while the last chunk (is_stop is True) still recognizes "role" and "content" as null, they are not serialized, allowing the final chunk's delta to correctly return as an empty object.

  • How the last chunk looks like (Before)
{
  "id":"null",
  "object":"chat.completion.chunk",
  "created":0,
  "model":"null",
  "choices":[{
    "index":0,
    "delta":{
      "role":null,
      "content":null
    },
    "finish_reason":"stop"
  }]
}
  • How the last chunk looks like (After, Equivalent with OpenAI API's behavior)
{
  "id":"null",
  "object":"chat.completion.chunk",
  "created":0,
  "model":"null",
  "choices":[{
    "index":0,
    "delta":{},
    "finish_reason":"stop"
  }]
}
  1. Swagger Docs Improvement: Enhancements were made to the OpenAPI (Swagger) Docs for better usability. The changes include making the OpenAI compatible API visible, tagging the LoRAX standalone APIs, OpenAI compatible APIs, and the newly added Tokenizer for easier differentiation, and adding missing schemas to eliminate errors related to schema unavailability. Before/after screenshots are provided for a clearer comparison.
  • Before this PR
    CleanShot 2024-03-26 at 02 52 22@2x
  • After
    CleanShot 2024-03-26 at 02 53 22@2x

These enhancements not only improve the project's functionality but also its usability and compatibility with existing OpenAI API clients.

Before submitting

Who can review?

@tgaddair
I have conducted several tests to ensure the proper functioning of these improvements and applied linting to adhere to coding style and conventions. Despite these precautions, there could still be issues within the code. Please feel free to review and suggest any changes you think might be necessary.

cc. @soeque1

Copy link
Contributor

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR! Thanks @GirinMan for putting this together. I just had one question about the behavior of tokenizer.encode when add_special_tokens is false, which was a change made in this PR. I just wanted to make sure I understand the motivation and ramifications of this change. Thanks!

// Get the number of tokens in the input
let mut encoding = tokenizer
.encode(inputs.clone(), true)
.encode(inputs.clone(), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for add_special_tokens is pretty unclear, but is there a difference in the generated input lengths when this param is set to false. The one thing I would want to double-check here is that we don't under-count the number of tokens in the input during validation, otherwise we could exceed the max positional embeddings during inference and cause segfaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.
Due to a simple test during development, the default value of add_special_tokens was changed and I forgot to revert it. I'll fix it back to the original.

@soeque1
Copy link

soeque1 commented Mar 26, 2024

/tokenize

The original code was implemented with reference to TGI's tokenize commit (huggingface/text-generation-inference#1471)

The problem is that in the case of TGI's tokenize, it uses tokenizers::Encoding from Infer, and there are utf-8 processing issues in get_offsets and get_ids.

It is likely due to the issues below:

huggingface/tokenizers#1201 (comment)
https://discuss.huggingface.co/t/token-offsets-in-rust-vs-python/37949

In fact, even in Huggingface's tokenizers, the python binding uses encode_char_offsets.

https://github.com/huggingface/tokenizers/blob/main/bindings/python/src/tokenizer.rs#L973-L978

Therefore, I changed it to hold the tokenizer as shown below and use encode_char_offsets.

https://github.com/predibase/lorax/pull/358/files#diff-58724fd9e4e3f359bc071bec55c6b3f8149372efa3dc33981ada5bf60afee878R1288-R1303

@GirinMan GirinMan marked this pull request as draft March 26, 2024 13:00
Copy link
Contributor

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @GirinMan. The PR looks good to me.

I see you marked it as draft, were there other changes you wanted to make before landing?

@GirinMan
Copy link
Contributor Author

GirinMan commented Mar 27, 2024

Actually we have found some issues during handling non-ascii unicode characters such as Korean or Japanese text.
Some characters are splitted into several tokens by BPE tokenizer, and the returned text for each tokens are not equal to the token corresponding to the ID.

For example, a single character can be usually seen in Japanese corpus, and it is splitted by NFKC normalizer like (, 1, ).
You can see this example and how we fixed it below:

Behavior of tokenize API before fixed

curl -X 'POST' \
  'http://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "⑴ 抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと"
}'
[
  { "id": 325, "text": "", "start": 0, "stop": 1 },
  { "id": 28740, "text": "", "start": 0, "stop": 1 },
  { "id": 28731, "text": "", "start": 0, "stop": 1 },
  { "id": 28705, "text": " ", "start": 1, "stop": 2 },
  { "id": 233, "text": "", "start": 2, "stop": 3 },
  { "id": 141, "text": "", "start": 2, "stop": 3 },
  { "id": 184, "text": "", "start": 2, "stop": 3 },
  { "id": 29162, "text": "", "start": 3, "stop": 4 },
  { "id": 233, "text": "", "start": 4, "stop": 5 },
  { "id": 171, "text": "", "start": 4, "stop": 5 },
  { "id": 172, "text": "", "start": 4, "stop": 5 },
  { "id": 29041, "text": "", "start": 5, "stop": 6 },
...

Issue cause and workaround

  • The text used in the old SimpleToken was sliced by the index on the original text for each token returned by tokenizer.encode_char_offsets.
  • When each token has NFKC normalization applied and the text area is different from the original (e.g., -> (1)), the same character is repeated because the index of the original string is referenced.
  • Instead of slicing text of a SimpleToken from the original text to idx, we use tokenizer.id_to_token(id) to get its token directly from vocab
    • Before
      let text: String = input.chars().skip(start).take(stop - start).collect();

    • After the change
      let text: String = tokenizer.id_to_token(id).unwrap();

API response changes after change

curl -X 'POST' \
  'http://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "⑴ 抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと"
}'
[
  { "id": 325, "text": "▁(", "start": 0, "stop": 1 },
  { "id": 28740, "text": "1", "start": 0, "stop": 1 },
  { "id": 28731, "text": ")", "start": 0, "stop": 1 },
  { "id": 28705, "text": "", "start": 1, "stop": 2 },
  { "id": 233, "text": "<0xE6>", "start": 2, "stop": 3 },
  { "id": 141, "text": "<0x8A>", "start": 2, "stop": 3 },
  { "id": 184, "text": "<0xB5>", "start": 2, "stop": 3 },
  { "id": 29162, "text": "", "start": 3, "stop": 4 },
  { "id": 233, "text": "<0xE6>", "start": 4, "stop": 5 },
  { "id": 171, "text": "<0xA8>", "start": 4, "stop": 5 },
  { "id": 172, "text": "<0xA9>", "start": 4, "stop": 5 },
  { "id": 29041, "text": "", "start": 5, "stop": 6 },
...

@GirinMan
Copy link
Contributor Author

And we can now truncate text to desired token length based on Tokenize result.

  • After the change, text correctly returns the value of the token itself, but it is sometimes broken into bytes, which is cumbersome to combine with a simple concat.
  • This can be solved by using the start, stop indexes spit out by the tokenize API without a separate detokenize process.
  • For example, if you want to slice the original text to within 100 tokens, you can slice it before the start index of the 101st token
# Truncate the original text to 10 tokens or less
text = "⑴ 抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと"
tokens = [{"id":325,"text":"▁(","start":0,"stop":1},{"id":28740,"text":"1","start":0,"stop":1},{"id":28731,"text":")","start":0,"stop":1},{"id":28705,"text":"▁","start":1,"stop":2},{"id":233,"text":"<0xE6>","start":2,"stop":3},{"id":141,"text":"<0x8A>","start":2,"stop":3},{"id":184,"text":"<0xB5>","start":2,"stop":3},{"id":29162,"text":"当","start":3,"stop":4},{"id":233,"text":"<0xE6>","start":4,"stop":5},{"id":171,"text":"<0xA8>","start":4,"stop":5},{"id":172,"text":"<0xA9>","start":4,"stop":5},{"id":29041,"text":"、","start":5,"stop":6},{"id":235,"text":"<0xE8>","start":6,"stop":7},{"id":182,"text":"<0xB3>","start":6,"stop":7},{"id":173,"text":"<0xAA>","start":6,"stop":7},{"id":233,"text":"<0xE6>","start":7,"stop":8},{"id":171,"text":"<0xA8>","start":7,"stop":8},{"id":172,"text":"<0xA9>","start":7,"stop":8},{"id":29041,"text":"、","start":8,"stop":9},{"id":29596,"text":"先","start":9,"stop":10},{"id":29012,"text":"取","start":10,"stop":11},{"id":29631,"text":"特","start":11,"stop":12},{"id":233,"text":"<0xE6>","start":12,"stop":13},{"id":171,"text":"<0xA8>","start":12,"stop":13},{"id":172,"text":"<0xA9>","start":12,"stop":13},{"id":29965,"text":"及","start":13,"stop":14},{"id":31050,"text":"び","start":14,"stop":15},{"id":235,"text":"<0xE8>","start":15,"stop":16},{"id":182,"text":"<0xB3>","start":15,"stop":16},{"id":134,"text":"<0x83>","start":15,"stop":16},{"id":31626,"text":"借","start":16,"stop":17},{"id":233,"text":"<0xE6>","start":17,"stop":18},{"id":171,"text":"<0xA8>","start":17,"stop":18},{"id":172,"text":"<0xA9>","start":17,"stop":18},{"id":29041,"text":"、","start":18,"stop":19},{"id":29241,"text":"ま","start":19,"stop":20},{"id":29227,"text":"た","start":20,"stop":21},{"id":29277,"text":"は","start":21,"stop":22},{"id":29163,"text":"所","start":22,"stop":23},{"id":28998,"text":"有","start":23,"stop":24},{"id":233,"text":"<0xE6>","start":24,"stop":25},{"id":171,"text":"<0xA8>","start":24,"stop":25},{"id":172,"text":"<0xA9>","start":24,"stop":25},{"id":30210,"text":"留","start":25,"stop":26},{"id":29321,"text":"保","start":26,"stop":27},{"id":29041,"text":"、","start":27,"stop":28},{"id":29414,"text":"等","start":28,"stop":29},{"id":28993,"text":"の","start":29,"stop":30},{"id":231,"text":"<0xE4>","start":30,"stop":31},{"id":188,"text":"<0xB9>","start":30,"stop":31},{"id":156,"text":"<0x99>","start":30,"stop":31},{"id":28993,"text":"の","start":31,"stop":32},{"id":29474,"text":"完","start":32,"stop":33},{"id":29374,"text":"全","start":33,"stop":34},{"id":29270,"text":"な","start":34,"stop":35},{"id":29116,"text":"る","start":35,"stop":36},{"id":29163,"text":"所","start":36,"stop":37},{"id":28998,"text":"有","start":37,"stop":38},{"id":233,"text":"<0xE6>","start":38,"stop":39},{"id":171,"text":"<0xA8>","start":38,"stop":39},{"id":172,"text":"<0xA9>","start":38,"stop":39},{"id":28993,"text":"の","start":39,"stop":40},{"id":29037,"text":"行","start":40,"stop":41},{"id":29154,"text":"使","start":41,"stop":42},{"id":29078,"text":"を","start":42,"stop":43},{"id":232,"text":"<0xE5>","start":43,"stop":44},{"id":169,"text":"<0xA6>","start":43,"stop":44},{"id":171,"text":"<0xA8>","start":43,"stop":44},{"id":31967,"text":"げ","start":44,"stop":45},{"id":29116,"text":"る","start":45,"stop":46},{"id":29339,"text":"事","start":46,"stop":47},{"id":29418,"text":"情","start":47,"stop":48},{"id":29309,"text":"が","start":48,"stop":49},{"id":29270,"text":"な","start":49,"stop":50},{"id":29132,"text":"い","start":50,"stop":51},{"id":29543,"text":"こ","start":51,"stop":52},{"id":29316,"text":"と","start":52,"stop":53}]

# truncate to start of 10+1 tokens in case of byte-level splitting
# if 10th and 11th tokens are combined into a single character, discard 10th token
sliced_tokens = tokens[:10+1] 
end_idx = sliced_tokens[-1]["start"]
sliced_text = text[:end_idx]

print(text)
# ⑴ 抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと

print(sliced_text)
# ⑴ 抵当
  • Truncated text tokenize result
curl -X 'POST' \
  'http://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "⑴ 抵当"
}'
  • The 9th, 10th, and 11th tokens, <0xE6> <0xA8> <0xA9>, make up one character "権", so leaving only 8 tokens worth of text
[
  { "id": 325, "text": "▁(", "start": 0, "stop": 1 },
  { "id": 28740, "text": "1", "start": 0, "stop": 1 },
  { "id": 28731, "text": ")", "start": 0, "stop": 1 },
  { "id": 28705, "text": "", "start": 1, "stop": 2 },
  { "id": 233, "text": "<0xE6>", "start": 2, "stop": 3 },
  { "id": 141, "text": "<0x8A>", "start": 2, "stop": 3 },
  { "id": 184, "text": "<0xB5>", "start": 2, "stop": 3 },
  { "id": 29162, "text": "", "start": 3, "stop": 4 }
]

@GirinMan
Copy link
Contributor Author

GirinMan commented Mar 27, 2024

  • As you mentioned as Fix chat completion and docs #358 (comment), When we receive the result of our tokenization, it's natural that the tokenizer add the special tokens.
  • However, some people may want to see the result with add_special_tokens set to false, and we've added this as a parameter to the request.
  • The default value of add_special_tokens is true(when if it was set to null), and of course you can set it to false and get tokenized result without adding special tokens.

The changed request may look like:

curl -X 'POST' \
  'https://localhost:8080/tokenize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "add_special_tokens": true,
  "inputs": "My name is Olivier and I"
}'

@GirinMan GirinMan marked this pull request as ready for review March 27, 2024 02:55
@GirinMan
Copy link
Contributor Author

@tgaddair We have found some issues related to this PR and made some commits for it.

@GirinMan GirinMan requested a review from tgaddair March 27, 2024 04:31
Copy link
Contributor

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very thorough explanation of the issue and the fix! This LGTM. I'll go ahead and land assuming tests finish successfully.

@tgaddair tgaddair merged commit 0b9117f into predibase:main Mar 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Addition of Tokenization API Fix Chat Completion Stream API and Swagger Docs
3 participants