-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix chat completion and docs #358
Fix chat completion and docs #358
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR! Thanks @GirinMan for putting this together. I just had one question about the behavior of tokenizer.encode
when add_special_tokens
is false
, which was a change made in this PR. I just wanted to make sure I understand the motivation and ramifications of this change. Thanks!
router/src/validation.rs
Outdated
// Get the number of tokens in the input | ||
let mut encoding = tokenizer | ||
.encode(inputs.clone(), true) | ||
.encode(inputs.clone(), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for add_special_tokens
is pretty unclear, but is there a difference in the generated input lengths when this param is set to false
. The one thing I would want to double-check here is that we don't under-count the number of tokens in the input during validation, otherwise we could exceed the max positional embeddings during inference and cause segfaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
Due to a simple test during development, the default value of add_special_tokens
was changed and I forgot to revert it. I'll fix it back to the original.
The original code was implemented with reference to TGI's tokenize commit (huggingface/text-generation-inference#1471) The problem is that in the case of TGI's tokenize, it uses tokenizers::Encoding from Infer, and there are utf-8 processing issues in get_offsets and get_ids. It is likely due to the issues below: huggingface/tokenizers#1201 (comment) In fact, even in Huggingface's tokenizers, the python binding uses encode_char_offsets. https://github.com/huggingface/tokenizers/blob/main/bindings/python/src/tokenizer.rs#L973-L978 Therefore, I changed it to hold the tokenizer as shown below and use encode_char_offsets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @GirinMan. The PR looks good to me.
I see you marked it as draft, were there other changes you wanted to make before landing?
Actually we have found some issues during handling non-ascii unicode characters such as Korean or Japanese text. For example, a single character Behavior of tokenize API before fixed
[
{ "id": 325, "text": "⑴", "start": 0, "stop": 1 },
{ "id": 28740, "text": "⑴", "start": 0, "stop": 1 },
{ "id": 28731, "text": "⑴", "start": 0, "stop": 1 },
{ "id": 28705, "text": " ", "start": 1, "stop": 2 },
{ "id": 233, "text": "抵", "start": 2, "stop": 3 },
{ "id": 141, "text": "抵", "start": 2, "stop": 3 },
{ "id": 184, "text": "抵", "start": 2, "stop": 3 },
{ "id": 29162, "text": "当", "start": 3, "stop": 4 },
{ "id": 233, "text": "権", "start": 4, "stop": 5 },
{ "id": 171, "text": "権", "start": 4, "stop": 5 },
{ "id": 172, "text": "権", "start": 4, "stop": 5 },
{ "id": 29041, "text": "、", "start": 5, "stop": 6 },
... Issue cause and workaround
API response changes after change
[
{ "id": 325, "text": "▁(", "start": 0, "stop": 1 },
{ "id": 28740, "text": "1", "start": 0, "stop": 1 },
{ "id": 28731, "text": ")", "start": 0, "stop": 1 },
{ "id": 28705, "text": "▁", "start": 1, "stop": 2 },
{ "id": 233, "text": "<0xE6>", "start": 2, "stop": 3 },
{ "id": 141, "text": "<0x8A>", "start": 2, "stop": 3 },
{ "id": 184, "text": "<0xB5>", "start": 2, "stop": 3 },
{ "id": 29162, "text": "当", "start": 3, "stop": 4 },
{ "id": 233, "text": "<0xE6>", "start": 4, "stop": 5 },
{ "id": 171, "text": "<0xA8>", "start": 4, "stop": 5 },
{ "id": 172, "text": "<0xA9>", "start": 4, "stop": 5 },
{ "id": 29041, "text": "、", "start": 5, "stop": 6 },
... |
And we can now truncate text to desired token length based on Tokenize result.
# Truncate the original text to 10 tokens or less
text = "⑴ 抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと"
tokens = [{"id":325,"text":"▁(","start":0,"stop":1},{"id":28740,"text":"1","start":0,"stop":1},{"id":28731,"text":")","start":0,"stop":1},{"id":28705,"text":"▁","start":1,"stop":2},{"id":233,"text":"<0xE6>","start":2,"stop":3},{"id":141,"text":"<0x8A>","start":2,"stop":3},{"id":184,"text":"<0xB5>","start":2,"stop":3},{"id":29162,"text":"当","start":3,"stop":4},{"id":233,"text":"<0xE6>","start":4,"stop":5},{"id":171,"text":"<0xA8>","start":4,"stop":5},{"id":172,"text":"<0xA9>","start":4,"stop":5},{"id":29041,"text":"、","start":5,"stop":6},{"id":235,"text":"<0xE8>","start":6,"stop":7},{"id":182,"text":"<0xB3>","start":6,"stop":7},{"id":173,"text":"<0xAA>","start":6,"stop":7},{"id":233,"text":"<0xE6>","start":7,"stop":8},{"id":171,"text":"<0xA8>","start":7,"stop":8},{"id":172,"text":"<0xA9>","start":7,"stop":8},{"id":29041,"text":"、","start":8,"stop":9},{"id":29596,"text":"先","start":9,"stop":10},{"id":29012,"text":"取","start":10,"stop":11},{"id":29631,"text":"特","start":11,"stop":12},{"id":233,"text":"<0xE6>","start":12,"stop":13},{"id":171,"text":"<0xA8>","start":12,"stop":13},{"id":172,"text":"<0xA9>","start":12,"stop":13},{"id":29965,"text":"及","start":13,"stop":14},{"id":31050,"text":"び","start":14,"stop":15},{"id":235,"text":"<0xE8>","start":15,"stop":16},{"id":182,"text":"<0xB3>","start":15,"stop":16},{"id":134,"text":"<0x83>","start":15,"stop":16},{"id":31626,"text":"借","start":16,"stop":17},{"id":233,"text":"<0xE6>","start":17,"stop":18},{"id":171,"text":"<0xA8>","start":17,"stop":18},{"id":172,"text":"<0xA9>","start":17,"stop":18},{"id":29041,"text":"、","start":18,"stop":19},{"id":29241,"text":"ま","start":19,"stop":20},{"id":29227,"text":"た","start":20,"stop":21},{"id":29277,"text":"は","start":21,"stop":22},{"id":29163,"text":"所","start":22,"stop":23},{"id":28998,"text":"有","start":23,"stop":24},{"id":233,"text":"<0xE6>","start":24,"stop":25},{"id":171,"text":"<0xA8>","start":24,"stop":25},{"id":172,"text":"<0xA9>","start":24,"stop":25},{"id":30210,"text":"留","start":25,"stop":26},{"id":29321,"text":"保","start":26,"stop":27},{"id":29041,"text":"、","start":27,"stop":28},{"id":29414,"text":"等","start":28,"stop":29},{"id":28993,"text":"の","start":29,"stop":30},{"id":231,"text":"<0xE4>","start":30,"stop":31},{"id":188,"text":"<0xB9>","start":30,"stop":31},{"id":156,"text":"<0x99>","start":30,"stop":31},{"id":28993,"text":"の","start":31,"stop":32},{"id":29474,"text":"完","start":32,"stop":33},{"id":29374,"text":"全","start":33,"stop":34},{"id":29270,"text":"な","start":34,"stop":35},{"id":29116,"text":"る","start":35,"stop":36},{"id":29163,"text":"所","start":36,"stop":37},{"id":28998,"text":"有","start":37,"stop":38},{"id":233,"text":"<0xE6>","start":38,"stop":39},{"id":171,"text":"<0xA8>","start":38,"stop":39},{"id":172,"text":"<0xA9>","start":38,"stop":39},{"id":28993,"text":"の","start":39,"stop":40},{"id":29037,"text":"行","start":40,"stop":41},{"id":29154,"text":"使","start":41,"stop":42},{"id":29078,"text":"を","start":42,"stop":43},{"id":232,"text":"<0xE5>","start":43,"stop":44},{"id":169,"text":"<0xA6>","start":43,"stop":44},{"id":171,"text":"<0xA8>","start":43,"stop":44},{"id":31967,"text":"げ","start":44,"stop":45},{"id":29116,"text":"る","start":45,"stop":46},{"id":29339,"text":"事","start":46,"stop":47},{"id":29418,"text":"情","start":47,"stop":48},{"id":29309,"text":"が","start":48,"stop":49},{"id":29270,"text":"な","start":49,"stop":50},{"id":29132,"text":"い","start":50,"stop":51},{"id":29543,"text":"こ","start":51,"stop":52},{"id":29316,"text":"と","start":52,"stop":53}]
# truncate to start of 10+1 tokens in case of byte-level splitting
# if 10th and 11th tokens are combined into a single character, discard 10th token
sliced_tokens = tokens[:10+1]
end_idx = sliced_tokens[-1]["start"]
sliced_text = text[:end_idx]
print(text)
# ⑴ 抵当権、質権、先取特権及び賃借権、または所有権留保、等の乙の完全なる所有権の行使を妨げる事情がないこと
print(sliced_text)
# ⑴ 抵当
[
{ "id": 325, "text": "▁(", "start": 0, "stop": 1 },
{ "id": 28740, "text": "1", "start": 0, "stop": 1 },
{ "id": 28731, "text": ")", "start": 0, "stop": 1 },
{ "id": 28705, "text": "▁", "start": 1, "stop": 2 },
{ "id": 233, "text": "<0xE6>", "start": 2, "stop": 3 },
{ "id": 141, "text": "<0x8A>", "start": 2, "stop": 3 },
{ "id": 184, "text": "<0xB5>", "start": 2, "stop": 3 },
{ "id": 29162, "text": "当", "start": 3, "stop": 4 }
] |
The changed request may look like:
|
@tgaddair We have found some issues related to this PR and made some commits for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very thorough explanation of the issue and the fix! This LGTM. I'll go ahead and land assuming tests finish successfully.
What does this PR do?
This PR introduces three improvements to the LoRAX project:
Tokenization API: Inspired by HF TGI, this addition allows users to get tokenization results via a new
/tokenize
route in the API.Chat Completion Stream API Fix: The existing implementation had a discrepancy with the official OpenAI API, where the final chunk in a stream erroneously contained null "role" and "content", leading to potential unexpected errors in some OpenAI API clients. This fix ensures that, while the last chunk (
is_stop
is True) still recognizes "role" and "content" as null, they are not serialized, allowing the final chunk's delta to correctly return as an empty object.These enhancements not only improve the project's functionality but also its usability and compatibility with existing OpenAI API clients.
Before submitting
to it if that's the case.
Who can review?
@tgaddair
I have conducted several tests to ensure the proper functioning of these improvements and applied linting to adhere to coding style and conventions. Despite these precautions, there could still be issues within the code. Please feel free to review and suggest any changes you think might be necessary.
cc. @soeque1