-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated llm demo to report number of generated tokens in the last response #2373
Conversation
demos/python_demos/llm_text_generation/servable_stream/model.py
Outdated
Show resolved
Hide resolved
def generate(): | ||
ov_model_exec.generate(**tokens, **generate_kwargs) | ||
result = ov_model_exec.generate(**tokens, **generate_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason we do not include special tokens? What are those special tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case special tokens are used for left/right padding. They are stripped to assure accurate token counting when batch_size > 1.
include output changes in readme |
@@ -181,6 +181,7 @@ def generate(): | |||
for partial_result in streamer: | |||
yield serialize_completions(batch_size, partial_result) | |||
t1.join() | |||
token_count[0] -= len(tokens["input_ids"].flatten()) | |||
yield [Tensor("token_count", np.array([token_count[0]], dtype=np.int32))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only 1st element of token_count? Was it tested with bs>1 with variable response size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First element of token_count represents final number of tokens. This snipped uses list instead of single variable due to the mutable property of python's list type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why are you re-creating numpy array here? token_count is an array anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently do not understand the value in having token reporting fragmented per batch, it would be a simple change
|
||
return serialize_completions(batch_size, completions) | ||
t1.join() | ||
return serialize_completions(batch_size, completions, token_count[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why are working on a numpy array, just to return a number in line 191. Then, in serialize_completions
you create numpy array anyway Tensor("token_count", np.array([token_count], dtype=np.int32))
@bstrzele
22b0135
to
a899698
Compare
3c84f44
to
94f2cc8
Compare
CVS-135176