Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated llm demo to report number of generated tokens in the last response #2373

Merged
merged 5 commits into from
Apr 16, 2024

Conversation

bstrzele
Copy link
Collaborator

CVS-135176

def generate():
ov_model_exec.generate(**tokens, **generate_kwargs)
result = ov_model_exec.generate(**tokens, **generate_kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason we do not include special tokens? What are those special tokens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case special tokens are used for left/right padding. They are stripped to assure accurate token counting when batch_size > 1.

@dtrawins
Copy link
Collaborator

include output changes in readme

@bstrzele bstrzele requested a review from dtrawins April 2, 2024 07:26
@@ -181,6 +181,7 @@ def generate():
for partial_result in streamer:
yield serialize_completions(batch_size, partial_result)
t1.join()
token_count[0] -= len(tokens["input_ids"].flatten())
yield [Tensor("token_count", np.array([token_count[0]], dtype=np.int32))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only 1st element of token_count? Was it tested with bs>1 with variable response size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First element of token_count represents final number of tokens. This snipped uses list instead of single variable due to the mutable property of python's list type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @dtrawins misunderstood token_count variable - it is always 1 element array in order to return it as numpy array.

But @dtrawins makes good point here - @bstrzele should we track token count per batch or just sum of tokens in all of batches?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • why are you re-creating numpy array here? token_count is an array anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently do not understand the value in having token reporting fragmented per batch, it would be a simple change


return serialize_completions(batch_size, completions)
t1.join()
return serialize_completions(batch_size, completions, token_count[0])
Copy link
Collaborator

@dkalinowski dkalinowski Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why are working on a numpy array, just to return a number in line 191. Then, in serialize_completions you create numpy array anyway Tensor("token_count", np.array([token_count], dtype=np.int32)) @bstrzele

@bstrzele bstrzele force-pushed the llm_token_reporting branch 2 times, most recently from 22b0135 to a899698 Compare April 15, 2024 13:56
@dtrawins dtrawins requested review from ngrozae and mzegla April 15, 2024 14:05
@dtrawins dtrawins merged commit 8435706 into main Apr 16, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants