Remove some more overhead from GPT3Tokenizer #675

stephentoub · 2023-04-26T21:01:33Z

Motivation and Context

Some more low-hanging fruit reduction in GPT3Tokenizer.

Description

We can both simplify and make faster the parsing of the vocab.bpe file
We can remove the SortedDictionary from BytePairEncoding, including a full O(N) iteration of the dictionary on every iteration of the outer loop (as part of the Min() call).

Tests before:

Tests after:

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
All unit tests pass, and I have added new tests where possible
I didn't break anyone 😄

- We can both simplify and make faster the parsing of the vocab.bpe file - We can remove the SortedDictionary from BytePairEncoding, including a full O(N) iteration of the dictionary on every iteration of the outer loop (as part of the Min() call).

### Motivation and Context Some more low-hanging fruit reduction in GPT3Tokenizer. ### Description - We can both simplify and make faster the parsing of the vocab.bpe file - We can remove the SortedDictionary from BytePairEncoding, including a full O(N) iteration of the dictionary on every iteration of the outer loop (as part of the Min() call). Co-authored-by: Devis Lucato <dluc@users.noreply.github.com> Co-authored-by: Devis Lucato <devis@microsoft.com>

github-actions bot added the .NET Issue or Pull requests regarding .NET code label Apr 26, 2023

dluc previously approved these changes Apr 27, 2023

View reviewed changes

dluc and others added 3 commits April 27, 2023 00:05

Merge branch 'main' into gpttokenperf

df3656b

Merge branch 'main' into gpttokenperf

096c120

fix code style

d8fee83

dluc dismissed their stale review via d8fee83 April 27, 2023 07:25

dluc approved these changes Apr 27, 2023

View reviewed changes

dluc merged commit 8fc9d7a into microsoft:main Apr 27, 2023
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove some more overhead from GPT3Tokenizer #675

Remove some more overhead from GPT3Tokenizer #675

stephentoub commented Apr 26, 2023

Remove some more overhead from GPT3Tokenizer #675

Remove some more overhead from GPT3Tokenizer #675

Conversation

stephentoub commented Apr 26, 2023

Motivation and Context

Description

Contribution Checklist