Improve performance of GPT3Tokenizer #579

stephentoub · 2023-04-21T16:32:42Z

Motivation and Context

Improve the performance of GPT3Tokenizer in both throughput and allocation.

Description

This primarily focuses on the path once the BPE cache has already been sufficiently warmed up. There's much more that can be done about the code path that warms it up, but I've only handled the low-hanging fruit there for now. This can all also be significantly improved once the project moves to target .NET Core instead of netstandard2.0.

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
All unit tests pass, and I have added new tests where possible
I didn't break anyone 😄

private string _input =
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
    "Blandit cursus risus at ultrices mi. Elementum facilisis leo vel fringilla est ullamcorper eget. Pellentesque sit amet porttitor " +
    "eget dolor morbi non arcu. Sed turpis tincidunt id aliquet. Sit amet luctus venenatis lectus magna fringilla urna. Eu turpis " +
    "egestas pretium aenean. Tempus quam pellentesque nec nam aliquam. Sagittis vitae et leo duis ut diam. Tempor orci eu lobortis " +
    "elementum. Placerat vestibulum lectus mauris ultrices eros in cursus. Tempus egestas sed sed risus pretium quam vulputate. Aliquam " +
    "faucibus purus in massa.";

[Benchmark(Baseline = true)]
public List<int> Old() => OldTokenizer.GPT3Tokenizer.Encode(_input);

[Benchmark]
public List<int> New() => NewTokenizer.GPT3Tokenizer.Encode(_input);

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Allocated	Alloc Ratio
Old	50.82 us	0.997 us	2.014 us	49.99 us	1.00	0.00	72.22 KB	1.00
New	26.52 us	0.199 us	0.155 us	26.50 us	0.50	0.02	31.55 KB	0.44

This primarily focuses on the path once the BPE cache has already been sufficiently warmed up. There's much more that can be done about the code path that warms it up, but I've only handled the low-hanging fruit there for now. This can all also be significantly improved once the project moves to target .NET Core instead of netstandard2.0.

lemillermicrosoft

LGTM - Thanks for the good lessons.

### Motivation and Context Improve the performance of GPT3Tokenizer in both throughput and allocation. ### Description This primarily focuses on the path once the BPE cache has already been sufficiently warmed up. There's much more that can be done about the code path that warms it up, but I've only handled the low-hanging fruit there for now. This can all also be significantly improved once the project moves to target .NET Core instead of netstandard2.0.

lemillermicrosoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel PR: ready for review All feedback addressed, ready for reviews labels Apr 21, 2023

lemillermicrosoft approved these changes Apr 21, 2023

View reviewed changes

Merge branch 'main' into gpt3perf

45e9446

lemillermicrosoft added the PR: ready to merge PR has been approved by all reviewers, and is ready to merge. label Apr 21, 2023

Merge branch 'main' into gpt3perf

36a4ab8

dluc approved these changes Apr 22, 2023

View reviewed changes

dluc merged commit 5c4cd3f into microsoft:main Apr 22, 2023
10 checks passed

stephentoub deleted the gpt3perf branch April 22, 2023 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of GPT3Tokenizer #579

Improve performance of GPT3Tokenizer #579

stephentoub commented Apr 21, 2023

lemillermicrosoft left a comment

Improve performance of GPT3Tokenizer #579

Improve performance of GPT3Tokenizer #579

Conversation

stephentoub commented Apr 21, 2023

Motivation and Context

Description

Contribution Checklist

lemillermicrosoft left a comment

Choose a reason for hiding this comment