-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.Net: cl100k_base tokenizer #2334
Labels
.NET
Issue or Pull requests regarding .NET code
Comments
@dluc are you able to look into this? |
5 tasks
4 tasks
github-merge-queue bot
pushed a commit
that referenced
this issue
Sep 14, 2023
### Motivation and Context <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> Closes #2679 Closes #2508 Closes #2334 This PR contains changes to remove `GPT3Tokenizer` from Semantic Kernel repository. Alternatives: https://github.com/microsoft/Tokenizer https://github.com/dmitry-brazhenko/sharptoken ### Description <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> 1. Removed tokenizer logic, tests and related files. 2. Updated documentation. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄
SOE-YoungS
pushed a commit
to SOE-YoungS/semantic-kernel
that referenced
this issue
Nov 1, 2023
### Motivation and Context <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> Closes microsoft#2679 Closes microsoft#2508 Closes microsoft#2334 This PR contains changes to remove `GPT3Tokenizer` from Semantic Kernel repository. Alternatives: https://github.com/microsoft/Tokenizer https://github.com/dmitry-brazhenko/sharptoken ### Description <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> 1. Removed tokenizer logic, tests and related files. 2. Updated documentation. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The existing GPT3Tokenizer class uses p50k_base encoding. GPT-3.5, GPT-4, and text-embedding-ada-002 all rely on cl100k_base encoding.
The below table is extracted from the OpenAI notebook on token counting.
The text was updated successfully, but these errors were encountered: