.Net: cl100k_base tokenizer #2334

anthonypuppo · 2023-08-05T17:50:49Z

The existing GPT3Tokenizer class uses p50k_base encoding. GPT-3.5, GPT-4, and text-embedding-ada-002 all rely on cl100k_base encoding.

The below table is extracted from the OpenAI notebook on token counting.

Encoding name	OpenAI models
cl100k_base	gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base	Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2)	GPT-3 models like davinci

anthonypuppo · 2023-08-05T18:05:31Z

SharpToken

nacharya1 · 2023-08-10T16:26:42Z

@dluc are you able to look into this?

### Motivation and Context  Closes #2679 Closes #2508 Closes #2334 This PR contains changes to remove `GPT3Tokenizer` from Semantic Kernel repository. Alternatives: https://github.com/microsoft/Tokenizer https://github.com/dmitry-brazhenko/sharptoken ### Description  1. Removed tokenizer logic, tests and related files. 2. Updated documentation. ### Contribution Checklist  - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄

### Motivation and Context  Closes microsoft#2679 Closes microsoft#2508 Closes microsoft#2334 This PR contains changes to remove `GPT3Tokenizer` from Semantic Kernel repository. Alternatives: https://github.com/microsoft/Tokenizer https://github.com/dmitry-brazhenko/sharptoken ### Description  1. Removed tokenizer logic, tests and related files. 2. Updated documentation. ### Contribution Checklist  - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄

shawncal added .NET Issue or Pull requests regarding .NET code triage labels Aug 5, 2023

nacharya1 removed the triage label Aug 10, 2023

nacharya1 assigned dluc Aug 10, 2023

nacharya1 added the kernel.core label Aug 10, 2023

anthonypuppo mentioned this issue Aug 23, 2023

Embed encoder.json/vocab.bpe in OpenAI connector assembly #1800

Closed

5 tasks

anthonypuppo mentioned this issue Sep 1, 2023

encoder.json/vocab.bpe show up in every project that uses SK #2679

Closed

dmytrostruk mentioned this issue Sep 13, 2023

.Net: Removed GPT3Tokenizer and related files #2809

Merged

4 tasks

anthonypuppo referenced this issue Sep 14, 2023

Added SharpToken to examples

7b1cc19

dmytrostruk closed this as completed in #2809 Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.Net: cl100k_base tokenizer #2334

.Net: cl100k_base tokenizer #2334

anthonypuppo commented Aug 5, 2023

anthonypuppo commented Aug 5, 2023

nacharya1 commented Aug 10, 2023

.Net: cl100k_base tokenizer #2334

.Net: cl100k_base tokenizer #2334

Comments

anthonypuppo commented Aug 5, 2023

anthonypuppo commented Aug 5, 2023

nacharya1 commented Aug 10, 2023