Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoder.json/vocab.bpe show up in every project that uses SK #2679

Closed
stephentoub opened this issue Sep 1, 2023 · 2 comments · Fixed by #2809
Closed

encoder.json/vocab.bpe show up in every project that uses SK #2679

stephentoub opened this issue Sep 1, 2023 · 2 comments · Fixed by #2809
Assignees
Labels

Comments

@stephentoub
Copy link
Member

stephentoub commented Sep 1, 2023

Bringing in the Microsoft.SemanticKernel nuget package causes these files to show up in the consuming application:
image
and they end up in the output directory for the application, regardless of whether the app is using the tokenizer or not.
image

I'd opened #1800 to turn them into assembly resources instead, so that they'd simply be part of the assembly and not separate files, but it was closed due to a lack of a decision about what to do with it.

@shawncal shawncal added the triage label Sep 1, 2023
@anthonypuppo
Copy link
Contributor

Echoing my comment on #1800

Bringing up SharpToken as this would solve 1) embedding of the tokenization resource files 2) extracting the tokenization logic to a separate package (maybe wrapped as an official SK package in the future) and 3) bug #2334.

@lemillermicrosoft
Copy link
Member

One other option could be to move the tokenizer into an extension package for OpenAI connector.

github-merge-queue bot pushed a commit that referenced this issue Sep 14, 2023
### Motivation and Context

<!-- Thank you for your contribution to the semantic-kernel repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->

Closes #2679
Closes #2508
Closes #2334

This PR contains changes to remove `GPT3Tokenizer` from Semantic Kernel
repository.

Alternatives:
https://github.com/microsoft/Tokenizer
https://github.com/dmitry-brazhenko/sharptoken

### Description

<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->
1. Removed tokenizer logic, tests and related files.
2. Updated documentation.

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [x] The code builds clean without any errors or warnings
- [x] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [x] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄
SOE-YoungS pushed a commit to SOE-YoungS/semantic-kernel that referenced this issue Nov 1, 2023
### Motivation and Context

<!-- Thank you for your contribution to the semantic-kernel repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->

Closes microsoft#2679
Closes microsoft#2508
Closes microsoft#2334

This PR contains changes to remove `GPT3Tokenizer` from Semantic Kernel
repository.

Alternatives:
https://github.com/microsoft/Tokenizer
https://github.com/dmitry-brazhenko/sharptoken

### Description

<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->
1. Removed tokenizer logic, tests and related files.
2. Updated documentation.

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [x] The code builds clean without any errors or warnings
- [x] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [x] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants