Python: adding external tokenizer to python text_chunker #1388

gramhagen · 2023-06-08T21:46:14Z

Motivation and Context

addressing issue #1387
chunking text should allow use of an external tokenizer

Description

added pass-through of an token counting function, defaulting to the existing _token_counter() method
while fixing a type hint bug I got sucked into making a few changes to clean up the code.

future work would be nice to add chunk overlap functionality similar to langchain's TextSplitter

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
All unit tests pass, and I have added new tests where possible
I didn't break anyone 😄

… a bit

shawncal · 2023-07-08T04:46:11Z

@gramhagen Cool change! Thanks for the contribution.

Welcome to Semantic Kernel!

adding external tokenizer option to text_chunker and cleaning up code…

177a2e1

… a bit

github-actions bot added the python Pull requests for the Python Semantic Kernel label Jun 8, 2023

alexchaomander requested review from mkarle and awharrison-28 June 8, 2023 22:23

lemillermicrosoft and others added 2 commits June 12, 2023 10:13

Merge branch 'main' into gramhagen/add_text_tokenizer

cbc409c

Running formatter

d03104b

mkarle approved these changes Jun 16, 2023

View reviewed changes

shawncal changed the title ~~adding external tokenizer to python text_chunker~~ Python: adding external tokenizer to python text_chunker Jun 29, 2023

Merge branch 'main' into gramhagen/add_text_tokenizer

7f84a50

shawncal requested a review from a team as a code owner July 8, 2023 04:42

shawncal added this pull request to the merge queue Jul 8, 2023

Merged via the queue into microsoft:main with commit 8527c58 Jul 8, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: adding external tokenizer to python text_chunker #1388

Python: adding external tokenizer to python text_chunker #1388

gramhagen commented Jun 8, 2023

shawncal commented Jul 8, 2023

Python: adding external tokenizer to python text_chunker #1388

Python: adding external tokenizer to python text_chunker #1388

Conversation

gramhagen commented Jun 8, 2023

Motivation and Context

Description

Contribution Checklist

shawncal commented Jul 8, 2023