Skip to content

Commit

Permalink
Added SharpToken to examples
Browse files Browse the repository at this point in the history
  • Loading branch information
dmytrostruk committed Sep 14, 2023
1 parent 75e1b3b commit 7b1cc19
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 1 deletion.
1 change: 1 addition & 0 deletions dotnet/Directory.Packages.props
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
<PackageVersion Include="NRedisStack" Version="0.9.0" />
<PackageVersion Include="Pgvector" Version="0.1.3" />
<PackageVersion Include="Polly" Version="7.2.4" />
<PackageVersion Include="SharpToken" Version="1.2.12" />
<PackageVersion Include="System.Diagnostics.DiagnosticSource" Version="6.0.1" />
<PackageVersion Include="System.Linq.Async" Version="6.0.1" />
<PackageVersion Include="System.Text.Json" Version="6.0.8" />
Expand Down
6 changes: 5 additions & 1 deletion dotnet/samples/KernelSyntaxExamples/Example55_TextChunker.cs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
using System.Collections.Generic;
using System.Threading.Tasks;
using Microsoft.SemanticKernel.Text;
using SharpToken;

// ReSharper disable once InconsistentNaming
public static class Example55_TextChunker
Expand Down Expand Up @@ -69,6 +70,9 @@ private static void WriteParagraphsToConsole(List<string> paragraphs)

private static int CustomTokenCounter(string input)
{
return input.Length / 4;
var encoding = GptEncoding.GetEncoding("p50k_base");

This comment has been minimized.

Copy link
@anthonypuppo

anthonypuppo Sep 14, 2023

Contributor

Can this be changed to use cl100k encoding? Assume mostly all consumers will need that as it's what gpt-3.5-turbo, gpt-4, etc. all use (see #2334).

This comment has been minimized.

Copy link
@dmytrostruk

dmytrostruk Sep 14, 2023

Author Member

@anthonypuppo Thank you for feedback.
That's the reason why I didn't want to use concrete tokenizer library initially. Because specifically in TextChunker example it doesn't matter which tokenizer library, encoding name or model to use. In all cases, TextChunker should work in similar way, based on any kind of tokenization logic.

Yes, it makes sense to use the most popular models in examples, because it should cover the most user scenarios. But at the same time, users may think that using SharpToken or cl100k encoding is requirement, rather than option.

I updated encoding and added some comments in these commits: 76d34e3, 795d30f

Thanks again!

var tokens = encoding.Encode(input);

return tokens.Count;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
<PackageReference Include="Microsoft.Extensions.Logging.Console" />
<PackageReference Include="Newtonsoft.Json" />
<PackageReference Include="Polly" />
<PackageReference Include="SharpToken" />
<PackageReference Include="System.Linq.Async" />
</ItemGroup>
<ItemGroup>
Expand Down

0 comments on commit 7b1cc19

Please sign in to comment.