You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Say I have a very long string s and i have a limited amount of token n
I want to get a substring the of the original string start from index 0 and is as long as possible, given it doesn't cost token exceed the token amount specified.
Let's say
s = "hello world, great to see you"
n = 2
then I probably get "hello world"
The text was updated successfully, but these errors were encountered:
final EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
final Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
final int n = 2;
final String s = "hello world, great to see you!";
final List<Integer> encoded = enc.encode(s);
final List<Integer> truncated = encoded.subList(0, n);
final String decoded = enc.decode(truncated);
System.out.println(decoded);
// prints: hello world
Note, that depending on your input text, the decoded text can contain non-printable chars. This can happen, when multiple-byte unicode characters (f. ex. emojis) that map to multiple tokens are encoded and happen to be truncated. For example for s = I love 🍕 and n = 3 the tokens [40, 3021, 11410, 235, 243] will be truncated to [40, 3021, 11410] where 40 corresponds to I, 3021 corresponds to love and 11410 corresponds to a space and the first byte of the 3-byte unicode representation of 🍕
Edit: Here, a visual explanation with a different encoding, but would result in the same edge-case if truncated after 3 tokens:
Say I have a very long string
s
and i have a limited amount of tokenn
I want to get a substring the of the original string start from index 0 and is as long as possible, given it doesn't cost token exceed the token amount specified.
Let's say
s = "hello world, great to see you"
n = 2
then I probably get "hello world"
The text was updated successfully, but these errors were encountered: