Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to get sliced string with limited token #3

Closed
jiangying000 opened this issue Apr 3, 2023 · 2 comments
Closed

Is it possible to get sliced string with limited token #3

jiangying000 opened this issue Apr 3, 2023 · 2 comments

Comments

@jiangying000
Copy link

jiangying000 commented Apr 3, 2023

Say I have a very long string s and i have a limited amount of token n

I want to get a substring the of the original string start from index 0 and is as long as possible, given it doesn't cost token exceed the token amount specified.

Let's say
s = "hello world, great to see you"
n = 2

then I probably get "hello world"

@tox-p
Copy link
Contributor

tox-p commented Apr 3, 2023

Sure:

final EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
final Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);

final int n = 2;
final String s = "hello world, great to see you!";

final List<Integer> encoded = enc.encode(s);
final List<Integer> truncated = encoded.subList(0, n);
final String decoded = enc.decode(truncated);
System.out.println(decoded);
// prints: hello world

Note, that depending on your input text, the decoded text can contain non-printable chars. This can happen, when multiple-byte unicode characters (f. ex. emojis) that map to multiple tokens are encoded and happen to be truncated. For example for s = I love 🍕 and n = 3 the tokens [40, 3021, 11410, 235, 243] will be truncated to [40, 3021, 11410] where 40 corresponds to I, 3021 corresponds to love and 11410 corresponds to a space and the first byte of the 3-byte unicode representation of 🍕

Edit: Here, a visual explanation with a different encoding, but would result in the same edge-case if truncated after 3 tokens:
image

@jiangying000
Copy link
Author

jiangying000 commented Apr 3, 2023

This is excellent. thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants