-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return position to EncodingResult #80
Comments
I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting. |
Hmm, I forgot that I wanted to get back to this issue after the performance optimizations are merged, sorry about that! I'm not opposed to adding this functionality. If you, @dimafa (or anyone who wants to pick this issue up), would kindly adapt the PR to the new code structure, I would gladly merge it |
After obtaining the encoded results, I iteratively decode each token and check if its content matches the expected string. This allows me to obtain the token sequence I want, as well as the positioning information related to the tokens. The relevant implementation is available for reference. List<Integer> encoded = encoding.encode(input).boxed();
List<Token> result = new ArrayList<>();
StringBuilder contentBuilder = new StringBuilder(input);
// Pointer for contentBuilder
int bufferPoint = 0;
IntArrayList tokenCollect = new IntArrayList();
for (int i = 0; i < encoded.size(); i++) {
// Decode each token
tokenCollect.add(encoded.get(i));
String decodeResult = encoding.decode(tokenCollect);
// If the decode result does not match the substring of the content pointer, it means not all tokens are involved in decoding, more tokens are needed for decoding
if (!contentBuilder.substring(bufferPoint, bufferPoint + decodeResult.length()).equals(decodeResult)) {
continue;
}
// Match successful, move the pointer and collect tokens
bufferPoint += decodeResult.length();
// String position [bufferPoint, bufferPoint+decodeResult.length())
result.add(new Token(decodeResult, tokenCollect.boxed()));
tokenCollect.clear();
} |
I've created a PR for this issue, please check it out : #97 |
To help with string chunking it would be very helpful to include the last token string position in the EncodingResult when encoding is requested with a given maxTokens.
This way one could efficiently use the library to chunk a string based on given number of tokens in each chunk
The text was updated successfully, but these errors were encountered: