Return position to EncodingResult #80

dimafa · 2024-01-17T14:29:49Z

To help with string chunking it would be very helpful to include the last token string position in the EncodingResult when encoding is requested with a given maxTokens.
This way one could efficiently use the library to chunk a string based on given number of tokens in each chunk

wertycn · 2024-03-20T11:13:11Z

I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.

tox-p · 2024-03-20T17:59:40Z

Hmm, I forgot that I wanted to get back to this issue after the performance optimizations are merged, sorry about that!

I'm not opposed to adding this functionality. If you, @dimafa (or anyone who wants to pick this issue up), would kindly adapt the PR to the new code structure, I would gladly merge it

wertycn · 2024-03-21T13:18:06Z

I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.

After obtaining the encoded results, I iteratively decode each token and check if its content matches the expected string. This allows me to obtain the token sequence I want, as well as the positioning information related to the tokens. The relevant implementation is available for reference.

List<Integer> encoded = encoding.encode(input).boxed();
List<Token> result = new ArrayList<>();

StringBuilder contentBuilder = new StringBuilder(input);
// Pointer for contentBuilder
int bufferPoint = 0;
IntArrayList tokenCollect = new IntArrayList();

for (int i = 0; i < encoded.size(); i++) {
    // Decode each token
    tokenCollect.add(encoded.get(i));
    String decodeResult = encoding.decode(tokenCollect);
    // If the decode result does not match the substring of the content pointer, it means not all tokens are involved in decoding, more tokens are needed for decoding
    if (!contentBuilder.substring(bufferPoint, bufferPoint + decodeResult.length()).equals(decodeResult)) {
        continue;
    }
    // Match successful, move the pointer and collect tokens
    bufferPoint += decodeResult.length();
    // String position [bufferPoint, bufferPoint+decodeResult.length())
    result.add(new Token(decodeResult, tokenCollect.boxed()));
    tokenCollect.clear();
}

imsosleepy · 2024-05-15T08:18:50Z

I've created a PR for this issue, please check it out : #97

imsosleepy mentioned this issue May 15, 2024

Return position to EncodingResult (#80) #97

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return position to EncodingResult #80

Return position to EncodingResult #80

dimafa commented Jan 17, 2024

wertycn commented Mar 20, 2024

tox-p commented Mar 20, 2024

wertycn commented Mar 21, 2024

imsosleepy commented May 15, 2024

Return position to EncodingResult #80

Return position to EncodingResult #80

Comments

dimafa commented Jan 17, 2024

wertycn commented Mar 20, 2024

tox-p commented Mar 20, 2024

wertycn commented Mar 21, 2024

imsosleepy commented May 15, 2024