Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return position to EncodingResult #80

Open
dimafa opened this issue Jan 17, 2024 · 4 comments
Open

Return position to EncodingResult #80

dimafa opened this issue Jan 17, 2024 · 4 comments

Comments

@dimafa
Copy link

dimafa commented Jan 17, 2024

To help with string chunking it would be very helpful to include the last token string position in the EncodingResult when encoding is requested with a given maxTokens.
This way one could efficiently use the library to chunk a string based on given number of tokens in each chunk

@wertycn
Copy link

wertycn commented Mar 20, 2024

I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.

@tox-p
Copy link
Contributor

tox-p commented Mar 20, 2024

Hmm, I forgot that I wanted to get back to this issue after the performance optimizations are merged, sorry about that!

I'm not opposed to adding this functionality. If you, @dimafa (or anyone who wants to pick this issue up), would kindly adapt the PR to the new code structure, I would gladly merge it

@wertycn
Copy link

wertycn commented Mar 21, 2024

I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.

After obtaining the encoded results, I iteratively decode each token and check if its content matches the expected string. This allows me to obtain the token sequence I want, as well as the positioning information related to the tokens. The relevant implementation is available for reference.

List<Integer> encoded = encoding.encode(input).boxed();
List<Token> result = new ArrayList<>();

StringBuilder contentBuilder = new StringBuilder(input);
// Pointer for contentBuilder
int bufferPoint = 0;
IntArrayList tokenCollect = new IntArrayList();

for (int i = 0; i < encoded.size(); i++) {
    // Decode each token
    tokenCollect.add(encoded.get(i));
    String decodeResult = encoding.decode(tokenCollect);
    // If the decode result does not match the substring of the content pointer, it means not all tokens are involved in decoding, more tokens are needed for decoding
    if (!contentBuilder.substring(bufferPoint, bufferPoint + decodeResult.length()).equals(decodeResult)) {
        continue;
    }
    // Match successful, move the pointer and collect tokens
    bufferPoint += decodeResult.length();
    // String position [bufferPoint, bufferPoint+decodeResult.length())
    result.add(new Token(decodeResult, tokenCollect.boxed()));
    tokenCollect.clear();
}

@imsosleepy
Copy link

I've created a PR for this issue, please check it out : #97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants