Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

We're storing the ranks in a sorted tree of sorted (or linked) trees. Getting the minimum rank is logarithmic and each subsequent occurrence is constant time. To know the previous and next indexes (and the corresponding ranks), we're storing them in arrays (the keys are the indexes). We're updating each after finding the minimum via the tree. We're iterating duplicates without removing them one-by-one, but if they are neighbors, we're skipping them manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

Commits on Feb 11, 2024

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

Are you sure you want to change the base?

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

Commits on Feb 11, 2024