Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Commits on Feb 11, 2024

  1. Add test for encoding huge byte sequences

    Lőrinc committed Feb 11, 2024
    Configuration menu
    Copy the full SHA
    aeca532 View commit details
    Browse the repository at this point in the history
  2. Add a _byte_pair_merge_large for worst-case scenarios

    We're storing the ranks in a sorted tree of sorted (or linked) trees.
    Getting the minimum rank is logarithmic and each subsequent occurrence is constant time.
    To know the previous and next indexes (and the corresponding ranks), we're storing them in arrays (the keys are the indexes). We're updating each after finding the minimum via the tree.
    We're iterating duplicates without removing them one-by-one, but if they are neighbors, we're skipping them manually.
    Lőrinc committed Feb 11, 2024
    Configuration menu
    Copy the full SHA
    5af8058 View commit details
    Browse the repository at this point in the history