fix memory leak in BPE #962

tmm1 · 2024-10-05T04:56:27Z

turns out this isn't much of a cache. just a memory leak in disguise.

> let {LlamaTokenizer} = await import('./src/transformers.js')
> var tok = await LlamaTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v0.4')
> tok.encode("very long user input ".repeat(10))
[
     1, 1407, 1472, 1404, 1881,  1407,
  1472, 1404, 1881, 1407, 1472,  1404,
  1881, 1407, 1472, 1404, 1881,  1407,
  1472, 1404, 1881, 1407, 1472,  1404,
  1881, 1407, 1472, 1404, 1881,  1407,
  1472, 1404, 1881, 1407, 1472,  1404,
  1881, 1407, 1472, 1404, 1881, 29871
]
> tok.model.cache
Map(1) {
  '▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁very▁long▁user▁input▁' => [
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁very', '▁long', '▁user', '▁input',
    '▁'
  ]
}

xenova · 2024-11-16T18:41:23Z

Hi there 👋 Indeed, this BPE cache is more useful to tokenizers that use a GPT-like pretokenizer:

import { AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.2';
    
const tok = await AutoTokenizer.from_pretrained('Xenova/gpt-4');
tok.encode("very long user input ".repeat(10));
console.log(tok.model.cache);

A similar note is present here:

https://github.com/huggingface/transformers/blob/13493215abceafc1653af88b045120014fb4c1fc/src/transformers/models/qwen2/tokenization_qwen2.py#L186-L190

However, it might be worth disabling it for LlamaTokenizer (and similar) or defining a max input length/max size/lru cache).

fix memory leak in BPE

50225f8

xenova mentioned this pull request Apr 13, 2025

Implement LRU cache for BPE tokenizer #1283

Merged

xenova closed this in #1283 Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix memory leak in BPE #962

fix memory leak in BPE #962

Uh oh!

tmm1 commented Oct 5, 2024 •

edited

Loading

Uh oh!

xenova commented Nov 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix memory leak in BPE #962

fix memory leak in BPE #962

Uh oh!

Conversation

tmm1 commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Nov 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tmm1 commented Oct 5, 2024 •

edited

Loading