-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLamaTokenizer with use_fast=True
/ and use_fast=False
causing memory leak when used with multiprocessing / dataset.map(num_proc)
#1495
Comments
dataset.map(num_proc)
dataset.map(num_proc)
use_fast=True
/ and use_fast=False
causing memory leak when used with multiprocessing / dataset.map(num_proc)
Update, the following function does not seem to have such a behavior. def tokenize(example, rank: int = 0):
# global tokenizer_tinyllama
gc.collect()
# chat = [
# {"role": "user", "content": book},
# ]
# tokens = tokenizer_tinyllama.apply_chat_template(chat, tokenize=True)
# if tokenizer_tinyllama is None:
tokenizer_tinyllama = LlamaTokenizerFast.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
example["input_ids"] = tokenizer_tinyllama(example["content"], max_length=None)["input_ids"]
example["n_tokens"] = len(example["input_ids"])
example["content"] = None
return example |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
No, not stale! |
I also encounter a similar issue with 0.19.1. |
Opened a new issue with a more general reproduction, I believe this is a more common problem. |
Same issue here. |
Thanks all for these. Is the issue more with |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Not stale |
When running a dataset.map with
num_proc=16
, I am unable to tokenize a ~45GB dataset on a machine with >200GB Vram. The dataset consists of ~30000 rows with a string of 120-180k characters.The memory linearly increases until it reaches max with 200GB, after just 2000 such iterations / 2000 lines..
Other things I have tried:
16 tokenizers
in global scope and accessing them via therank
parameter.gc.collect
'use_fast
makes the script more efficent - it takes now ~10k lines instead of 2k to go OOM'Reproduction script
Env
OS: Ubuntu 22.04
PIP freeze
The text was updated successfully, but these errors were encountered: