-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python version hangs on encode_batch when run in subprocess #187
Comments
Hey @kretes, thanks for bringing this up and thanks for the reproducible example. I ran the gist and sure enough I saw the same results. What I'm guessing is happening (and keep in mind I am no expert in multiprocessing so I could be way off) is that when a tokenizer is used from the main process, the Rust code behind the scenes is creating some resources that it needs for parallel processing (like a thread pool and I don't know what else - this stuff comes from https://github.com/rayon-rs/rayon and I'm not super familiar with it). Then when the process is forked and So as long as you don't call from multiprocessing import Process
from tokenizers.implementations import ByteLevelBPETokenizer
def encode():
tok = ByteLevelBPETokenizer()
print(tok.encode_batch(['ala']))
print(tok.encode_batch(['ala', 'kot']))
p = Process(target=encode)
p.start()
p.join() |
Encode batch hangs for me in python even when not ran as a subprocess. |
@ivanjacobsec can you provide an example? |
First off all thank you for the great work and it is a inspring project. I tried 2 approaches:
With both of them batch_encode hangs. Other issues I encountered:
produces this error:
produces this error:
I would really like to see a full working example in python with custom pre-tokenizer,normalizer and decoder. |
I think I am having the same issue so I am not going to make a new one. But this really puzzled me for a long time. Basically when I test the tokenizer and then try to use it in a torch data loader I get a freeze on the call to encode in data loader. Here is a simple example import torch
import tokenizers
#Just making a test tokenizer call even in a new class
#will ruin the tokenizer when used later in torch data_loader
#TOKENIZER = transformers.RobertaTokenizerFast(
TEST_TOKENIZER = tokenizers.ByteLevelBPETokenizer(
vocab_file=f"roberta-base/vocab.json",
merges_file=f"roberta-base/merges.txt",
lowercase=True,
add_prefix_space=True
)
test = TEST_TOKENIZER.encode("this is a test.")
TOKENIZER = tokenizers.ByteLevelBPETokenizer(
vocab_file=f"roberta-base/vocab.json",
merges_file=f"roberta-base/merges.txt",
lowercase=True,
add_prefix_space=True
)
class Dataset:
def __init__(self, texts):
self.texts = texts
self.tokenizer = TOKENIZER
def __len__(self):
return len(self.texts)
def __getitem__(self, item):
#This call freezes when test is run above.
data = self.tokenizer.encode(self.texts[item])
return data
texts = ['This is another test from data_loader',
'This is the second test from data_loader']
dataset = Dataset(texts=texts)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, num_workers=1)
for d in data_loader:
print (d) |
I spent quite some time digging this to try and understand what happens exactly. @epwalsh got it right: When the process gets forked, we end up with two (or more) different processes, with different memory spaces, but the exact same content in memory (at the time of the fork). Now, this shouldn't be a problem in most cases, but sometimes the locks we use to support multi-threading contain some state to operate properly. When this state does not make sense in the new context (in the new process), this lock might get impossible to unlock. (cf this discussion on stack overflow for more info, along with this example) There are two different places that seem to cause such behavior:
Now, this is actually quite easy to fix, with only one rule to follow: do not use a tokenizer in the main process if you want to use them with multiprocessing. I think in most cases, this is actually a best practice that should be followed regardless of this problem. By following this rule, the above snippet will look like: import torch
import tokenizers
class Dataset:
def __init__(self, texts):
self.texts = texts
self.tokenizer = tokenizers.ByteLevelBPETokenizer(
vocab_file=f"roberta-base/vocab.json",
merges_file=f"roberta-base/merges.txt",
lowercase=True,
add_prefix_space=True
)
def __len__(self):
return len(self.texts)
def __getitem__(self, item):
data = self.tokenizer.encode(self.texts[item])
return data.ids
texts = [
'This is another test from data_loader',
'This is the second test from data_loader'
]
dataset = Dataset(texts=texts)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, num_workers=1)
for d in data_loader:
print (d) Any features of the tokenizers can be used by following this setup. |
Thanks for the reply. But your code doesn't change the bug. You still get a freeze if you use any tokenizer outside the loader. This is not an issue in normal code if you know this can happen. I experienced this working in a notebook and testing some code before running the loader. And thinking it was a tokenizer thread issue was not the first thing that came to mind. I do not know Rust so I cannot speak to getting this fixed. Seems like a major bug but if you already know about it, then the workaround is easy. |
I have the same bug and only have it since upgrading from 0.0.5 to 0.0.7. Basically whenever I use encode within a dataset and use It only also happens if there are more than one word in the sequence to be tokenizes. So to me it seems that I am especially curious what changes you did from 0.0.5 to 0.0.7 that started to cause this. Is there a way to disable any internal multithreading/multiprocessing you are doing? This bug is causing me quite some issues. |
@n1t0 Any idea if and when this will be fixed? I need to go back to older versions or other alternatives otherwise. |
@psinger If you have an example of what you are saying, that with an older version it works, I'd love to see that. It could help in the process of debugging. I'd love to be able to fix this, the thing is, I have no idea how to fix it right now. I found a huge number of articles (a good example here) online about the problems related to Python
So, I agree it seems like a huge bug, and I'd love to be able to fix it, but I don't see any reasonable way to do this right now. Any help would be greatly appreciated! 😃 |
@n1t0 I've been thinking a lot about this as well because we'd like to be able to load and tokenize data in multiple processes in AllenNLP, but at the same time need the tokenizers in the main process. One thought I had would be to just disable using rayon adapters based on an environment or something. To keep this from getting unwieldy, we could have some wrapper trait with our own implementation of I might have time to look into this next week. |
@n1t0 I am too unaware what multiprocessing/multithreading is being done within the Rust routines. Is there a chance to enable an option to disable any multithreading within the tokenizer? I only encountered this bug after upgrading from 0.0.5 to 0.0.7. Before I could use any number of workers in the data loaders without having any deadlocks. |
Sounds good @epwalsh! That's a great idea! I suspect there might be some other roadblocks but it should cover the largest source of locks use. Plus, this is something that we will need anyway, for example, to allow using custom Python parts ( |
Ok, I've got a solution. PR coming soon! |
This should now be fixed with the @psinger @BlueProphet We added an environment variable that allows us to disable the parallelism ( And thank you very much @epwalsh for all your help on this! |
Hello,
When I run encode_batch with at least two texts in an array in python subprocess - the call hangs.
This happens in real-life when used inside pytorch dataloaders with multiple workers.
A self-contained reproducible script is here: https://gist.github.com/kretes/1a51bb8b936fc4e6277f71931b886bed
The text was updated successfully, but these errors were encountered: