python version hangs on encode_batch when run in subprocess #187

kretes · 2020-03-04T18:42:57Z

Hello,

When I run encode_batch with at least two texts in an array in python subprocess - the call hangs.
This happens in real-life when used inside pytorch dataloaders with multiple workers.
A self-contained reproducible script is here: https://gist.github.com/kretes/1a51bb8b936fc4e6277f71931b886bed

epwalsh · 2020-03-06T19:59:47Z

Hey @kretes, thanks for bringing this up and thanks for the reproducible example. I ran the gist and sure enough I saw the same results.

What I'm guessing is happening (and keep in mind I am no expert in multiprocessing so I could be way off) is that when a tokenizer is used from the main process, the Rust code behind the scenes is creating some resources that it needs for parallel processing (like a thread pool and I don't know what else - this stuff comes from https://github.com/rayon-rs/rayon and I'm not super familiar with it). Then when the process is forked and encode_batch is called, the child process now needs to access those resources, but they are locked by the parent process.

So as long as you don't call encode or encode_batch until after forking, it will work:

from multiprocessing import Process
from tokenizers.implementations import ByteLevelBPETokenizer

def encode():
    tok = ByteLevelBPETokenizer()
    print(tok.encode_batch(['ala']))
    print(tok.encode_batch(['ala', 'kot']))

p = Process(target=encode)
p.start()
p.join()

ivanjacobsec · 2020-05-02T05:14:11Z

Encode batch hangs for me in python even when not ran as a subprocess.

epwalsh · 2020-05-02T15:35:53Z

@ivanjacobsec can you provide an example?

ivanjacobsec · 2020-05-03T05:19:02Z

First off all thank you for the great work and it is a inspring project.

I tried 2 approaches:

   data_path="../data/"
   dest_path='../data/models/'
   model_name='my-tokenizer'
   paths = [str(x) for x in Path(data_path).glob("**/*.txt")]
   from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

   tokenizer = Tokenizer(BPE.empty())

   tokenizer.pre_tokenizer = pre_tokenizers.PreTokenizer.custom(MyPretokenizer())
   tokenizer.decoder=decoders.Decoder.custom(MoleculePretokenizer())
   tokenizer.pre_tokenizer = pre_tokenizers.PreTokenizer.custom(MoleculePretokenizer())
   tokenizer.decoder = decoders.Decoder.custom(MoleculePretokenizer())

   trainer = trainers.BpeTrainer(special_tokens=[ "<mask>",'<pad>'])
   tokenizer.train(trainer, paths)
   tokenizer.model.save(dest_path, model_name)
   print(tokenizer.encode_batch(["string","string string"]).ids))

class MyByteLevelBPETokenizer(BaseTokenizer):
    """ MyByteLevelBPETokenizer

    Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model
    """

    def __init__(
        self,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        add_prefix_space: bool = False,
        lowercase: bool = False,
        dropout: Optional[float] = None,
        unicode_normalizer: Optional[str] = None,
        continuing_subword_prefix: Optional[str] = None,
        end_of_word_suffix: Optional[str] = None,
    ):
        if vocab_file is not None and merges_file is not None:
            tokenizer = Tokenizer(
                BPE.from_files(
                    vocab_file,
                    merges_file,
                    dropout=dropout,
                    continuing_subword_prefix=continuing_subword_prefix or "",
                    end_of_word_suffix=end_of_word_suffix or "",
                )
            )
        else:
            tokenizer = Tokenizer(BPE.empty())

        # Check for Unicode normalization first (before everything else)
        normalizers = []

        if unicode_normalizer:
            normalizers += [unicode_normalizer_from_str(unicode_normalizer)]

        if lowercase:
            normalizers += [Lowercase()]

        # Create the normalizer structure
        if len(normalizers) > 0:
            if len(normalizers) > 1:
                tokenizer.normalizer = Sequence(normalizers)
            else:
                tokenizer.normalizer = normalizers[0]

        tokenizer.pre_tokenizer =pre_tokenizers.PreTokenizer.custom(MyPretokenizer())
        tokenizer.decoder = decoders.Decoder.custom(MyPretokenizer())

        parameters = {
            "model": "ByteLevelBPE",
            "add_prefix_space": add_prefix_space,
            "lowercase": lowercase,
            "dropout": dropout,
            "unicode_normalizer": unicode_normalizer,
            "continuing_subword_prefix": continuing_subword_prefix,
            "end_of_word_suffix": end_of_word_suffix,
        }

        super().__init__(tokenizer, parameters)

    def train(
        self,
        files: Union[str, List[str]],
        vocab_size: int = 30000,
        min_frequency: int = 2,
        show_progress: bool = True,
        special_tokens: List[str] = [],
    ):
        """ Train the model using the given files """

        trainer = trainers.BpeTrainer(
            special_tokens=special_tokens,
            min_frequency=min_frequency,
            show_progress=show_progress
        )
        if isinstance(files, str):
            files = [files]
        self._tokenizer.train(trainer, files)

class MyPretokenizer(object):
    def __init__(self) -> None:
      
        pass
    def pre_tokenize(self, sequence: str) -> List[Tuple[str, Offsets]]:
      
       return  [(l,(i,i)) for i, l in enumerate(list(sequence.strip()))]

    def decode(self, tokens):
      
        return "".join(tokens)

data_path="../data/"
 dest_path='../data/models/'
 model_name='my-tokenizer'
 paths = [str(x) for x in Path(data_path).glob("**/*.txt")]
 tokenizer=MyByteLevelBPETokenizer()
 tokenizer.train(paths)
 tokenizer.save(dest_path,dodel_name)
 tokenizer.encode_batch(['string string','string string']

With both of them batch_encode hangs.
In both cases the merges.txt is empty. This is still no clear to me why and where to look for that.
I am not quite sure on the correct way to make the offsets havent had the time to read the paper. If you could point me to some python code where it is done allready it would be nice.

Other issues I encountered:

tokenizer = Tokenizer(BPE.empty())
tokenizer.normalizer = Strip()

produces this error:

    tokenizer.normalizer = Strip()
TypeError

tokenizer = Tokenizer(BPE())

produces this error:

  tokenizer = Tokenizer(BPE())
TypeError: cannot create 'BPE' instances

I would really like to see a full working example in python with custom pre-tokenizer,normalizer and decoder.

BlueProphet · 2020-05-28T05:01:51Z

I think I am having the same issue so I am not going to make a new one. But this really puzzled me for a long time.

Basically when I test the tokenizer and then try to use it in a torch data loader I get a freeze on the call to encode in data loader.

Here is a simple example

import torch
import tokenizers

#Just making a test tokenizer call even in a new class 
#will ruin the tokenizer when used later in torch data_loader
#TOKENIZER = transformers.RobertaTokenizerFast(
TEST_TOKENIZER = tokenizers.ByteLevelBPETokenizer(
    vocab_file=f"roberta-base/vocab.json", 
    merges_file=f"roberta-base/merges.txt", 
    lowercase=True,
    add_prefix_space=True
)

test = TEST_TOKENIZER.encode("this is a test.")

TOKENIZER = tokenizers.ByteLevelBPETokenizer(
    vocab_file=f"roberta-base/vocab.json", 
    merges_file=f"roberta-base/merges.txt", 
    lowercase=True,
    add_prefix_space=True
)

class Dataset:
    def __init__(self, texts):
        self.texts = texts
        self.tokenizer = TOKENIZER
    
    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        #This call freezes when test is run above.
        data = self.tokenizer.encode(self.texts[item])
        return data

texts = ['This is another test from data_loader', 
    'This is the second test from data_loader']
dataset = Dataset(texts=texts)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, num_workers=1)
for d in data_loader:
    print (d)

n1t0 · 2020-05-29T00:45:05Z

I spent quite some time digging this to try and understand what happens exactly.

@epwalsh got it right: When the process gets forked, we end up with two (or more) different processes, with different memory spaces, but the exact same content in memory (at the time of the fork). Now, this shouldn't be a problem in most cases, but sometimes the locks we use to support multi-threading contain some state to operate properly. When this state does not make sense in the new context (in the new process), this lock might get impossible to unlock. (cf this discussion on stack overflow for more info, along with this example)

There are two different places that seem to cause such behavior:

The lazy_static used in pre_tokenizers/byte_level.rs uses a std::sync:once type of lock that apparently doesn't support the fork. This is what happens in the example shown by @BlueProphet. Even if we only use encode here, the pre-tokenizer gets stuck when trying to access the lazy static properties.
The encode_batch, train, and any other methods that make use of rayon to parallelize the workload.

Now, this is actually quite easy to fix, with only one rule to follow: do not use a tokenizer in the main process if you want to use them with multiprocessing. I think in most cases, this is actually a best practice that should be followed regardless of this problem.

By following this rule, the above snippet will look like:

import torch
import tokenizers

class Dataset:
    def __init__(self, texts):
        self.texts = texts
        self.tokenizer = tokenizers.ByteLevelBPETokenizer(
            vocab_file=f"roberta-base/vocab.json", 
            merges_file=f"roberta-base/merges.txt", 
            lowercase=True,
            add_prefix_space=True
        )
    
    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        data = self.tokenizer.encode(self.texts[item])
        return data.ids

texts = [
    'This is another test from data_loader', 
    'This is the second test from data_loader'
]
dataset = Dataset(texts=texts)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, num_workers=1)
for d in data_loader:
    print (d)

Any features of the tokenizers can be used by following this setup.

BlueProphet · 2020-05-29T21:18:34Z

Thanks for the reply.

But your code doesn't change the bug. You still get a freeze if you use any tokenizer outside the loader.

This is not an issue in normal code if you know this can happen. I experienced this working in a notebook and testing some code before running the loader. And thinking it was a tokenizer thread issue was not the first thing that came to mind.

I do not know Rust so I cannot speak to getting this fixed. Seems like a major bug but if you already know about it, then the workaround is easy.

psinger · 2020-05-31T09:37:00Z

I have the same bug and only have it since upgrading from 0.0.5 to 0.0.7. Basically whenever I use encode within a dataset and use num_workers>0 in a dataloader, I end up in a deadlock. This appears to be a critical bug to me.

It only also happens if there are more than one word in the sequence to be tokenizes. So to me it seems that tokenizers is doing some internal multiprocessing which is also imminent from observing CPU usage, and this interferes with any other outside multiprocessing, e.g., on data loader level.

I am especially curious what changes you did from 0.0.5 to 0.0.7 that started to cause this. Is there a way to disable any internal multithreading/multiprocessing you are doing? This bug is causing me quite some issues.

psinger · 2020-06-11T08:53:07Z

@n1t0 Any idea if and when this will be fixed? I need to go back to older versions or other alternatives otherwise.

n1t0 · 2020-06-11T19:08:25Z

@psinger If you have an example of what you are saying, that with an older version it works, I'd love to see that. It could help in the process of debugging.

I'd love to be able to fix this, the thing is, I have no idea how to fix it right now. I found a huge number of articles (a good example here) online about the problems related to Python multiprocessing while using fork. Everybody seems to agree on one point: fork and multithreading don't mix well.
This is something that has to do with the way fork works and has nothing to do with any specific language. And you can even find it in the Python documentation: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

fork
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

So, I agree it seems like a huge bug, and I'd love to be able to fix it, but I don't see any reasonable way to do this right now. Any help would be greatly appreciated! 😃

epwalsh · 2020-06-12T22:43:15Z

@n1t0 I've been thinking a lot about this as well because we'd like to be able to load and tokenize data in multiple processes in AllenNLP, but at the same time need the tokenizers in the main process.

One thought I had would be to just disable using rayon adapters based on an environment or something. To keep this from getting unwieldy, we could have some wrapper trait with our own implementation of par_iter that would handle this logic, i.e. checking for the environment variable, then deciding whether to use rayon's par_iter or not.

I might have time to look into this next week.

psinger · 2020-06-13T19:08:43Z

@n1t0 I am too unaware what multiprocessing/multithreading is being done within the Rust routines. Is there a chance to enable an option to disable any multithreading within the tokenizer?

I only encountered this bug after upgrading from 0.0.5 to 0.0.7. Before I could use any number of workers in the data loaders without having any deadlocks.

n1t0 · 2020-06-15T12:22:57Z

Sounds good @epwalsh! That's a great idea! I suspect there might be some other roadblocks but it should cover the largest source of locks use. Plus, this is something that we will need anyway, for example, to allow using custom Python parts (PreTokenizer, Normalizer, ...) where the GIL won't let us do multithreading anyway. Let me know if I can be of any help!

epwalsh · 2020-06-15T22:04:53Z

Ok, I've got a solution. PR coming soon!

n1t0 · 2020-06-29T16:22:23Z

This should now be fixed with the 0.8.0 version that was recently released.

@psinger @BlueProphet We added an environment variable that allows us to disable the parallelism (TOKENIZERS_PARALLELISM=false), but it should detect that the process has been forked and prevent any deadlock. So this should avoid a lot of frustration in the future. Thank you for your patience!

And thank you very much @epwalsh for all your help on this!

This was referenced Jun 5, 2020

Shuffling + bucketing are incompatible with lazy dataset reading allenai/allennlp#3966

Closed

Transformer tokenizers cause deadlocks when dataset reader is lazy and dataloader num_workers > 0 allenai/allennlp#4330

Closed

This was referenced Jun 15, 2020

add ability to disable parallelism #306

Closed

remove use of parallel iterators except in batch methods #308

Closed

n1t0 mentioned this issue Jun 19, 2020

Make parallelism optional #311

Merged

n1t0 closed this as completed Jun 29, 2020

Narsil mentioned this issue Aug 11, 2020

Tokenization with GPT2TokenizerFast not doing parallel tokenization #358

Closed

n1t0 mentioned this issue Sep 22, 2020

map/filter multiprocessing raises errors and corrupts datasets huggingface/datasets#620

Closed

Narsil mentioned this issue Oct 16, 2020

Training a Tokenizer with a custom PreTokenizer deadlocks. #469

Closed

stas00 mentioned this issue Feb 25, 2021

[Deepspeed] getting multiple prints of: Avoid using tokenizers before the fork if possible huggingface/transformers#10400

Closed

benjaminrwilson mentioned this issue Sep 26, 2022

Python sub-process gets stucked when parent uses reading funcs pola-rs/polars#3144

Closed

badGarnet mentioned this issue Oct 2, 2023

refactor: unstructured ingest as a pipeline Unstructured-IO/unstructured#1551

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python version hangs on encode_batch when run in subprocess #187

python version hangs on encode_batch when run in subprocess #187

kretes commented Mar 4, 2020

epwalsh commented Mar 6, 2020 •

edited

Loading

ivanjacobsec commented May 2, 2020

epwalsh commented May 2, 2020

ivanjacobsec commented May 3, 2020 •

edited

Loading

BlueProphet commented May 28, 2020 •

edited by n1t0

Loading

n1t0 commented May 29, 2020

BlueProphet commented May 29, 2020

psinger commented May 31, 2020 •

edited

Loading

psinger commented Jun 11, 2020

n1t0 commented Jun 11, 2020

epwalsh commented Jun 12, 2020

psinger commented Jun 13, 2020

n1t0 commented Jun 15, 2020

epwalsh commented Jun 15, 2020

n1t0 commented Jun 29, 2020

python version hangs on encode_batch when run in subprocess #187

python version hangs on encode_batch when run in subprocess #187

Comments

kretes commented Mar 4, 2020

epwalsh commented Mar 6, 2020 • edited Loading

ivanjacobsec commented May 2, 2020

epwalsh commented May 2, 2020

ivanjacobsec commented May 3, 2020 • edited Loading

BlueProphet commented May 28, 2020 • edited by n1t0 Loading

n1t0 commented May 29, 2020

BlueProphet commented May 29, 2020

psinger commented May 31, 2020 • edited Loading

psinger commented Jun 11, 2020

n1t0 commented Jun 11, 2020

epwalsh commented Jun 12, 2020

psinger commented Jun 13, 2020

n1t0 commented Jun 15, 2020

epwalsh commented Jun 15, 2020

n1t0 commented Jun 29, 2020

epwalsh commented Mar 6, 2020 •

edited

Loading

ivanjacobsec commented May 3, 2020 •

edited

Loading

BlueProphet commented May 28, 2020 •

edited by n1t0

Loading

psinger commented May 31, 2020 •

edited

Loading