Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fit_generator using use_multiprocessing=True does not work on Windows 8.1 x64, python 3.5 #10842

Closed
erwinkendo opened this issue Aug 3, 2018 · 25 comments

Comments

@erwinkendo
Copy link

Dear Keras community

I have been using keras succesfully for many tasks.

After implementing a custom data generator using the keras Sequence class, I tried using the use_multiprocessing=True of the fit_generator function, with more than 1 worker (so data can be fed to my GPU).

Unfortunately, after testing this setup in 3 different machines, the code seems to work only on Linux (even having a different GPU).

  • Behaviour using my windows 8.1 machine with python 3.5 installed using conda: Program freezes and do not start loading or training.
  • Behaviour using my archlinux laptop, same software setup as the windows machine using conda: Program does work, feeds the data from disk, allows multiple threads and uses GPU for training.

Is this the expected behaviour on a windows machine?

Kind regards,

@bjoernholzhauer
Copy link

This issue is a bit unclear to me about whether that's expected or whether it should only occur with a standard data generator, but not with a sequence. I also get

ValueError: Using a generator with use_multiprocessing=True is not supported on Windows (no marshalling of generators across process boundaries). Instead, use single thread/process or multithreading.

with both (generator or sequence) when using multiprocessing=True in keras 2.2.0 in Python 3.6.6 under Windows 10 64-bit. I am not sure how to do multithreading in keras instead, either (the error message seems to suggest that might solve the problem, but I was unable to find any information exactly how one would do that).

@Dref360
Copy link
Contributor

Dref360 commented Aug 7, 2018

We do mimic Windows in the unit test using the 'spawn' method. So it "should" work.
I don't own a Windows machine so I can't really help you.

@bjoernholzhauer
Copy link

bjoernholzhauer commented Aug 7, 2018

@Dref360 Can you point me to some code I should try, which would be expected to work? Does one need to do anything specific to use the 'spawn' method?

@Dref360
Copy link
Contributor

Dref360 commented Aug 7, 2018

Try running the tests in tests/keras/utils/data_utils.py

'spawn' (fork-exec) is the default on Windows. UNIX uses fork by default.
In those tests, we try with both mechanisms.

@bjoernholzhauer
Copy link

I will test it from home with my proper system. I just realized that with a sequence generator at least multi_processing=False and workers=4 (or so) does actually do multi-threading. I had missed that, because my toy example did not spend enough time in the sequence generator to make it really obvious. multi_processing=True of course still hangs the system.

@bjoernholzhauer
Copy link

bjoernholzhauer commented Aug 18, 2018

@Dref360 Sorry, finally got around to trying it. What file / filename do I need to provide in the data_utils.py code? I assume some filename needs to go into where it says __file__?
if __name__ == '__main__': pytest.main([__file__])
Or am I missing what the standard way of running these tests is?

@bjoernholzhauer
Copy link

Any suggestions from anyone how to test this on a windows system (I'm honestly not clear on what files are needed for the test script @Dref360 pointed to)?

@Dref360
Copy link
Contributor

Dref360 commented Sep 16, 2018

pytest tests/keras/test_multiprocessing.py tests/keras/utils/

Follow the CONTRIBUTION.md for your setup.

@loheden
Copy link

loheden commented Oct 21, 2018

This still seems to be an issue. When use_multithreading=True, it is just hanging and literally nothing is happening. I am running it on Windows 10.

Setting workers to a number that is bigger than 1 seems to improve the speed even if use_multithereading=False. Why is this setup improving it?

Additionally, I also wonder if one needs to make its class generator (with Sequence) thread safe? Since I am not able to set use_multithreading to True, I am then wondering if I need to make my generator thread safe? In that case, I am also wondering if the thread safe version will give the desired performance improvement..?

I even asked a question related to this topic (regarding how things should be working in wondows 10) on Stackoverflow.
https://stackoverflow.com/questions/52932406/is-the-class-generator-inheriting-sequence-thread-safe-in-keras-tensorflow
But noone has replied so far...

@ghost
Copy link

ghost commented Jul 22, 2019

@Dref360 Couldn't the keras team update keras/utils/data_utils.py to pass a regular multiprocessing.Lock() at Pool creation time, using the initializer kwarg? This will make your lock instance global in all the child workers. See this Stackoverflow answer: Python sharing a lock between processes.

@Dref360
Copy link
Contributor

Dref360 commented Jul 22, 2019

What would be the purpose of this Lock?

@ghost
Copy link

ghost commented Jul 22, 2019

@Dref360 To overcome the error on Windows TypeError: can't pickle _thread.lock objects. (To be clear, that is the error I receive on Windows. With a global Lock, Windows users might at least be able to use OrderedEnqueuer through a generator derived from the Sequence class to utilize multiprocessing.)

@ghost
Copy link

ghost commented Jul 23, 2019

@Dref360 I create a generator class that extends tensorflow.python.keras.preprocessing.image.Iterator. I only implement __init__ and _get_batches_of_transformed_samples. The problem is Iterator itself contains a threading.Lock()

def __init__(self, n, batch_size, shuffle, seed):
        ....
        self.lock = threading.Lock()
        ...

and uses it in its next() function to control index generation

def next(self):
        """For python 2.x.

        # Returns
            The next batch.
        """
        with self.lock:
            index_array = next(self.index_generator)
        # The transformation of images is not under thread lock
        # so it can be done in parallel
        return self._get_batches_of_transformed_samples(index_array)

When I try to use my generator and pass it to fit_generator, I inevitably get the error TypeError: can't pickle _thread.lock objects. Thread locks can be marshalled on Linux, but not Windows.

My initial thought was to have Sequence hold a self.lock, then update init_pool and init_pool_generator in data_utils.py to accept a lock, changing the first two lines to be

    global _SHARED_SEQUENCES, _LOCK
    _SHARED_SEQUENCES = seqs
    _LOCK = lk

and lastly update the _get_executor_init functions in SequenceEnqueuer subclasses to add lk to initargs

return lambda seqs: mp.Pool(workers,
                                    initializer=init_pool,
                                    initargs=(seqs,lk))

Then, per the above mentioned StackOverflow answer, Iterator would inherit its lock from Sequence, but I think we would run into the same problem because seqs gets passed to initargs, which would then contain locks.

Also, fyi, a separate class that just holds a threading lock doesn't work either (i.e. Have Iterator extend (Sequence, SomeLockContainerClass)).

@ghost
Copy link

ghost commented Jul 23, 2019

@Dref360 Dumb question, but is there any way the thread locking could be moved to a function outside the Iterator class in and have next call it, similar to what was done with init_pool and seqs in data_utils.py? Or maybe just making turning _flow_index into a Queue of indeces to be accessed across processes?

@mchaniotakis
Copy link

Same here..! any solutions?

@Dref360
Copy link
Contributor

Dref360 commented Aug 6, 2019

PRs are welcome. I cannot work on this issue as I do not use Windows.

@txyugood
Copy link

txyugood commented Aug 6, 2019

I think the threading lock in the multiprocess module is unused. Can I remove the threading lock in the class Sequence?

@Dref360
Copy link
Contributor

Dref360 commented Aug 6, 2019 via email

@txyugood
Copy link

txyugood commented Aug 6, 2019

I'm sorry, the threading lock in the class Iterator.
https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image/iterator.py#L43.

I mean the multiprocess module doesn't use the thread, it uses the process.
So is the threading lock necessary in the process ?
Can I remove it, when I used the multiprocess on windows?

@Dref360
Copy link
Contributor

Dref360 commented Aug 6, 2019 via email

@evrial
Copy link

evrial commented Aug 23, 2019

Still no solution?

@ghost
Copy link

ghost commented Aug 25, 2019

@evrial @txyugood @mchaniotakis I dont think there is a solution, unfortunately. Tensorflow has historically been built for Linux OSs. The multiprocesssing module is the best bet to get this working with Windows, but IMO it would take a complete rewrite of preprocessing.image.

@adam-grant-hendry
Copy link

adam-grant-hendry commented Aug 28, 2020

I have a proposed "solution" that may interest others. Please note this is not a direct solution to the problem, but I believe a useful workaround. Please note this is coming from my experience with Tensorflow 1.15 (I have yet to use version 2). Please also see StackOverflow question Is the class generator (inheriting Sequence) thread safe in Keras/Tensorflow?

TL;DR

Install wsl version 2 on Windows, install Tensorflow in a Linux environment (e.g. Ubuntu) here, and then set use_multiprocessing to True to get this to work.

NOTE: The Windows Subshell for Linux (WSL) version 2 is only available in Windows 10, Version 1903, Build 18362 or higher. Be sure to upgrade your Windows version in Windows Update to get this to work.

Long Answer

For multitasking and multithreading (i.e. parallelism and concurrency), there are two operations we must consider:

  • forking = a parent process creates a copy of itself (a child) that has an exact copy of all the memory segments it uses
  • spawning = a parent process creates an entirely new child process that does not share its memory and the parent process must wait for the child process to finish before continuing

Linux supports forking, but Windows does not. Windows only supports spawning.

The reason Windows hangs when using use_multiprocessing=True is because the Python threading module uses spawn for Windows. Hence, the parent process waits forever for the child to finish because the parent cannot transfer its memory to the child, so the child doesn't know what to do.

On Windows, use_multiprocessing=True It is not threadsafe. On Windows, if you've ever attempted to use a data generator or sequence, you've probably seen an error like this

ValueError: Using a generator with use_multiprocessing=True is not supported on Windows 
(no marshalling of generators across process boundaries). Instead, use single 
thread/process or multithreading.

marshalling means "transforming the memory representation of an object into a data format that is suitable for transmission." The error is saying that unlike Linux, which uses fork, use_multiprocessing=True doesn't work on Windows because it uses spawn and cannot transfer its data to the child thread.

At this point, you may be asking yourself:

"Wait...What about the Python Global Interpreter Lock (GIL)?..If Python only allows one thread to run at a time, why does it even have the threading module and why do we care about this in Tensorflow??!"

The answer lies in the difference between CPU-bound tasks and I/O-bound tasks:

  • CPU-bound tasks = those that are waiting for data to be crunched
  • I/O-bound tasks = those that are waiting for input or output from other processes (i.e. data transferring)

In programming, when we say two tasks are concurrent, we mean they can start, run, and complete in overlapping time. When we say they are parallel, we mean they are literally running at the same time.

So, the GIL prevents threads from running in parallel, but not concurrently. The reason this is important for Tensorflow is because concurrency is all about I/O operations (data transfer). A good dataflow pipeline in Tensorflow should try to be concurrent so that there's no lag time when data is being transferred to-and-from the CPU, GPU, and/or RAM and training finishes faster. (Rather than have a thread sit and wait until it gets data back from somewhere else, we can have it executing image preprocessing or something else until the data gets back.)


IMPORTANT ASIDE: The GIL was made in Python because everything in Python is an object. (This is why you can do "weird" things with "dunder/magic" methods, like (5).__add__(3) to get 8

NOTE: In the above, parentheses are needed around 5 since 5. is a float, so we need to take advantage of order of operations by using parentheses.

Python handles memory and garbage collection by counting all references made to individual objects. When the count goes to 0, Python deletes the object. If two threads tried to access the same object simultaneously, or if one thread finishes faster than another, you can get a race condition and objects would be deleted "randomly". We could put a lock on each thread, but then we would be unable to prevent deadlocks.

Losing parallel thread execution was seen by Guido (and by myself, though it is certainly arguable) as a minor loss because we still maintained I/O concurrent operations, and tasks could still be run in parallel by running them on different cpu cores (i.e. multiprocessing). Hence, this is (one reason) why Python has both the threading and multiprocessing modules.


Now back to threadsafe. When running concurrent/parallel tasks, you have to watch out for additional things. Two big ones are:

  1. race conditions - operations don't take exactly the same time to execute each time the program is run (hence, e.g., why we typically average results over a number of runs when using the timeit module). Because threads will finish at different times depending on the execution run, you will get different results with each run that are unpredictable a priori.

  2. deadlock - if two threads try to access the same memory at the same time, you'll get an error. To prevent this, we add a lock or mutex (mutual exclusion) to threads to prevent other threads from accessing the same memory while it is running. However, if two threads are locked, need to access the same memory, and each depends on the other to finish in order to execute, the program hangs in what is known as a deadlock.

I bring this up because Tensorflow needs to be able to pickle Python objects to make code run faster. (pickling is turning objects and data into machine byte code, much in the same way that an program's source code is converted into an exe on Windows). The Tensorflow Iterator.__init__() method locks threads and contains a threading.Lock()

def __init__(self, n, batch_size, shuffle, seed):
    ...
    self.lock = threading.Lock()
    ...

The problem is Python cannot pickle threading lock objects on Windows (i.e. Windows cannot marshall thread locks to a child thread).

If you try to use a generator and pass it to fit_generator, you will get the error

TypeError: can't pickle _thread.lock objects

So, while use_multiprocessing=True is threadsafe on Linux, it is not on Windows.

Solution: Around June 2020, Microsoft came out with version 2 of the Windows Subshell for Linux (wsl). This was significant because it enabled GPU hardware acceleration. Version 1 was "simply" a driver between Windows NT and Linux, whereas wsl is now actually a kernel. Thus, you can now install Linux on Windows, open a bash shell from the command prompt, and (most importantly) access hardware. Thus, it is now possible to install tensorflow-gpu on wsl. In addition, you'll now be able to use fork.

**Thus, I recommend

  1. Installing wsl version 2 on Windows and add your desired Linux environment
  2. Install tensorflow-gpu in a virtual environment in wsl Linux environment here
  3. Retry use_multiprocessing=True to see if it works.**

CAVEAT: I haven't tested this yet to verify that it works, but to the best of my limited knowledge, I believe it should.

@josuerocha
Copy link

josuerocha commented Aug 29, 2021

I am still able to reproduce the issue on Python 3.9.0 and Tensorflow 2.6.0 on Windows 10.

I tried WSL 2 but the speedup relative to Windows 10 without multiprocessing was of just 20%.

Is there any alternative or solution today?

@rayjennings3rd
Copy link

I am trying to get this same option to work on MacOS Monterey. While other python packages are able to use multi-processing, I see no improvement at all with this option. I am running 8 cores, tensorflow 2.6. I am using Keras Sequential models. I have:

workers=6,
use_multiprocessing=True

And there is zero difference between having this on or off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests