WIP: Data processing server #56

bartvm · 2015-03-27T03:01:11Z

As requested in #7. We already have some basic multi-processing, but we've already run into issues with HDF5 files (#47) so I figured it was worth revisiting the idea of having a separate server that pushes data around through sockets (optimized for NumPy data).

As I'm kind of figuring out ZMQ and messaging frameworks as I go along, this is a very rough draft, and I might be doing it completely wrong. I'm basically abusing the extended request-and-reply pattern, making the broker act as a buffer. The idea is that the user creates his data stream in one file and calls start_server(data_stream) at that end end. In the script with the main loop the data stream is then given as ServerDataStream().

Long to-do list:

Any feedback welcome. Like I said, making this up as I go along.

bartvm · 2015-03-27T03:27:22Z

Crude way of sending StopIteration messages now, and also wrote a ServerDataStream class (which largely ignores the entire data stream protocol). But this should work now:

python fuel/server.py

from fuel.streams import ServerDataStream
data_stream = ServerDataStream(('features',))
it = data_stream.get_epoch_iterator()
next(it)

bartvm · 2015-03-27T15:23:11Z

Woohoo! For my dogs vs. cats model, testing this on a single run, the time to read data fell from 76.84 seconds per epoch to 5.61 seconds. That means it went from being 38% of the training time to 4% of the training time :)

bartvm · 2015-03-27T15:50:29Z

Whoever wants to have a look, this should be good for initial review (@dwf, @vdumoulin, @rizar)

vdumoulin · 2015-03-27T17:43:35Z

I'm by no means acquainted with networking, but if you provide a minimal starting example I can "kick the tires" and tell you what I think.

bartvm · 2015-03-27T17:51:44Z

# server.py
from fuel.datasets import MNIST
from fuel.streams import DataStream
from fuel.schemes import SequentialScheme
mnist = MNIST('train')
data_stream = DataStream(
    mnist, iteration_scheme=SequentialScheme(1500, 500)
)
start_server(data_stream)

# train.py
from fuel.streams import ServerDataStream
data_stream = ServerDataStream(('features', 'targets'))
epoch = data_stream.get_epoch_iterator()
batch = next(epoch)
print(batch)

Running python server.py followed by python train.py should work now.

vdumoulin · 2015-03-27T17:52:05Z

Thanks, I'll have a look at it.

vdumoulin · 2015-03-27T19:47:09Z

I still haven't read your code into detail, but if I call python train.py twice after calling python server.py (which needs an import for start_server to work), it gets stuck on the second call.

bartvm · 2015-03-27T19:54:40Z

Oops, forgot the start_server import.

I'm aware of the inability for the client to reconnect, it's the "lazy pirate pattern" problem I referred to the in the description. Basically, ZMQ binds the processes together in the background somehow. If the client dies and tries to reconnect, its ID changes and it can't connect to the server anymore. The only way to fix this is by manually implementing a pattern where the client forcefully closes the socket, deregisters from the poll, and reconnects (see e.g. this code).

I was planning on fixing that in a later PR, if at all. It normally doesn't make sense to reconnect the client without restarting the server. The client would start receiving data from the middle of an epoch, which doesn't make much sense.

vdumoulin · 2015-03-27T20:01:18Z

Makes sense.

dwf · 2015-03-27T21:48:29Z

fuel/server.py

+
+
+def send_arrays(socket, arrays, flags=0, copy=True, track=False, stop=False):
+    """Send a NumPy array using the buffer interface and some metadata.


singular, yet below you say we're sending a list of numpy arrays

Right, should be plural.

bartvm · 2015-03-28T15:48:41Z

Nothing I can do about the drop in coverage by the way. Coverage doesn't count lines run in subprocesses. It actually has an experimental flag for this in the latest alpha releases, but I can't get that to play nice with Coveralls (whose support for Coverage 4 is alpha level, so alpha level support for an alpha level release...). Give it a month or two and I guess we should be able to enable it.

bartvm · 2015-03-28T18:53:50Z

As I said, figuring this out as I go along: After reading the documentation a bit more I figured I could switch to a push-pull pattern and use ZeroMQ's high-water mark to limit the queue instead, which turns out to be significantly faster. Reading time for my Dogs vs. Cats model (originally 76.84 s) went from 5.61 s to now 1.9 s, 3 times faster and only 1.4% of the training time now.

WIP: Data processing server

bartvm force-pushed the server branch from d09d15d to d0f579a Compare March 27, 2015 13:22

bartvm mentioned this pull request Mar 27, 2015

MultiProcessing doesn't work with HDF5 files #47

Closed

dwf reviewed Mar 27, 2015
View reviewed changes

bartvm added 7 commits March 27, 2015 19:03

Proof of concept data processing server

e20533c

StopIteration and client interface

1d302ee

Python 3 compatibility

755cc8b

Multiple arrays in one message

b5d280c

Cleaning up and docstrings

d16ffff

Server starting messages

5f70b7a

Shut up flake8

57154cc

bartvm force-pushed the server branch from 86871fe to 231c821 Compare March 27, 2015 23:03

Add tests and fix some docstrings

6cd0a88

bartvm force-pushed the server branch from 231c821 to 6cd0a88 Compare March 27, 2015 23:05

Support for Fortran arrays, use NumPy header function

d7cbfe5

bartvm force-pushed the server branch from 9264f53 to d7cbfe5 Compare March 28, 2015 15:43

bartvm force-pushed the server branch from e58ee69 to d26376e Compare March 28, 2015 18:53

Switch to push-pull pattern

36e644e

bartvm force-pushed the server branch from d26376e to 36e644e Compare March 28, 2015 19:02

bartvm added a commit that referenced this pull request Mar 28, 2015

Merge pull request #56 from bartvm/server

c36524a

WIP: Data processing server

bartvm merged commit c36524a into master Mar 28, 2015

bartvm mentioned this pull request Mar 28, 2015

Wishlist: "server" process to do preprocessing in a separate thread #7

Closed

bartvm deleted the server branch September 22, 2015 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Data processing server #56

WIP: Data processing server #56

bartvm commented Mar 27, 2015

bartvm commented Mar 27, 2015

bartvm commented Mar 27, 2015

bartvm commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

bartvm commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

bartvm commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

dwf Mar 27, 2015

bartvm Mar 27, 2015

bartvm commented Mar 28, 2015

bartvm commented Mar 28, 2015



		def send_arrays(socket, arrays, flags=0, copy=True, track=False, stop=False):
		"""Send a NumPy array using the buffer interface and some metadata.

WIP: Data processing server #56

WIP: Data processing server #56

Conversation

bartvm commented Mar 27, 2015

bartvm commented Mar 27, 2015

bartvm commented Mar 27, 2015

bartvm commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

bartvm commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

bartvm commented Mar 27, 2015

vdumoulin commented Mar 27, 2015

dwf Mar 27, 2015

Choose a reason for hiding this comment

bartvm Mar 27, 2015

Choose a reason for hiding this comment

bartvm commented Mar 28, 2015

bartvm commented Mar 28, 2015