When I perform a multipart/form-data upload of any large binary file in Flask, those uploads are very easily CPU bound (with Python consuming 100% CPU) instead of I/O bound on any reasonably fast network connection.
A little bit of CPU profiling reveals that almost all CPU time during these uploads is spent in werkzeug.formparser.MultiPartParser.parse_parts(). The reason this that the method parse_lines() yields a lot of very small chunks, sometimes even just single bytes:
# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
yield _cont, buf
buf = b''
So parse_parts() goes through a lot of small iterations (more than 2 million for a 100 MB file) processing single "lines", always writing just very short chunks or even single bytes into the output stream. This adds a lot of overhead slowing down those whole process and making it CPU bound very quickly.
A quick test shows that a speed-up is very easily possible by first collecting the data in a bytearray in parse_lines() and only yielding that data back into parse_parts() when self.buffer_size is exceeded. Something like this:
buf = b''
collect = bytearray()
for line in iterator:
if not line:
self.fail('unexpected end of stream')
if line[:2] == b'--':
terminator = line.rstrip()
if terminator in (next_part, last_part):
# yield remaining collected data
if collect:
yield _cont, collect
break
if transfer_encoding is not None:
if transfer_encoding == 'base64':
transfer_encoding = 'base64_codec'
try:
line = codecs.decode(line, transfer_encoding)
except Exception:
self.fail('could not decode transfer encoded chunk')
# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
collect += buf
buf = b''
# If the line ends with windows CRLF we write everything except
# the last two bytes. In all other cases however we write
# everything except the last byte. If it was a newline, that's
# fine, otherwise it does not matter because we will write it
# the next iteration. this ensures we do not write the
# final newline into the stream. That way we do not have to
# truncate the stream. However we do have to make sure that
# if something else than a newline is in there we write it
# out.
if line[-2:] == b'\r\n':
buf = b'\r\n'
cutoff = -2
else:
buf = line[-1:]
cutoff = -1
collect += line[:cutoff]
if len(collect) >= self.buffer_size:
yield _cont, collect
collect.clear()
This change alone reduces the upload time for my 34 MB test file from 4200 ms to around 1100 ms over localhost on my machine, that's almost a 4X increase in performance. All tests are done on Windows (64-bit Python 3.4), I'm not sure if it's as much of a problem on Linux.
It's still mostly CPU bound, so I'm sure there is even more potential for optimization. I think I'll look into it when I find a bit more time.
When I perform a
multipart/form-dataupload of any large binary file in Flask, those uploads are very easily CPU bound (with Python consuming 100% CPU) instead of I/O bound on any reasonably fast network connection.A little bit of CPU profiling reveals that almost all CPU time during these uploads is spent in
werkzeug.formparser.MultiPartParser.parse_parts(). The reason this that the methodparse_lines()yields a lot of very small chunks, sometimes even just single bytes:So
parse_parts()goes through a lot of small iterations (more than 2 million for a 100 MB file) processing single "lines", always writing just very short chunks or even single bytes into the output stream. This adds a lot of overhead slowing down those whole process and making it CPU bound very quickly.A quick test shows that a speed-up is very easily possible by first collecting the data in a
bytearrayinparse_lines()and only yielding that data back intoparse_parts()whenself.buffer_sizeis exceeded. Something like this:This change alone reduces the upload time for my 34 MB test file from 4200 ms to around 1100 ms over localhost on my machine, that's almost a 4X increase in performance. All tests are done on Windows (64-bit Python 3.4), I'm not sure if it's as much of a problem on Linux.
It's still mostly CPU bound, so I'm sure there is even more potential for optimization. I think I'll look into it when I find a bit more time.