Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

werkzeug.formparser is really slow with large binary uploads #875

Closed
sekrause opened this issue Mar 3, 2016 · 33 comments · Fixed by #2017 or #2022
Closed

werkzeug.formparser is really slow with large binary uploads #875

sekrause opened this issue Mar 3, 2016 · 33 comments · Fixed by #2017 or #2022
Labels
Milestone

Comments

@sekrause
Copy link
Contributor

@sekrause sekrause commented Mar 3, 2016

When I perform a multipart/form-data upload of any large binary file in Flask, those uploads are very easily CPU bound (with Python consuming 100% CPU) instead of I/O bound on any reasonably fast network connection.

A little bit of CPU profiling reveals that almost all CPU time during these uploads is spent in werkzeug.formparser.MultiPartParser.parse_parts(). The reason this that the method parse_lines() yields a lot of very small chunks, sometimes even just single bytes:

# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
    yield _cont, buf
    buf = b''

So parse_parts() goes through a lot of small iterations (more than 2 million for a 100 MB file) processing single "lines", always writing just very short chunks or even single bytes into the output stream. This adds a lot of overhead slowing down those whole process and making it CPU bound very quickly.

A quick test shows that a speed-up is very easily possible by first collecting the data in a bytearray in parse_lines() and only yielding that data back into parse_parts() when self.buffer_size is exceeded. Something like this:

buf = b''
collect = bytearray()
for line in iterator:
    if not line:
        self.fail('unexpected end of stream')

    if line[:2] == b'--':
        terminator = line.rstrip()
        if terminator in (next_part, last_part):
            # yield remaining collected data
            if collect:
                yield _cont, collect
            break

    if transfer_encoding is not None:
        if transfer_encoding == 'base64':
            transfer_encoding = 'base64_codec'
        try:
            line = codecs.decode(line, transfer_encoding)
        except Exception:
            self.fail('could not decode transfer encoded chunk')

    # we have something in the buffer from the last iteration.
    # this is usually a newline delimiter.
    if buf:
        collect += buf
        buf = b''

    # If the line ends with windows CRLF we write everything except
    # the last two bytes.  In all other cases however we write
    # everything except the last byte.  If it was a newline, that's
    # fine, otherwise it does not matter because we will write it
    # the next iteration.  this ensures we do not write the
    # final newline into the stream.  That way we do not have to
    # truncate the stream.  However we do have to make sure that
    # if something else than a newline is in there we write it
    # out.
    if line[-2:] == b'\r\n':
        buf = b'\r\n'
        cutoff = -2
    else:
        buf = line[-1:]
        cutoff = -1

    collect += line[:cutoff]

    if len(collect) >= self.buffer_size:
        yield _cont, collect
        collect.clear()

This change alone reduces the upload time for my 34 MB test file from 4200 ms to around 1100 ms over localhost on my machine, that's almost a 4X increase in performance. All tests are done on Windows (64-bit Python 3.4), I'm not sure if it's as much of a problem on Linux.

It's still mostly CPU bound, so I'm sure there is even more potential for optimization. I think I'll look into it when I find a bit more time.

@languanghao
Copy link

@languanghao languanghao commented Mar 18, 2016

I also have same problem, when I upload an iso file(200m), the first call to request.form will take 7s

@RonnyPfannschmidt
Copy link
Contributor

@RonnyPfannschmidt RonnyPfannschmidt commented Mar 18, 2016

2 things seem interesting for further optimization - experimenting with cython, and experimenting with interpreting the content-site headers for smarter mime message parsing

(no need to scan for lines if you know the content-length of a sub-message)

@lnielsen
Copy link
Contributor

@lnielsen lnielsen commented Aug 22, 2016

Just a quick note, that if you stream the file directly in the request body (i.e. no application/multipart-formdata), you completely bypass the form parser and read the file directly from request.stream.

@carbn
Copy link

@carbn carbn commented Oct 18, 2016

I have the same issue with slow upload speeds with multipart uploads when using jQuery-File-Upload's chunked upload method. When using small chunks (~10MB), the transfer speed jumps between 0 and 12MB/s while the network and server are fully capable of speeds over 50MB/s. The slowdown is caused by the cpu bound multipart parsing which takes about the same time as the actual upload. Sadly, using streaming uploads to bypass the multipart parsing is not really an option as I must support iOS devices that can't do streaming in the background.

The patch provided by @sekrause looks nice but doesn't work in python 2.7.

@cuibonobo
Copy link

@cuibonobo cuibonobo commented Oct 18, 2016

@carbn: I was able to get the patch to work in Python 2.7 by changing the last line to collect = bytearray(). This just creates a new bytearray instead of clearing the existing one.

@carbn
Copy link

@carbn carbn commented Oct 18, 2016

@cuibonobo: That's the first thing I changed but still had another error. I can't check the working patch at the moment, but IIRC the yields had to be changed from yield _cont, collect to yield _cont, str(collect). This allowed the code to be tested and the patch yielded about 30% increase in the multipart processing speed. It's a nice speedup, but the performance is still pretty bad.

@sekrause
Copy link
Contributor Author

@sekrause sekrause commented Oct 18, 2016

A little further investigation shows that werkzeug.wsgi.make_line_iter is already too much of a bottleneck to really be able to optimize parse_lines(). Look at this Python 3 test script:

import io
import time
from werkzeug.wsgi import make_line_iter

filename = 'test.bin' # Large binary file
lines = 0

# load a large binary file into memory
with open(filename, 'rb') as f:
    data = f.read()
    stream = io.BytesIO(data)
    filesize = len(data) / 2**20 # MB

start = time.perf_counter()
for _ in make_line_iter(stream):
    lines += 1
stop = time.perf_counter()
delta = stop - start

print('File size: %.2f MB' % filesize)
print('Time: %.1f seconds' % delta)
print('Read speed: %.2f MB/s' % (filesize / delta))
print('Number of lines yielded by make_line_iter: %d' % lines)

For a 923 MB video file with Python 3.5 the output look something like this on my laptop:

File size: 926.89 MB
Time: 20.6 seconds
Read speed: 44.97 MB/s
Number of lines yielded by make_line_iter: 7562905

So even if you apply my optimization above and optimize it further until perfection you'll still be limited to ~45 MB/s for large binary uploads simply because make_line_iter can't give you the data fast enough and you'll be doing 7.5 million iterations for 923 MB of data in your loop that checks for the boundary.

I guess the only great optimization will be to completely replace parse_lines() with something else. A possible solution that comes to mind is to read a reasonably large chunk of the stream into memory then use string.find() (or bytes.find() in Python 3) to check if the boundary is in the chunk. In Python find() is a highly optimized string search algorithm written in C, so that should give you some performance. You would just have to take care of the case where the boundary might be right between two chunks.

@sdizazzo
Copy link

@sdizazzo sdizazzo commented Jun 2, 2017

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It's working great for me. https://github.com/siddhantgoel/streaming-form-data

@davidism davidism added the bug label Jun 2, 2017
@lambdaq
Copy link

@lambdaq lambdaq commented Jun 20, 2017

I guess the only great optimization will be to completely replace parse_lines()

+1 for this.

I am writing a bridge to stream user's upload directly to S3 without any intermediate temp files, possibly with backpressure, and I find werkzeug and flask situation frustrating. You can't move data directly between two pipes.

@pallets pallets deleted a comment from huhuanming Jun 20, 2017
@davidism
Copy link
Member

@davidism davidism commented Jun 20, 2017

@lambdaq I agree it's a problem that needs to be fixed. If this is important to you, I'd be happy to review a patch changing the behavior.

@lnielsen
Copy link
Contributor

@lnielsen lnielsen commented Jun 20, 2017

@lambdaq Note that if you just stream data directly in the request body and use application/octet-stream then the form parser doesn't kick in at all and you can use request.stream (i.e. no temp files etc).

The only problem we had is the werkzeug form parser is eagerly checking content length against the allowed max content length before knowing if it should actually parse the request body.

This prevents you from setting max content length on normal form data, but also allow very large file uploads.

We fixed it by reordering the check the function a bit. Not sure if it makes sense to provide this upstream as some apps might rely on the existing behaviour.

@lambdaq
Copy link

@lambdaq lambdaq commented Jun 21, 2017

Note that if you just stream data directly in the request body and use application/octet-stream then the form parser doesn't kick in at all and you can use request.stream (i.e. no temp files etc).

Unfortunately not. It's just normal form uploads with multipart.

I'd be happy to review a patch changing the behavior.

I tried to hack werkzeug.wsgi.make_line_iter or parse_lines() using generators's send(), so we can signal _iter_basic_lines() to emit whole chunks instead of lines. It turns out not so easy.

Basically, the rabbit whole starts with 'itertools.chain' object has no attribute 'send'.... 😂

@ThiefMaster
Copy link
Member

@ThiefMaster ThiefMaster commented Jun 21, 2017

I wonder how much this code could be sped up using native speedups written in C (or Cython etc.). I think handling semi-large (a few 100 MB, but not huge as in many GB) files more efficiently is important without having to change how the app uses them (ie streaming them directly instead of buffering) - for many applications this would be overkill and is not absolutely necessary (actually, even the current somewhat slow performance is probably OK for them) but making things faster is always nice!

@lambdaq
Copy link

@lambdaq lambdaq commented Sep 30, 2017

Another possible solution is offload the multipart parsing job to nginx

https://www.nginx.com/resources/wiki/modules/upload/

@ThiefMaster
Copy link
Member

@ThiefMaster ThiefMaster commented Dec 14, 2017

Both repos look dead.

@mdemin914
Copy link

@mdemin914 mdemin914 commented Feb 1, 2018

so is there no known solution to this?

@lnielsen
Copy link
Contributor

@lnielsen lnielsen commented Feb 1, 2018

There's a workaround👆

@sdizazzo
Copy link

@sdizazzo sdizazzo commented Feb 2, 2018

Under uwsgi, we use it's built in chunked_read() function and parse the stream on our own as it comes in. It works 99% of the time, but it has a bug that I have yet to track down. See my earlier comment for an out-of-the box streaming form parser. Under python2 it was slow, so we rolled our own and it is fast. :)

@davidism
Copy link
Member

@davidism davidism commented Feb 2, 2018

Quoting from above:

I agree it's a problem that needs to be fixed. If this is important to you, I'd be happy to review a patch changing the behavior.

I don't really have time to work on this right now. If this is something that you are spending time on, please consider contributing a patch. Contributions are very welcome.

@siddhantgoel
Copy link

@siddhantgoel siddhantgoel commented Feb 3, 2018

@sdizazzo

but it has a bug that I have yet to track down

are you talking about streaming-form-data? if so, I'd love to know what the bug is.

@kneufeld
Copy link

@kneufeld kneufeld commented Jul 21, 2018

Our problem was that the slow form processing prevented concurrent request handling which caused nomad to think the process was hung and killed it.

My fix was to add a sleep(0) in werkzeug/formparser.py:MutlipartParser.parse_lines():

            for i, line in enumerate(iterator):
                if not line:
                    self.fail('unexpected end of stream')

                # give other greenlets a chance to run every 100 lines
                if i % 100 == 0:
                    time.sleep(0)

search for unexpected end of stream if you want to apply this patch.

@patrislav1
Copy link

@patrislav1 patrislav1 commented Oct 12, 2018

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It's working great for me. https://github.com/siddhantgoel/streaming-form-data

seconded.
this speeds up file uploads to my Flask app by more than factor 10

@ghost
Copy link

@ghost ghost commented Sep 10, 2019

@siddhantgoel
Thanks a lot for your fix with streaming-form-data. I can finally upload gigabyte sized files at good speed and without memory filling up!

@davidism
Copy link
Member

@davidism davidism commented Apr 12, 2020

See #1788 which discusses rewriting the parser to be sans-io. Based on the feedback here, I think that would address this issue too.

@sekrause
Copy link
Contributor Author

@sekrause sekrause commented Jan 26, 2021

@davidism I don't think this issue should be closed because the speed-up is negligible.

Below is a little test script to benchmark the multipart parser and to compare Werkzeug with streaming-form-data. Run it with:

  • bench.py werkzeug myfile
  • bench.py streaming-form-data myfile (only works the the stable Werkzeug version.)

These are my results with a 425 MB zip file on my laptop:

  • Old Werkzeug parser: 8200 ms.
  • New Werkzeug parser: 6250 ms.
  • streaming-form-data: 260 ms.

So the new parser is only about 25% faster than the old parser, but still more than an order of magnitude slower than a fast parser.

import argparse
import io
import time
from os.path import basename

from flask import Flask, request
from streaming_form_data import StreamingFormDataParser
from streaming_form_data.targets import BaseTarget
from werkzeug.test import EnvironBuilder, run_wsgi_app

app = Flask(__name__)

class LengthTarget(BaseTarget):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.total = 0

    def on_data_received(self, chunk: bytes):
        self.total += len(chunk)

@app.route("/streaming-form-data", methods=['POST'])
def streaming_form_data_upload():
    target = LengthTarget()

    parser = StreamingFormDataParser(headers=request.headers)
    parser.register('file', target)

    while True:
        chunk = request.stream.read(131072)
        if not chunk:
            break
        parser.data_received(chunk)

    print(target.total)
    return 'done'

@app.route("/werkzeug", methods=['POST'])
def werkzeug_upload():
    file = request.files['file']
    stream = file.stream
    stream.seek(0, io.SEEK_END)
    print(stream.tell())
    return 'done'

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('parser', choices=['streaming-form-data', 'werkzeug'])
    parser.add_argument('file')
    args = parser.parse_args()
    with open(args.file, 'rb') as f:
        data = f.read()

    # Prepare the whole environment in advance so that this doesn't slow down the benchmark.
    e = EnvironBuilder(method='POST', path=f'/{args.parser}')
    e.files.add_file('file', io.BytesIO(data), basename(args.file))
    environ = e.get_environ()

    start = time.perf_counter()
    run_wsgi_app(app, environ)
    stop = time.perf_counter()
    delta = (stop - start) * 1000
    print(f'{delta:.1f} ms')

if __name__ == "__main__":
    main()

@davidism
Copy link
Member

@davidism davidism commented Jan 26, 2021

@sekrause hey, I really appreciate the detail you're providing. However, in the five years since you opened this issue, neither you or anyone else invested in seeing the issue fixed has actually submitted a fix.

I personally will not have the time to learn that library's implementation and identify how it can be applied to ours. Note that the library you're comparing to is implemented in C, so it's unlikely we'll every achieve the same speed. It's also already possible to use that library with Werkzeug when that speed is required. Perhaps someone could turn that into an extension library so it's more integrated as a Request subclass.

I'm happy to consider a PR that adds further improvements to the parser, but leaving this issue open so far doesn't seem to have resulted in that.

@siddhantgoel
Copy link

@siddhantgoel siddhantgoel commented Jan 27, 2021

Author of the other library here. I'm more than happy to review proposals/patches in case someone wants to provide an extension so it can work better with Werkzeug.

@sekrause
Copy link
Contributor Author

@sekrause sekrause commented Jan 30, 2021

@davidism So I looked into your current implementation to check where it's slow and I think it turns out that from here we can get another 10x speedup by adding less than 10 lines of code.

When uploading a large binary file most of the time is spent in the elif self.state == State.DATA clause of MultipartDecoder.next_event, about half in list(LINE_BREAK_RE.finditer(self.buffer)) and half in the remaining lines.

But we don't really need to look at all lines break. The trick is to offload as much work as possible to bytes.find() which is really fast.

When we execute self.buffer.find(boundary_end) and it returns that nothing has been found, we can be sure that the ending boundary is not in self.buffer[:-len(boundary_end)] and just return this data without looking at it any further. We need to keep the last len(boundary_end) bytes of the buffer for the next iteration in case the ending boundary is on the border between two chunks.

When uploading a large file almost all iterations of the loop can return immediately after self.buffer.find(boundary_end). Only when it actually seems like we have and ending bounary we fall back to the code which checks for the line breaks and with the regular expressions.

If you want to test it yourself add this to MultipartDecoder.__init__():

self.boundary_end = b'--' + boundary + b'--'

And then change the elif self.state == State.DATA clause of MultipartDecoder.next_event into this:

elif self.state == State.DATA:
    if len(self.buffer) <= len(self.boundary_end):
        event = NEED_DATA
    elif self.buffer.find(self.boundary_end) == -1:
        data = bytes(self.buffer[:-len(self.boundary_end)])
        del self.buffer[:-len(self.boundary_end)]
        event = Data(data=data, more_data=True)
    else:
        # Return up to the last line break as data, anything past
        # that line break could be a boundary - more data may be
        # required to know for sure.
        lines = list(LINE_BREAK_RE.finditer(self.buffer))
        if len(lines):
            data_length = del_index = lines[-1].start()
            match = self.boundary_re.search(self.buffer)
            if match is not None:
                if match.group(1).startswith(b"--"):
                    self.state = State.EPILOGUE
                else:
                    self.state = State.PART
                data_length = match.start()
                del_index = match.end()

            data = bytes(self.buffer[:data_length])
            del self.buffer[:del_index]
            more_data = match is None
            if data or not more_data:
                event = Data(data=data, more_data=more_data)

Everything after the else is the old code unchanged and the elif self.buffer.find(self.boundary_end) == -1 is the trick I described above. This change alone reduces the upload time of my 430 MB test from 7000 to 700 ms, a 10x speed-up!

The if len(self.buffer) <= len(self.boundary_end) was needed so that we don't get an infinite loop, not sure if it's correct.

What do you think?

@davidism
Copy link
Member

@davidism davidism commented Jan 30, 2021

Sounds interesting, can you make a pr?

@davidism davidism reopened this Jan 30, 2021
@pgjones
Copy link
Member

@pgjones pgjones commented Jan 30, 2021

Thanks @sekrause. I don't think exactly as you've written can be used (it ignores the complexity around CR and NL), however I think with some tweaks it works as #2022. Could you try that PR against your benchmarks?

@sekrause
Copy link
Contributor Author

@sekrause sekrause commented Jan 30, 2021

I don't think exactly as you've written can be used (it ignores the complexity around CR and NL)

I think we can make it work. The regular expression from MultipartDecoder.__init__() which checks for the boundary with all the complexity around CR and NL looks like this:

self.boundary_re = re.compile(
    br"%s--%s(--[^\S\n\r]*%s?|[^\S\n\r]*%s)"
    % (LINE_BREAK, boundary, LINE_BREAK, LINE_BREAK),
    re.MULTILINE,
)

So if buffer.find(b'--' + boundary) doesn't find a result the regular expression is also guaranteed to have no match because find() looks for a strict subset of the regular expression.

If find() does find something, we check again with the regular expression which handles all the complexity around CR and NL.

This additional precheck with find() seems to be worth it because when uploading a large file more than 99% of the iterations in the loop will end after find() and never have to recheck with the slower regular expression.

however I think with some tweaks it works as #2022. Could you try that PR against your benchmarks?

Your change is a ~2x speed-up, from 7000 ms to 3700 ms on my computer with my 430 MB test file. I've posted my benchmark program in #875 (comment) if you want to compare yourself.

@sekrause
Copy link
Contributor Author

@sekrause sekrause commented Jan 31, 2021

Final summary now that our changes have landed in Github master. Small benchmark uploading a file of 64 MB random data 10 times in a row and measuring the average request time on an Intel Core i7-8550U:

  • Werkzeug 1.0.1: 10 requests. Avg request time: 1390.9 ms. 46.0 MiB/s
  • Werkzeug Master: 10 requests. Avg request time: 91.3 ms. 701.0 MiB/s

With reasonable large files that's a 15x improvement (the difference is a little lower with small files because of the request overhead) and on a somewhat fast server CPU Werkzeug's multipart parser should now be able to saturate a gigabit ethernet link!

I'm happy with the result. :)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet