Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

werkzeug.serving.DechunkedInput.read returns more than what was asked for #2021

Closed
ghost opened this issue Jan 27, 2021 · 10 comments · Fixed by #2075
Closed

werkzeug.serving.DechunkedInput.read returns more than what was asked for #2021

ghost opened this issue Jan 27, 2021 · 10 comments · Fixed by #2075
Assignees
Labels
Milestone

Comments

@ghost
Copy link

ghost commented Jan 27, 2021

I believe this is a bug in werkzeug 1.0.1 and not Flask because the error is produced by request.stream.read(), which is a werkzeug.serving.DechunkedInput instance. Please let me know if this actually belongs to Flask, and I will report it there instead. Note that I have used Flask in these examples for simplicity and because this is where I first found the issue.

With Python 3.9 on Windows 10, given the following Flask app (pip install flask==1.1.2) as server.py:

from flask import Flask, request

SIZE = 320000
app = Flask('bug')

@app.route('/bug', methods=['POST'])
def bug():
    chunk = request.stream.read(SIZE)
    if len(chunk) > SIZE:
        raise ValueError(f'bug in {type(request.stream)}; read({SIZE}) returned {len(chunk)} bytes')

    return 'ok'

if __name__ == "__main__":
    app.run()

And the following client.py file (pip install aiohttp==3.7.3):

import asyncio
import aiohttp

URL = 'http://localhost:5000/bug'
EXTRA = 44
CHUNK = 4000
DELAY = 0.125

async def req():
    yield bytes(EXTRA)
    while True:
        yield bytes(CHUNK)
        await asyncio.sleep(DELAY)

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.post(URL, data=req()) as resp:
            async for line in resp.content:
                print(line)

asyncio.run(main())

Running python server.py followed by python client.py in another terminal will eventually lead to the following error:

[2021-01-27 16:20:20,419] ERROR in app: Exception on /bug [POST]
Traceback (most recent call last):
  File "C:\Python39\lib\site-packages\flask\app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\Python39\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "C:\Python39\lib\site-packages\flask\app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "C:\Python39\lib\site-packages\flask\_compat.py", line 39, in reraise
    raise value
  File "C:\Python39\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "C:\Python39\lib\site-packages\flask\app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "C:\server.py", line 10, in bug
    raise ValueError(f'bug in {type(request.stream)}; read({SIZE}) returned {len(chunk)} bytes')
ValueError: bug in <class 'werkzeug.serving.DechunkedInput'>; read(320000) returned 320044 bytes

I expected read to never return more than size, but it did. According to PEP 3333 - Input and Error Streams, read is defined on the input stream with as follows (strong emphasis mine):

The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:

  1. The server is not required to read past the client's specified Content-Length, and should simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.

And the Python Library Reference for io.RawIOBase.read reads:

Read up to size bytes from the object and return them. As a convenience, if size is unspecified or -1, all bytes until EOF are returned. Otherwise, only one system call is ever made. Fewer than size bytes may be returned if the operating system call returns fewer than size bytes.

If 0 bytes are returned, and size was not 0, this indicates end of file. If the object is in non-blocking mode and no bytes are available, None is returned.

The default implementation defers to readall() and readinto().

Note that it says "Read up to size", and not "Read at least size". Werkzeug behaviours seem to not behave as the documentation says it should, which I think is a bug.

Because the default read implementation delegates to readinto, the buggy implementation is likely here, although I have not investigated further:

def readinto(self, buf: bytearray) -> int: # type: ignore

@ghost
Copy link
Author

ghost commented Jan 27, 2021

In fact, I don't understand why the implementation tries to fill the entirity of buf, based on the following condition:

while not self._done and read < len(buf):

This method is not readall! It's simply read! It should not try to fill buf in its entirity. It should read as much as it can once (in "one system call") and then return. It should only block to perform that first read. It should not repeatedly block to fill buf.

@davidism
Copy link
Member

davidism commented Jan 27, 2021

Is this actually causing an issue? DechunkedInput is only used by the development server in order to provide basic chunked encoding support. It also doesn't appear to be something we can control, as read is provided by io.RawIOBase. All we know is that read has given us a buffer to read some amount into. readinto reads until the buffer is full, it's RawIOBase.read that needs to maintain that buffer and only return the requested amount. This seems like a bug in RawIOBase.read.

@ghost
Copy link
Author

ghost commented Jan 27, 2021

Is this actually causing an issue

Yes, in my code I have a ring buffer which I want to fill as much as possible without writing so much that I start losing data. Because read returns more than what can fit in my ring buffer, I lose the first few bytes.

the size is not provided to us

I don't have a lot of domain knowledge here, but isn't the size is len(buf)?

readinto is allowed to read however much it wants to populate the buffer

This is debatable. PEP 3333 says it should behave like read, and read says that, if a size is provided, it will only perform a single system call. Yet, by reading that much more, is violating this aspect.

it's RawIOBase.read that needs to maintain that buffer and only return the requested amount

How is it supposed to do that, if it needs to create a buffer which it can pass to readinto and then return that buffer (thus losing control of that buffer it created)?

This seems like a bug in RawIOBase.read

It might be, although it seems strange that it's (seemingly) creating a buffer larger than requested?

@davidism
Copy link
Member

davidism commented Jan 27, 2021

With the information I have, I'm calling this a bug in RawIOBase.read. That's the thing in charge of only returning a given amount of data.

@ghost
Copy link
Author

ghost commented Jan 27, 2021

RawIOBase.read is implemented as follows (unless I'm looking at the wrong thing):

    b = PyByteArray_FromStringAndSize(NULL, n);
    if (b == NULL)
        return NULL;

    res = PyObject_CallMethodObjArgs(self, _PyIO_str_readinto, b, NULL);
    if (res == NULL || res == Py_None) {
        Py_DECREF(b);
        return res;
    }

I very much doubt PyByteArray_FromStringAndSize is allocating more than it should.

        alloc = size + 1;
        new->ob_bytes = PyObject_Malloc(alloc);

@davidism
Copy link
Member

We only know the size of the buffer, that's the only thing we can use to know how much to read.

@ghost
Copy link
Author

ghost commented Jan 27, 2021

The following server.py code exhibits the same behaviour:

from flask import Flask, request

SIZE = 320000
app = Flask('bug')

@app.route('/bug', methods=['POST'])
def bug():
    chunk = bytearray(SIZE)
    n = request.stream.readinto(chunk)
    if n > SIZE:
        raise ValueError(f'bug in {type(request.stream)}; read({SIZE}) returned {n} bytes')

    return 'ok'

if __name__ == "__main__":
    app.run()

No read in sight, only readinto which werkzeug controls. This is equivalent to what RawIOBase is doing behind the scenes, according to my understanding of the C code.

To clarify, the problem is readinto is reading beyond the size of the provided buf.

@davidism
Copy link
Member

Great, thanks for reducing this to a minimal reproducible example. Happy to consider a PR since it seems like you're the most knowledgeable about this right now. I'm unlikely to have time to address this before 2.0 otherwise.

A bytearray can grow beyond its pre-allocated size, the same as if no size was given. Seems there's an issue with how we're tracking the size we've read compared to the size of the buffer. That seems like a good place to start.

@ghost
Copy link
Author

ghost commented Jan 27, 2021

It appears SIZE = 800, EXTRA = 44 and CHUNK = 400 is enough to trigger this issue ( 844 > 800). I used the following code to see what aiohttp was sending:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('localhost', 5000))
s.listen(1)
c, _ = s.accept()

buffer = b''
while len(buffer) < 4096:
    buffer += c.recv(4096)

print(buffer.decode())

...and adjusted client.py to send E for extra and C for chunk to more easily distinguish the two parts. This is what the socket, and thus flask receives, which werkzeug deals with (note, obviously, the line endings are \r\n as used by HTTP):

POST /bug HTTP/1.1
Host: localhost:5000
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.9 aiohttp/3.7.3
Content-Type: application/octet-stream
Transfer-Encoding: chunked

2c
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
190
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
190
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
190
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
(...infinite other 190 CCCC...)

This should make for a good testcase.

@Saif807380
Copy link
Contributor

I'd like to work on this.

@davidism davidism added this to the 2.0.0 milestone Mar 30, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants