Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the Multipart parsing into a Sans-IO layer #2017

Merged
merged 2 commits into from
Jan 26, 2021

Conversation

pgjones
Copy link
Member

@pgjones pgjones commented Jan 24, 2021

This allows it to be used in async (ASGI) frameworks. It also
hopefully makes the code a little clearer to follow.

This removes the Content-Transfer-Encoding support as RFC7578
states it is deprecated and

Currently, no deployed implementations that send such bodies have
been discovered.

This requires the dataclasses backport, which I think is ok as
dataclasses are in the stdlib, and 3.6 has less than a year till EoL.

Checklist:

  • Add tests that demonstrate the correct behavior of the change. Tests should fail without the change.
  • Add or update relevant docs, in the docs folder and in code.
  • Add an entry in CHANGES.rst summarizing the change and linking to the issue.
  • Add .. versionchanged:: entries in any relevant code docs.
  • Run pre-commit hooks and fix any issues.
  • Run pytest and tox, no tests failed.

src/werkzeug/sansio/multipart.py Outdated Show resolved Hide resolved
src/werkzeug/sansio/multipart.py Outdated Show resolved Hide resolved
) -> None:
self.buffer = bytearray()
self.complete = False
self.max_form_memory_size = max_form_memory_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we work on #1513, this probably isn't necessary any more. It's not intuitive, as it's tracking how much data is currently in the buffer, not how much has been received.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer us to fix #1513 in a separate change, rather than with this (will make this easier for me to manage and hopefully merge). Maybe with the body attribute idea I mentioned.

src/werkzeug/formparser.py Show resolved Hide resolved
src/werkzeug/sansio/multipart.py Outdated Show resolved Hide resolved
src/werkzeug/sansio/multipart.py Outdated Show resolved Hide resolved
src/werkzeug/formparser.py Outdated Show resolved Hide resolved
src/werkzeug/sansio/multipart.py Outdated Show resolved Hide resolved
src/werkzeug/sansio/multipart.py Outdated Show resolved Hide resolved
src/werkzeug/sansio/multipart.py Outdated Show resolved Hide resolved
@davidism
Copy link
Member

Nice work, the code is easy for me to follow and well commented. Had a bunch of questions and ideas as I was reviewing.


current_part: Union[Field, File]
for data in iterator:
parser.receive_data(data)
Copy link
Member

@davidism davidism Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this loop would be useful in the sansio class, as a higher level API that yields any available key, value pairs after sending data in. I'm imagining this method could then look something like this:

decoder = MultipartDecoder()

while True:
    data = stream.read(self.buffer_size)
    
    if not data:
        break  # raise if decoder not completed?

    decoder.receive_data(data)

    # iterator stops when complete or waiting for data
    for item in decoder:  # item = (key, value)
        if isinstance(item[1], str):
            fields.append(item)
        else:  # FileStorage
            files.append(item)

    # break if decoder completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a nicer API, however the file data needs to be streamed to disc as it is parsed (say for large files). This prevents it being yielded in a key, value pair as the disc access is IO and hence can not be SansIO.

This allows it to be used in async (ASGI) frameworks. It also
hopefully makes the code a little clearer to follow.

This removes the ``Content-Transfer-Encoding`` support as RFC7578
states it is deprecated and

    Currently, no deployed implementations that send such bodies have
    been discovered.

This removes the IE6 name fix as IE6 is no longer supported (either
directly or by Werkzeug).

This requires the dataclasses backport, which I think is ok as
dataclasses are in the stdlib, and 3.6 has less than a year till EoL.

The API is based on that successfully used by h11.
@davidism davidism merged commit 8bbc58e into pallets:master Jan 26, 2021
@pgjones
Copy link
Member Author

pgjones commented Jan 26, 2021

Thanks!

@pgjones pgjones deleted the multipart branch January 26, 2021 09:27
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor form parsing to be sans-io werkzeug.formparser is really slow with large binary uploads
2 participants